CHAPTER 04 Intermediate

NumPy, Pandas, and Data Handling

Updated: May 16, 2026

6 min read

# CHAPTER 4

NumPy, Pandas, and Data Handling

1. Introduction

A neural network is essentially a giant mathematical blender. It takes numbers in, multiplies them by other numbers, and spits a number out. It cannot read Excel files, and it cannot process standard Python lists efficiently. To prepare data for PyTorch, we must use specialized scientific libraries. NumPy provides blazing-fast multidimensional arrays for matrix math, while Pandas acts as a programmable spreadsheet to clean and organize real-world data. In this chapter, we will master data handling.

2. Learning Objectives

By the end of this chapter, you will be able to:

Create and manipulate NumPy ndarrays.

Understand the shape and dimensions of data matrices.

Load datasets (like CSVs) using Pandas DataFrames.

Preprocess, filter, and clean missing data.

Transition data from Pandas to PyTorch.

3. NumPy Basics and NDArrays

NumPy (Numerical Python) is written in C. It introduces the ndarray (N-Dimensional Array), which is magnitudes faster than a standard Python list.

python

1234567891011121314

import numpy as np

# 1D Array (Vector) - e.g., a single row of data
vector = np.array([1, 2, 3, 4])

# 2D Array (Matrix) - e.g., a grayscale image or a spreadsheet
matrix = np.array([
    [10, 20],
    [30, 40],
    [50, 60]
])

print(f"Matrix Shape: {matrix.shape}") 
# Output: (3, 2) -> 3 rows, 2 columns

*Understanding .shape is critical. 90% of the errors beginners face in PyTorch are "Shape Mismatch" errors!*

4. Vectorized Operations

In standard Python, if you want to multiply every item in a list by 5, you have to write a for loop. In NumPy, you use Vectorization, which applies the math instantly to the entire array using C-backend acceleration.

python

12345

prices = np.array([10, 20, 30])
# Instantly multiply all elements by 2
new_prices = prices * 2

print(new_prices) # Output: [20 40 60]

5. Pandas Basics and DataFrames

While NumPy is great for pure math, it doesn't handle column names or missing text values well. Pandas is the ultimate data wrangling tool. Its core object is the DataFrame (a 2D table).

python

1234567891011

import pandas as pd

# Creating a DataFrame manually
data = {
    "Age": [25, 30, 35],
    "Salary": [50000, 65000, 80000],
    "Purchased": [0, 1, 1]
}

df = pd.DataFrame(data)
print(df)

6. Reading CSV Files

In reality, you will load data from external files, usually CSVs downloaded from Kaggle or your company's database.

python

12345678

# Load a CSV file (Assuming 'customer_data.csv' exists)
# df = pd.read_csv("customer_data.csv")

# View the first 5 rows
print(df.head())

# View a statistical summary (mean, min, max)
print(df.describe())

7. Data Preprocessing with Pandas

Neural networks hate missing data (NaN values). We must use Pandas to clean the data before feeding it to PyTorch.

python

123456789

# Check for missing values
print(df.isnull().sum())

# Drop any row that is missing data
df_clean = df.dropna()

# Alternatively, fill missing ages with the average age
mean_age = df[&#039;Age'].mean()
df[&#039;Age'].fillna(mean_age, inplace=True)

8. Mini Project: Dataset Exploration Project

Let's extract exactly what a neural network needs: The Features (Inputs/X) and the Label (Output/y).

python

123456789101112131415161718192021

import pandas as pd

# Mock Data
df = pd.DataFrame({
    "Age": [22, 25, 47, 52, 46],
    "Credit_Score": [600, 650, 800, 750, 710],
    "Approved_Loan": [0, 0, 1, 1, 1] # 0 = No, 1 = Yes
})

# 1. Isolate the Features (X)
# Drop the target column to keep only the inputs
X = df.drop("Approved_Loan", axis=1)

# 2. Isolate the Target Label (y)
y = df["Approved_Loan"]

# 3. Convert Pandas to NumPy array (PyTorch prefers NumPy arrays for conversion)
X_array = X.to_numpy()
y_array = y.to_numpy()

print("Features Shape:", X_array.shape) # Output: (5, 2)

*In Chapter 6, we will learn how to turn this Xarray directly into a PyTorch Tensor!*

9. Common Mistakes

Confusing Pandas indexing: Trying to select rows using df[0]. In Pandas, you must use df.iloc[0] for integer-location based indexing.

Feeding Pandas DataFrames directly into PyTorch: PyTorch layers do not understand Pandas DataFrames. You must always convert your DataFrame to a NumPy array (.tonumpy()) and then to a PyTorch Tensor before training.

10. Best Practices

Always check shapes: Get into the habit of printing X.shape and y.shape before you build your neural network. The input layer of your network *must* match the number of columns in X.

11. Exercises

1. Create a NumPy array containing the numbers 1 through 5. Square every number in the array using a single mathematical operation (Vectorization).

2. If you load a Pandas DataFrame and df.shape returns (1000, 15), what does that mean in terms of rows and columns?

12. MCQ Quiz with Answers

Question 1

Why are NumPy arrays preferred over standard Python lists for Deep Learning?

Question 2

In Pandas, what method is used to drop rows that contain missing (NaN) values?

13. Interview Questions

Q: Explain the difference in purpose between NumPy and Pandas in a Data Science workflow.

Q: What does the .shape attribute of a NumPy array tell you, and why is it critical when designing a Neural Network?

14. FAQs

Q: Does PyTorch have its own data structures? A: Yes! PyTorch uses Tensors. Tensors are almost identical to NumPy arrays, but they have special properties that allow them to run on GPUs and calculate calculus gradients for deep learning. (We cover Tensors in Chapter 6).

15. Summary

NumPy and Pandas are the unsung heroes of Artificial Intelligence. While PyTorch gets all the glory for building the "brain", it is NumPy that provides the fast mathematical infrastructure, and Pandas that organizes the chaotic real-world data into clean matrices ready for consumption.

16. Next Chapter Recommendation

We have the tools. We have the data. Now, we need to understand the architecture of the brain we are trying to build. In Chapter 5: Understanding Neural Networks, we will dive into the theory of Artificial Neurons, Layers, and Backpropagation.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

NumPy, Pandas, and Data Handling #

1. Introduction #

2. Learning Objectives #

3. NumPy Basics and NDArrays #

4. Vectorized Operations #

5. Pandas Basics and DataFrames #

6. Reading CSV Files #

7. Data Preprocessing with Pandas #

8. Mini Project: Dataset Exploration Project #

9. Common Mistakes #

10. Best Practices #

11. Exercises #

12. MCQ Quiz with Answers #

Why are NumPy arrays preferred over standard Python lists for Deep Learning?

In Pandas, what method is used to drop rows that contain missing (NaN) values?

13. Interview Questions #

14. FAQs #

15. Summary #

16. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 4

Send Feedback / Bug

Feedback Submitted!

NumPy, Pandas, and Data Handling

1. Introduction

2. Learning Objectives

3. NumPy Basics and NDArrays

4. Vectorized Operations

5. Pandas Basics and DataFrames

6. Reading CSV Files

7. Data Preprocessing with Pandas

8. Mini Project: Dataset Exploration Project

9. Common Mistakes

10. Best Practices

11. Exercises

12. MCQ Quiz with Answers

13. Interview Questions

14. FAQs

15. Summary

16. Next Chapter Recommendation