Skip to main content
Regression Models
CHAPTER 04 Intermediate

NumPy, Pandas, and Data Preparation

Updated: May 16, 2026
6 min read

# CHAPTER 4

NumPy, Pandas, and Data Preparation

1. Introduction

Machine learning algorithms like Linear Regression are essentially giant mathematical formulas. They cannot read Excel files, and they cannot process standard Python lists efficiently. To prepare data for scikit-learn, we must use specialized scientific libraries. NumPy provides blazing-fast multidimensional arrays for matrix math, while Pandas acts as a programmable spreadsheet to load, clean, and organize real-world data. In this chapter, we will master data wrangling.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Create and manipulate NumPy ndarrays.
  • Load datasets (like CSVs) using Pandas DataFrames.
  • Explore data using .head(), .info(), and .describe().
  • Filter rows based on specific conditions.
  • Handle missing data (NaN) before training a model.

3. NumPy Basics

NumPy (Numerical Python) introduces the ndarray (N-Dimensional Array), which is magnitudes faster than a standard Python list because it is backed by C code.
python
1234567891011121314
import numpy as np

# 1D Array (Vector) - e.g., a single column of ages
ages = np.array([25, 30, 35, 40])

# 2D Array (Matrix) - e.g., a spreadsheet with rows and columns
data_matrix = np.array([
    [25, 50000], # Row 1: Age 25, Salary 50k
    [30, 60000], # Row 2: Age 30, Salary 60k
    [35, 75000]  # Row 3: Age 35, Salary 75k
])

print(f"Matrix Shape: {data_matrix.shape}") 
# Output: (3, 2) -> 3 rows, 2 columns

4. Pandas Basics and DataFrames

While NumPy is great for pure math, it doesn't handle column names or missing text well. Pandas is the ultimate data wrangling tool. Its core object is the DataFrame (a 2D table).
python
1234567891011
import pandas as pd

# Creating a DataFrame manually
data = {
    "Square_Feet": [1500, 2000, 2500],
    "Bedrooms": [3, 4, 4],
    "Price": [300000, 450000, 500000]
}

df = pd.DataFrame(data)
print(df)

5. Reading CSV Files and Exploration

In reality, you will load data from external files, usually CSVs. Once loaded, you must explore it to understand its structure.
python
1234567891011
# Load a CSV file (Assuming 'housing_data.csv' exists in your folder)
# df = pd.read_csv("housing_data.csv")

# View the first 5 rows
print(df.head())

# View dataset information (Column names, data types, and null counts)
print(df.info())

# View statistical summary (Mean, Standard Deviation, Min, Max of numerical columns)
print(df.describe())

6. Data Filtering

Pandas allows you to query your data just like SQL.
python
12345678
# Get a single column (Returns a Pandas "Series")
prices = df['Price']

# Filter rows: Find all houses with more than 3 bedrooms
large_houses = df[df['Bedrooms'] > 3]

# Multiple conditions: Houses > 3 bedrooms AND Price < 460000
ideal_houses = df[(df[&#039;Bedrooms'] > 3) & (df['Price'] < 460000)]

7. Handling Missing Data

Real-world data is messy. If a CSV has a blank cell, Pandas loads it as NaN (Not a Number). If you feed NaN to scikit-learn, the algorithm will crash.
python
12345678910
# Check exactly how many missing values are in each column
print(df.isnull().sum())

# Strategy 1: Drop any row that is missing data (Good if you have millions of rows)
df_clean = df.dropna()

# Strategy 2: Imputation (Fill missing values with the column's average)
# This is usually preferred so you don't lose valuable data
mean_sqft = df[&#039;Square_Feet'].mean()
df[&#039;Square_Feet'].fillna(mean_sqft, inplace=True)

8. Mini Project: Prepare Housing Dataset for ML

Let's extract exactly what a machine learning model needs: The Features (Inputs/X) and the Label (Output/y).
python
12345678910111213141516171819
import pandas as pd

# Mock Data
df = pd.DataFrame({
    "Age_Years": [10, 5, 20, 1, 15],
    "Bedrooms": [3, 4, 2, 5, 3],
    "Price": [250000, 400000, 150000, 600000, 200000]
})

# 1. Isolate the Features (X)
# We drop the target column to keep ONLY the inputs
X = df.drop("Price", axis=1)

# 2. Isolate the Target Label (y)
# This is the single column we want to predict
y = df["Price"]

print("Features (X) shape:", X.shape) # Output: (5, 2)
print("Target (y) shape:", y.shape)   # Output: (5,)

*Your data is now mathematically separated and ready to be fed into a Regression model in the upcoming chapters!*

9. Common Mistakes

  • Confusing loc and iloc: If you want to select rows by their integer position (e.g., "Give me row number 5"), you MUST use df.iloc[5]. If you use df.loc[5], it looks for a row whose actual index label is "5", which might be different if the data was shuffled.
  • Forgetting axis=1 when dropping columns: When calling df.drop('Price'), Pandas will look for a *row* named 'Price' and throw an error. You must specify axis=1 to tell it to drop a *column*.

10. Best Practices

  • Never overwrite your raw data: Always keep your original df intact. When cleaning or filtering, save it to a new variable (e.g., df_clean = df.dropna()) so you can backtrack if you make a mistake.

11. Exercises

  1. 1. If you load a Pandas DataFrame and df.shape returns (1000, 15), what does that mean in terms of rows and columns?
  1. 2. Write the Pandas code to fill all NaN values in a column named Salary with the number 0.

12. MCQ Quiz with Answers

Question 1

What is the primary difference in functionality between NumPy and Pandas?

Question 2

In Machine Learning terminology, when we separate our DataFrame into X and y, what does X represent?

13. Interview Questions

  • Q: What does the .shape attribute of a Pandas DataFrame tell you, and why is it critical to check this before training a model?
  • Q: If your dataset contains 5% missing values in a critical column, explain the pros and cons of using .dropna() versus using .fillna(mean).

14. FAQs

Q: Do I need to convert Pandas DataFrames into NumPy arrays before giving them to scikit-learn? A: Historically, yes. However, modern versions of scikit-learn accept Pandas DataFrames directly and will seamlessly convert them into NumPy arrays under the hood during training!

15. Summary

Data cleaning is 80% of a Data Scientist's job. If you feed garbage data into a pristine machine learning algorithm, it will output garbage predictions. By mastering Pandas DataFrames, handling missing NaN values, and isolating your X features from your y target, you ensure your data is mathematically sound and algorithm-ready.

16. Next Chapter Recommendation

We have the tools. We have the clean data. But how does a computer actually "learn" a trend from these numbers? In Chapter 5: Understanding Regression Fundamentals, we will dive into the core math, exploring regression lines, correlation, and the eternal battle of Bias vs Variance.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·