CHAPTER 04
Intermediate
NumPy, Pandas, and Data Preparation
Updated: May 16, 2026
6 min read
# CHAPTER 4
NumPy, Pandas, and Data Preparation
1. Introduction
Machine learning algorithms like Linear Regression are essentially giant mathematical formulas. They cannot read Excel files, and they cannot process standard Python lists efficiently. To prepare data forscikit-learn, we must use specialized scientific libraries. NumPy provides blazing-fast multidimensional arrays for matrix math, while Pandas acts as a programmable spreadsheet to load, clean, and organize real-world data. In this chapter, we will master data wrangling.
2. Learning Objectives
By the end of this chapter, you will be able to:-
Create and manipulate NumPy
ndarrays.
- Load datasets (like CSVs) using Pandas DataFrames.
-
Explore data using
.head(),.info(), and.describe().
- Filter rows based on specific conditions.
- Handle missing data (NaN) before training a model.
3. NumPy Basics
NumPy (Numerical Python) introduces thendarray (N-Dimensional Array), which is magnitudes faster than a standard Python list because it is backed by C code.
python
4. Pandas Basics and DataFrames
While NumPy is great for pure math, it doesn't handle column names or missing text well. Pandas is the ultimate data wrangling tool. Its core object is the DataFrame (a 2D table).
python
5. Reading CSV Files and Exploration
In reality, you will load data from external files, usually CSVs. Once loaded, you must explore it to understand its structure.
python
6. Data Filtering
Pandas allows you to query your data just like SQL.
python
7. Handling Missing Data
Real-world data is messy. If a CSV has a blank cell, Pandas loads it asNaN (Not a Number). If you feed NaN to scikit-learn, the algorithm will crash.
python
8. Mini Project: Prepare Housing Dataset for ML
Let's extract exactly what a machine learning model needs: The Features (Inputs/X) and the Label (Output/y).
python
*Your data is now mathematically separated and ready to be fed into a Regression model in the upcoming chapters!*
9. Common Mistakes
-
Confusing
locandiloc: If you want to select rows by their integer position (e.g., "Give me row number 5"), you MUST usedf.iloc[5]. If you usedf.loc[5], it looks for a row whose actual index label is "5", which might be different if the data was shuffled.
-
Forgetting
axis=1when dropping columns: When callingdf.drop('Price'), Pandas will look for a *row* named 'Price' and throw an error. You must specifyaxis=1to tell it to drop a *column*.
10. Best Practices
-
Never overwrite your raw data: Always keep your original
dfintact. When cleaning or filtering, save it to a new variable (e.g.,df_clean = df.dropna()) so you can backtrack if you make a mistake.
11. Exercises
-
1.
If you load a Pandas DataFrame and
df.shapereturns(1000, 15), what does that mean in terms of rows and columns?
-
2.
Write the Pandas code to fill all
NaNvalues in a column namedSalarywith the number0.
12. MCQ Quiz with Answers
Question 1
What is the primary difference in functionality between NumPy and Pandas?
Question 2
In Machine Learning terminology, when we separate our DataFrame into X and y, what does X represent?
13. Interview Questions
-
Q: What does the
.shapeattribute of a Pandas DataFrame tell you, and why is it critical to check this before training a model?
-
Q: If your dataset contains 5% missing values in a critical column, explain the pros and cons of using
.dropna()versus using.fillna(mean).
14. FAQs
Q: Do I need to convert Pandas DataFrames into NumPy arrays before giving them toscikit-learn?
A: Historically, yes. However, modern versions of scikit-learn accept Pandas DataFrames directly and will seamlessly convert them into NumPy arrays under the hood during training!
15. Summary
Data cleaning is 80% of a Data Scientist's job. If you feed garbage data into a pristine machine learning algorithm, it will output garbage predictions. By mastering Pandas DataFrames, handling missingNaN values, and isolating your X features from your y target, you ensure your data is mathematically sound and algorithm-ready.