CHAPTER 04
Intermediate
NumPy and Pandas Essentials
Updated: May 16, 2026
5 min read
# CHAPTER 4
NumPy and Pandas Essentials
1. Introduction
Machine learning algorithms do not understand Excel files or raw text; they only understand matrices of numbers. Standard Python lists are too slow and lack the mathematical functions required to process millions of numbers efficiently. To bridge this gap, we use two foundational libraries: NumPy (for lightning-fast mathematical arrays) and Pandas (for manipulating tabular data like spreadsheets). Scikit-learn is built directly on top of these tools. In this chapter, we will master the essentials of both.2. Learning Objectives
By the end of this chapter, you will be able to:- Create and manipulate NumPy arrays.
- Understand the concept of vectorization.
- Create Pandas DataFrames and Series.
- Load data from CSV files into Pandas.
- Explore, filter, and summarize data using Pandas.
3. NumPy Basics
NumPy (Numerical Python) provides thendarray (N-dimensional array) object. Under the hood, it is written in C, making it exponentially faster than Python lists.
Creating Arrays:
python
Array Operations (Vectorization): Unlike standard Python lists, you can do math directly on the entire NumPy array without writing a loop.
python
4. Pandas Basics
While NumPy is great for math, it lacks labels (column names). Pandas solves this by providing the DataFrame—think of it as a highly programmable Excel spreadsheet.
python
5. Reading Data (CSV Files)
In the real world, you rarely type data manually. You load it from a CSV (Comma Separated Values) file.
python
6. Data Exploration
Before feeding data to Scikit-learn, you must understand it. Pandas provides incredible tools for this:
python
7. Data Filtering and Selection
Selecting specific columns (Features) or rows is a daily task in ML data preparation.
python
8. Mini Project: Analyze CSV Dataset
Task: Imagine you loaded acustomers.csv file. You need to find the average salary of all customers living in "London".
python
9. Common Mistakes
-
Confusing NumPy indexing with Pandas indexing: NumPy uses pure number indices
matrix[0, 1]. Pandas usesdf.loc[]for label-based selection anddf.iloc[]for integer-based selection. Mixing them up causes errors.
-
Forgetting inplace=True: When modifying a DataFrame (like dropping a column), Pandas usually returns a *new* copy of the DataFrame. If you want to modify the original, you must use
df.drop("City", axis=1, inplace=True).
10. Best Practices
-
Always check
.shapeand.info()immediately after loading a dataset. If you expect 10,000 rows but.shapesays 5,000, you have a data loading problem.
11. Exercises
- 1. Create a NumPy array with numbers from 1 to 10. Multiply the entire array by 10 and print the result.
- 2. Create a Pandas DataFrame with columns "Product", "Price", and "In_Stock" (boolean). Filter the DataFrame to show only products that are in stock and cost more than $20.
12. MCQ Quiz with Answers
Question 1
What is the fundamental data structure used in Pandas to represent a 2-dimensional table of data?
Question 2
Why do Data Scientists use NumPy arrays instead of standard Python lists for mathematical operations?
13. Interview Questions
- Q: Explain the difference between a Pandas Series and a Pandas DataFrame.
- Q: How do you read a CSV file into a Pandas DataFrame and view the first 10 rows?