Skip to main content
Scikit-learn Basics
CHAPTER 04 Intermediate

NumPy and Pandas Essentials

Updated: May 16, 2026
5 min read

# CHAPTER 4

NumPy and Pandas Essentials

1. Introduction

Machine learning algorithms do not understand Excel files or raw text; they only understand matrices of numbers. Standard Python lists are too slow and lack the mathematical functions required to process millions of numbers efficiently. To bridge this gap, we use two foundational libraries: NumPy (for lightning-fast mathematical arrays) and Pandas (for manipulating tabular data like spreadsheets). Scikit-learn is built directly on top of these tools. In this chapter, we will master the essentials of both.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Create and manipulate NumPy arrays.
  • Understand the concept of vectorization.
  • Create Pandas DataFrames and Series.
  • Load data from CSV files into Pandas.
  • Explore, filter, and summarize data using Pandas.

3. NumPy Basics

NumPy (Numerical Python) provides the ndarray (N-dimensional array) object. Under the hood, it is written in C, making it exponentially faster than Python lists.

Creating Arrays:

python
1234567891011121314
import numpy as np

# Create a 1D array (a vector)
vector = np.array([1, 2, 3, 4, 5])

# Create a 2D array (a matrix) - This is how ML datasets are structured!
matrix = np.array([
    [10, 20],
    [30, 40],
    [50, 60]
])

print("Shape of matrix:", matrix.shape) 
# Output: Shape of matrix: (3, 2)  <- 3 rows, 2 columns

Array Operations (Vectorization): Unlike standard Python lists, you can do math directly on the entire NumPy array without writing a loop.

python
1234
prices = np.array([10, 20, 30])
# Add $5 to all prices instantly
new_prices = prices + 5
print(new_prices) # Output: [15 25 35]

4. Pandas Basics

While NumPy is great for math, it lacks labels (column names). Pandas solves this by providing the DataFrame—think of it as a highly programmable Excel spreadsheet.
python
12345678910111213141516
import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    "Age": [25, 30, 35],
    "Salary": [50000, 65000, 80000],
    "City": ["New York", "London", "Paris"]
}

df = pd.DataFrame(data)
print(df)
# Output:
#    Age  Salary      City
# 0   25   50000  New York
# 1   30   65000    London
# 2   35   80000     Paris

5. Reading Data (CSV Files)

In the real world, you rarely type data manually. You load it from a CSV (Comma Separated Values) file.
python
12345
# Load a CSV file into a DataFrame
df = pd.read_csv("housing_data.csv")

# View the first 5 rows
print(df.head())

6. Data Exploration

Before feeding data to Scikit-learn, you must understand it. Pandas provides incredible tools for this:
python
12345678
# Get a summary of the dataset (row count, columns, non-null counts)
df.info()

# Get statistical summaries (mean, min, max, standard deviation)
df.describe()

# Get the names of the columns
print(df.columns)

7. Data Filtering and Selection

Selecting specific columns (Features) or rows is a daily task in ML data preparation.
python
12345678910
# Select a single column (returns a Pandas Series)
ages = df["Age"]

# Select multiple columns (returns a DataFrame)
subset = df[["Age", "Salary"]]

# Filter rows based on a condition (People older than 30)
older_people = df[df["Age"] > 30]

print(older_people)

8. Mini Project: Analyze CSV Dataset

Task: Imagine you loaded a customers.csv file. You need to find the average salary of all customers living in "London".
python
1234567891011121314151617
import pandas as pd

# 1. Load data
# df = pd.read_csv('customers.csv')

# Let's use our mock dataframe from earlier
data = {"Age": [25, 30, 35, 40], "Salary": [50000, 65000, 80000, 90000], "City": ["New York", "London", "Paris", "London"]}
df = pd.DataFrame(data)

# 2. Filter for London
london_customers = df[df["City"] == "London"]

# 3. Calculate the mean of the Salary column
average_salary = london_customers["Salary"].mean()

print(f"Average salary in London: ${average_salary}")
# Output: Average salary in London: $77500.0

9. Common Mistakes

  • Confusing NumPy indexing with Pandas indexing: NumPy uses pure number indices matrix[0, 1]. Pandas uses df.loc[] for label-based selection and df.iloc[] for integer-based selection. Mixing them up causes errors.
  • Forgetting inplace=True: When modifying a DataFrame (like dropping a column), Pandas usually returns a *new* copy of the DataFrame. If you want to modify the original, you must use df.drop("City", axis=1, inplace=True).

10. Best Practices

  • Always check .shape and .info() immediately after loading a dataset. If you expect 10,000 rows but .shape says 5,000, you have a data loading problem.

11. Exercises

  1. 1. Create a NumPy array with numbers from 1 to 10. Multiply the entire array by 10 and print the result.
  1. 2. Create a Pandas DataFrame with columns "Product", "Price", and "In_Stock" (boolean). Filter the DataFrame to show only products that are in stock and cost more than $20.

12. MCQ Quiz with Answers

Question 1

What is the fundamental data structure used in Pandas to represent a 2-dimensional table of data?

Question 2

Why do Data Scientists use NumPy arrays instead of standard Python lists for mathematical operations?

13. Interview Questions

  • Q: Explain the difference between a Pandas Series and a Pandas DataFrame.
  • Q: How do you read a CSV file into a Pandas DataFrame and view the first 10 rows?

14. FAQs

Q: Do I need to master every single Pandas function before learning Scikit-learn? A: No! Pandas is massive. For Scikit-learn, you only need to know how to load data, select specific columns (Features and Targets), filter out bad rows, and handle missing values (which we cover in Chapter 6).

15. Summary

NumPy and Pandas are the engines that power data science in Python. NumPy provides the highly optimized, multidimensional arrays needed for complex math, while Pandas provides a user-friendly interface to load, explore, and filter real-world tabular data. Scikit-learn algorithms expect their inputs to be presented in these formats.

16. Next Chapter Recommendation

With our data manipulation tools ready, it's time to look at the big picture. In Chapter 5: Understanding Machine Learning Workflow, we will outline the exact step-by-step lifecycle of building an ML project from raw data to a deployed model.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·