Skip to main content
Python for Data Science
CHAPTER 15 Beginner

Data Cleaning with Pandas

Updated: May 18, 2026
5 min read

# CHAPTER 15

Data Cleaning with Pandas

1. Chapter Introduction

In the real world, data is never clean. Users forget to fill out forms, systems glitch, and text is entered inconsistently. If you feed dirty data into a machine learning model, the model will output garbage. It is said that Data Scientists spend 80% of their time cleaning data. This chapter covers the essential Pandas functions for handling missing values (NaN), duplicates, and bad formatting.

2. Identifying Missing Data

In Pandas, missing data is represented as NaN (Not a Number). You must find where these NaNs are before you can fix them.

python
12345678910111213141516
import pandas as pd
import numpy as np

# A messy dataset
data = {
    "Name": ["Alice", "Bob", "Charlie", "Dave", "Emma"],
    "Age": [25, np.nan, 35, 40, 25], # Bob is missing an age
    "Salary": [50000, 60000, np.nan, 80000, 50000]
}
df = pd.DataFrame(data)

# 1. Returns a True/False table showing where NaNs are
print(df.isna())

# 2. Returns a count of NaNs per column (CRITICAL FUNCTION)
print(df.isna().sum())

3. Handling Missing Data (Dropping vs. Filling)

You have two choices when dealing with missing data: delete the row entirely, or guess/fill in the missing value.

Option 1: Dropping Rows (dropna) Use this if the dataset is massive and the missing row is useless.

python
12
# Drops ANY row that contains at least one NaN
clean_df = df.dropna()

Option 2: Filling Values (fillna) Use this if you don't want to lose data. A common technique is to fill missing numbers with the column's average (mean).

python
12345678
# Calculate the average age
avg_age = df['Age'].mean()

# Fill missing ages with the average
df['Age'] = df['Age'].fillna(avg_age)

# Fill missing salaries with a static number (0)
df['Salary'] = df['Salary'].fillna(0)

4. Handling Duplicates

Sometimes databases record the exact same transaction twice.

python
12345
# Check for duplicate rows
print(df.duplicated().sum())

# Drop the duplicates (keeps the first occurrence, deletes the rest)
df = df.drop_duplicates()

5. String Manipulation for Cleaning

If a "Price" column contains $1,000, Pandas will treat it as a String (Text). You cannot calculate the average of a string. We must clean the text and convert the column type using .str accessor.

python
123456789101112
messy_prices = pd.Series(["$1,200.50", "$400.00", "Free"])

# 1. Use .str.replace to remove $ and commas
clean_prices = messy_prices.str.replace('$', '').str.replace(',', '')

# 2. Handle the edge case ("Free" -> "0")
clean_prices = clean_prices.replace("Free", "0")

# 3. Cast the column from String to Float using .astype()
final_prices = clean_prices.astype(float)

print(final_prices.mean()) # Now we can do math!

6. The inplace=True Parameter

By default, functions like dropna() return a *new* copy of the DataFrame and leave the original alone. If you want to modify the original DataFrame directly to save memory, use inplace=True.

python
12
# Modifies 'df' directly. No need to write df = df.dropna()
df.dropna(inplace=True) 

7. Mini Project: Customer Data Cleaner

python
1234567891011121314151617
# Raw messy data
raw = pd.DataFrame({
    "CustomerID": [1, 2, 2, 4], # Duplicate ID 2
    "Email": ["a@test.com", "b@test.com", "b@test.com", None], # Missing email
    "SignUp": ["2023-01-01", "2023/02/01", "2023/02/01", "Jan 5th 2023"] # Messy dates
})

# 1. Drop duplicates
clean = raw.drop_duplicates()

# 2. Fill missing emails with a placeholder
clean['Email'] = clean['Email'].fillna("unknown@email.com")

# 3. Standardize dates using pd.to_datetime (Pandas magic!)
clean['SignUp'] = pd.to_datetime(clean['SignUp'])

print(clean)

8. Common Mistakes

  • Forgetting to reassign or use inplace=True: If you run df.dropna() but then print df, the NaNs will still be there. You must do df = df.dropna() or df.dropna(inplace=True).
  • Filling with the wrong data type: If an Age column is numeric, and you do df['Age'].fillna("Unknown"), you have just mixed numbers and strings in the same column. Any future math operations on that column will crash.

9. MCQs

Question 1

What does NaN stand for in Pandas?

Question 2

Which command is best for checking exactly how many missing values exist in each column?

Question 3

Which method removes all rows that contain at least one NaN?

Question 4

If you want to replace missing values with the number 0, which method do you use?

Question 5

How do you remove identical duplicate rows from a DataFrame?

Question 6

What happens if you run df.dropna() without reassigning it?

Question 7

What parameter forces Pandas to modify the original DataFrame directly without returning a copy?

Question 8

To use string methods like .replace() on an entire Pandas column of text, what accessor must you prefix it with?

Question 9

Which method converts a column's data type (e.g., from String to Float)?

Question 10

What is a common best practice for filling missing numerical data without deleting the row?

10. Interview Questions

  • Q: Explain the pros and cons of using .dropna() vs .fillna() when dealing with missing values.
  • Q: You import a CSV where the "Revenue" column looks like "$1,500". Write the workflow to convert this to a usable float for analysis.

11. Summary

Data cleaning is mandatory. Identify missing data using df.isna().sum(). Decide whether to drop the rows (.dropna()) or impute the missing values with an average (.fillna()). Use .drop_duplicates() to clean redundant records. Finally, clean messy text strings using the .str accessor and convert them to proper numerical types using .astype() so you can perform mathematical analysis.

12. Next Chapter Recommendation

In Chapter 16: Data Analysis and Aggregation, we will take our clean data and ask business questions using grouping (groupby()) and Pivot Tables to uncover trends.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·