CHAPTER 15 Beginner

Data Cleaning with Pandas

Updated: May 18, 2026

5 min read

# CHAPTER 15

Data Cleaning with Pandas

1. Chapter Introduction

In the real world, data is never clean. Users forget to fill out forms, systems glitch, and text is entered inconsistently. If you feed dirty data into a machine learning model, the model will output garbage. It is said that Data Scientists spend 80% of their time cleaning data. This chapter covers the essential Pandas functions for handling missing values (NaN), duplicates, and bad formatting.

2. Identifying Missing Data

In Pandas, missing data is represented as NaN (Not a Number). You must find where these NaNs are before you can fix them.

python

12345678910111213141516

import pandas as pd
import numpy as np

# A messy dataset
data = {
    "Name": ["Alice", "Bob", "Charlie", "Dave", "Emma"],
    "Age": [25, np.nan, 35, 40, 25], # Bob is missing an age
    "Salary": [50000, 60000, np.nan, 80000, 50000]
}
df = pd.DataFrame(data)

# 1. Returns a True/False table showing where NaNs are
print(df.isna())

# 2. Returns a count of NaNs per column (CRITICAL FUNCTION)
print(df.isna().sum())

3. Handling Missing Data (Dropping vs. Filling)

You have two choices when dealing with missing data: delete the row entirely, or guess/fill in the missing value.

Option 1: Dropping Rows (dropna) Use this if the dataset is massive and the missing row is useless.

python

# Drops ANY row that contains at least one NaN
clean_df = df.dropna()

Option 2: Filling Values (fillna) Use this if you don't want to lose data. A common technique is to fill missing numbers with the column's average (mean).

python

12345678

# Calculate the average age
avg_age = df[&#039;Age'].mean()

# Fill missing ages with the average
df[&#039;Age'] = df['Age'].fillna(avg_age)

# Fill missing salaries with a static number (0)
df[&#039;Salary'] = df['Salary'].fillna(0)

4. Handling Duplicates

Sometimes databases record the exact same transaction twice.

python

12345

# Check for duplicate rows
print(df.duplicated().sum())

# Drop the duplicates (keeps the first occurrence, deletes the rest)
df = df.drop_duplicates()

5. String Manipulation for Cleaning

If a "Price" column contains $1,000, Pandas will treat it as a String (Text). You cannot calculate the average of a string. We must clean the text and convert the column type using .str accessor.

python

123456789101112

messy_prices = pd.Series(["$1,200.50", "$400.00", "Free"])

# 1. Use .str.replace to remove $ and commas
clean_prices = messy_prices.str.replace(&#039;$', '').str.replace(',', '')

# 2. Handle the edge case ("Free" -> "0")
clean_prices = clean_prices.replace("Free", "0")

# 3. Cast the column from String to Float using .astype()
final_prices = clean_prices.astype(float)

print(final_prices.mean()) # Now we can do math!

6. The `inplace=True` Parameter

By default, functions like dropna() return a *new* copy of the DataFrame and leave the original alone. If you want to modify the original DataFrame directly to save memory, use inplace=True.

python

# Modifies 'df' directly. No need to write df = df.dropna()
df.dropna(inplace=True)

7. Mini Project: Customer Data Cleaner

python

1234567891011121314151617

# Raw messy data
raw = pd.DataFrame({
    "CustomerID": [1, 2, 2, 4], # Duplicate ID 2
    "Email": ["a@test.com", "b@test.com", "b@test.com", None], # Missing email
    "SignUp": ["2023-01-01", "2023/02/01", "2023/02/01", "Jan 5th 2023"] # Messy dates
})

# 1. Drop duplicates
clean = raw.drop_duplicates()

# 2. Fill missing emails with a placeholder
clean[&#039;Email'] = clean['Email'].fillna("unknown@email.com")

# 3. Standardize dates using pd.to_datetime (Pandas magic!)
clean[&#039;SignUp'] = pd.to_datetime(clean['SignUp'])

print(clean)

8. Common Mistakes

Forgetting to reassign or use inplace=True: If you run df.dropna() but then print df, the NaNs will still be there. You must do df = df.dropna() or df.dropna(inplace=True).

Filling with the wrong data type: If an Age column is numeric, and you do df['Age'].fillna("Unknown"), you have just mixed numbers and strings in the same column. Any future math operations on that column will crash.

9. MCQs

Question 1

What does `NaN` stand for in Pandas?

Question 2

Which command is best for checking exactly how many missing values exist in each column?

Question 3

Which method removes all rows that contain at least one NaN?

Question 4

If you want to replace missing values with the number 0, which method do you use?

Question 5

How do you remove identical duplicate rows from a DataFrame?

Question 6

What happens if you run `df.dropna()` without reassigning it?

Question 7

What parameter forces Pandas to modify the original DataFrame directly without returning a copy?

Question 8

To use string methods like `.replace()` on an entire Pandas column of text, what accessor must you prefix it with?

Question 9

Which method converts a column's data type (e.g., from String to Float)?

Question 10

What is a common best practice for filling missing numerical data without deleting the row?

10. Interview Questions

Q: Explain the pros and cons of using .dropna() vs .fillna() when dealing with missing values.

Q: You import a CSV where the "Revenue" column looks like "$1,500". Write the workflow to convert this to a usable float for analysis.

11. Summary

Data cleaning is mandatory. Identify missing data using df.isna().sum(). Decide whether to drop the rows (.dropna()) or impute the missing values with an average (.fillna()). Use .drop_duplicates() to clean redundant records. Finally, clean messy text strings using the .str accessor and convert them to proper numerical types using .astype() so you can perform mathematical analysis.

12. Next Chapter Recommendation

In Chapter 16: Data Analysis and Aggregation, we will take our clean data and ask business questions using grouping (groupby()) and Pivot Tables to uncover trends.

Explore More

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Data Cleaning with Pandas #

1. Chapter Introduction #

2. Identifying Missing Data #

3. Handling Missing Data (Dropping vs. Filling) #

4. Handling Duplicates #

5. String Manipulation for Cleaning #

6. The inplace=True Parameter #

7. Mini Project: Customer Data Cleaner #

8. Common Mistakes #

9. MCQs #

What does NaN stand for in Pandas?

Which command is best for checking exactly how many missing values exist in each column?

Which method removes all rows that contain at least one NaN?

If you want to replace missing values with the number 0, which method do you use?

How do you remove identical duplicate rows from a DataFrame?

What happens if you run df.dropna() without reassigning it?

What parameter forces Pandas to modify the original DataFrame directly without returning a copy?

To use string methods like .replace() on an entire Pandas column of text, what accessor must you prefix it with?

Which method converts a column's data type (e.g., from String to Float)?

What is a common best practice for filling missing numerical data without deleting the row?

10. Interview Questions #

11. Summary #

12. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

🧪 Related Labs 1

🎥 Related Videos 1

🗺️ Related Roadmaps 1

Send Feedback / Bug

Feedback Submitted!

Data Cleaning with Pandas

1. Chapter Introduction

2. Identifying Missing Data

3. Handling Missing Data (Dropping vs. Filling)

4. Handling Duplicates

5. String Manipulation for Cleaning

6. The `inplace=True` Parameter

7. Mini Project: Customer Data Cleaner

8. Common Mistakes

9. MCQs

What does `NaN` stand for in Pandas?

Which method removes all rows that contain at least one `NaN`?

What happens if you run `df.dropna()` without reassigning it?

To use string methods like `.replace()` on an entire Pandas column of text, what accessor must you prefix it with?

10. Interview Questions

11. Summary

12. Next Chapter Recommendation