Data Cleaning with Pandas
# CHAPTER 15
Data Cleaning with Pandas
1. Chapter Introduction
In the real world, data is never clean. Users forget to fill out forms, systems glitch, and text is entered inconsistently. If you feed dirty data into a machine learning model, the model will output garbage. It is said that Data Scientists spend 80% of their time cleaning data. This chapter covers the essential Pandas functions for handling missing values (NaN), duplicates, and bad formatting.
2. Identifying Missing Data
In Pandas, missing data is represented as NaN (Not a Number). You must find where these NaNs are before you can fix them.
3. Handling Missing Data (Dropping vs. Filling)
You have two choices when dealing with missing data: delete the row entirely, or guess/fill in the missing value.
Option 1: Dropping Rows (dropna)
Use this if the dataset is massive and the missing row is useless.
Option 2: Filling Values (fillna)
Use this if you don't want to lose data. A common technique is to fill missing numbers with the column's average (mean).
4. Handling Duplicates
Sometimes databases record the exact same transaction twice.
5. String Manipulation for Cleaning
If a "Price" column contains $1,000, Pandas will treat it as a String (Text). You cannot calculate the average of a string. We must clean the text and convert the column type using .str accessor.
6. The inplace=True Parameter
By default, functions like dropna() return a *new* copy of the DataFrame and leave the original alone. If you want to modify the original DataFrame directly to save memory, use inplace=True.
7. Mini Project: Customer Data Cleaner
8. Common Mistakes
-
Forgetting to reassign or use
inplace=True: If you rundf.dropna()but then printdf, the NaNs will still be there. You must dodf = df.dropna()ordf.dropna(inplace=True).
-
Filling with the wrong data type: If an
Agecolumn is numeric, and you dodf['Age'].fillna("Unknown"), you have just mixed numbers and strings in the same column. Any future math operations on that column will crash.
9. MCQs
What does NaN stand for in Pandas?
Which command is best for checking exactly how many missing values exist in each column?
Which method removes all rows that contain at least one NaN?
If you want to replace missing values with the number 0, which method do you use?
How do you remove identical duplicate rows from a DataFrame?
What happens if you run df.dropna() without reassigning it?
What parameter forces Pandas to modify the original DataFrame directly without returning a copy?
To use string methods like .replace() on an entire Pandas column of text, what accessor must you prefix it with?
Which method converts a column's data type (e.g., from String to Float)?
What is a common best practice for filling missing numerical data without deleting the row?
10. Interview Questions
-
Q: Explain the pros and cons of using
.dropna()vs.fillna()when dealing with missing values.
-
Q: You import a CSV where the "Revenue" column looks like
"$1,500". Write the workflow to convert this to a usable float for analysis.
11. Summary
Data cleaning is mandatory. Identify missing data usingdf.isna().sum(). Decide whether to drop the rows (.dropna()) or impute the missing values with an average (.fillna()). Use .drop_duplicates() to clean redundant records. Finally, clean messy text strings using the .str accessor and convert them to proper numerical types using .astype() so you can perform mathematical analysis.
12. Next Chapter Recommendation
In Chapter 16: Data Analysis and Aggregation, we will take our clean data and ask business questions using grouping (groupby()) and Pivot Tables to uncover trends.