CHAPTER 13
Beginner
Data Cleaning in Pandas
Updated: May 18, 2026
5 min read
# CHAPTER 13
Data Cleaning in Pandas
1. Chapter Introduction
Real-world data is messy — inconsistent casing, extra spaces, wrong data types, duplicates, and formatting issues. Data cleaning (wrangling) is estimated to consume 60-80% of a data scientist's time. This chapter masters every key cleaning technique.2. Removing Duplicates
python
3. Renaming and Standardizing Columns
python
4. Fixing Data Types
python
5. Standardizing String Values
python
6. Mini Project: Customer Data Cleaner
python
7. Common Mistakes
-
str.replacewithoutregex=True: When using regex patterns like[^0-9], you must passregex=True.
-
astype()on strings with mixed content:'N/A'.astype(int)throws ValueError. Usepd.tonumeric(errors='coerce')to convert safely.
8. MCQs
Question 1
df.dropduplicates(subset=['Email']) removes rows where?
Question 2
str.strip() removes?
Question 3
pd.tonumeric(col, errors='coerce') converts invalid values to?
Question 4
str.title() converts?
Question 5
df.columns.str.lower() applies to?
Question 6
df.duplicated().sum() counts?
Question 7
str.replace('[^0-9]', '', regex=True) does?
Question 8
df.rename(columns={'old': 'new'}) affects?
Question 9
df.dropduplicates(keep='last') keeps?
Question 10
map({'Male': 'M', 'Female': 'F'}) on a Series?
9. Interview Questions
- Q: How do you clean a messy phone number column in Pandas?
-
Q: What is the difference between
dropduplicates()andduplicated()?
10. Summary
Data cleaning follows a consistent workflow: remove duplicates → standardize column names → fix dtypes → clean string values → handle missing data.str accessor methods chain cleanly. Always use pd.tonumeric(errors='coerce') for safe numeric conversion.
11. Next Chapter Recommendation
In Chapter 14: Handling Missing Data, we tackle NaN values withisnull(), dropna(), fillna(), and interpolation strategies.