CHAPTER 13 Beginner

Data Cleaning in Pandas

Updated: May 18, 2026

5 min read

# CHAPTER 13

Data Cleaning in Pandas

1. Chapter Introduction

Real-world data is messy — inconsistent casing, extra spaces, wrong data types, duplicates, and formatting issues. Data cleaning (wrangling) is estimated to consume 60-80% of a data scientist's time. This chapter masters every key cleaning technique.

2. Removing Duplicates

python

1234567891011121314151617181920212223

import pandas as pd

df = pd.DataFrame({
    &#039;ID': [1, 2, 3, 2, 4, 3],
    &#039;Name': ['Alice', 'Bob', 'Carol', 'Bob', 'David', 'Carol'],
    &#039;Email': ['a@x.com', 'b@x.com', 'c@x.com', 'b@x.com', 'd@x.com', 'c@x.com'],
    &#039;Salary': [85000, 72000, 91000, 72000, 55000, 91000]
})

print("Original:", len(df), "rows")
print(df.duplicated())           # Boolean mask of duplicate rows
print(df.duplicated().sum())     # Count: 2

# Remove duplicates (keep first occurrence)
df_clean = df.drop_duplicates()
print("After drop_duplicates:", len(df_clean))

# Keep last occurrence
df_last = df.drop_duplicates(keep=&#039;last')

# Check duplicates based on specific columns
df_unique_email = df.drop_duplicates(subset=[&#039;Email'])
df_unique_id = df.drop_duplicates(subset=[&#039;ID'], keep='first')

3. Renaming and Standardizing Columns

python

12345678910111213141516171819202122

df = pd.DataFrame({
    &#039;  Employee Name  ': ['Alice', 'Bob', 'Carol'],
    &#039;DEPT': ['ENG', 'MKT', 'ENG'],
    &#039;salary_usd': [85000, 72000, 91000],
    &#039;YearsExp': [5, 3, 8]
})

# Strip whitespace from column names
df.columns = df.columns.str.strip()

# Standardize to snake_case lowercase
df.columns = df.columns.str.lower().str.replace(&#039; ', '_')

# Manual rename
df = df.rename(columns={
    &#039;employee_name': 'name',
    &#039;dept': 'department',
    &#039;yearsexp': 'experience_years'
})

print(df.columns.tolist())
# ['name', 'department', 'salary_usd', 'experience_years']

4. Fixing Data Types

python

12345678910111213141516171819

df = pd.DataFrame({
    &#039;ID': ['001', '002', '003'],          # Should be int
    &#039;Salary': ['85,000', '72,000', '91,000'],  # Should be float
    &#039;Date': ['2024-01-15', '2024-02-20', '2024-03-10'],
    &#039;Is_Active': ['True', 'False', 'True']  # Should be bool
})

# Fix numeric columns with comma formatting
df[&#039;Salary'] = df['Salary'].str.replace(',', '').astype(float)
df[&#039;ID'] = df['ID'].astype(int)

# Fix dates
df[&#039;Date'] = pd.to_datetime(df['Date'])

# Fix boolean
df[&#039;Is_Active'] = df['Is_Active'].map({'True': True, 'False': False})

print(df.dtypes)
print(df)

5. Standardizing String Values

python

1234567891011121314151617181920

df = pd.DataFrame({
    &#039;Name': ['  alice ', 'BOB', 'Carol  ', ' DAVID'],
    &#039;City': ['new york', 'NEW YORK', 'Los Angeles', 'los angeles'],
    &#039;Gender': ['Male', 'M', 'female', 'F']
})

# Strip whitespace and standardize case
df[&#039;Name'] = df['Name'].str.strip().str.title()
df[&#039;City'] = df['City'].str.strip().str.title()

# Normalize categorical values
gender_map = {&#039;Male': 'M', 'male': 'M', 'Female': 'F', 'female': 'F'}
df[&#039;Gender'] = df['Gender'].map(gender_map)

# Extract numbers from strings
df[&#039;Phone_Digits'] = df.get('Phone', pd.Series(['555-1234-X', '555-5678-Y'])) \
    if False else pd.Series([&#039;555-1234', '555-5678', '555-9012', '555-3456'])
df[&#039;Phone_Clean'] = df['Phone_Digits'].str.replace('[^0-9]', '', regex=True)

print(df)

6. Mini Project: Customer Data Cleaner

python

12345678910111213141516171819202122232425262728293031323334353637383940414243

import pandas as pd

# Simulate messy customer data
raw_data = {
    &#039;customer_id': ['C001', 'C002', 'C002', 'C003', 'C004', 'C003'],
    &#039;Name': ['  Alice Smith', 'BOB jones', 'BOB jones', 'carol White', 'DAVID BROWN', 'carol White'],
    &#039;Email': ['alice@gmail.com', 'BOB@GMAIL.COM', 'BOB@GMAIL.COM', 'carol@yahoo.com', 'david@hotmail.com', 'carol@yahoo.com'],
    &#039;Phone': ['(555) 123-4567', '555.234.5678', '555.234.5678', '5553456789', '+1-555-456-7890', '5553456789'],
    &#039;age': ['28', '35', '35', 'N/A', '42', 'N/A'],
    &#039;Salary': ['$55,000', '$72,000', '$72,000', '$48,000', '$91,000', '$48,000']
}
df = pd.DataFrame(raw_data)

print(f"Before: {df.shape[0]} rows")

# Step 1: Remove duplicates
df = df.drop_duplicates(subset=[&#039;customer_id'])
print(f"After dedup: {df.shape[0]} rows")

# Step 2: Standardize column names
df.columns = df.columns.str.lower().str.strip()

# Step 3: Clean Name
df[&#039;name'] = df['name'].str.strip().str.title()

# Step 4: Clean Email
df[&#039;email'] = df['email'].str.lower().str.strip()

# Step 5: Clean Phone — extract digits only, format as XXX-XXX-XXXX
df[&#039;phone_digits'] = df['phone'].str.replace('[^0-9]', '', regex=True).str[-10:]
df[&#039;phone_clean'] = df['phone_digits'].str.replace(r'(\d{3})(\d{3})(\d{4})', r'\1-\2-\3', regex=True)

# Step 6: Clean Age
df[&#039;age'] = pd.to_numeric(df['age'], errors='coerce')  # N/A → NaN

# Step 7: Clean Salary
df[&#039;salary'] = df['salary'].str.replace('[$,]', '', regex=True).astype(float)

# Step 8: Final clean dataset
clean_df = df[[&#039;customer_id', 'name', 'email', 'phone_clean', 'age', 'salary']]
print("\nClean Customer Data:")
print(clean_df.to_string(index=False))
print(f"\nMissing values:\n{clean_df.isnull().sum()}")

7. Common Mistakes

str.replace without regex=True: When using regex patterns like [^0-9], you must pass regex=True.

astype() on strings with mixed content: 'N/A'.astype(int) throws ValueError. Use pd.tonumeric(errors='coerce') to convert safely.

8. MCQs

Question 1

`df.drop``duplicates(subset=['Email'])` removes rows where?

Question 2

`str.strip()` removes?

Question 3

`pd.tonumeric(col, errors='coerce')` converts invalid values to?

Question 4

str.title() converts?

Question 5

df.columns.str.lower() applies to?

Question 6

df.duplicated().sum() counts?

Question 7

str.replace('[^0-9]', '', regex=True) does?

Question 8

df.rename(columns={'old': 'new'}) affects?

Question 9

`df.drop``duplicates(keep='last')` keeps?

Question 10

`map({'Male': 'M', 'Female': 'F'})` on a Series?

9. Interview Questions

Q: How do you clean a messy phone number column in Pandas?

Q: What is the difference between dropduplicates() and duplicated()?

10. Summary
Data cleaning follows a consistent workflow: remove duplicates → standardize column names → fix dtypes → clean string values → handle missing data. str accessor methods chain cleanly. Always use pd.tonumeric(errors='coerce') for safe numeric conversion.

11. Next Chapter Recommendation

In Chapter 14: Handling Missing Data, we tackle NaN values with isnull(), dropna(), fillna(), and interpolation strategies.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Data Cleaning in Pandas #

1. Chapter Introduction #

2. Removing Duplicates #

3. Renaming and Standardizing Columns #

4. Fixing Data Types #

5. Standardizing String Values #

6. Mini Project: Customer Data Cleaner #

7. Common Mistakes #

8. MCQs #

df.dropduplicates(subset=['Email']) removes rows where?

str.strip() removes?

pd.tonumeric(col, errors='coerce') converts invalid values to?

str.title() converts?

df.columns.str.lower() applies to?

df.duplicated().sum() counts?

str.replace('[^0-9]', '', regex=True) does?

df.rename(columns={'old': 'new'}) affects?

df.dropduplicates(keep='last') keeps?

map({'Male': 'M', 'Female': 'F'}) on a Series?

9. Interview Questions #

10. Summary #

11. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

Send Feedback / Bug

Feedback Submitted!

Data Cleaning in Pandas

1. Chapter Introduction

2. Removing Duplicates

3. Renaming and Standardizing Columns

4. Fixing Data Types

5. Standardizing String Values

6. Mini Project: Customer Data Cleaner

7. Common Mistakes

8. MCQs

`df.drop``duplicates(subset=['Email'])` removes rows where?

`str.strip()` removes?

`pd.tonumeric(col, errors='coerce')` converts invalid values to?

`str.title()` converts?

`df.columns.str.lower()` applies to?

`df.duplicated().sum()` counts?

`str.replace('[^0-9]', '', regex=True)` does?

`df.rename(columns={'old': 'new'})` affects?

`df.drop``duplicates(keep='last')` keeps?

`map({'Male': 'M', 'Female': 'F'})` on a Series?

9. Interview Questions

10. Summary

11. Next Chapter Recommendation