Skip to main content
Pandas & NumPy
CHAPTER 13 Beginner

Data Cleaning in Pandas

Updated: May 18, 2026
5 min read

# CHAPTER 13

Data Cleaning in Pandas

1. Chapter Introduction

Real-world data is messy — inconsistent casing, extra spaces, wrong data types, duplicates, and formatting issues. Data cleaning (wrangling) is estimated to consume 60-80% of a data scientist's time. This chapter masters every key cleaning technique.

2. Removing Duplicates

python
1234567891011121314151617181920212223
import pandas as pd

df = pd.DataFrame({
    'ID': [1, 2, 3, 2, 4, 3],
    'Name': ['Alice', 'Bob', 'Carol', 'Bob', 'David', 'Carol'],
    'Email': ['a@x.com', 'b@x.com', 'c@x.com', 'b@x.com', 'd@x.com', 'c@x.com'],
    'Salary': [85000, 72000, 91000, 72000, 55000, 91000]
})

print("Original:", len(df), "rows")
print(df.duplicated())           # Boolean mask of duplicate rows
print(df.duplicated().sum())     # Count: 2

# Remove duplicates (keep first occurrence)
df_clean = df.drop_duplicates()
print("After drop_duplicates:", len(df_clean))

# Keep last occurrence
df_last = df.drop_duplicates(keep='last')

# Check duplicates based on specific columns
df_unique_email = df.drop_duplicates(subset=['Email'])
df_unique_id = df.drop_duplicates(subset=['ID'], keep='first')

3. Renaming and Standardizing Columns

python
12345678910111213141516171819202122
df = pd.DataFrame({
    '  Employee Name  ': ['Alice', 'Bob', 'Carol'],
    'DEPT': ['ENG', 'MKT', 'ENG'],
    'salary_usd': [85000, 72000, 91000],
    'YearsExp': [5, 3, 8]
})

# Strip whitespace from column names
df.columns = df.columns.str.strip()

# Standardize to snake_case lowercase
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Manual rename
df = df.rename(columns={
    'employee_name': 'name',
    'dept': 'department',
    'yearsexp': 'experience_years'
})

print(df.columns.tolist())
# ['name', 'department', 'salary_usd', 'experience_years']

4. Fixing Data Types

python
12345678910111213141516171819
df = pd.DataFrame({
    'ID': ['001', '002', '003'],          # Should be int
    'Salary': ['85,000', '72,000', '91,000'],  # Should be float
    'Date': ['2024-01-15', '2024-02-20', '2024-03-10'],
    'Is_Active': ['True', 'False', 'True']  # Should be bool
})

# Fix numeric columns with comma formatting
df['Salary'] = df['Salary'].str.replace(',', '').astype(float)
df['ID'] = df['ID'].astype(int)

# Fix dates
df['Date'] = pd.to_datetime(df['Date'])

# Fix boolean
df['Is_Active'] = df['Is_Active'].map({'True': True, 'False': False})

print(df.dtypes)
print(df)

5. Standardizing String Values

python
1234567891011121314151617181920
df = pd.DataFrame({
    'Name': ['  alice ', 'BOB', 'Carol  ', ' DAVID'],
    'City': ['new york', 'NEW YORK', 'Los Angeles', 'los angeles'],
    'Gender': ['Male', 'M', 'female', 'F']
})

# Strip whitespace and standardize case
df['Name'] = df['Name'].str.strip().str.title()
df['City'] = df['City'].str.strip().str.title()

# Normalize categorical values
gender_map = {'Male': 'M', 'male': 'M', 'Female': 'F', 'female': 'F'}
df['Gender'] = df['Gender'].map(gender_map)

# Extract numbers from strings
df['Phone_Digits'] = df.get('Phone', pd.Series(['555-1234-X', '555-5678-Y'])) \
    if False else pd.Series(['555-1234', '555-5678', '555-9012', '555-3456'])
df['Phone_Clean'] = df['Phone_Digits'].str.replace('[^0-9]', '', regex=True)

print(df)

6. Mini Project: Customer Data Cleaner

python
12345678910111213141516171819202122232425262728293031323334353637383940414243
import pandas as pd

# Simulate messy customer data
raw_data = {
    'customer_id': ['C001', 'C002', 'C002', 'C003', 'C004', 'C003'],
    'Name': ['  Alice Smith', 'BOB jones', 'BOB jones', 'carol White', 'DAVID BROWN', 'carol White'],
    'Email': ['alice@gmail.com', 'BOB@GMAIL.COM', 'BOB@GMAIL.COM', 'carol@yahoo.com', 'david@hotmail.com', 'carol@yahoo.com'],
    'Phone': ['(555) 123-4567', '555.234.5678', '555.234.5678', '5553456789', '+1-555-456-7890', '5553456789'],
    'age': ['28', '35', '35', 'N/A', '42', 'N/A'],
    'Salary': ['$55,000', '$72,000', '$72,000', '$48,000', '$91,000', '$48,000']
}
df = pd.DataFrame(raw_data)

print(f"Before: {df.shape[0]} rows")

# Step 1: Remove duplicates
df = df.drop_duplicates(subset=['customer_id'])
print(f"After dedup: {df.shape[0]} rows")

# Step 2: Standardize column names
df.columns = df.columns.str.lower().str.strip()

# Step 3: Clean Name
df['name'] = df['name'].str.strip().str.title()

# Step 4: Clean Email
df['email'] = df['email'].str.lower().str.strip()

# Step 5: Clean Phone — extract digits only, format as XXX-XXX-XXXX
df['phone_digits'] = df['phone'].str.replace('[^0-9]', '', regex=True).str[-10:]
df['phone_clean'] = df['phone_digits'].str.replace(r'(\d{3})(\d{3})(\d{4})', r'\1-\2-\3', regex=True)

# Step 6: Clean Age
df['age'] = pd.to_numeric(df['age'], errors='coerce')  # N/A → NaN

# Step 7: Clean Salary
df['salary'] = df['salary'].str.replace('[$,]', '', regex=True).astype(float)

# Step 8: Final clean dataset
clean_df = df[['customer_id', 'name', 'email', 'phone_clean', 'age', 'salary']]
print("\nClean Customer Data:")
print(clean_df.to_string(index=False))
print(f"\nMissing values:\n{clean_df.isnull().sum()}")

7. Common Mistakes

  • str.replace without regex=True: When using regex patterns like [^0-9], you must pass regex=True.
  • astype() on strings with mixed content: 'N/A'.astype(int) throws ValueError. Use pd.tonumeric(errors='coerce') to convert safely.

8. MCQs

Question 1

df.dropduplicates(subset=['Email']) removes rows where?

Question 2

str.strip() removes?

Question 3

pd.tonumeric(col, errors='coerce') converts invalid values to?

Question 4

str.title() converts?

Question 5

df.columns.str.lower() applies to?

Question 6

df.duplicated().sum() counts?

Question 7

str.replace('[^0-9]', '', regex=True) does?

Question 8

df.rename(columns={'old': 'new'}) affects?

Question 9

df.dropduplicates(keep='last') keeps?

Question 10

map({'Male': 'M', 'Female': 'F'}) on a Series?

9. Interview Questions

  • Q: How do you clean a messy phone number column in Pandas?
  • Q: What is the difference between dropduplicates() and duplicated()?

10. Summary

Data cleaning follows a consistent workflow: remove duplicates → standardize column names → fix dtypes → clean string values → handle missing data. str accessor methods chain cleanly. Always use pd.to
numeric(errors='coerce') for safe numeric conversion.

11. Next Chapter Recommendation

In Chapter 14: Handling Missing Data, we tackle NaN values with isnull(), dropna(), fillna(), and interpolation strategies.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·