Skip to main content
Data Cleaning
CHAPTER 02 Beginner

Understanding Dirty Data

Updated: May 18, 2026
5 min read

# CHAPTER 2

Understanding Dirty Data

1. Chapter Introduction

Before you can fix data problems, you need to recognize them. Dirty data comes in predictable patterns — and each pattern requires a different fix. This chapter catalogs every dirty data type with real examples so you can instantly diagnose data quality issues in any dataset.

2. Taxonomy of Dirty Data

text
12345678910111213141516
DIRTY DATA TAXONOMY:

┌─────────────────────────────────────────────────────────────────┐
│                    DIRTY DATA TYPES                             │
├─────────────────────────────────────────────────────────────────┤
│ 1. MISSING DATA         → NULL, NaN, empty strings             │
│ 2. DUPLICATE RECORDS    → Identical or near-identical rows      │
│ 3. INVALID VALUES       → Ages=999, negative prices            │
│ 4. INCONSISTENT FORMAT  → "US", "USA", "United States"         │
│ 5. STRUCTURAL ERRORS    → Name in email column                  │
│ 6. OUTLIERS             → $1 salary, 200-year-old customer      │
│ 7. TYPE MISMATCHES      → "25" stored as string not integer     │
│ 8. CORRUPTED DATA       → "A1@#$%B" in numeric column          │9. SCHEMA VIOLATIONS    → Required field empty, wrong FK        │
│ 10. STALE/OUTDATED DATA → 2015 product catalog in 2024 system  │
└─────────────────────────────────────────────────────────────────┘

3. Missing Data — Types and Patterns

python
12345678910111213141516171819202122232425262728293031323334
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Types of Missing Data (from statistics theory):
# MCAR — Missing Completely At Random: no pattern, random sensor failure
# MAR  — Missing At Random: missing-ness related to OTHER observed variables
#         (higher-income people skip salary questions on forms)
# MNAR — Missing Not At Random: missing-ness related to the missing value itself
#         (very sick patients drop out of medical trials — sickest are missing)

# Generate a dirty customer dataset
np.random.seed(42)
n = 200
customers = pd.DataFrame({
    'customer_id': range(1, n+1),
    'name':        [f'Customer_{i}' if i % 15 != 0 else None for i in range(1, n+1)],
    'email':       [f'user{i}@domain.com' if i % 8 != 0 else '' for i in range(1, n+1)],
    'age':         [np.random.randint(18, 70) if i % 7 != 0 else np.nan for i in range(1, n+1)],
    'salary':      [round(np.random.uniform(30000, 120000), 2)
                    if i % 5 != 0 else np.nan for i in range(1, n+1)],
    'country':     [np.random.choice(['USA', 'US', 'United States', 'U.S.A', 'uk', 'UK', 'Canada', None])
                    for _ in range(n)]
})

print("=== MISSING VALUE ANALYSIS ===")
missing = customers.isnull() | (customers == '')
print(missing.sum())
print(f"\nMissing percentage per column:")
print((missing.mean() * 100).round(1))

# Visualize missing data pattern
print("\nMissing pattern (first 20 rows):")
print(missing.head(20).astype(int).to_string())

4. Duplicate Data

python
123456789101112131415161718192021
# ─── DUPLICATE DATA EXAMPLES ──────────────────────────
duplicate_data = pd.DataFrame({
    'id':    [1, 2, 3, 3, 4, 5, 5, 5, 6, 7],
    'name':  ['Alice', 'Bob', 'Carol', 'Carol', 'David', 'Eve', 'Eve', 'EVE', 'Frank', 'Grace'],
    'email': ['a@x.com', 'b@x.com', 'c@x.com', 'c@x.com',
               'd@x.com', 'e@x.com', 'e@x.com', 'E@x.com', 'f@x.com', 'g@x.com'],
    'score': [85, 90, 78, 78, 82, 95, 95, 95, 67, 88]
})

print("=== DUPLICATE ANALYSIS ===")
print(f"Total rows: {len(duplicate_data)}")
print(f"\nExact duplicates: {duplicate_data.duplicated().sum()}")
print(f"Duplicates by ID: {duplicate_data.duplicated(subset=['id']).sum()}")
print(f"Duplicates by email: {duplicate_data.duplicated(subset=['email'], keep=False).sum()}")

# Detect near-duplicates (different case)
duplicate_data['name_lower'] = duplicate_data['name'].str.lower()
duplicate_data['email_lower'] = duplicate_data['email'].str.lower()
near_dupes = duplicate_data.duplicated(subset=['email_lower'], keep=False)
print(f"\nNear-duplicates (case-insensitive email): {near_dupes.sum()}")
print(duplicate_data[near_dupes][['id','name','email']])

5. Invalid and Outlier Values

python
12345678910111213141516171819
# ─── INVALID VALUES ────────────────────────────────────
invalid_data = pd.DataFrame({
    'employee_id': range(1, 11),
    'name':       ['Alice','Bob','Carol','David','Eve','Frank','Grace','Heidi','Ivan','Judy'],
    'age':        [28, -5, 32, 999, 25, 30, 0, 150, 27, 35],         # Invalid: -5, 999, 0, 150
    'salary':     [75000, 82000, -5000, 95000, 60000, 0, 88000, 72000, 999999, 68000],  # Invalid: -5000, 0, 999999
    'email':      ['a@co.com','b@co.com','c@co.com','not-email',
                   'e@co.com','f@co.com','g@co.com','h@co.com','i@co.com','j@co.com'],
    'department': ['IT','HR','IT','Sales','IT','Unknown','HR','Sales','IT','Finance'],
    'hire_date':  ['2020-03-15','2023-07-01','2021-11-20','2099-01-01',
                   '2022-05-10','2024-01-15','2019-08-22','2023-12-01','2020-06-30','2024-03-10']
})

print("=== INVALID VALUE DETECTION ===")
print(f"Invalid ages (< 18 or > 80): {((invalid_data[&#039;age'] < 18) | (invalid_data['age'] > 80)).sum()}")
print(f"Invalid salaries (< 0 or > 500000): {((invalid_data[&#039;salary'] < 0) | (invalid_data['salary'] > 500000)).sum()}")
print(f"Invalid emails: {(~invalid_data[&#039;email'].str.contains('@', na=False)).sum()}")
print(f"Future hire dates: {(pd.to_datetime(invalid_data[&#039;hire_date'], errors='coerce') > pd.Timestamp.now()).sum()}")
print(f"Unknown departments: {(invalid_data[&#039;department'] == 'Unknown').sum()}")

6. Inconsistent Formatting

python
12345678910111213141516171819202122
# ─── INCONSISTENCY EXAMPLES ────────────────────────────
inconsistent = pd.DataFrame({
    &#039;id':      range(1, 12),
    &#039;country': ['USA', 'US', 'United States', 'U.S.A.', 'us', 'Usa',
                 &#039;UK', 'United Kingdom', 'u.k.', 'Canada', 'CAN'],
    &#039;phone':   ['+1-555-100-2000', '5551002001', '(555) 100-2002', '555.100.2003',
                &#039;+1 555 1002004', '5551002005', '555-100-2006', '15551002007',
                None, &#039;555 100 2008', '+1(555)1002009'],
    &#039;date':    ['2024-01-15', '01/15/2024', '15-Jan-2024', 'January 15, 2024',
                &#039;20240115', '2024/01/15', '15.01.2024', '01-15-2024',
                &#039;2024-01-15T10:30:00', 'Jan 15 2024', '1/15/24'],
    &#039;status':  ['Active', 'active', 'ACTIVE', 'A', '1', 'Yes', 'Y',
                 &#039;Inactive', 'INACTIVE', 'inactive', '0']
})

print("=== INCONSISTENCY ANALYSIS ===")
print(f"\nCountry unique values: {inconsistent[&#039;country'].nunique()}")
print(inconsistent[&#039;country'].value_counts())

print(f"\nStatus unique values: {inconsistent[&#039;status'].nunique()}")
print(inconsistent[&#039;status'].value_counts())
# All mean "Active" or "Inactive" — but stored 7 different ways!

7. Corrupted Data

python
1234567891011121314151617181920
# ─── CORRUPTED DATA EXAMPLES ──────────────────────────
corrupted = pd.DataFrame({
    &#039;product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    &#039;price':      ['29.99', '$45.00', '12,500', 'N/A', '##ERROR##'],
    &#039;quantity':   ['100', '250.5', '-10', 'NULL', 'many'],
    &#039;rating':     [4.5, 'five', '3.8', None, '9999'],     # Mixed types
    &#039;description':['Great product', 'Good\x00item', b'bytes'.decode('utf-8'),
                   &#039;Normal item', 'Item\nwith\nnewlines']
})

print("=== CORRUPTION DETECTION ===")
# Try to convert price to numeric
pd.to_numeric(corrupted[&#039;price'].str.replace('[$,]', '', regex=True), errors='coerce')
# N/A and ##ERROR## become NaN

# Check for non-numeric in rating
corrupted[&#039;rating_clean'] = pd.to_numeric(corrupted['rating'], errors='coerce')
print("Original rating:", corrupted[&#039;rating'].tolist())
print("Numeric rating: ", corrupted[&#039;rating_clean'].tolist())
print("Conversion failures:", corrupted[&#039;rating_clean'].isnull().sum())

8. Common Mistakes

  • Treating all missing data the same: An empty cell might mean "zero," "unknown," "not applicable," or "data entry error" — different meanings require different fixes. Always investigate WHY data is missing.
  • Confusing validation with cleaning: Validation tells you what's wrong; cleaning fixes it. They're separate steps. Combining them leads to poorly documented processes.

9. MCQs

Question 1

MCAR stands for?

Question 2

Near-duplicate detection handles?

Question 3

"USA", "U.S.A", "United States" is which dirty data type?

Question 4

pd.tonumeric(col, errors='coerce') converts errors to?

Question 5

MNAR (Missing Not At Random) means?

Question 6

A future hiredate of 2099 is which issue?

Question 7

df.duplicated(subset=['email']) detects duplicates based on?

Question 8

Corrupted data differs from invalid data by?

Question 9

Schema violations include?

Question 10

Empty string "" vs None vs NaN in pandas?

10. Interview Questions

  • Q: What is the difference between MCAR, MAR, and MNAR missing data?
  • Q: How do you detect near-duplicate records that differ only by case?

11. Summary

Dirty data has 10 major types: missing, duplicate, invalid, inconsistent format, structural errors, outliers, type mismatches, corrupted, schema violations, and stale data. Each type has distinct patterns: MCAR/MAR/MNAR for missing; exact/near/fuzzy for duplicates; impossible values vs statistical outliers. Always diagnose before cleaning.

12. Next Chapter Recommendation

In Chapter 3: Installing Python and Data Cleaning Tools, we set up the complete Python data cleaning environment with Pandas, NumPy, and Jupyter.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·