Skip to main content
Data Cleaning
CHAPTER 01 Beginner

Introduction to Data Cleaning

Updated: May 18, 2026
5 min read

# CHAPTER 1

Introduction to Data Cleaning

1. Chapter Introduction

Data is the fuel of the modern economy — but raw data is almost always dirty. Missing values, inconsistent formats, duplicate records, and invalid entries are the rule, not the exception. Data cleaning is the systematic process of detecting and correcting these problems so that data is reliable enough to drive decisions, train ML models, and power analytics dashboards.

2. Learning Objectives

  • Define data cleaning and its role in the data science lifecycle.
  • Understand the 6 dimensions of data quality.
  • Map the complete data preprocessing workflow.
  • Identify real-world consequences of dirty data.
  • Complete a mini student dataset cleaning project.

3. What is Data Cleaning?

text
1234567891011121314151617
DATA CLEANING DEFINITION:
─────────────────────────────────────────────────────────────────
Data cleaning (also called data cleansing or data scrubbing) is
the process of detecting, diagnosing, and correcting or removing
corrupt, inaccurate, incomplete, or irrelevant records from a
dataset to improve data quality before analysis.

REAL-WORLD ANALOGY:
Think of data as raw vegetables from a farm:
  • Some vegetables are bruised       → Outliers
  • Some are missing                  → Missing values
  • Some are duplicates (two carrots) → Duplicate records
  • Some are labeled wrong            → Incorrect data types
  • Some have dirt on them            → Extra whitespace/formatting

Data cleaning = washing and preparing vegetables before cooking.
─────────────────────────────────────────────────────────────────

4. Why Data Cleaning Matters

text
12345678910111213141516
"Garbage in, garbage out" — the most fundamental principle of data science.

REAL CONSEQUENCES OF DIRTY DATA:
┌─────────────────────────────────────────────────────────────────┐
│ Industry    │ Problem              │ Consequence                 │
├─────────────┼──────────────────────┼─────────────────────────────┤
│ Healthcare  │ Wrong drug dosage    │ Patient harm                │
│ Finance     │ Duplicate transaction│ Customer overcharged        │
│ E-commerce  │ Wrong address format │ Failed delivery             │
│ ML Model    │ Biased training data │ Discriminatory predictions  │
│ Analytics   │ Missing sales data   │ Wrong business decisions    │
│ Government  │ Census errors        │ Misallocated resources      │
└─────────────┴──────────────────────┴─────────────────────────────┘

COST: Gartner estimates poor data quality costs organizations
      an average of $12.9 million per year.

5. Data Quality Dimensions

text
12345678910111213141516171819202122232425
6 DIMENSIONS OF DATA QUALITY:

1. COMPLETENESS — Are all required values present?
   Bad:  name="Alice", email=NULL, phone=NULL
   Good: name="Alice", email="alice@co.com", phone="+1-555-0100"

2. ACCURACY — Do values reflect reality?
   Bad:  age = 250  (impossible)
   Good: age = 25

3. CONSISTENCY — Is data uniform across systems?
   Bad:  USA / United States / U.S.A (3 formats for same thing)
   Good: United States (one format everywhere)

4. VALIDITY — Do values conform to defined rules?
   Bad:  email = "not-an-email"
   Good: email = "user@domain.com"

5. UNIQUENESS — Are there duplicate records?
   Bad:  Row 1 and Row 47 are identical customer records
   Good: Each customer appears exactly once

6. TIMELINESS — Is data current/relevant?
   Bad:  Using 2019 product prices in 2024 analysis
   Good: Data reflects the current state of reality

6. Data Cleaning Workflow

text
1234567891011121314151617181920212223242526272829303132333435363738394041
DATA CLEANING PIPELINE:

Raw Data (CSV, Excel, Database, API)
        │
        ▼
┌─────────────────────────────────────────────────┐
│  STEP 1: DATA PROFILING                          │
│  • Shape, types, missing values, distributions  │
│  • Tools: df.info(), df.describe(), df.head()   │
└─────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────┐
│  STEP 2: ISSUE DETECTION                         │
│  • Find NaN, outliers, duplicates, bad formats  │
│  • Tools: isnull(), duplicated(), value_counts()│
└─────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────┐
│  STEP 3: CLEANING & TRANSFORMATION               │
│  • Fix each issue type systematically           │
│  • fillna(), drop_duplicates(), str.strip()     │
└─────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────┐
│  STEP 4: VALIDATION                              │
│  • Verify fixes didn't introduce new problems   │
│  • Re-run profiling, check constraints          │
└─────────────────────────────────────────────────┘
        │
        ▼
┌─────────────────────────────────────────────────┐
│  STEP 5: EXPORT & DOCUMENT                       │
│  • Save cleaned dataset                         │
│  • Document all changes for reproducibility     │
└─────────────────────────────────────────────────┘
        │
        ▼
Analytics-Ready Dataset ✅

7. Mini Project: Clean a Student Dataset

python
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import pandas as pd
import numpy as np

# ─── DIRTY STUDENT DATASET ────────────────────────────
raw_data = {
    'student_id': [1, 2, 3, 4, 5, 5, 6, 7, 8, 9],
    'name':       ['Alice', 'BOB', '  Carol  ', 'David', 'Eve',
                   'Eve', 'Frank', None, 'Heidi', 'Ivan'],
    'age':        [20, 21, 19, 250, 22, 22, 20, 21, -5, 23],
    'email':      ['alice@edu.com', 'bob@edu.com', 'carol@edu.com',
                   'david@edu.com', 'eve@edu.com', 'eve@edu.com',
                   'not-an-email', None, 'heidi@edu.com', 'ivan@edu.com'],
    'score':      [85, 92, 78, 88, 95, 95, 72, 80, None, 88],
    'grade':      ['B', 'A', 'C', 'B', 'A', 'A', 'C', 'B', None, 'B']
}
df = pd.DataFrame(raw_data)

print("=== RAW DATA ===")
print(df.to_string())
print(f"\nShape: {df.shape}")
print(f"\nData Types:\n{df.dtypes}")
print(f"\nMissing Values:\n{df.isnull().sum()}")

# ─── STEP 1: PROFILE ──────────────────────────────────
print("\n=== DATA QUALITY ISSUES FOUND ===")
print(f"  Duplicates: {df.duplicated().sum()}")
print(f"  Missing values: {df.isnull().sum().sum()}")
print(f"  Invalid ages (< 0 or > 120): {((df[&#039;age'] < 0) | (df['age'] > 120)).sum()}")
print(f"  Invalid emails: {(~df[&#039;email'].str.contains('@', na=False)).sum()}")

# ─── STEP 2: CLEAN ────────────────────────────────────
# Fix names: strip whitespace + title case
df[&#039;name'] = df['name'].str.strip().str.title()

# Fix invalid ages
df[&#039;age'] = df['age'].where((df['age'] >= 5) & (df['age'] <= 100), np.nan)
df[&#039;age'] = df['age'].fillna(df['age'].median())

# Handle missing names
df[&#039;name'] = df['name'].fillna('Unknown')

# Mark invalid emails
df[&#039;email_valid'] = df['email'].str.contains(r'^[\w.]+@[\w.]+\.[a-z]{2,}$', na=False)

# Impute missing scores
df[&#039;score'] = df['score'].fillna(df['score'].median())

# Remove duplicates
df = df.drop_duplicates(subset=[&#039;student_id']).reset_index(drop=True)

# ─── STEP 3: VALIDATE ─────────────────────────────────
print("\n=== CLEANED DATA ===")
print(df.to_string())
print(f"\nShape after cleaning: {df.shape}")
print(f"Remaining missing: {df.isnull().sum().sum()}")

# ─── STEP 4: EXPORT ───────────────────────────────────
df.to_csv(&#039;cleaned_students.csv', index=False)
print("\n✅ Cleaned data saved to cleaned_students.csv")

Output:

12345678910111213
=== DATA QUALITY ISSUES FOUND ===
  Duplicates: 1
  Missing values: 3
  Invalid ages (< 0 or > 120): 2
  Invalid emails: 1

=== CLEANED DATA ===
   student_id   name   age           email  score grade  email_valid
0           1  Alice  20.0   alice@edu.com   85.0     B         True
1           2    Bob  21.0     bob@edu.com   92.0     A         True
...
Shape after cleaning: (9, 7)
Remaining missing: 0

8. Common Mistakes

  • Cleaning data before understanding it: Always profile first. Blindly running dropna() can remove 40% of your data before you know why values are missing.
  • Not documenting cleaning steps: In production, every cleaning decision must be logged — what was changed, why, and when. Undocumented cleaning is impossible to audit or reproduce.

9. Best Practices

text
123456
✅ Always keep a copy of raw data — never overwrite the original
✅ Profile before cleaning (understand the problem first)
✅ Log every transformation step
✅ Validate after each major cleaning operation
✅ Involve domain experts for business rule validation
✅ Use version control for cleaning scripts

10. MCQs

Question 1

Data cleaning is also known as?

Question 2

"Garbage in, garbage out" means?

Question 3

Which data quality dimension checks for duplicates?

Question 4

First step in the cleaning pipeline is?

Question 5

df.isnull().sum() counts?

Question 6

Gartner estimates poor data quality costs?

Question 7

Age value of 250 is a problem of?

Question 8

"USA" vs "United States" vs "U.S.A" violates which dimension?

Question 9

df.drop_duplicates() in pandas?

Question 10

Why keep original raw data?

11. Interview Questions

  • Q: What is the difference between data cleaning, data wrangling, and data preprocessing?
  • Q: Walk me through your data cleaning process for a new dataset.

12. FAQ

  • Q: Is data cleaning the same as ETL? A: ETL (Extract-Transform-Load) is a broader process. Data cleaning is the Transform step within ETL — but cleaning is also done interactively in analysis.
  • Q: How long does data cleaning take? A: Industry data: 60-80% of a data scientist's time. On a Kaggle project: 20-30% of total time.

13. Summary

Data cleaning transforms dirty, unreliable raw data into analysis-ready datasets. The 6 quality dimensions — completeness, accuracy, consistency, validity, uniqueness, timeliness — define what "clean" means. The cleaning pipeline: profile → detect issues → clean → validate → export. Always preserve raw data and document every change.

14. Next Chapter Recommendation

In Chapter 2: Understanding Dirty Data, we catalog every type of data problem with real examples from customer, sales, and healthcare datasets.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·