CHAPTER 02
Beginner
Understanding Dirty Data
Updated: May 18, 2026
5 min read
# CHAPTER 2
Understanding Dirty Data
1. Chapter Introduction
Before you can fix data problems, you need to recognize them. Dirty data comes in predictable patterns — and each pattern requires a different fix. This chapter catalogs every dirty data type with real examples so you can instantly diagnose data quality issues in any dataset.2. Taxonomy of Dirty Data
text
3. Missing Data — Types and Patterns
python
4. Duplicate Data
python
5. Invalid and Outlier Values
python
6. Inconsistent Formatting
python
7. Corrupted Data
python
8. Common Mistakes
- Treating all missing data the same: An empty cell might mean "zero," "unknown," "not applicable," or "data entry error" — different meanings require different fixes. Always investigate WHY data is missing.
- Confusing validation with cleaning: Validation tells you what's wrong; cleaning fixes it. They're separate steps. Combining them leads to poorly documented processes.
9. MCQs
Question 1
MCAR stands for?
Question 2
Near-duplicate detection handles?
Question 3
"USA", "U.S.A", "United States" is which dirty data type?
Question 4
pd.tonumeric(col, errors='coerce') converts errors to?
Question 5
MNAR (Missing Not At Random) means?
Question 6
A future hiredate of 2099 is which issue?
Question 7
df.duplicated(subset=['email']) detects duplicates based on?
Question 8
Corrupted data differs from invalid data by?
Question 9
Schema violations include?
Question 10
Empty string "" vs None vs NaN in pandas?
10. Interview Questions
- Q: What is the difference between MCAR, MAR, and MNAR missing data?
- Q: How do you detect near-duplicate records that differ only by case?