CHAPTER 14
Beginner
Handling Missing Data
Updated: May 18, 2026
5 min read
# CHAPTER 14
Handling Missing Data
1. Chapter Introduction
Missing data (NaN) is inevitable in real datasets. How you handle it determines analysis quality. This chapter covers detection, removal, imputation, and interpolation strategies used by professional data scientists.2. Detecting Missing Data
python
3. Dropping Missing Data
python
4. Filling Missing Data (Imputation)
python
5. Interpolation (for Time Series)
python
6. Missing Data Strategies by Data Type
text
7. Common Mistakes
- Imputing before train/test split: Calculate imputation statistics (mean/median) on training data only, then apply to test data. Imputing on full dataset causes data leakage.
- Filling IDs or keys: Never fill missing customer IDs or primary keys — drop those rows instead.
8. MCQs
Question 1
df.isnull().sum() returns?
Question 2
df.dropna(how='all') drops rows where?
Question 3
df.dropna(thresh=3) keeps rows with?
Question 4
fillna(method='ffill') fills with?
Question 5
Best fill for outlier-prone numeric data?
Question 6
interpolate(method='linear') fills gaps using?
Question 7
df.notnull().sum() counts?
Question 8
Missing >50% in a column — best action?
Question 9
df.dropna(subset=['Salary']) drops rows where?
Question 10
Data leakage in imputation happens when?
9. Interview Questions
-
Q: What is the difference between
dropna()andfillna()?
- Q: What is data leakage in the context of missing value imputation?
10. Summary
Missing data handling: detect withisnull(), remove with dropna(), impute with fillna() using mean/median/mode/constant, or interpolate for time series. Never impute key/ID columns. Always split data before computing imputation statistics to avoid leakage.