CHAPTER 14 Beginner

Handling Missing Data

Updated: May 18, 2026

5 min read

# CHAPTER 14

Handling Missing Data

1. Chapter Introduction

Missing data (NaN) is inevitable in real datasets. How you handle it determines analysis quality. This chapter covers detection, removal, imputation, and interpolation strategies used by professional data scientists.

2. Detecting Missing Data

python

1234567891011121314151617181920212223

import pandas as pd
import numpy as np

df = pd.DataFrame({
    &#039;Name':   ['Alice', 'Bob', None, 'David', 'Eve'],
    &#039;Age':    [25, np.nan, 35, np.nan, 28],
    &#039;Salary': [55000, 72000, np.nan, 88000, 61000],
    &#039;Dept':   ['Eng', 'Mkt', 'Eng', None, 'HR'],
    &#039;Rating': [4.5, np.nan, 4.9, 3.5, np.nan]
})

# Detect missing values
print(df.isnull())          # Boolean DataFrame
print(df.isnull().sum())    # Count per column
print(df.isnull().sum().sum())  # Total missing: 5
print(df.notnull().sum())   # Count of non-null per column

# Percentage missing
missing_pct = (df.isnull().sum() / len(df) * 100).round(1)
print("% Missing:\n", missing_pct)

# Rows with any missing
print(df[df.isnull().any(axis=1)])

3. Dropping Missing Data

python

12345678910111213141516

# Drop rows with ANY missing value
df_no_nulls = df.dropna()
print(f"After dropna(): {len(df_no_nulls)} rows")

# Drop rows only if ALL values are missing
df_all_null = df.dropna(how=&#039;all')

# Drop based on minimum non-null count
df_thresh = df.dropna(thresh=4)  # Keep rows with at least 4 non-null

# Drop based on specific columns
df_key = df.dropna(subset=[&#039;Name', 'Salary'])  # Only if Name or Salary is null

# Drop columns with more than 30% missing
threshold = len(df) * 0.7
df_col_drop = df.dropna(axis=1, thresh=int(threshold))

4. Filling Missing Data (Imputation)

python

123456789101112131415161718192021222324

# Fill with a constant
df[&#039;Rating'] = df['Rating'].fillna(0)
df[&#039;Dept'] = df['Dept'].fillna('Unknown')

# Fill with statistical values
df[&#039;Age'] = df['Age'].fillna(df['Age'].mean())
df[&#039;Salary'] = df['Salary'].fillna(df['Salary'].median())

# Fill with mode (most common value) for categorical
df[&#039;Dept'] = df['Dept'].fillna(df['Dept'].mode()[0])

# Forward fill (ffill): use previous row's value
df_ffill = df.fillna(method=&#039;ffill')  # Or df.ffill()

# Backward fill (bfill): use next row's value
df_bfill = df.fillna(method=&#039;bfill')  # Or df.bfill()

# Fill different columns differently
df = df.fillna({
    &#039;Age': df['Age'].mean(),
    &#039;Salary': df['Salary'].median(),
    &#039;Dept': 'Unknown',
    &#039;Rating': df['Rating'].mean()
})

5. Interpolation (for Time Series)

python

12345678910111213141516

# Interpolation — estimate missing from surrounding values
ts = pd.Series([10, np.nan, np.nan, 40, 50, np.nan, 70])

print(ts.interpolate(method=&#039;linear'))
# [10, 20, 30, 40, 50, 60, 70] — fills gaps linearly

# Polynomial interpolation
print(ts.interpolate(method=&#039;polynomial', order=2))

# Limit: fill only N consecutive NaNs
print(ts.interpolate(limit=1))  # Fill only 1 consecutive NaN

# In DataFrames
df_ts = pd.DataFrame({&#039;Price': [100, np.nan, np.nan, 130, np.nan, 160]})
df_ts[&#039;Price_filled'] = df_ts['Price'].interpolate()
print(df_ts)

6. Missing Data Strategies by Data Type

text

123456789

Data Type      | Best Strategy
───────────────────────────────────────────────────────
Numeric (continuous) | Median (outlier-robust) or mean
Categorical          | Mode or &#039;Unknown' category
Time series          | Interpolation or ffill
Boolean              | False or mode
Text/String          | &#039;Unknown', 'N/A', or drop
Key/ID columns       | Always DROP — never impute IDs
High % missing (>50%)| Consider dropping entire column

7. Common Mistakes

Imputing before train/test split: Calculate imputation statistics (mean/median) on training data only, then apply to test data. Imputing on full dataset causes data leakage.

Filling IDs or keys: Never fill missing customer IDs or primary keys — drop those rows instead.

8. MCQs

Question 1

`df.isnull().sum()` returns?

Question 2

`df.dropna(how='all')` drops rows where?

Question 3

`df.dropna(thresh=3)` keeps rows with?

Question 4

`fillna(method='ffill')` fills with?

Question 5

Best fill for outlier-prone numeric data?

Question 6

`interpolate(method='linear')` fills gaps using?

Question 7

`df.notnull().sum()` counts?

Question 8

Missing >50% in a column — best action?

Question 9

`df.dropna(subset=['Salary'])` drops rows where?

Question 10

Data leakage in imputation happens when?

9. Interview Questions

Q: What is the difference between dropna() and fillna()?

Q: What is data leakage in the context of missing value imputation?

10. Summary

Missing data handling: detect with isnull(), remove with dropna(), impute with fillna() using mean/median/mode/constant, or interpolate for time series. Never impute key/ID columns. Always split data before computing imputation statistics to avoid leakage.

11. Next Chapter Recommendation

In Chapter 15: Data Transformation and Manipulation, we sort, apply custom functions, map values, and build transformation pipelines.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Handling Missing Data #

1. Chapter Introduction #

2. Detecting Missing Data #

3. Dropping Missing Data #

4. Filling Missing Data (Imputation) #

5. Interpolation (for Time Series) #

6. Missing Data Strategies by Data Type #

7. Common Mistakes #

8. MCQs #

df.isnull().sum() returns?

df.dropna(how='all') drops rows where?

df.dropna(thresh=3) keeps rows with?

fillna(method='ffill') fills with?

Best fill for outlier-prone numeric data?

interpolate(method='linear') fills gaps using?

df.notnull().sum() counts?

Missing >50% in a column — best action?

df.dropna(subset=['Salary']) drops rows where?

Data leakage in imputation happens when?

9. Interview Questions #

10. Summary #

11. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

Send Feedback / Bug

Feedback Submitted!

Handling Missing Data

1. Chapter Introduction

2. Detecting Missing Data

3. Dropping Missing Data

4. Filling Missing Data (Imputation)

5. Interpolation (for Time Series)

6. Missing Data Strategies by Data Type

7. Common Mistakes

8. MCQs

`df.isnull().sum()` returns?

`df.dropna(how='all')` drops rows where?

`df.dropna(thresh=3)` keeps rows with?

`fillna(method='ffill')` fills with?

`interpolate(method='linear')` fills gaps using?

`df.notnull().sum()` counts?

`df.dropna(subset=['Salary'])` drops rows where?

9. Interview Questions

10. Summary

11. Next Chapter Recommendation