Skip to main content
Pandas & NumPy
CHAPTER 14 Beginner

Handling Missing Data

Updated: May 18, 2026
5 min read

# CHAPTER 14

Handling Missing Data

1. Chapter Introduction

Missing data (NaN) is inevitable in real datasets. How you handle it determines analysis quality. This chapter covers detection, removal, imputation, and interpolation strategies used by professional data scientists.

2. Detecting Missing Data

python
1234567891011121314151617181920212223
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Name':   ['Alice', 'Bob', None, 'David', 'Eve'],
    'Age':    [25, np.nan, 35, np.nan, 28],
    'Salary': [55000, 72000, np.nan, 88000, 61000],
    'Dept':   ['Eng', 'Mkt', 'Eng', None, 'HR'],
    'Rating': [4.5, np.nan, 4.9, 3.5, np.nan]
})

# Detect missing values
print(df.isnull())          # Boolean DataFrame
print(df.isnull().sum())    # Count per column
print(df.isnull().sum().sum())  # Total missing: 5
print(df.notnull().sum())   # Count of non-null per column

# Percentage missing
missing_pct = (df.isnull().sum() / len(df) * 100).round(1)
print("% Missing:\n", missing_pct)

# Rows with any missing
print(df[df.isnull().any(axis=1)])

3. Dropping Missing Data

python
12345678910111213141516
# Drop rows with ANY missing value
df_no_nulls = df.dropna()
print(f"After dropna(): {len(df_no_nulls)} rows")

# Drop rows only if ALL values are missing
df_all_null = df.dropna(how='all')

# Drop based on minimum non-null count
df_thresh = df.dropna(thresh=4)  # Keep rows with at least 4 non-null

# Drop based on specific columns
df_key = df.dropna(subset=['Name', 'Salary'])  # Only if Name or Salary is null

# Drop columns with more than 30% missing
threshold = len(df) * 0.7
df_col_drop = df.dropna(axis=1, thresh=int(threshold))

4. Filling Missing Data (Imputation)

python
123456789101112131415161718192021222324
# Fill with a constant
df['Rating'] = df['Rating'].fillna(0)
df['Dept'] = df['Dept'].fillna('Unknown')

# Fill with statistical values
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Salary'] = df['Salary'].fillna(df['Salary'].median())

# Fill with mode (most common value) for categorical
df['Dept'] = df['Dept'].fillna(df['Dept'].mode()[0])

# Forward fill (ffill): use previous row's value
df_ffill = df.fillna(method='ffill')  # Or df.ffill()

# Backward fill (bfill): use next row's value
df_bfill = df.fillna(method='bfill')  # Or df.bfill()

# Fill different columns differently
df = df.fillna({
    'Age': df['Age'].mean(),
    'Salary': df['Salary'].median(),
    'Dept': 'Unknown',
    'Rating': df['Rating'].mean()
})

5. Interpolation (for Time Series)

python
12345678910111213141516
# Interpolation — estimate missing from surrounding values
ts = pd.Series([10, np.nan, np.nan, 40, 50, np.nan, 70])

print(ts.interpolate(method='linear'))
# [10, 20, 30, 40, 50, 60, 70] — fills gaps linearly

# Polynomial interpolation
print(ts.interpolate(method='polynomial', order=2))

# Limit: fill only N consecutive NaNs
print(ts.interpolate(limit=1))  # Fill only 1 consecutive NaN

# In DataFrames
df_ts = pd.DataFrame({'Price': [100, np.nan, np.nan, 130, np.nan, 160]})
df_ts['Price_filled'] = df_ts['Price'].interpolate()
print(df_ts)

6. Missing Data Strategies by Data Type

text
123456789
Data Type      | Best Strategy
───────────────────────────────────────────────────────
Numeric (continuous) | Median (outlier-robust) or mean
Categorical          | Mode or 'Unknown' category
Time series          | Interpolation or ffill
Boolean              | False or mode
Text/String          | 'Unknown', 'N/A', or drop
Key/ID columns       | Always DROP — never impute IDs
High % missing (>50%)| Consider dropping entire column

7. Common Mistakes

  • Imputing before train/test split: Calculate imputation statistics (mean/median) on training data only, then apply to test data. Imputing on full dataset causes data leakage.
  • Filling IDs or keys: Never fill missing customer IDs or primary keys — drop those rows instead.

8. MCQs

Question 1

df.isnull().sum() returns?

Question 2

df.dropna(how='all') drops rows where?

Question 3

df.dropna(thresh=3) keeps rows with?

Question 4

fillna(method='ffill') fills with?

Question 5

Best fill for outlier-prone numeric data?

Question 6

interpolate(method='linear') fills gaps using?

Question 7

df.notnull().sum() counts?

Question 8

Missing >50% in a column — best action?

Question 9

df.dropna(subset=['Salary']) drops rows where?

Question 10

Data leakage in imputation happens when?

9. Interview Questions

  • Q: What is the difference between dropna() and fillna()?
  • Q: What is data leakage in the context of missing value imputation?

10. Summary

Missing data handling: detect with isnull(), remove with dropna(), impute with fillna() using mean/median/mode/constant, or interpolate for time series. Never impute key/ID columns. Always split data before computing imputation statistics to avoid leakage.

11. Next Chapter Recommendation

In Chapter 15: Data Transformation and Manipulation, we sort, apply custom functions, map values, and build transformation pipelines.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·