Skip to main content
Data Cleaning
CHAPTER 06 Beginner

Handling Missing Values

Updated: May 18, 2026
5 min read

# CHAPTER 6

Handling Missing Values

1. Chapter Introduction

Missing data (represented as NaN, NULL, or NA) is the most frequent challenge in data cleaning. Sensors fail, users skip survey questions, and database joins leave gaps. How you handle these gaps determines the validity of your entire analysis. This chapter covers detection, removal, and imputation strategies using Pandas.

2. Detecting Missing Values

python
12345678910111213141516171819202122232425
import pandas as pd
import numpy as np

# Sample dataset with missing values
df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', np.nan, 'David', 'Eve'],
    'age': [25, np.nan, 30, 22, np.nan],
    'salary': [50000, 60000, 75000, np.nan, 45000]
})

print("=== RAW DATA ===")
print(df)

# Check for missing values (returns boolean dataframe)
print("\n=== ISNULL() MATRIX ===")
print(df.isnull())

# Count missing values per column
print("\n=== MISSING COUNTS PER COLUMN ===")
print(df.isnull().sum())

# Calculate percentage of missing data
print("\n=== MISSING PERCENTAGE ===")
print((df.isnull().mean() * 100).round(2), "%")

3. Strategy 1: Dropping Missing Data (dropna)

Dropping data is the easiest but most destructive approach.

python
12345678910111213
# Drop ANY row that has at least one NaN
df_drop_any = df.dropna()
# Warning: If you have 50 columns, almost every row might have at least one NaN!

# Drop rows only if ALL columns are NaN
df_drop_all = df.dropna(how='all')

# Drop rows if they are missing data in specific crucial columns
df_drop_subset = df.dropna(subset=['id', 'name'])

# Drop columns that have too many missing values (e.g., > 50% missing)
threshold = len(df) * 0.5
df_drop_cols = df.dropna(axis=1, thresh=threshold)

4. Strategy 2: Filling Missing Data (fillna) / Imputation

Imputation means replacing missing values with estimated values.

python
123456789101112131415
# 1. Fill with a constant value
df['name_filled'] = df['name'].fillna('Unknown')
df['salary_zero'] = df['salary'].fillna(0)

# 2. Fill with Statistical Measures (Mean/Median)
# Mean is sensitive to outliers. Median is usually safer for skewed data like salary/age.
median_age = df['age'].median()
df['age_imputed'] = df['age'].fillna(median_age)

# 3. Forward Fill and Backward Fill (Great for Time Series)
ts_df = pd.DataFrame({'date': ['Jan', 'Feb', 'Mar', 'Apr'], 'stock_price': [100, np.nan, np.nan, 105]})
# Forward fill: propogates the last valid observation forward
ts_df['ffill'] = ts_df['stock_price'].ffill() 
# Backward fill: uses next valid observation to fill gap
ts_df['bfill'] = ts_df['stock_price'].bfill()

5. Advanced Imputation: GroupBy

Sometimes filling with a global average isn't accurate. For example, replacing a missing salary with the company average is bad if the employee is an intern. It's better to fill based on their department.

python
1234567891011
hr_df = pd.DataFrame({
    'dept': ['IT', 'IT', 'IT', 'HR', 'HR', 'HR'],
    'salary': [80000, np.nan, 85000, 50000, 52000, np.nan]
})

# Fill NaN salary with the mean salary of THAT SPECIFIC department
hr_df['salary_imputed'] = hr_df.groupby('dept')['salary'].transform(lambda x: x.fillna(x.mean()))

print("\n=== GROUPBY IMPUTATION ===")
print(hr_df)
# IT gets 82500, HR gets 51000

6. Mini Project: Customer Dataset Repair

python
1234567891011121314151617181920
# Dirty dataset
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105],
    'signup_date': ['2023-01-01', '2023-01-02', np.nan, '2023-01-04', '2023-01-05'],
    'country': ['USA', 'UK', np.nan, 'USA', np.nan],
    'lifetime_value': [500, np.nan, 1200, 300, np.nan]
})

# 1. Date missing? Forward fill (assuming sequential signups)
customers['signup_date'] = customers['signup_date'].ffill()

# 2. Country missing? Fill categorical with Mode (most frequent) or 'Unknown'
mode_country = customers['country'].mode()[0]
customers['country'] = customers['country'].fillna(mode_country)

# 3. LTV missing? Fill numeric with Median
customers['lifetime_value'] = customers['lifetime_value'].fillna(customers['lifetime_value'].median())

print("\n=== REPAIRED CUSTOMERS ===")
print(customers)

7. Common Mistakes

  • Filling NaNs before converting data types: If a column contains "Unknown" (string) and NaNs, converting it to numeric will fail. Handle the strings first, convert to numeric (which turns errors to NaNs), THEN fill the NaNs.
  • Using Mean for Skewed Data: Filling missing house prices or salaries with the mean will pull the data towards extreme outliers. Always use Median for right-skewed data.

8. MCQs

Question 1

Which Pandas function counts missing values per column?

Question 2

What does df.dropna() do by default?

Question 3

How do you drop rows where ONLY the 'email' column is missing?

Question 4

Replacing missing values with estimated values is called?

Question 5

Why is Median often preferred over Mean for imputing numeric data?

Question 6

What is Forward Fill (ffill) best used for?

Question 7

How do you fill NaNs in a specific column with the text "Unknown"?

Question 8

What does df.dropna(how='all') do?

Question 9

To impute missing categorical data (like Country), what statistical measure is most appropriate?

Question 10

Which is a consequence of simply dropping all rows with any NaNs?

9. Interview Questions

  • Q: Explain the difference between MCAR, MAR, and MNAR. How do they dictate your missing value strategy?
  • Q: You have a dataset where 70% of the 'secondary_phone' column is missing. What do you do?

10. Summary

Handling missing values is about making informed choices. isnull().sum() tells you the scope of the problem. Use dropna(subset=[...]) to remove records missing critical identifiers. Use fillna() to impute values: constant text for categories, median for skewed numbers, and ffill() for time series. Advanced imputation uses groupby to fill values based on related categories.

11. Next Chapter Recommendation

In Chapter 7: Removing Duplicate Data, we will learn how to identify and clean identical or near-identical records that skew analytics and frustrate customers.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·