Handling Missing Values
# CHAPTER 6
Handling Missing Values
1. Chapter Introduction
Missing data (represented asNaN, NULL, or NA) is the most frequent challenge in data cleaning. Sensors fail, users skip survey questions, and database joins leave gaps. How you handle these gaps determines the validity of your entire analysis. This chapter covers detection, removal, and imputation strategies using Pandas.
2. Detecting Missing Values
3. Strategy 1: Dropping Missing Data (dropna)
Dropping data is the easiest but most destructive approach.
4. Strategy 2: Filling Missing Data (fillna) / Imputation
Imputation means replacing missing values with estimated values.
5. Advanced Imputation: GroupBy
Sometimes filling with a global average isn't accurate. For example, replacing a missing salary with the company average is bad if the employee is an intern. It's better to fill based on their department.
6. Mini Project: Customer Dataset Repair
7. Common Mistakes
- Filling NaNs before converting data types: If a column contains "Unknown" (string) and NaNs, converting it to numeric will fail. Handle the strings first, convert to numeric (which turns errors to NaNs), THEN fill the NaNs.
- Using Mean for Skewed Data: Filling missing house prices or salaries with the mean will pull the data towards extreme outliers. Always use Median for right-skewed data.
8. MCQs
Which Pandas function counts missing values per column?
What does df.dropna() do by default?
How do you drop rows where ONLY the 'email' column is missing?
Replacing missing values with estimated values is called?
Why is Median often preferred over Mean for imputing numeric data?
What is Forward Fill (ffill) best used for?
How do you fill NaNs in a specific column with the text "Unknown"?
What does df.dropna(how='all') do?
To impute missing categorical data (like Country), what statistical measure is most appropriate?
Which is a consequence of simply dropping all rows with any NaNs?
9. Interview Questions
- Q: Explain the difference between MCAR, MAR, and MNAR. How do they dictate your missing value strategy?
- Q: You have a dataset where 70% of the 'secondary_phone' column is missing. What do you do?
10. Summary
Handling missing values is about making informed choices.isnull().sum() tells you the scope of the problem. Use dropna(subset=[...]) to remove records missing critical identifiers. Use fillna() to impute values: constant text for categories, median for skewed numbers, and ffill() for time series. Advanced imputation uses groupby to fill values based on related categories.