Detecting and Handling Outliers
# CHAPTER 8
Detecting and Handling Outliers
1. Chapter Introduction
An outlier is an observation that lies an abnormal distance from other values in a dataset. If 99 people in a room earn $50,000 a year, and Elon Musk walks in, the *average* salary in the room becomes millions of dollars. Outliers distort statistical analyses and cripple machine learning models. This chapter teaches you how to detect them using robust statistical methods and handle them appropriately.2. What Causes Outliers?
Outliers usually come from three sources:
- 1. Data Entry Errors: Typing 1000 instead of 10.00 (Human error).
- 2. Measurement Errors: A faulty sensor spikes to 9999 for one second (System error).
- 3. Natural Extreme Variation: Fraudulent transactions, billionaires, or a viral marketing campaign (Valid, but extreme data).
*Important:* You should only delete outliers if you are certain they are errors (Sources 1 & 2). Natural outliers (Source 3) often contain the most valuable insights (e.g., detecting credit card fraud!).
3. Method 1: The IQR (Interquartile Range) Method
The IQR method is robust to extreme values because it relies on percentiles (medians), not means. It's the standard for building boxplots.
4. Method 2: The Z-Score Method
The Z-score tells you how many standard deviations a data point is from the mean. This method assumes your data is normally distributed (bell curve).
*Why IQR is often better than Z-Score:* If you have massive outliers, they pull the mean and standard deviation towards themselves, which can actually hide other outliers in the Z-score calculation. IQR is immune to this.
5. Handling Outliers
Once detected, how do you handle them?
1. Removal (Trimming): Remove the rows entirely. Best when you are certain they are data entry errors.
2. Capping (Winsorizing): Cap extreme values at a specific threshold. E.g., any salary above $150k is capped at $150k. This preserves the row for other column analyses.
3. Transformation (Log Transformation): If the data is heavily skewed (like income), taking the logarithm shrinks extreme values, making the data more normally distributed.
6. Mini Project: Fraud Transaction Detector
7. Common Mistakes
- Deleting outliers without investigation: If you are building a fraud detection model, the outliers ARE the target variable. Deleting them destroys the exact phenomenon you are trying to predict.
- Using Z-score on highly skewed data: Applying Z-scores to income or population data will falsely identify too many valid data points as outliers. Use IQR or Log-transform first.
8. MCQs
What does IQR stand for?
The standard multiplier for the IQR to find outlier bounds is?
A Z-score tells you how many \\\\\_ a value is from the mean?
What is a common threshold for identifying an outlier using Z-scores?
Replacing extreme outliers with the maximum acceptable threshold value is called?
Which method is more robust to extreme outliers?
If you apply a log transformation to highly right-skewed data, it generally:
Should you always delete outliers?
In the IQR formula, Q3 represents the?
np.where(condition, x, y) in pandas/numpy does what?
9. Interview Questions
- Q: Explain the difference between the IQR method and the Z-score method for outlier detection. Which do you prefer and why?
- Q: If you detect outliers in a dataset containing housing prices, how do you decide whether to cap them, delete them, or leave them alone?