Skip to main content
Data Cleaning
CHAPTER 08 Beginner

Detecting and Handling Outliers

Updated: May 18, 2026
5 min read

# CHAPTER 8

Detecting and Handling Outliers

1. Chapter Introduction

An outlier is an observation that lies an abnormal distance from other values in a dataset. If 99 people in a room earn $50,000 a year, and Elon Musk walks in, the *average* salary in the room becomes millions of dollars. Outliers distort statistical analyses and cripple machine learning models. This chapter teaches you how to detect them using robust statistical methods and handle them appropriately.

2. What Causes Outliers?

Outliers usually come from three sources:

  1. 1. Data Entry Errors: Typing 1000 instead of 10.00 (Human error).
  1. 2. Measurement Errors: A faulty sensor spikes to 9999 for one second (System error).
  1. 3. Natural Extreme Variation: Fraudulent transactions, billionaires, or a viral marketing campaign (Valid, but extreme data).

*Important:* You should only delete outliers if you are certain they are errors (Sources 1 & 2). Natural outliers (Source 3) often contain the most valuable insights (e.g., detecting credit card fraud!).

3. Method 1: The IQR (Interquartile Range) Method

The IQR method is robust to extreme values because it relies on percentiles (medians), not means. It's the standard for building boxplots.

python
123456789101112131415161718192021222324252627
import pandas as pd
import numpy as np

# Sample data with outliers
np.random.seed(42)
salaries = np.random.normal(50000, 10000, 100).tolist()
salaries.extend([500000, 1000000, 5000]) # Add extreme outliers
df = pd.DataFrame({'salary': salaries})

# 1. Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)

# 2. Calculate IQR
IQR = Q3 - Q1

# 3. Define the lower and upper bounds (standard is 1.5 * IQR)
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Lower Bound: {lower_bound:.2f}")
print(f"Upper Bound: {upper_bound:.2f}")

# 4. Identify Outliers
outliers = df[(df[&#039;salary'] < lower_bound) | (df['salary'] > upper_bound)]
print(f"\nNumber of outliers detected: {len(outliers)}")
print(outliers)

4. Method 2: The Z-Score Method

The Z-score tells you how many standard deviations a data point is from the mean. This method assumes your data is normally distributed (bell curve).

python
123456789101112
from scipy import stats

# Calculate Z-scores for the column
df[&#039;z_score'] = np.abs(stats.zscore(df['salary']))

# A common threshold for outliers is a Z-score > 3
# (Meaning the value is 3 standard deviations away from the mean)
threshold = 3
z_outliers = df[df[&#039;z_score'] > threshold]

print("\n=== Z-SCORE OUTLIERS ===")
print(z_outliers)

*Why IQR is often better than Z-Score:* If you have massive outliers, they pull the mean and standard deviation towards themselves, which can actually hide other outliers in the Z-score calculation. IQR is immune to this.

5. Handling Outliers

Once detected, how do you handle them?

1. Removal (Trimming): Remove the rows entirely. Best when you are certain they are data entry errors.

python
12
# Keep only rows within the IQR bounds
df_trimmed = df[(df[&#039;salary'] >= lower_bound) & (df['salary'] <= upper_bound)]

2. Capping (Winsorizing): Cap extreme values at a specific threshold. E.g., any salary above $150k is capped at $150k. This preserves the row for other column analyses.

python
123
# Cap values at the upper and lower bounds
df[&#039;salary_capped'] = np.where(df['salary'] > upper_bound, upper_bound,
                       np.where(df[&#039;salary'] < lower_bound, lower_bound, df['salary']))

3. Transformation (Log Transformation): If the data is heavily skewed (like income), taking the logarithm shrinks extreme values, making the data more normally distributed.

python
12
# log1p is log(1+x), safer because log(0) is undefined
df[&#039;salary_log'] = np.log1p(df['salary']) 

6. Mini Project: Fraud Transaction Detector

python
12345678910111213141516171819202122
# E-commerce transaction amounts
transactions = pd.DataFrame({
    &#039;tx_id': range(1, 101),
    &#039;amount': np.random.normal(50, 15, 100).tolist()
})
# Inject fraudulent large purchases
transactions.loc[10, &#039;amount'] = 1500
transactions.loc[55, &#039;amount'] = 2200

# Function to flag outliers using IQR
def flag_outliers(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    upper_bound = Q3 + 1.5 * IQR
    return series > upper_bound # Boolean mask

transactions[&#039;is_fraud_suspect'] = flag_outliers(transactions['amount'])

print("\n=== SUSPICIOUS TRANSACTIONS (OUTLIERS) ===")
print(transactions[transactions[&#039;is_fraud_suspect']])
# We don't delete these! We send them to the fraud team for review.

7. Common Mistakes

  • Deleting outliers without investigation: If you are building a fraud detection model, the outliers ARE the target variable. Deleting them destroys the exact phenomenon you are trying to predict.
  • Using Z-score on highly skewed data: Applying Z-scores to income or population data will falsely identify too many valid data points as outliers. Use IQR or Log-transform first.

8. MCQs

Question 1

What does IQR stand for?

Question 2

The standard multiplier for the IQR to find outlier bounds is?

Question 3

A Z-score tells you how many \\\\\_ a value is from the mean?

Question 4

What is a common threshold for identifying an outlier using Z-scores?

Question 5

Replacing extreme outliers with the maximum acceptable threshold value is called?

Question 6

Which method is more robust to extreme outliers?

Question 7

If you apply a log transformation to highly right-skewed data, it generally:

Question 8

Should you always delete outliers?

Question 9

In the IQR formula, Q3 represents the?

Question 10

np.where(condition, x, y) in pandas/numpy does what?

9. Interview Questions

  • Q: Explain the difference between the IQR method and the Z-score method for outlier detection. Which do you prefer and why?
  • Q: If you detect outliers in a dataset containing housing prices, how do you decide whether to cap them, delete them, or leave them alone?

10. Summary

Outliers distort analytics. Detect them using statistical methods: IQR (robust, relies on medians/percentiles) or Z-score (relies on mean/std dev, assumes normal distribution). Once detected, investigate their source. If they are errors, remove or cap them (Winsorization). If they are natural variations, consider leaving them or applying a log transformation to reduce their leverage on models.

11. Next Chapter Recommendation

In Chapter 9: String Cleaning and Text Processing, we move from numbers to text, learning how to wrangle messy, user-generated string data using regex and Pandas string methods.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·