Skip to main content
Classification Algorithms
CHAPTER 14 Intermediate

Handling Imbalanced Datasets

Updated: May 16, 2026
6 min read

# CHAPTER 14

Handling Imbalanced Datasets

1. Introduction

In standard tutorials, datasets are beautifully balanced: 500 images of Dogs and 500 images of Cats. In the real world, data is rarely balanced. If you are building an AI to detect Credit Card Fraud, 99.9% of transactions are Legitimate (Class 0), and only 0.1% are Fraud (Class 1). This is an Imbalanced Dataset. If you feed this into a standard classifier, it will suffer a catastrophic mathematical failure. In this chapter, we will learn how to detect, measure, and fix severe class imbalances.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Explain why standard algorithms fail on imbalanced data.
  • Understand the "Accuracy Paradox".
  • Implement Oversampling and Undersampling.
  • Use Synthetic Minority Over-sampling Technique (SMOTE).
  • Adjust algorithm class weights.

3. The Accuracy Paradox

Imagine training a Logistic Regression model on the Fraud dataset (99.9% Legitimate, 0.1% Fraud). The algorithm's goal is to maximize Accuracy. It quickly realizes a mathematical shortcut: *If I just blindly predict "Legitimate" for every single transaction and completely ignore the features, I will achieve 99.9% Accuracy!*

Your model looks perfect on paper (99.9% accurate), but it is completely useless in reality because it caught 0 cases of actual Fraud. This is the Accuracy Paradox.

4. Strategy 1: Algorithm Class Weights

The easiest way to fix imbalance is to tell the algorithm to penalize mistakes unequally. By setting class_weight='balanced', we tell Scikit-learn: *"If you misclassify a Legitimate transaction, I will deduct 1 point. But if you miss a Fraud transaction, I will deduct 1,000 points!"* The algorithm is forced to pay attention to the minority class.
python
12345
from sklearn.linear_model import LogisticRegression

# The algorithm will mathematically weigh Class 1 much heavier!
model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)

5. Strategy 2: Undersampling

If you have 10,000 Legitimate rows and 100 Fraud rows, Undersampling randomly deletes 9,900 Legitimate rows.
  • *Result:* You now have a perfectly balanced dataset of 100 Legitimate and 100 Fraud.
  • *The Problem:* You just deleted 9,900 rows of potentially valuable training data! Only use this if you have millions of rows.

6. Strategy 3: Oversampling and SMOTE

Instead of deleting data, we can create more of it! Simple oversampling just duplicates the 100 Fraud rows over and over until there are 10,000 of them. But this causes severe Overfitting.

SMOTE (Synthetic Minority Over-sampling Technique) is the industry standard. It does not duplicate data. It looks at the existing Fraud cases and uses KNN geometry to mathematically invent *brand new, synthetic* Fraud data points that are similar to the originals!

python
1234567891011121314151617181920
# Note: SMOTE requires the 'imbalanced-learn' library
# pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
import numpy as np

# Imbalanced Data: 5 Majority (Class 0), 2 Minority (Class 1)
X_train = np.array([[1.1], [1.2], [1.3], [1.4], [1.5], [9.1], [9.2]])
y_train = np.array([0, 0, 0, 0, 0, 1, 1])

# Initialize SMOTE
# k_neighbors=1 is used here because we only have 2 minority samples!
smote = SMOTE(k_neighbors=1, random_state=42)

# Generate synthetic data!
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Check the new balance
unique, counts = np.unique(y_train_resampled, return_counts=True)
print(dict(zip(unique, counts)))
# Output: {0: 5, 1: 5} -> The data is perfectly balanced!

7. The Golden Rule of SMOTE

CRITICAL: You must split your data into Train/Test *BEFORE* applying SMOTE. You only apply SMOTE to the Xtrain data to help the model learn. You must NEVER apply SMOTE to your Xtest data. If you test your model on fake, synthetic data, your evaluation metrics will be complete lies.

8. Common Mistakes

  • Trusting "Accuracy": As discussed, if you see 99% accuracy on an imbalanced dataset, you should immediately assume the model is broken. We must use specific evaluation metrics (like Precision, Recall, and the F1-Score) which we will cover deeply in Chapter 16.
  • Applying SMOTE to the whole dataset: This causes massive Data Leakage. Synthetic data generated from the Test set will bleed into the Training set, resulting in artificially high scores that fail in production.

9. Best Practices

  • Combine Strategies: In highly skewed datasets, a common enterprise strategy is to lightly Undersample the majority class first (to reduce compute time), and then apply SMOTE to the minority class to achieve a 50/50 balance.

10. Exercises

  1. 1. In a dataset trying to predict rare network intrusions (99.5% Normal traffic, 0.5% Intrusion), why is a model that predicts "Normal" 100% of the time mathematically "accurate" but useless?
  1. 2. Explain the difference between simple Oversampling (duplication) and SMOTE.

11. MCQ Quiz with Answers

Question 1

What does the hyperparameter classweight='balanced' do in Scikit-Learn models?

Question 2

When utilizing SMOTE to balance a dataset, at what point in the pipeline MUST it be applied?

12. Interview Questions

  • Q: Explain the "Accuracy Paradox" in the context of an imbalanced dataset, and suggest two algorithmic ways to overcome it.
  • Q: Why is it considered a critical error (Data Leakage) to apply SMOTE to your entire dataset before performing a Train/Test split?

13. FAQs

Q: Can I use SMOTE with text data (NLP)? A: It is generally not recommended. SMOTE operates on continuous geometric space. Generating a "synthetic" word frequency matrix often results in mathematical garbage that doesn't represent real human language. For NLP, rely on class
weight='balanced' or undersampling.

14. Summary

Real-world data is inherently unfair. By recognizing the trap of the Accuracy Paradox and proactively balancing the mathematical scales—either by adjusting algorithm weights or intelligently synthesizing new data with SMOTE—you ensure your classifier actually learns to identify the rare, critical events it was built to find.

15. Next Chapter Recommendation

We have heavily focused on Binary Classification (0 or 1). But what if we need to categorize news articles into Politics, Sports, Tech, or Entertainment? In Chapter 15: Multiclass and Multilabel Classification, we will expand our algorithms to handle complex, multi-category environments.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·