Skip to main content
Regression Models
CHAPTER 09 Intermediate

Data Preprocessing for Regression

Updated: May 16, 2026
6 min read

# CHAPTER 9

Data Preprocessing for Regression

1. Introduction

Machine learning algorithms are notoriously sensitive to messy data. If your dataset contains extreme outliers, or if one feature is measured in thousands (Salary) while another is measured in single digits (Years of Experience), the regression math will skew violently, resulting in terrible predictions. Data Preprocessing is the art of mathematically cleaning, formatting, and scaling raw data so that the algorithm can digest it efficiently.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Handle missing values using SimpleImputer.
  • Identify and mitigate Outliers.
  • Understand the mathematical need for Feature Scaling.
  • Implement Standardization (StandardScaler).
  • Implement Normalization (MinMaxScaler).

3. Handling Missing Values

In Chapter 4, we used Pandas to fill missing values (NaN). scikit-learn provides a more robust, pipeline-friendly way to do this using SimpleImputer.
python
1234567891011121314
import numpy as np
from sklearn.impute import SimpleImputer

# Dataset with a missing value (NaN)
X = np.array([[10], [20], [np.nan], [40], [50]])

# Create an imputer that replaces NaN with the average (mean) of the column
imputer = SimpleImputer(strategy='mean')

# Apply it to the data
X_clean = imputer.fit_transform(X)

print(X_clean)
# The NaN is replaced by 30 (the average of 10, 20, 40, 50)

4. The Outlier Problem

An Outlier is a data point that is drastically different from the rest of the dataset. Imagine predicting house prices. Your data has 100 normal houses ranging from $200k to $500k. Suddenly, there is one massive 100-bedroom castle worth $50 Million. Because Linear Regression tries to minimize the *squared* error, it will completely warp the Line of Best Fit just to get closer to that one castle, ruining the predictions for the 100 normal houses. Solution: You must identify outliers (using Box Plots or Z-scores) and usually delete them from your training data.

5. Why Do We Need Feature Scaling?

Look at these two features:
  • Bedrooms: Ranges from 1 to 5.
  • SquareFeet: Ranges from 1,000 to 5,000.

In regression, the algorithm multiplies weights by these inputs. Because SquareFeet has such massive numbers, the math will naturally assume Square_Feet is 1,000 times more important than Bedrooms simply because the numbers are bigger! This is a disaster. Feature Scaling squashes all columns down to the exact same numerical scale (e.g., between 0 and 1) so the algorithm treats them fairly.

6. Standardization (Z-Score Scaling)

Standardization transforms the data so that the column has a Mean of 0 and a Standard Deviation of 1. It centers the data around zero. This is the default scaling method for almost all regression models.
python
1234567891011121314151617
from sklearn.preprocessing import StandardScaler
import numpy as np

# Raw data: massive differences in scale
X = np.array([[1, 10000], 
              [2, 20000], 
              [3, 30000]])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(X_scaled)
# Output:
# [[-1.22  -1.22]
#  [ 0.     0.  ]
#  [ 1.22   1.22]]
# Now both columns are on the exact same scale!

7. Normalization (Min-Max Scaling)

Normalization squashes all data to be exactly between 0.0 and 1.0. The smallest number in the column becomes 0, the largest becomes 1, and everything else is a decimal in between.
python
12345678910
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

print(X_normalized)
# Output:
# [[0.  0. ]
#  [0.5 0.5]
#  [1.  1. ]]

8. Standardization vs. Normalization

When should you use which?
  • Use Standardization (StandardScaler): 90% of the time in Linear Regression, Logistic Regression, and Deep Learning. It handles outliers slightly better than Normalization.
  • Use Normalization (MinMaxScaler): When you specifically need the data bounded between 0 and 1 (e.g., Image pixel data for Neural Networks).

9. Common Mistakes

  • Data Leakage: A critical mistake is scaling the *entire* dataset before splitting it into Train and Test sets. If you do this, information from the Test set "leaks" into the scaler's calculations (Mean/Max). You must fit_transform the Training data, and only transform the Test data.
  • Scaling the Target Variable (y): Usually, you only scale the Input Features (X). Scaling the target variable (y) makes the final predictions unreadable (e.g., predicting a salary as 0.85 instead of $85,000).

10. Best Practices

  • Build Pipelines: Instead of manually calling imputer.fit(), then scaler.fit(), then model.fit(), scikit-learn allows you to chain them together using from sklearn.pipeline import Pipeline. This prevents Data Leakage and makes deployment incredibly easy.

11. Exercises

  1. 1. If an age column has a minimum value of 20 and a maximum value of 60, what will the age 40 become after applying MinMaxScaler?
  1. 2. Why does a massive outlier drastically change the slope of a Simple Linear Regression line?

12. MCQ Quiz with Answers

Question 1

Why is Feature Scaling (like Standardization) crucial before training a Multiple Linear Regression model?

Question 2

What is the fundamental difference between StandardScaler and MinMaxScaler?

13. Interview Questions

  • Q: Explain "Data Leakage" in the context of applying a StandardScaler. How do you prevent it?
  • Q: In Linear Regression, explain why an extreme outlier in the training data negatively impacts the "Line of Best Fit."

14. FAQs

Q: Do Decision Trees and Random Forests require feature scaling? A: No! Tree-based algorithms (covered in Chapters 14 & 15) do not use mathematical equations to draw lines; they use logical splits (e.g., "Is Age > 30?"). Therefore, they are completely immune to scale differences!

15. Summary

Data Preprocessing transforms raw, chaotic data into a pristine mathematical matrix. By replacing missing values, identifying outliers, and ensuring all features are playing on an equal numerical playing field through Standardization or Normalization, you guarantee that your regression model learns true patterns rather than mathematical anomalies.

16. Next Chapter Recommendation

Our data is scaled and clean. But what if we have 500 features, and half of them are useless? Or what if a feature is text (like "City")? In Chapter 10: Feature Engineering and Selection, we will learn how to create new, highly predictive columns and delete the noise.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·