Data Transformation and Standardization
# CHAPTER 14
Data Transformation and Standardization
1. Chapter Introduction
Cleaning fixes *errors*. Transformation changes the *shape* and *scale* of clean data so algorithms can understand it. If you feed a machine learning model a dataset where "Age" ranges from 18-80 and "Salary" ranges from 30,000-150,000, the algorithm will assume Salary is thousands of times more important simply because the numbers are bigger. This chapter teaches you how to scale numbers and encode text categories.2. Standardization (Z-Score Scaling)
Standardization rescales data so it has a mean of 0 and a standard deviation of 1. It centers the data around zero. This is the preferred method for algorithms like Linear Regression, Logistic Regression, and SVMs.
3. Normalization (Min-Max Scaling)
Min-Max scaling compresses all data into a strict range, usually between 0 and 1. This is preferred for Neural Networks and algorithms that don't assume a normal distribution (like K-Nearest Neighbors).
4. Handling Skewed Data (Log Transformation)
As discussed in Chapter 8, financial data (salaries, prices) is usually heavily right-skewed. A log transformation squashes the massive outliers, making the distribution more bell-shaped.
5. Encoding Categorical Data
Machine learning models only understand numbers. You cannot feed them a column that says "Red", "Green", "Blue". You must encode them.
1. Label Encoding (For Ordinal Data) Ordinal data has an inherent order (e.g., Small, Medium, Large).
2. One-Hot Encoding (For Nominal Data) Nominal data has NO order (e.g., Colors, Cities). If you assign Red=1, Green=2, Blue=3, the algorithm will assume Blue is "greater than" Red. To fix this, we create binary columns for each category.
*(Note: In ML pipelines, you typically convert the True/False booleans to 1/0 by appending .astype(int)).*
6. The Dummy Variable Trap
When you use One-Hot Encoding on a column with 3 categories (Red, Green, Blue), you actually only need 2 columns to represent all the information. If colorBlue is 0 and colorGreen is 0, it *must* be Red.
Leaving all 3 columns in creates perfect multicollinearity, which breaks linear models.
7. Common Mistakes
-
Fitting the scaler on the Test Set: In machine learning, you must fit the
StandardScaleron your Training data only, and then use.transform()on your Test data. If youfittransform()the entire dataset before splitting, information from the test set leaks into the training process (Data Leakage).
-
One-Hot Encoding high-cardinality columns: Do not one-hot encode a
useridorcitycolumn that has 5,000 unique values. You will create a dataframe with 5,000 new columns, crashing your memory.
8. MCQs
What is the goal of Standardization (Z-Score)?
Which scaler ensures all data falls precisely between 0 and 1?
Why is scaling numerical data important for machine learning?
A categorical column contains "Low", "Medium", "High". Which encoding is best?
A categorical column contains "Dog", "Cat", "Bird". Which encoding is best?
What Pandas function is used for One-Hot Encoding?
What is the Dummy Variable Trap?
How do you avoid the Dummy Variable Trap in Pandas?
Which transformation is best for handling heavily right-skewed data like income?
Data Leakage occurs when you:
9. Interview Questions
- Q: Explain the difference between Normalization (Min-Max) and Standardization (Z-score). When would you use one over the other?
- Q: What is the Dummy Variable Trap, and how do you avoid it when preparing data for a Linear Regression model?
10. Summary
Data transformation prepares clean data for algorithms. UseStandardScaler to center data around zero (best for regression/SVM). Use MinMaxScaler to squeeze data between 0 and 1 (best for neural networks/KNN). Categorical data must be converted to numbers: use Label Encoding for ordinal data (ranked) and One-Hot Encoding (pd.getdummies(dropfirst=True)) for nominal data (unranked).