Skip to main content
Data Cleaning
CHAPTER 14 Beginner

Data Transformation and Standardization

Updated: May 18, 2026
5 min read

# CHAPTER 14

Data Transformation and Standardization

1. Chapter Introduction

Cleaning fixes *errors*. Transformation changes the *shape* and *scale* of clean data so algorithms can understand it. If you feed a machine learning model a dataset where "Age" ranges from 18-80 and "Salary" ranges from 30,000-150,000, the algorithm will assume Salary is thousands of times more important simply because the numbers are bigger. This chapter teaches you how to scale numbers and encode text categories.

2. Standardization (Z-Score Scaling)

Standardization rescales data so it has a mean of 0 and a standard deviation of 1. It centers the data around zero. This is the preferred method for algorithms like Linear Regression, Logistic Regression, and SVMs.

python
1234567891011121314151617181920212223
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({
    'age': [25, 30, 35, 40, 45],
    'salary': [50000, 60000, 80000, 110000, 150000]
})

print("=== RAW DATA ===")
print(df)

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the data
# (Returns a numpy array, so we wrap it back in a DataFrame)
scaled_data = scaler.fit_transform(df)
df_standardized = pd.DataFrame(scaled_data, columns=df.columns)

print("\n=== STANDARDIZED (Z-SCORE) ===")
print(df_standardized)
# Notice salaries and ages are now on the exact same scale (-1.4 to +1.4)

3. Normalization (Min-Max Scaling)

Min-Max scaling compresses all data into a strict range, usually between 0 and 1. This is preferred for Neural Networks and algorithms that don't assume a normal distribution (like K-Nearest Neighbors).

python
123456789101112
from sklearn.preprocessing import MinMaxScaler

# Initialize the scaler
minmax = MinMaxScaler()

# Fit and transform
normalized_data = minmax.fit_transform(df)
df_normalized = pd.DataFrame(normalized_data, columns=df.columns)

print("\n=== NORMALIZED (MIN-MAX) ===")
print(df_normalized)
# The lowest value becomes 0.0, the highest becomes 1.0

4. Handling Skewed Data (Log Transformation)

As discussed in Chapter 8, financial data (salaries, prices) is usually heavily right-skewed. A log transformation squashes the massive outliers, making the distribution more bell-shaped.

python
12345
# Create skewed data
skewed_df = pd.DataFrame({'income': [30k, 40k, 45k, 50k, 500k, 1.2M]}) # Pseudo-code representation

# Apply log1p (log(1+x)) to handle potential zeros
df['log_salary'] = np.log1p(df['salary'])

5. Encoding Categorical Data

Machine learning models only understand numbers. You cannot feed them a column that says "Red", "Green", "Blue". You must encode them.

1. Label Encoding (For Ordinal Data) Ordinal data has an inherent order (e.g., Small, Medium, Large).

python
123456789
sizes = pd.DataFrame({'size': ['Medium', 'Large', 'Small', 'Medium']})

# Create a mapping dictionary based on logical order
size_mapping = {'Small': 1, 'Medium': 2, 'Large': 3}

# Map the values
sizes['size_encoded'] = sizes['size'].map(size_mapping)
print("\n=== LABEL ENCODING ===")
print(sizes)

2. One-Hot Encoding (For Nominal Data) Nominal data has NO order (e.g., Colors, Cities). If you assign Red=1, Green=2, Blue=3, the algorithm will assume Blue is "greater than" Red. To fix this, we create binary columns for each category.

python
1234567891011
colors = pd.DataFrame({'color': ['Red', 'Green', 'Blue', 'Red']})

# Pandas get_dummies performs One-Hot Encoding
colors_encoded = pd.get_dummies(colors, columns=['color'])

print("\n=== ONE-HOT ENCODING ===")
print(colors_encoded)
# Result:
#    color_Blue  color_Green  color_Red
# 0       False        False       True
# 1       False         True      False

*(Note: In ML pipelines, you typically convert the True/False booleans to 1/0 by appending .astype(int)).*

6. The Dummy Variable Trap

When you use One-Hot Encoding on a column with 3 categories (Red, Green, Blue), you actually only need 2 columns to represent all the information. If colorBlue is 0 and colorGreen is 0, it *must* be Red.

Leaving all 3 columns in creates perfect multicollinearity, which breaks linear models.

python
1234
# Drop the first category to avoid the Dummy Variable Trap
colors_safe = pd.get_dummies(colors, columns=['color'], drop_first=True, dtype=int)
print("\n=== ONE-HOT ENCODING (DROP FIRST) ===")
print(colors_safe)

7. Common Mistakes

  • Fitting the scaler on the Test Set: In machine learning, you must fit the StandardScaler on your Training data only, and then use .transform() on your Test data. If you fittransform() the entire dataset before splitting, information from the test set leaks into the training process (Data Leakage).
  • One-Hot Encoding high-cardinality columns: Do not one-hot encode a userid or city column that has 5,000 unique values. You will create a dataframe with 5,000 new columns, crashing your memory.

8. MCQs

Question 1

What is the goal of Standardization (Z-Score)?

Question 2

Which scaler ensures all data falls precisely between 0 and 1?

Question 3

Why is scaling numerical data important for machine learning?

Question 4

A categorical column contains "Low", "Medium", "High". Which encoding is best?

Question 5

A categorical column contains "Dog", "Cat", "Bird". Which encoding is best?

Question 6

What Pandas function is used for One-Hot Encoding?

Question 7

What is the Dummy Variable Trap?

Question 8

How do you avoid the Dummy Variable Trap in Pandas?

Question 9

Which transformation is best for handling heavily right-skewed data like income?

Question 10

Data Leakage occurs when you:

9. Interview Questions

  • Q: Explain the difference between Normalization (Min-Max) and Standardization (Z-score). When would you use one over the other?
  • Q: What is the Dummy Variable Trap, and how do you avoid it when preparing data for a Linear Regression model?

10. Summary

Data transformation prepares clean data for algorithms. Use StandardScaler to center data around zero (best for regression/SVM). Use MinMaxScaler to squeeze data between 0 and 1 (best for neural networks/KNN). Categorical data must be converted to numbers: use Label Encoding for ordinal data (ranked) and One-Hot Encoding (pd.getdummies(dropfirst=True)) for nominal data (unranked).

11. Next Chapter Recommendation

In Chapter 15: Exploratory Data Analysis for Cleaning, we will learn how to use visual profiling and descriptive statistics to uncover hidden data quality issues that programmatic checks might miss.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·