CHAPTER 14 Beginner

Data Transformation and Standardization

Updated: May 18, 2026

5 min read

# CHAPTER 14

Data Transformation and Standardization

1. Chapter Introduction

Cleaning fixes *errors*. Transformation changes the *shape* and *scale* of clean data so algorithms can understand it. If you feed a machine learning model a dataset where "Age" ranges from 18-80 and "Salary" ranges from 30,000-150,000, the algorithm will assume Salary is thousands of times more important simply because the numbers are bigger. This chapter teaches you how to scale numbers and encode text categories.

2. Standardization (Z-Score Scaling)

Standardization rescales data so it has a mean of 0 and a standard deviation of 1. It centers the data around zero. This is the preferred method for algorithms like Linear Regression, Logistic Regression, and SVMs.

python

1234567891011121314151617181920212223

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

df = pd.DataFrame({
    &#039;age': [25, 30, 35, 40, 45],
    &#039;salary': [50000, 60000, 80000, 110000, 150000]
})

print("=== RAW DATA ===")
print(df)

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the data
# (Returns a numpy array, so we wrap it back in a DataFrame)
scaled_data = scaler.fit_transform(df)
df_standardized = pd.DataFrame(scaled_data, columns=df.columns)

print("\n=== STANDARDIZED (Z-SCORE) ===")
print(df_standardized)
# Notice salaries and ages are now on the exact same scale (-1.4 to +1.4)

3. Normalization (Min-Max Scaling)

Min-Max scaling compresses all data into a strict range, usually between 0 and 1. This is preferred for Neural Networks and algorithms that don't assume a normal distribution (like K-Nearest Neighbors).

python

123456789101112

from sklearn.preprocessing import MinMaxScaler

# Initialize the scaler
minmax = MinMaxScaler()

# Fit and transform
normalized_data = minmax.fit_transform(df)
df_normalized = pd.DataFrame(normalized_data, columns=df.columns)

print("\n=== NORMALIZED (MIN-MAX) ===")
print(df_normalized)
# The lowest value becomes 0.0, the highest becomes 1.0

4. Handling Skewed Data (Log Transformation)

As discussed in Chapter 8, financial data (salaries, prices) is usually heavily right-skewed. A log transformation squashes the massive outliers, making the distribution more bell-shaped.

python

12345

# Create skewed data
skewed_df = pd.DataFrame({&#039;income': [30k, 40k, 45k, 50k, 500k, 1.2M]}) # Pseudo-code representation

# Apply log1p (log(1+x)) to handle potential zeros
df[&#039;log_salary'] = np.log1p(df['salary'])

5. Encoding Categorical Data

Machine learning models only understand numbers. You cannot feed them a column that says "Red", "Green", "Blue". You must encode them.

1. Label Encoding (For Ordinal Data) Ordinal data has an inherent order (e.g., Small, Medium, Large).

python

123456789

sizes = pd.DataFrame({&#039;size': ['Medium', 'Large', 'Small', 'Medium']})

# Create a mapping dictionary based on logical order
size_mapping = {&#039;Small': 1, 'Medium': 2, 'Large': 3}

# Map the values
sizes[&#039;size_encoded'] = sizes['size'].map(size_mapping)
print("\n=== LABEL ENCODING ===")
print(sizes)

2. One-Hot Encoding (For Nominal Data) Nominal data has NO order (e.g., Colors, Cities). If you assign Red=1, Green=2, Blue=3, the algorithm will assume Blue is "greater than" Red. To fix this, we create binary columns for each category.

python

1234567891011

colors = pd.DataFrame({&#039;color': ['Red', 'Green', 'Blue', 'Red']})

# Pandas get_dummies performs One-Hot Encoding
colors_encoded = pd.get_dummies(colors, columns=[&#039;color'])

print("\n=== ONE-HOT ENCODING ===")
print(colors_encoded)
# Result:
#    color_Blue  color_Green  color_Red
# 0       False        False       True
# 1       False         True      False

*(Note: In ML pipelines, you typically convert the True/False booleans to 1/0 by appending .astype(int)).*

6. The Dummy Variable Trap

When you use One-Hot Encoding on a column with 3 categories (Red, Green, Blue), you actually only need 2 columns to represent all the information. If colorBlue is 0 and colorGreen is 0, it *must* be Red.

Leaving all 3 columns in creates perfect multicollinearity, which breaks linear models.

python

1234

# Drop the first category to avoid the Dummy Variable Trap
colors_safe = pd.get_dummies(colors, columns=[&#039;color'], drop_first=True, dtype=int)
print("\n=== ONE-HOT ENCODING (DROP FIRST) ===")
print(colors_safe)

7. Common Mistakes

Fitting the scaler on the Test Set: In machine learning, you must fit the StandardScaler on your Training data only, and then use .transform() on your Test data. If you fittransform() the entire dataset before splitting, information from the test set leaks into the training process (Data Leakage).

One-Hot Encoding high-cardinality columns: Do not one-hot encode a userid or city column that has 5,000 unique values. You will create a dataframe with 5,000 new columns, crashing your memory.

8. MCQs

Question 1

What is the goal of Standardization (Z-Score)?

Question 2

Which scaler ensures all data falls precisely between 0 and 1?

Question 3

Why is scaling numerical data important for machine learning?

Question 4

A categorical column contains "Low", "Medium", "High". Which encoding is best?

Question 5

A categorical column contains "Dog", "Cat", "Bird". Which encoding is best?

Question 6

What Pandas function is used for One-Hot Encoding?

Question 7

What is the Dummy Variable Trap?

Question 8

How do you avoid the Dummy Variable Trap in Pandas?

Question 9

Which transformation is best for handling heavily right-skewed data like income?

Question 10

Data Leakage occurs when you:

9. Interview Questions

Q: Explain the difference between Normalization (Min-Max) and Standardization (Z-score). When would you use one over the other?

Q: What is the Dummy Variable Trap, and how do you avoid it when preparing data for a Linear Regression model?

10. Summary

Data transformation prepares clean data for algorithms. Use StandardScaler to center data around zero (best for regression/SVM). Use MinMaxScaler to squeeze data between 0 and 1 (best for neural networks/KNN). Categorical data must be converted to numbers: use Label Encoding for ordinal data (ranked) and One-Hot Encoding (pd.getdummies(dropfirst=True)) for nominal data (unranked).

11. Next Chapter Recommendation

In Chapter 15: Exploratory Data Analysis for Cleaning, we will learn how to use visual profiling and descriptive statistics to uncover hidden data quality issues that programmatic checks might miss.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Data Transformation and Standardization #

1. Chapter Introduction #

2. Standardization (Z-Score Scaling) #

3. Normalization (Min-Max Scaling) #

4. Handling Skewed Data (Log Transformation) #

5. Encoding Categorical Data #

6. The Dummy Variable Trap #

7. Common Mistakes #

8. MCQs #

What is the goal of Standardization (Z-Score)?

Which scaler ensures all data falls precisely between 0 and 1?

Why is scaling numerical data important for machine learning?

A categorical column contains "Low", "Medium", "High". Which encoding is best?

A categorical column contains "Dog", "Cat", "Bird". Which encoding is best?

What Pandas function is used for One-Hot Encoding?

What is the Dummy Variable Trap?

How do you avoid the Dummy Variable Trap in Pandas?

Which transformation is best for handling heavily right-skewed data like income?

Data Leakage occurs when you:

9. Interview Questions #

10. Summary #

11. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

🧪 Related Labs 1

🗺️ Related Roadmaps 1

Send Feedback / Bug

Feedback Submitted!

Data Transformation and Standardization

1. Chapter Introduction

2. Standardization (Z-Score Scaling)

3. Normalization (Min-Max Scaling)

4. Handling Skewed Data (Log Transformation)

5. Encoding Categorical Data

6. The Dummy Variable Trap

7. Common Mistakes

8. MCQs

9. Interview Questions

10. Summary

11. Next Chapter Recommendation