CHAPTER 26 Beginner

Preparing Data for Machine Learning

Updated: May 18, 2026

5 min read

# CHAPTER 26

Preparing Data for Machine Learning

1. Chapter Introduction

Raw data cannot be fed directly into ML models. Data preparation — feature engineering, encoding, scaling, and splitting — is what makes models accurate. Pandas and NumPy handle 90% of this preprocessing pipeline.

2. Feature Engineering

python

1234567891011121314151617181920212223242526272829303132

import pandas as pd
import numpy as np

df = pd.DataFrame({
    &#039;Name': ['Alice', 'Bob', 'Carol', 'David', 'Eve', 'Frank'],
    &#039;Age': [28, 35, 24, 45, 31, 39],
    &#039;Salary': [55000, 87000, 42000, 105000, 72000, 91000],
    &#039;Experience': [3, 8, 1, 18, 6, 12],
    &#039;Dept': ['Eng', 'Mkt', 'Eng', 'Mgt', 'HR', 'Eng'],
    &#039;Last_Review': ['2024-06-15', '2023-11-30', '2024-09-01', '2023-08-15', '2024-03-20', '2024-01-10'],
    &#039;Performance': ['Good', 'Excellent', 'Average', 'Excellent', 'Good', 'Excellent']
})

# Feature 1: Date-derived features
df[&#039;Last_Review'] = pd.to_datetime(df['Last_Review'])
df[&#039;Days_Since_Review'] = (pd.Timestamp('2024-12-01') - df['Last_Review']).dt.days
df[&#039;Months_Since_Review'] = df['Days_Since_Review'] // 30

# Feature 2: Ratio features
df[&#039;Salary_Per_Year'] = df['Salary'] / df['Experience'].clip(lower=1)  # Avoid div by zero
df[&#039;Age_Experience_Ratio'] = df['Age'] / df['Experience'].clip(lower=1)

# Feature 3: Binning
df[&#039;Age_Group'] = pd.cut(df['Age'], bins=[0, 30, 40, 100],
                          labels=[&#039;Young', 'Mid', 'Senior'])
df[&#039;Salary_Quartile'] = pd.qcut(df['Salary'], q=4,
                                  labels=[&#039;Q1', 'Q2', 'Q3', 'Q4'])

# Feature 4: Interaction features
df[&#039;Exp_Salary_Score'] = df['Experience'] * df['Salary'] / 100000

print(df[[&#039;Name', 'Days_Since_Review', 'Salary_Per_Year', 'Age_Group', 'Exp_Salary_Score']].round(2))

3. Encoding Categorical Variables

python

12345678910111213141516171819

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Method 1: Label Encoding (for ordinal variables — has order)
perf_map = {&#039;Average': 0, 'Good': 1, 'Excellent': 2}
df[&#039;Performance_Encoded'] = df['Performance'].map(perf_map)

# Method 2: One-Hot Encoding (for nominal — no order)
df_encoded = pd.get_dummies(df, columns=[&#039;Dept'], prefix='Dept', drop_first=False)
print("After OHE:")
dept_cols = [c for c in df_encoded.columns if c.startswith(&#039;Dept_')]
print(df_encoded[[&#039;Name'] + dept_cols])

# Method 3: Binary encoding (for high-cardinality)
# Method 4: Target encoding (for high-cardinality with numeric target)
# Using mean salary as target encode for Dept
target_encode = df.groupby(&#039;Dept')['Salary'].mean()
df[&#039;Dept_Target_Encoded'] = df['Dept'].map(target_encode)
print("\nTarget encoding (Dept → mean Salary):")
print(df[[&#039;Name', 'Dept', 'Dept_Target_Encoded']])

4. Feature Scaling / Normalization

python

1234567891011121314151617181920212223242526272829303132

from sklearn.preprocessing import StandardScaler, MinMaxScaler

df_ml = pd.DataFrame({
    &#039;Age': [28, 35, 24, 45, 31, 39],
    &#039;Salary': [55000, 87000, 42000, 105000, 72000, 91000],
    &#039;Experience': [3, 8, 1, 18, 6, 12]
})

# Min-Max Scaling: [0, 1] — use when no outliers
scaler_mm = MinMaxScaler()
df_minmax = pd.DataFrame(
    scaler_mm.fit_transform(df_ml),
    columns=[f&#039;{c}_MinMax' for c in df_ml.columns]
)

# Standard Scaling (Z-score): mean=0, std=1 — use with outliers
scaler_std = StandardScaler()
df_standard = pd.DataFrame(
    scaler_std.fit_transform(df_ml),
    columns=[f&#039;{c}_Std' for c in df_ml.columns]
)

# Manual (no sklearn) — using Pandas and NumPy
df_manual_minmax = (df_ml - df_ml.min()) / (df_ml.max() - df_ml.min())
df_manual_zscore = (df_ml - df_ml.mean()) / df_ml.std()

print("Original:")
print(df_ml)
print("\nMin-Max Scaled:")
print(df_minmax.round(3))
print("\nStandardized:")
print(df_standard.round(3))

5. Train-Test Split

python

12345678910111213141516171819202122232425262728

from sklearn.model_selection import train_test_split

# Build final feature matrix
features = [&#039;Age', 'Salary', 'Experience', 'Days_Since_Review', 'Performance_Encoded']

# Ensure all features exist (using our df)
df_model = df[[&#039;Age', 'Experience', 'Salary', 'Days_Since_Review', 'Performance_Encoded']].dropna()

X = df_model.drop(columns=[&#039;Salary'])
y = df_model[&#039;Salary']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,        # 20% test
    random_state=42,      # Reproducibility
    shuffle=True
)

print(f"Full dataset:  {len(X)} samples")
print(f"Training set:  {len(X_train)} samples ({len(X_train)/len(X)*100:.0f}%)")
print(f"Test set:      {len(X_test)} samples ({len(X_test)/len(X)*100:.0f}%)")

# Scale AFTER split — fit on train only, transform both
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)    # Fit + transform on train
X_test_scaled  = scaler.transform(X_test)          # Transform only on test
# NEVER fit on test — that's data leakage!

6. Common Mistakes

Scaling before split (data leakage): fittransform() on the full dataset lets test data influence the scaling parameters. Always fit scaler on training data only.

Label encoding nominal categories: Using LabelEncoder on nominal categories (City: 0, 1, 2) implies an order. Use getdummies() (One-Hot Encoding) for unordered categories.

7. MCQs

Question 1

One-Hot Encoding is for?

Question 2

Label Encoding is appropriate for?

Question 3

Standard scaling (Z-score) produces?

Question 4

Data leakage in scaling happens when?

Question 5

`dropfirst=True` in getdummies prevents?

Question 6

Min-Max scaling maps values to?

Question 7

`testsize=0.2` means?

Question 8

Target encoding maps?

Question 9

.clip(lower=1) in feature engineering prevents?

Question 10

`random``state=42` in traintestsplit ensures?

8. Interview Questions

Q: What is the difference between Min-Max scaling and standardization?

Q: What is data leakage and how does it occur in preprocessing?

9. Summary

ML preprocessing pipeline: feature engineering (ratios, dates, bins) → encoding (OHE for nominal, label for ordinal, target for high-cardinality) → scaling (StandardScaler for outliers, MinMaxScaler otherwise) → train/test split → fit scaler on train only. Strict train/test separation prevents data leakage.

10. Next Chapter Recommendation

In Chapter 27: Real-World Data Science Projects, we build complete analysis systems on real datasets — sales, student performance, COVID, and ecommerce.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Preparing Data for Machine Learning #

1. Chapter Introduction #

2. Feature Engineering #

3. Encoding Categorical Variables #

4. Feature Scaling / Normalization #

5. Train-Test Split #

6. Common Mistakes #

7. MCQs #

One-Hot Encoding is for?

Label Encoding is appropriate for?

Standard scaling (Z-score) produces?

Data leakage in scaling happens when?

dropfirst=True in getdummies prevents?

Min-Max scaling maps values to?

testsize=0.2 means?

Target encoding maps?

.clip(lower=1) in feature engineering prevents?

randomstate=42 in traintestsplit ensures?

8. Interview Questions #

9. Summary #

10. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

Send Feedback / Bug

Feedback Submitted!

Preparing Data for Machine Learning

1. Chapter Introduction

2. Feature Engineering

3. Encoding Categorical Variables

4. Feature Scaling / Normalization

5. Train-Test Split

6. Common Mistakes

7. MCQs

`dropfirst=True` in getdummies prevents?

`testsize=0.2` means?

`.clip(lower=1)` in feature engineering prevents?

`random``state=42` in traintestsplit ensures?

8. Interview Questions

9. Summary

10. Next Chapter Recommendation