Skip to main content
Pandas & NumPy
CHAPTER 26 Beginner

Preparing Data for Machine Learning

Updated: May 18, 2026
5 min read

# CHAPTER 26

Preparing Data for Machine Learning

1. Chapter Introduction

Raw data cannot be fed directly into ML models. Data preparation — feature engineering, encoding, scaling, and splitting — is what makes models accurate. Pandas and NumPy handle 90% of this preprocessing pipeline.

2. Feature Engineering

python
1234567891011121314151617181920212223242526272829303132
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Carol', 'David', 'Eve', 'Frank'],
    'Age': [28, 35, 24, 45, 31, 39],
    'Salary': [55000, 87000, 42000, 105000, 72000, 91000],
    'Experience': [3, 8, 1, 18, 6, 12],
    'Dept': ['Eng', 'Mkt', 'Eng', 'Mgt', 'HR', 'Eng'],
    'Last_Review': ['2024-06-15', '2023-11-30', '2024-09-01', '2023-08-15', '2024-03-20', '2024-01-10'],
    'Performance': ['Good', 'Excellent', 'Average', 'Excellent', 'Good', 'Excellent']
})

# Feature 1: Date-derived features
df['Last_Review'] = pd.to_datetime(df['Last_Review'])
df['Days_Since_Review'] = (pd.Timestamp('2024-12-01') - df['Last_Review']).dt.days
df['Months_Since_Review'] = df['Days_Since_Review'] // 30

# Feature 2: Ratio features
df['Salary_Per_Year'] = df['Salary'] / df['Experience'].clip(lower=1)  # Avoid div by zero
df['Age_Experience_Ratio'] = df['Age'] / df['Experience'].clip(lower=1)

# Feature 3: Binning
df['Age_Group'] = pd.cut(df['Age'], bins=[0, 30, 40, 100],
                          labels=['Young', 'Mid', 'Senior'])
df['Salary_Quartile'] = pd.qcut(df['Salary'], q=4,
                                  labels=['Q1', 'Q2', 'Q3', 'Q4'])

# Feature 4: Interaction features
df['Exp_Salary_Score'] = df['Experience'] * df['Salary'] / 100000

print(df[['Name', 'Days_Since_Review', 'Salary_Per_Year', 'Age_Group', 'Exp_Salary_Score']].round(2))

3. Encoding Categorical Variables

python
12345678910111213141516171819
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Method 1: Label Encoding (for ordinal variables — has order)
perf_map = {'Average': 0, 'Good': 1, 'Excellent': 2}
df['Performance_Encoded'] = df['Performance'].map(perf_map)

# Method 2: One-Hot Encoding (for nominal — no order)
df_encoded = pd.get_dummies(df, columns=['Dept'], prefix='Dept', drop_first=False)
print("After OHE:")
dept_cols = [c for c in df_encoded.columns if c.startswith('Dept_')]
print(df_encoded[['Name'] + dept_cols])

# Method 3: Binary encoding (for high-cardinality)
# Method 4: Target encoding (for high-cardinality with numeric target)
# Using mean salary as target encode for Dept
target_encode = df.groupby('Dept')['Salary'].mean()
df['Dept_Target_Encoded'] = df['Dept'].map(target_encode)
print("\nTarget encoding (Dept → mean Salary):")
print(df[['Name', 'Dept', 'Dept_Target_Encoded']])

4. Feature Scaling / Normalization

python
1234567891011121314151617181920212223242526272829303132
from sklearn.preprocessing import StandardScaler, MinMaxScaler

df_ml = pd.DataFrame({
    'Age': [28, 35, 24, 45, 31, 39],
    'Salary': [55000, 87000, 42000, 105000, 72000, 91000],
    'Experience': [3, 8, 1, 18, 6, 12]
})

# Min-Max Scaling: [0, 1] — use when no outliers
scaler_mm = MinMaxScaler()
df_minmax = pd.DataFrame(
    scaler_mm.fit_transform(df_ml),
    columns=[f'{c}_MinMax' for c in df_ml.columns]
)

# Standard Scaling (Z-score): mean=0, std=1 — use with outliers
scaler_std = StandardScaler()
df_standard = pd.DataFrame(
    scaler_std.fit_transform(df_ml),
    columns=[f'{c}_Std' for c in df_ml.columns]
)

# Manual (no sklearn) — using Pandas and NumPy
df_manual_minmax = (df_ml - df_ml.min()) / (df_ml.max() - df_ml.min())
df_manual_zscore = (df_ml - df_ml.mean()) / df_ml.std()

print("Original:")
print(df_ml)
print("\nMin-Max Scaled:")
print(df_minmax.round(3))
print("\nStandardized:")
print(df_standard.round(3))

5. Train-Test Split

python
12345678910111213141516171819202122232425262728
from sklearn.model_selection import train_test_split

# Build final feature matrix
features = ['Age', 'Salary', 'Experience', 'Days_Since_Review', 'Performance_Encoded']

# Ensure all features exist (using our df)
df_model = df[['Age', 'Experience', 'Salary', 'Days_Since_Review', 'Performance_Encoded']].dropna()

X = df_model.drop(columns=['Salary'])
y = df_model['Salary']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,        # 20% test
    random_state=42,      # Reproducibility
    shuffle=True
)

print(f"Full dataset:  {len(X)} samples")
print(f"Training set:  {len(X_train)} samples ({len(X_train)/len(X)*100:.0f}%)")
print(f"Test set:      {len(X_test)} samples ({len(X_test)/len(X)*100:.0f}%)")

# Scale AFTER split — fit on train only, transform both
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)    # Fit + transform on train
X_test_scaled  = scaler.transform(X_test)          # Transform only on test
# NEVER fit on test — that's data leakage!

6. Common Mistakes

  • Scaling before split (data leakage): fittransform() on the full dataset lets test data influence the scaling parameters. Always fit scaler on training data only.
  • Label encoding nominal categories: Using LabelEncoder on nominal categories (City: 0, 1, 2) implies an order. Use getdummies() (One-Hot Encoding) for unordered categories.

7. MCQs

Question 1

One-Hot Encoding is for?

Question 2

Label Encoding is appropriate for?

Question 3

Standard scaling (Z-score) produces?

Question 4

Data leakage in scaling happens when?

Question 5

dropfirst=True in getdummies prevents?

Question 6

Min-Max scaling maps values to?

Question 7

testsize=0.2 means?

Question 8

Target encoding maps?

Question 9

.clip(lower=1) in feature engineering prevents?

Question 10

randomstate=42 in traintestsplit ensures?

8. Interview Questions

  • Q: What is the difference between Min-Max scaling and standardization?
  • Q: What is data leakage and how does it occur in preprocessing?

9. Summary

ML preprocessing pipeline: feature engineering (ratios, dates, bins) → encoding (OHE for nominal, label for ordinal, target for high-cardinality) → scaling (StandardScaler for outliers, MinMaxScaler otherwise) → train/test split → fit scaler on train only. Strict train/test separation prevents data leakage.

10. Next Chapter Recommendation

In Chapter 27: Real-World Data Science Projects, we build complete analysis systems on real datasets — sales, student performance, COVID, and ecommerce.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·