CHAPTER 26
Beginner
Preparing Data for Machine Learning
Updated: May 18, 2026
5 min read
# CHAPTER 26
Preparing Data for Machine Learning
1. Chapter Introduction
Raw data cannot be fed directly into ML models. Data preparation — feature engineering, encoding, scaling, and splitting — is what makes models accurate. Pandas and NumPy handle 90% of this preprocessing pipeline.2. Feature Engineering
python
3. Encoding Categorical Variables
python
4. Feature Scaling / Normalization
python
5. Train-Test Split
python
6. Common Mistakes
-
Scaling before split (data leakage):
fittransform()on the full dataset lets test data influence the scaling parameters. Always fit scaler on training data only.
-
Label encoding nominal categories: Using
LabelEncoderon nominal categories (City: 0, 1, 2) implies an order. Usegetdummies()(One-Hot Encoding) for unordered categories.
7. MCQs
Question 1
One-Hot Encoding is for?
Question 2
Label Encoding is appropriate for?
Question 3
Standard scaling (Z-score) produces?
Question 4
Data leakage in scaling happens when?
Question 5
dropfirst=True in getdummies prevents?
Question 6
Min-Max scaling maps values to?
Question 7
testsize=0.2 means?
Question 8
Target encoding maps?
Question 9
.clip(lower=1) in feature engineering prevents?
Question 10
randomstate=42 in traintestsplit ensures?
8. Interview Questions
- Q: What is the difference between Min-Max scaling and standardization?
- Q: What is data leakage and how does it occur in preprocessing?