Skip to main content
Scikit-learn Basics
CHAPTER 08 Intermediate

Train-Test Split and Cross Validation

Updated: May 16, 2026
7 min read

# CHAPTER 8

Train-Test Split and Cross Validation

1. Introduction

Imagine a student preparing for a final math exam. If the teacher gives the student the exact exam paper to study the night before, the student will score 100%. Did the student actually *learn* math, or did they just memorize the answers? If given a brand new math problem, they would likely fail. Machine learning models do the exact same thing. If you train a model on all your data, it will memorize it and score 100%, but fail miserably in the real world. To prevent this, we must hide a portion of our data during training. In this chapter, we will master the traintestsplit function and the concept of Cross-Validation.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Define Overfitting and Underfitting.
  • Use Scikit-learn's traintestsplit to divide datasets.
  • Prevent Data Leakage during preprocessing.
  • Understand and implement K-Fold Cross-Validation.

3. Overfitting vs. Underfitting

  • Overfitting (Memorization): The model learns the training data *too well*, including the noise and outliers. It has a high training accuracy but a terrible testing accuracy on unseen data.
  • Underfitting (Incapable): The model is too simple to capture the underlying patterns in the data (like trying to fit a straight line through a U-shaped curve). It scores poorly on both training and testing data.
  • The Goal (Generalization): A model that learns the true underlying pattern and performs equally well on both training and unseen testing data.

4. The Train-Test Split

The simplest way to evaluate a model is to split the dataset. We usually keep 80% of the data for Training, and hold back 20% for Testing.

Scikit-learn provides a one-line function for this:

python
123456789101112131415
import pandas as pd
from sklearn.model_selection import train_test_split

# Assume df is our dataframe
X = df.drop('Target_Column', axis=1) # The Features (Inputs)
y = df['Target_Column'] # The Label (Answers)

# Split the data!
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.20,     # 20% goes to the test set
    random_state=42     # Sets a seed so the random split is reproducible
)

print(f"Training rows: {X_train.shape[0]}, Testing rows: {X_test.shape[0]}")

5. The Golden Rule: Prevent Data Leakage

Data leakage occurs when information from the *Test set* accidentally leaks into the *Training set* during preprocessing. Critical Error: Running StandardScaler().fittransform() on your *entire* dataset BEFORE splitting it. If you do this, the Scaler calculates the Mean using data from the Test set! The model has now indirectly "seen" the test data.

The Correct Workflow:

  1. 1. Split the data FIRST.
  1. 2. fit() the Scaler ONLY on Xtrain.
  1. 3. transform() Xtrain.
  1. 4. transform() Xtest (using the scaling parameters learned from the training set).

python
123456789
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Fit ONLY on training data!
X_train_scaled = scaler.fit_transform(X_train)

# ONLY transform test data. Do NOT fit!
X_test_scaled = scaler.transform(X_test)

6. The Problem with a Simple Split

What if your random 80/20 split was just incredibly "lucky" or "unlucky"? Maybe all the difficult examples ended up in the test set. Your accuracy score would not reflect reality. To get a highly reliable evaluation, professionals use Cross-Validation.

7. K-Fold Cross-Validation

Instead of splitting the data once, K-Fold Cross-Validation splits the data multiple times (e.g., 5 times, known as K=5).
  1. 1. It divides the dataset into 5 equal chunks (Folds).
  1. 2. It trains the model on Folds 1, 2, 3, 4 and tests on Fold 5.
  1. 3. Then it trains on Folds 1, 2, 3, 5 and tests on Fold 4.
  1. 4. It repeats this until every fold has been used as the test set exactly once.
  1. 5. It averages the 5 accuracy scores to give you a true, robust estimate of model performance.
python
12345678910
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# Run 5-fold cross validation on the ENTIRE dataset
scores = cross_val_score(model, X, y, cv=5)

print("Scores for each fold:", scores)
print("Average Accuracy: {:.2f}%".format(scores.mean() * 100))

8. Mini Project: Model Validation Workflow

Let's establish the professional workflow:
python
12345678910111213141516171819202122232425
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load Data
X, y = load_iris(return_X_y=True)

# 2. Split Data (Before any processing!)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Preprocess (Fit only on Train)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Train
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# 5. Evaluate on unseen Test data
predictions = model.predict(X_test_scaled)
acc = accuracy_score(y_test, predictions)
print(f"Final Test Accuracy: {acc*100}%")

9. Common Mistakes

  • Fitting preprocessing tools on Test data: Again, calling .fit() on your test data is the most common reason junior data scientists get fired.
  • Not shuffling time-series data: traintestsplit shuffles data randomly by default. However, if you are predicting stock prices, shuffling destroys the timeline. For time-series, you must split chronologically (e.g., train on 2020-2022, test on 2023).

10. Best Practices

  • Validation Sets: For deep learning or complex hyperparameter tuning, splitting 80/20 isn't enough. You often split into Train (70%), Validation (15%), and Test (15%). You use the Validation set to tweak the model, and keep the Test set completely hidden until the very end of the project.

11. Exercises

  1. 1. Explain the analogy of the student and the math exam in the context of Overfitting.
  1. 2. What does the randomstate parameter do in the traintestsplit function, and why is it useful?

12. MCQ Quiz with Answers

Question 1

What is the result of Overfitting in a machine learning model?

Question 2

When scaling data using StandardScaler, which method should you call on your testing data (Xtest)?

13. Interview Questions

  • Q: Explain K-Fold Cross-Validation and why it provides a more reliable metric than a simple Train-Test split.
  • Q: What is Data Leakage in machine learning, and how do you prevent it during preprocessing?

14. FAQs

Q: Should I use Train-Test split or Cross-Validation? A: Use Cross-Validation during the development phase to accurately evaluate different algorithms and tune them. Once you have finalized your model design, you do a final Train-Test split to ensure it works on completely held-out data.

15. Summary

Evaluation is the only way to know if your model is learning or just memorizing. By strictly isolating our Test data before any preprocessing occurs, and by utilizing robust techniques like K-Fold Cross-Validation, we ensure our models will generalize perfectly to real-world scenarios.

16. Next Chapter Recommendation

Our data is clean, encoded, and properly split. It is finally time to do some real Machine Learning! In Chapter 9: Linear Regression in Scikit-learn, we will train our very first algorithm to predict continuous numbers, like housing prices.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·