Skip to main content
Scikit-learn Basics
CHAPTER 18 Intermediate

Building ML Pipelines in Scikit-learn

Updated: May 16, 2026
6 min read

# CHAPTER 18

Building ML Pipelines in Scikit-learn

1. Introduction

Up to this point, our code has been a bit messy. We created an Imputer to fill missing values, then a Scaler to standardize the math, and finally a Classifier to make predictions. When new data arrives in production, you have to remember to run that new data through the exact same Imputer and Scaler before feeding it to the Classifier. If you forget a step, the app crashes. Pipelines solve this. A Pipeline is a Scikit-learn object that bundles all your preprocessing steps and your model into one single, automated workflow.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Understand the importance of ML Pipelines.
  • Prevent Data Leakage using Pipelines.
  • Implement make_pipeline in Scikit-learn.
  • Combine Imputers, Scalers, and Models into one object.
  • Pass a Pipeline into GridSearchCV.

3. The Problem with Manual Preprocessing

Look at this standard workflow:
python
12345678910111213
# 1. Impute
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# 2. Scale
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

# 3. Train
model.fit(X_train_scaled, y_train)

# 4. Predict
predictions = model.predict(X_test_scaled)

This is exhausting to write and maintain. What if you add a One-Hot Encoder? The code becomes a nightmare.

4. The Pipeline Solution

A Pipeline chains these steps together. When you call .fit() on the Pipeline, it automatically runs .fittransform() on the Imputer, then .fittransform() on the Scaler, and finally .fit() on the model.
python
12345678910111213141516171819
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Create the Pipeline
# Data flows from left to right!
my_pipeline = make_pipeline(
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    LogisticRegression()
)

# Fit the entire pipeline in ONE line of code
my_pipeline.fit(X_train, y_train)

# Predict in ONE line of code
# The pipeline automatically imputes and scales the test data before predicting!
predictions = my_pipeline.predict(X_test)

5. Preventing Data Leakage

Beyond cleaner code, Pipelines are mandatory for proper Cross-Validation. If you manually scale your data *before* running crossvalscore, you cause Data Leakage. The Scaler saw the "hidden" test folds. If you pass a Pipeline into crossvalscore, Scikit-learn is smart enough to do the split *first*, and then apply the Scaler *only* to the training folds on every single iteration. This guarantees 100% leak-free evaluation. You can even tune the hyperparameters of the components *inside* the pipeline using Grid Search! You just have to prefix the parameter name with the lowercase name of the component, followed by two underscores __.
python
123456789101112131415161718192021
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Note: Using Pipeline() instead of make_pipeline() allows us to name the steps manually
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(random_state=42))
])

# Define grid. Notice the 'rf__' prefix telling it to tune the Random Forest, not the Scaler.
param_grid = {
    'rf__n_estimators': [50, 100],
    'rf__max_depth': [5, 10]
}

# Run Grid Search on the Pipeline!
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

print(f"Best Accuracy: {grid.best_score_:.2f}")

7. Complex Pipelines (ColumnTransformer)

What if your dataset has numeric columns that need StandardScaler, and text columns that need OneHotEncoder? You can't just pass the whole dataset to a single scaler. Scikit-learn provides ColumnTransformer. It allows you to create two separate mini-pipelines (one for numbers, one for text) and run them simultaneously on the same DataFrame. This is advanced, but it is the industry standard for production code.

8. Common Mistakes

  • Putting the Model first: The order in the pipeline matters. Data flows from Step 1 to Step 2. If you put LogisticRegression() before StandardScaler(), the pipeline will crash because the model doesn't transform data; it makes final predictions. The model must always be the *last* step in the pipeline.

9. Best Practices

  • Never deploy without a Pipeline: If you save just the trained model to a file, the software engineer building the web app won't know how the data was scaled. If you save a Pipeline to a file, the software engineer just passes raw user input to .predict(), and the Pipeline handles the math automatically.

10. Exercises

  1. 1. Write the code to create a pipeline using makepipeline that contains a MinMaxScaler and a KNeighborsClassifier.
  1. 2. Why does using a Pipeline prevent data leakage during Cross-Validation?

11. MCQ Quiz with Answers

Question 1

In a Scikit-learn Pipeline, what method is called on the final step (the algorithm) when pipeline.fit() is executed?

Question 2

What is the primary benefit of bundling preprocessing steps and a model into a single Pipeline?

12. Interview Questions

  • Q: Explain what a Scikit-learn Pipeline is and why it is considered a best practice for production code.
  • Q: How do you perform hyperparameter tuning via GridSearchCV on a model that is embedded inside a Pipeline?

13. FAQs

Q: Can I put custom Python functions into a Pipeline? A: Yes! Scikit-learn provides a FunctionTransformer that allows you to wrap any custom Python function (like a complex text-cleaning script) and place it seamlessly into the pipeline.

14. Summary

Pipelines represent the transition from "writing scripts" to "software engineering." By bundling imputers, scalers, and models into a single, unified object, you guarantee consistency, eliminate data leakage, and create highly portable machine learning systems ready for the real world.

15. Next Chapter Recommendation

Our pipeline is perfect. It takes raw data and outputs highly accurate predictions. But right now, it only exists in our Jupyter Notebook. How do we put it on a server so the world can use it? In Chapter 19: Saving and Deploying Machine Learning Models, we will learn how to export our models and integrate them into web applications.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·