CHAPTER 18
Intermediate
Building ML Pipelines in Scikit-learn
Updated: May 16, 2026
6 min read
# CHAPTER 18
Building ML Pipelines in Scikit-learn
1. Introduction
Up to this point, our code has been a bit messy. We created an Imputer to fill missing values, then a Scaler to standardize the math, and finally a Classifier to make predictions. When new data arrives in production, you have to remember to run that new data through the exact same Imputer and Scaler before feeding it to the Classifier. If you forget a step, the app crashes. Pipelines solve this. A Pipeline is a Scikit-learn object that bundles all your preprocessing steps and your model into one single, automated workflow.2. Learning Objectives
By the end of this chapter, you will be able to:- Understand the importance of ML Pipelines.
- Prevent Data Leakage using Pipelines.
-
Implement
make_pipelinein Scikit-learn.
- Combine Imputers, Scalers, and Models into one object.
-
Pass a Pipeline into
GridSearchCV.
3. The Problem with Manual Preprocessing
Look at this standard workflow:
python
This is exhausting to write and maintain. What if you add a One-Hot Encoder? The code becomes a nightmare.
4. The Pipeline Solution
A Pipeline chains these steps together. When you call.fit() on the Pipeline, it automatically runs .fittransform() on the Imputer, then .fittransform() on the Scaler, and finally .fit() on the model.
python
5. Preventing Data Leakage
Beyond cleaner code, Pipelines are mandatory for proper Cross-Validation. If you manually scale your data *before* runningcrossvalscore, you cause Data Leakage. The Scaler saw the "hidden" test folds.
If you pass a Pipeline into crossvalscore, Scikit-learn is smart enough to do the split *first*, and then apply the Scaler *only* to the training folds on every single iteration. This guarantees 100% leak-free evaluation.
6. Mini Project: End-to-End ML Pipeline with Grid Search
You can even tune the hyperparameters of the components *inside* the pipeline using Grid Search! You just have to prefix the parameter name with the lowercase name of the component, followed by two underscores__.
python
7. Complex Pipelines (ColumnTransformer)
What if your dataset has numeric columns that needStandardScaler, and text columns that need OneHotEncoder? You can't just pass the whole dataset to a single scaler.
Scikit-learn provides ColumnTransformer. It allows you to create two separate mini-pipelines (one for numbers, one for text) and run them simultaneously on the same DataFrame. This is advanced, but it is the industry standard for production code.
8. Common Mistakes
-
Putting the Model first: The order in the pipeline matters. Data flows from Step 1 to Step 2. If you put
LogisticRegression()beforeStandardScaler(), the pipeline will crash because the model doesn't transform data; it makes final predictions. The model must always be the *last* step in the pipeline.
9. Best Practices
-
Never deploy without a Pipeline: If you save just the trained model to a file, the software engineer building the web app won't know how the data was scaled. If you save a Pipeline to a file, the software engineer just passes raw user input to
.predict(), and the Pipeline handles the math automatically.
10. Exercises
-
1.
Write the code to create a pipeline using
makepipelinethat contains aMinMaxScalerand aKNeighborsClassifier.
- 2. Why does using a Pipeline prevent data leakage during Cross-Validation?
11. MCQ Quiz with Answers
Question 1
In a Scikit-learn Pipeline, what method is called on the final step (the algorithm) when pipeline.fit() is executed?
Question 2
What is the primary benefit of bundling preprocessing steps and a model into a single Pipeline?
12. Interview Questions
- Q: Explain what a Scikit-learn Pipeline is and why it is considered a best practice for production code.
- Q: How do you perform hyperparameter tuning via GridSearchCV on a model that is embedded inside a Pipeline?
13. FAQs
Q: Can I put custom Python functions into a Pipeline? A: Yes! Scikit-learn provides aFunctionTransformer that allows you to wrap any custom Python function (like a complex text-cleaning script) and place it seamlessly into the pipeline.