Hyperparameter Tuning and GridSearchCV
# CHAPTER 17
Hyperparameter Tuning and GridSearchCV
1. Introduction
When you instantiate a model likeRandomForestClassifier(), Scikit-learn uses default settings (e.g., 100 trees, no maximum depth). While these defaults are good, they are rarely optimal for your specific dataset. These internal settings are called Hyperparameters. Adjusting them is like a DJ turning the knobs on a mixing board to get the perfect sound. In this chapter, we will learn how to systematically test hundreds of different knob combinations to squeeze every ounce of accuracy out of our models.
2. Learning Objectives
By the end of this chapter, you will be able to:- Define Hyperparameters vs. Parameters.
- Understand the concept of Grid Search.
-
Implement
GridSearchCVin Scikit-learn.
-
Implement
RandomizedSearchCVfor faster tuning.
- Optimize a model to prevent overfitting.
3. Parameters vs. Hyperparameters
- Parameters: These are the numbers the model *learns* during training. (e.g., The slope and intercept in Linear Regression). You cannot set these.
-
Hyperparameters: These are the settings *you* provide before training begins. (e.g.,
K=5in KNN,maxdepth=3in Decision Trees).
4. What is Grid Search?
If a Random Forest has two hyperparameters you want to tune:-
nestimators(Number of trees): [50, 100, 200]
-
max_depth(Depth of trees): [5, 10, None]
Grid Search creates a literal "grid" of all possible combinations (50 & 5, 50 & 10, 50 & None, 100 & 5, etc.). It trains a separate model for *every single combination* using Cross-Validation, evaluates them, and tells you which specific combination produced the highest accuracy.
5. Implementing GridSearchCV
Let's optimize a Random Forest model.*Note: n_jobs=-1 tells Scikit-learn to use all the cores on your CPU to run the grid search in parallel, making it much faster!*
6. Using the Best Model
You don't need to manually re-create the model with the best parameters.GridSearchCV automatically saves the best model for you.
7. RandomizedSearchCV (The Faster Alternative)
If you have 10 hyperparameters with 10 values each, Grid Search will try to train 10 billion models. Your computer will melt. RandomizedSearchCV solves this. Instead of trying *every* combination, it tries a random sample of combinations (e.g., exactly 50 random combinations). Statistically, it almost always finds a set of hyperparameters that is 99% as good as Grid Search, but in a fraction of the time.8. Common Mistakes
-
Leaking Data into the Grid Search: You should only pass your Training data (
Xtrain,ytrain) into the Grid Search. If you pass the entire dataset, you are optimizing the hyperparameters to perfectly fit the Test data, defeating the purpose of a blind test.
-
Grids that are too large: Trying
nestimators: [1, 2, 3... up to 1000]. Be strategic. Try large steps first[10, 100, 1000]. If 100 wins, do a smaller grid around it[80, 100, 120].
9. Best Practices
-
Optimize for the right metric: By default, GridSearchCV optimizes for "Accuracy". If you are working with imbalanced data, change the parameter to
scoring='f1'orscoring='rocauc'.
10. Exercises
- 1. What is the difference between a Parameter and a Hyperparameter in Scikit-learn?
-
2.
If
paramgridhas 4 values forC, 3 values forgamma, andcv=5, exactly how many models willGridSearchCVtrain?
11. MCQ Quiz with Answers
What is the primary purpose of GridSearchCV?
If GridSearchCV is taking too long to run because the parameter grid is massive, which Scikit-learn function is the best alternative?
12. Interview Questions
- Q: Explain the difference between GridSearchCV and RandomizedSearchCV. When would you use one over the other?
- Q: Explain why tuning hyperparameters on the Test dataset is considered a bad practice.
13. FAQs
Q: How do I know which hyperparameters exist for a specific algorithm? A: Check the Scikit-learn documentation! Every algorithm's page lists all its hyperparameters, what they do, and their default values.14. Summary
Hyperparameter tuning is the final polish on a machine learning model. By usingGridSearchCV and RandomizedSearchCV, we remove the guesswork from tuning and programmatically guarantee that our models are operating at their peak potential before we deploy them.