CHAPTER 18
Intermediate
Hyperparameter Tuning and Optimization
Updated: May 16, 2026
6 min read
# CHAPTER 18
Hyperparameter Tuning and Optimization
1. Introduction
Building a model withnn.Linear(10, 64) and an Adam optimizer is easy. But *why* 64 neurons? Why not 128? Why Adam and not SGD? These architectural and mathematical decisions are called Hyperparameters. If your model is struggling to learn, simply throwing more data at it won't always fix the problem; you need to tune the engine. In this chapter, we will dissect the most critical hyperparameter—the Learning Rate—and learn how to systematically optimize our networks.
2. Learning Objectives
By the end of this chapter, you will be able to:- Explain what an Optimizer does mathematically.
- Compare SGD (Stochastic Gradient Descent) vs. Adam.
- Understand the critical importance of the Learning Rate.
- Implement Learning Rate Schedulers in PyTorch.
- Apply Regularization (Weight Decay) to prevent overfitting.
3. What is an Optimizer?
During Backpropagation, the Loss Function calculates the error (the gradient). The Optimizer is the engine that decides exactly *how* to update the network's Weights to reduce that error. Imagine you are blindfolded on a mountain, trying to walk down to the lowest valley (Zero Loss). The Optimizer determines which direction you step, and how big of a step you take.4. The Learning Rate (The Most Important Hyperparameter)
The "size of the step" the Optimizer takes is called the Learning Rate (LR).-
LR Too Large (e.g.,
0.1): You take massive leaps. You might jump completely over the valley and end up higher on the other side. The model fails to converge and the loss fluctuates wildly.
-
LR Too Small (e.g.,
0.000001): You take microscopic baby steps. It will take you 10 years to reach the bottom. Training takes forever, or the model gets stuck in a small ditch (local minima).
-
Just Right (Usually
0.001or0.0001): You walk smoothly down to the lowest point.
5. Optimizers: SGD vs. Adam
In PyTorch, you import optimizers fromtorch.optim.
- SGD (Stochastic Gradient Descent): The classic algorithm. It is simple, reliable, but can be slow and requires you to manually pick the perfect Learning Rate.
- Adam (Adaptive Moment Estimation): The modern industry standard. Adam doesn't use one static Learning Rate. It *adapts* the learning rate for every single weight dynamically during training. It starts fast and slows down as it gets closer to the valley. Always use Adam as your baseline.
python
6. Learning Rate Scheduling
Even with Adam, it is often best to force the learning rate to drop over time. You want big steps early in training to learn fast, and tiny steps at the end to "fine-tune" the details without overshooting. We use a Scheduler.
python
7. Regularization (Weight Decay)
If your model is overfitting (memorizing the training data instead of learning patterns), you can penalize the model for having weights that are too large. This is called L2 Regularization, or Weight Decay. In PyTorch, you add it directly to the optimizer!
python
8. Automated Hyperparameter Search
Should you use 32, 64, or 128 neurons? Don't guess manually. Professionals use libraries like Ray Tune or Optuna to automate this. You write a function that builds your model, and the library automatically runs the training loop 50 times using different combinations of Learning Rates and Layer Sizes, finally spitting out the "optimal" configuration!9. Common Mistakes
-
Stepping the Scheduler at the wrong time: You must call
scheduler.step()at the *end* of the Epoch loop, NOT inside the batch loop. Stepping it inside the batch loop will cause your Learning Rate to hit zero in a matter of seconds.
- Changing too many things at once: If you change the architecture, the optimizer, and the batch size all at the same time, and accuracy drops, you don't know which change caused it. Treat tuning like a science experiment: change one variable at a time.
10. Best Practices
-
Use Logarithmic Scales for LR: When testing learning rates, don't test
0.01and0.02. Test by orders of magnitude (Powers of 10):0.1,0.01,0.001,0.0001.
11. Exercises
-
1.
What happens to the training process if the Learning Rate is set astronomically high (e.g.,
100.0)?
-
2.
Explain the difference between
optimizer.step()andscheduler.step().
12. MCQ Quiz with Answers
Question 1
Which of the following is considered a Hyperparameter rather than a Model Parameter?
Question 2
Why is the Adam optimizer generally preferred over standard Stochastic Gradient Descent (SGD) for beginners?
13. Interview Questions
- Q: Explain the metaphor of walking down a mountain in relation to the Optimizer and the Learning Rate.
- Q: What is the purpose of Weight Decay (L2 Regularization), and how is it implemented in PyTorch?
14. FAQs
Q: Is there an Optimizer better than Adam? A:AdamW (Adam with decoupled Weight Decay) has recently become the standard for training massive transformer models like GPT. It is highly recommended to use optim.AdamW in modern workflows!