Skip to main content
PyTorch Essentials
CHAPTER 18 Intermediate

Hyperparameter Tuning and Optimization

Updated: May 16, 2026
6 min read

# CHAPTER 18

Hyperparameter Tuning and Optimization

1. Introduction

Building a model with nn.Linear(10, 64) and an Adam optimizer is easy. But *why* 64 neurons? Why not 128? Why Adam and not SGD? These architectural and mathematical decisions are called Hyperparameters. If your model is struggling to learn, simply throwing more data at it won't always fix the problem; you need to tune the engine. In this chapter, we will dissect the most critical hyperparameter—the Learning Rate—and learn how to systematically optimize our networks.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Explain what an Optimizer does mathematically.
  • Compare SGD (Stochastic Gradient Descent) vs. Adam.
  • Understand the critical importance of the Learning Rate.
  • Implement Learning Rate Schedulers in PyTorch.
  • Apply Regularization (Weight Decay) to prevent overfitting.

3. What is an Optimizer?

During Backpropagation, the Loss Function calculates the error (the gradient). The Optimizer is the engine that decides exactly *how* to update the network's Weights to reduce that error. Imagine you are blindfolded on a mountain, trying to walk down to the lowest valley (Zero Loss). The Optimizer determines which direction you step, and how big of a step you take.

4. The Learning Rate (The Most Important Hyperparameter)

The "size of the step" the Optimizer takes is called the Learning Rate (LR).
  • LR Too Large (e.g., 0.1): You take massive leaps. You might jump completely over the valley and end up higher on the other side. The model fails to converge and the loss fluctuates wildly.
  • LR Too Small (e.g., 0.000001): You take microscopic baby steps. It will take you 10 years to reach the bottom. Training takes forever, or the model gets stuck in a small ditch (local minima).
  • Just Right (Usually 0.001 or 0.0001): You walk smoothly down to the lowest point.

5. Optimizers: SGD vs. Adam

In PyTorch, you import optimizers from torch.optim.
  • SGD (Stochastic Gradient Descent): The classic algorithm. It is simple, reliable, but can be slow and requires you to manually pick the perfect Learning Rate.
  • Adam (Adaptive Moment Estimation): The modern industry standard. Adam doesn't use one static Learning Rate. It *adapts* the learning rate for every single weight dynamically during training. It starts fast and slows down as it gets closer to the valley. Always use Adam as your baseline.
python
1234567
import torch.optim as optim

# Standard Adam setup
optimizer_adam = optim.Adam(model.parameters(), lr=0.001)

# Standard SGD setup (Momentum helps push it out of small ditches)
optimizer_sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

6. Learning Rate Scheduling

Even with Adam, it is often best to force the learning rate to drop over time. You want big steps early in training to learn fast, and tiny steps at the end to "fine-tune" the details without overshooting. We use a Scheduler.
python
123456789101112131415
# Create Optimizer
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Create Scheduler: Reduce LR by 10x every 5 epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)

for epoch in range(20):
    # ... Standard training steps (forward, backward, step) ...
    
    # CRITICAL: Step the scheduler at the END of the epoch
    scheduler.step()
    
    # Print the current learning rate
    current_lr = scheduler.get_last_lr()[0]
    print(f"Epoch {epoch} | Current LR: {current_lr}")

7. Regularization (Weight Decay)

If your model is overfitting (memorizing the training data instead of learning patterns), you can penalize the model for having weights that are too large. This is called L2 Regularization, or Weight Decay. In PyTorch, you add it directly to the optimizer!
python
12
# weight_decay adds a mathematical penalty for large weights, forcing the network to stay simple
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
Should you use 32, 64, or 128 neurons? Don't guess manually. Professionals use libraries like Ray Tune or Optuna to automate this. You write a function that builds your model, and the library automatically runs the training loop 50 times using different combinations of Learning Rates and Layer Sizes, finally spitting out the "optimal" configuration!

9. Common Mistakes

  • Stepping the Scheduler at the wrong time: You must call scheduler.step() at the *end* of the Epoch loop, NOT inside the batch loop. Stepping it inside the batch loop will cause your Learning Rate to hit zero in a matter of seconds.
  • Changing too many things at once: If you change the architecture, the optimizer, and the batch size all at the same time, and accuracy drops, you don't know which change caused it. Treat tuning like a science experiment: change one variable at a time.

10. Best Practices

  • Use Logarithmic Scales for LR: When testing learning rates, don't test 0.01 and 0.02. Test by orders of magnitude (Powers of 10): 0.1, 0.01, 0.001, 0.0001.

11. Exercises

  1. 1. What happens to the training process if the Learning Rate is set astronomically high (e.g., 100.0)?
  1. 2. Explain the difference between optimizer.step() and scheduler.step().

12. MCQ Quiz with Answers

Question 1

Which of the following is considered a Hyperparameter rather than a Model Parameter?

Question 2

Why is the Adam optimizer generally preferred over standard Stochastic Gradient Descent (SGD) for beginners?

13. Interview Questions

  • Q: Explain the metaphor of walking down a mountain in relation to the Optimizer and the Learning Rate.
  • Q: What is the purpose of Weight Decay (L2 Regularization), and how is it implemented in PyTorch?

14. FAQs

Q: Is there an Optimizer better than Adam? A: AdamW (Adam with decoupled Weight Decay) has recently become the standard for training massive transformer models like GPT. It is highly recommended to use optim.AdamW in modern workflows!

15. Summary

An AI is only as smart as its training mechanism. By understanding the critical role of the Learning Rate and the adaptive power of the Adam optimizer, we can ensure our models navigate the complex mathematical landscape of Backpropagation efficiently. By leveraging tools like LR Schedulers and Weight Decay, we force our models to learn robustly.

16. Next Chapter Recommendation

Our model is perfectly tuned, but is our code executing at maximum speed? In Chapter 19: Performance Optimization and GPU Training, we will cover the advanced techniques required to squeeze every ounce of compute out of your NVIDIA GPU.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·