Kubernetes Autoscaling
# CHAPTER 12
Kubernetes Autoscaling
1. Introduction
The ultimate promise of the Cloud is "Elasticity"—the ability to scale resources up instantly when demand surges, and scale them down instantly to save money when demand drops. While you can manually typekubectl scale deployment web --replicas=50, doing this manually at 3:00 AM during a traffic spike is unacceptable. In this chapter, we will master the Horizontal Pod Autoscaler (HPA), the Kubernetes robot that monitors CPU usage and scales your applications autonomously.
2. Learning Objectives
By the end of this chapter, you will be able to:- Define the Horizontal Pod Autoscaler (HPA).
- Understand the difference between Horizontal and Vertical scaling.
- Configure CPU and Memory Resource Requests/Limits on a Pod.
- Enable the Metrics Server in Minikube.
- Deploy an HPA to autonomously scale a Deployment under load.
3. Beginner-Friendly Explanation
Imagine a checkout line at a grocery store.- Manual Scaling: The store manager watches the line. If it gets too long, they manually open a new register. If they are busy in the back office, the line extends out the door and customers leave.
- The HPA (Autoscaling): You install a robot above the registers. You program it: "If the average line has more than 5 people, automatically open another register. If the average line drops below 2 people, close a register." The manager goes home and sleeps peacefully, knowing the store will dynamically adapt to any rush hour.
4. Horizontal vs. Vertical Scaling
- Horizontal Scaling (HPA): Adding *more* Pods. (Going from 2 Nginx Pods to 10 Nginx Pods). This is the standard for web applications and microservices.
- Vertical Scaling (VPA): Adding *more power* to an existing Pod. (Giving a MySQL database Pod 8GB of RAM instead of 2GB). This requires restarting the Pod and is generally used for monolithic databases.
5. The Prerequisite: Resource Limits and Metrics
The HPA robot cannot scale your Pods if it doesn't know how much CPU they are currently using! Step 1: You must install themetrics-server (a cluster add-on that measures CPU/RAM).
Step 2: Your Deployment YAML *MUST* contain resources.requests. You have to tell Kubernetes what "100% CPU usage" looks like for your specific application before it can calculate a percentage.
Example Pod Resource Block:
6. Anatomy of an HPA YAML
Once resources are defined, you create the HPA object. This configuration states: "Monitor thephp-apache deployment. Keep the average CPU usage at 50%. If it goes higher, spawn up to 10 Pods to handle the load. If it drops, scale down to a minimum of 1 Pod."
7. Mini Project: Auto-Scale Under Load
Let's generate fake traffic and watch Kubernetes scale our infrastructure.Step-by-Step Tutorial:
- 1. Critical: Enable the metrics server in Minikube:
- 2. Deploy a mathematical application that deliberately consumes CPU:
- 3. Instead of writing a YAML, let's create the HPA using a quick imperative command:
- 4. Verify the HPA is active:
*(Initially, TARGETS might say <unknown>/50%. Wait 60 seconds for the metrics server to gather data).*
- 5. The Stress Test: Open a *new* terminal window. Let's run a temporary busybox Pod to bombard our application with infinite HTTP requests, simulating a viral traffic spike:
- 6. Return to your original terminal and watch the HPA react:
*(Within 1-2 minutes, you will see the CPU spike to 200%+. The HPA will panic and automatically scale the REPLICAS from 1 -> 4 -> 8 -> 10 to absorb the load!)*
-
7.
Stop the
load-generator(Ctrl+C). After 5 minutes, the HPA will notice the traffic is gone and gracefully scale the replicas back down to 1 to save resources.
8. Real-World Scenarios
A ticket sales website operates at minimal capacity (3 Pods) for weeks. Suddenly, tickets for a global pop star go on sale. Within 10 seconds, 100,000 users hit the site. The HPA detects the massive CPU spike and immediately scales the frontend Deployment to 200 Pods. However, the physical AWS Worker Nodes run out of capacity! To solve this, DevOps engineers combine the HPA with the Cluster Autoscaler—a tool that automatically rents *more physical EC2 servers* from AWS when the HPA needs more room for Pods.9. Best Practices
- Scale Down Cooldown: HPAs are configured with a "cooldown" period (usually 5 minutes). When traffic drops, the HPA will NOT scale down immediately. It waits to ensure the traffic drop wasn't just a temporary dip. This prevents "thrashing" (rapidly creating and destroying Pods over and over).
10. Common Mistakes
-
Missing Resource Requests: The absolute most common reason an HPA fails to scale (showing
<unknown>/50%forever) is because the developer forgot to defineresources.requestsin the Deployment YAML. If the HPA doesn't know the Pod's baseline CPU allowance, it cannot calculate the math required to scale it.
11. Exercises
- 1. What prerequisite cluster add-on is mathematically required for the Horizontal Pod Autoscaler to function?
- 2. Explain the difference between the Horizontal Pod Autoscaler (HPA) and the Cluster Autoscaler.
12. FAQs
Q: Can I scale based on memory (RAM) instead of CPU? A: Yes. You can configure the HPA to scale if memory usage exceeds 70%. However, memory scaling is notoriously tricky because many programming languages (like Java) grab all available RAM and hold onto it even when idle (Garbage Collection delays), causing the HPA to scale up unnecessarily. CPU scaling is vastly more reliable.13. Interview Questions
-
Q: Explain the mechanical interaction between a Deployment's
resources.requestsdefinition and the Horizontal Pod Autoscaler'saverageUtilizationmetric.
- Q: A production HPA is configured to scale on CPU utilization. During a load test, the CPU spikes, but the HPA fails to spawn new Pods. Describe your systematic troubleshooting process to identify the misconfiguration.