Skip to main content
Kubernetes Introduction
CHAPTER 12 Intermediate

Kubernetes Autoscaling

Updated: May 15, 2026
25 min read

# CHAPTER 12

Kubernetes Autoscaling

1. Introduction

The ultimate promise of the Cloud is "Elasticity"—the ability to scale resources up instantly when demand surges, and scale them down instantly to save money when demand drops. While you can manually type kubectl scale deployment web --replicas=50, doing this manually at 3:00 AM during a traffic spike is unacceptable. In this chapter, we will master the Horizontal Pod Autoscaler (HPA), the Kubernetes robot that monitors CPU usage and scales your applications autonomously.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Define the Horizontal Pod Autoscaler (HPA).
  • Understand the difference between Horizontal and Vertical scaling.
  • Configure CPU and Memory Resource Requests/Limits on a Pod.
  • Enable the Metrics Server in Minikube.
  • Deploy an HPA to autonomously scale a Deployment under load.

3. Beginner-Friendly Explanation

Imagine a checkout line at a grocery store.
  • Manual Scaling: The store manager watches the line. If it gets too long, they manually open a new register. If they are busy in the back office, the line extends out the door and customers leave.
  • The HPA (Autoscaling): You install a robot above the registers. You program it: "If the average line has more than 5 people, automatically open another register. If the average line drops below 2 people, close a register." The manager goes home and sleeps peacefully, knowing the store will dynamically adapt to any rush hour.

4. Horizontal vs. Vertical Scaling

  • Horizontal Scaling (HPA): Adding *more* Pods. (Going from 2 Nginx Pods to 10 Nginx Pods). This is the standard for web applications and microservices.
  • Vertical Scaling (VPA): Adding *more power* to an existing Pod. (Giving a MySQL database Pod 8GB of RAM instead of 2GB). This requires restarting the Pod and is generally used for monolithic databases.

5. The Prerequisite: Resource Limits and Metrics

The HPA robot cannot scale your Pods if it doesn't know how much CPU they are currently using! Step 1: You must install the metrics-server (a cluster add-on that measures CPU/RAM). Step 2: Your Deployment YAML *MUST* contain resources.requests. You have to tell Kubernetes what "100% CPU usage" looks like for your specific application before it can calculate a percentage.

Example Pod Resource Block:

yaml
123456
      containers:
      - name: php-apache
        image: k8s.gcr.io/hpa-example
        resources:
          requests:
            cpu: 200m # 200 milliCPU (20% of a single CPU core)

6. Anatomy of an HPA YAML

Once resources are defined, you create the HPA object. This configuration states: "Monitor the php-apache deployment. Keep the average CPU usage at 50%. If it goes higher, spawn up to 10 Pods to handle the load. If it drops, scale down to a minimum of 1 Pod."
yaml
123456789101112131415161718
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: php-apache
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

7. Mini Project: Auto-Scale Under Load

Let's generate fake traffic and watch Kubernetes scale our infrastructure.

Step-by-Step Tutorial:

  1. 1. Critical: Enable the metrics server in Minikube:

bash
1
minikube addons enable metrics-server
  1. 2. Deploy a mathematical application that deliberately consumes CPU:
bash
1
kubectl apply -f https://k8s.io/examples/application/php-apache.yaml
  1. 3. Instead of writing a YAML, let's create the HPA using a quick imperative command:
bash
1
kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10
  1. 4. Verify the HPA is active:
bash
1
kubectl get hpa

*(Initially, TARGETS might say <unknown>/50%. Wait 60 seconds for the metrics server to gather data).*

  1. 5. The Stress Test: Open a *new* terminal window. Let's run a temporary busybox Pod to bombard our application with infinite HTTP requests, simulating a viral traffic spike:

bash
1
kubectl run -i --tty load-generator --rm --image=busybox:1.28 -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://php-apache; done"
  1. 6. Return to your original terminal and watch the HPA react:
bash
1
kubectl get hpa -w

*(Within 1-2 minutes, you will see the CPU spike to 200%+. The HPA will panic and automatically scale the REPLICAS from 1 -> 4 -> 8 -> 10 to absorb the load!)*

  1. 7. Stop the load-generator (Ctrl+C). After 5 minutes, the HPA will notice the traffic is gone and gracefully scale the replicas back down to 1 to save resources.

8. Real-World Scenarios

A ticket sales website operates at minimal capacity (3 Pods) for weeks. Suddenly, tickets for a global pop star go on sale. Within 10 seconds, 100,000 users hit the site. The HPA detects the massive CPU spike and immediately scales the frontend Deployment to 200 Pods. However, the physical AWS Worker Nodes run out of capacity! To solve this, DevOps engineers combine the HPA with the Cluster Autoscaler—a tool that automatically rents *more physical EC2 servers* from AWS when the HPA needs more room for Pods.

9. Best Practices

  • Scale Down Cooldown: HPAs are configured with a "cooldown" period (usually 5 minutes). When traffic drops, the HPA will NOT scale down immediately. It waits to ensure the traffic drop wasn't just a temporary dip. This prevents "thrashing" (rapidly creating and destroying Pods over and over).

10. Common Mistakes

  • Missing Resource Requests: The absolute most common reason an HPA fails to scale (showing <unknown>/50% forever) is because the developer forgot to define resources.requests in the Deployment YAML. If the HPA doesn't know the Pod's baseline CPU allowance, it cannot calculate the math required to scale it.

11. Exercises

  1. 1. What prerequisite cluster add-on is mathematically required for the Horizontal Pod Autoscaler to function?
  1. 2. Explain the difference between the Horizontal Pod Autoscaler (HPA) and the Cluster Autoscaler.

12. FAQs

Q: Can I scale based on memory (RAM) instead of CPU? A: Yes. You can configure the HPA to scale if memory usage exceeds 70%. However, memory scaling is notoriously tricky because many programming languages (like Java) grab all available RAM and hold onto it even when idle (Garbage Collection delays), causing the HPA to scale up unnecessarily. CPU scaling is vastly more reliable.

13. Interview Questions

  • Q: Explain the mechanical interaction between a Deployment's resources.requests definition and the Horizontal Pod Autoscaler's averageUtilization metric.
  • Q: A production HPA is configured to scale on CPU utilization. During a load test, the CPU spikes, but the HPA fails to spawn new Pods. Describe your systematic troubleshooting process to identify the misconfiguration.

14. Summary

In Chapter 12, we achieved the holy grail of cloud computing: Elasticity. We introduced the Horizontal Pod Autoscaler (HPA) as an autonomous controller that continuously monitors application metrics. We configured foundational Resource Requests, deployed the Metrics Server, and executed a live stress test, witnessing the HPA dynamically scale our Pod replicas up to absorb a traffic spike and gracefully scale them down to conserve resources when the crisis subsided.

15. Next Chapter Recommendation

Our applications are scaling beautifully, but visibility is low. We need enterprise dashboards to visualize our metrics and logs. Proceed to Chapter 13: Monitoring and Logging in Kubernetes.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·