CHAPTER 13
Intermediate
Monitoring and Logging in Kubernetes
Updated: May 15, 2026
25 min read
# CHAPTER 13
Monitoring and Logging in Kubernetes
1. Introduction
If a monolithic application running on a single server crashes, you SSH into the server and read/var/log/app.log. If a microservice architecture running across 500 Pods and 20 Nodes crashes, reading logs manually is mathematically impossible. Kubernetes demands centralized observability. In this chapter, we will master basic kubectl debugging tools, explore robust Health Checks, and introduce the industry-standard observability stack: Prometheus and Grafana.
2. Learning Objectives
By the end of this chapter, you will be able to:- Retrieve real-time logs from standalone and multi-container Pods.
- Differentiate between Liveness, Readiness, and Startup Probes.
- Author YAML configurations for Health Checks.
- Understand the architectural purpose of Prometheus (Metrics).
- Understand the architectural purpose of Grafana (Visualization).
3. Beginner-Friendly Explanation
Imagine a massive hospital.-
kubectl logs(The Clipboard): You ask a specific nurse to read you the notes for exactly one patient. Good for a quick check, useless for understanding the health of the entire hospital.
- Health Probes (The Heart Monitor): A machine attached to every patient. If the patient's heart stops (Liveness Probe fails), the machine automatically pages a doctor to resuscitate them. If the patient is dizzy (Readiness Probe fails), the machine tells the receptionist to stop sending visitors to the room until they recover.
- Prometheus & Grafana (The Command Center): A massive wall of TVs in the director's office. Prometheus acts as the thousands of sensors gathering data (temperature, heart rates, blood pressure) from every room. Grafana takes that raw data and draws beautiful, color-coded graphs on the TVs so the director can instantly see if the hospital is failing.
4. Basic Logging (kubectl logs)
To troubleshoot a failing application, you must read its output.
-
Single Container Pod:
kubectl logs <pod-name>
-
Live Stream Logs:
kubectl logs -f <pod-name>(Follows the logs in real-time).
-
Multi-Container Pod: If a Pod has a web container and a sidecar container, you MUST specify which one you want to read:
kubectl logs <pod-name> -c <container-name>.
5. Health Checks (Probes)
A Pod with aRunning status is not necessarily healthy. A Java application might be technically running, but stuck in a 10-minute deadlock, returning 500 Errors to customers. Kubernetes uses 3 Probes to test true health:
- 1. Startup Probe: Tests if the legacy application has finished its massive 2-minute boot sequence.
- 2. Readiness Probe: Tests if the Pod is ready to accept user traffic. If it fails (e.g., the Pod lost connection to the Database), the Pod stays alive, but the Service stops sending user traffic to it.
- 3. Liveness Probe: The ultimate test. If it fails, Kubernetes ruthlessly assassinates the Pod and restarts it.
6. Anatomy of a Health Probe YAML
You add probes directly into the container spec.
yaml
7. Introduction to Prometheus and Grafana
Whilekubectl is great for debugging one Pod, enterprises require cluster-wide metrics.
- Prometheus: A time-series database. It "scrapes" metrics (CPU usage, HTTP 404 errors, network latency) from every Node and Pod in the cluster every 10 seconds and stores them.
- Grafana: A visualization tool. It connects to Prometheus and transforms the raw numbers into beautiful, interactive dashboards. It also handles alerting (e.g., "If CPU > 90% for 5 minutes, send a message to Slack").
8. Mini Project: Monitor Kubernetes Application Health
Let's deploy a self-healing application using a Liveness Probe.Step-by-Step Tutorial:
-
1.
Create a file named
liveness-pod.yaml:
yaml
-
2.
Apply the YAML:
kubectl apply -f liveness-pod.yaml
-
3.
Immediately run
kubectl describe pod liveness-test. Scroll to the bottom "Events". It will look normal.
-
4.
Wait exactly 35 seconds. Run
kubectl describe pod liveness-testagain.
-
5.
*The Magic:* Look at the Events. You will see a
Warning: Unhealthyevent. Kubernetes tried tocat /tmp/healthy, but the file was gone. The next event will sayNormal: Killing. Kubernetes detected the simulated crash and automatically executed the container to heal it!
-
6.
Clean up:
kubectl delete pod liveness-test
9. Real-World Scenarios
An e-commerce website uses a heavy Java backend. Sometimes, the Java Garbage Collector freezes the application for 45 seconds. Without a Readiness Probe, the Kubernetes Service continues sending customer checkout requests to the frozen Pod, causing the customers' browsers to timeout and fail. By implementing a Readiness Probe (GET /api/status), Kubernetes detects the freeze, temporarily removes the Pod from the Service rotation, and routes the checkout requests to other healthy Pods until the Java app recovers.
10. Best Practices
- Centralized Log Aggregation (EFK Stack): In production, if a Pod crashes and gets replaced, its logs are deleted forever! You must install a cluster-level logging architecture like Elasticsearch, Fluentd, and Kibana (EFK). Fluentd runs on every Node, collects the logs from every Pod before they die, and ships them to a permanent Elasticsearch database for historical debugging.
11. Common Mistakes
-
Aggressive Liveness Probes: If you set
initialDelaySeconds: 1andperiodSeconds: 1on a Liveness Probe for a heavy Spring Boot application that takes 30 seconds to boot up, the probe will test the app at second 1, fail, and kill the app. It will try to boot again, fail at second 1, and be killed again. The Pod will enter an infiniteCrashLoopBackOffdeath spiral simply because it wasn't given enough time to turn on!
12. Exercises
- 1. Differentiate between the actions Kubernetes takes when a Readiness Probe fails versus when a Liveness Probe fails.
- 2. Why is centralized log aggregation (like the EFK stack) absolutely mandatory for debugging in a highly volatile Kubernetes environment?
13. FAQs
Q: Can I use Datadog or New Relic instead of Prometheus/Grafana? A: Yes! Prometheus and Grafana are open-source and free, making them the default standard. Datadog and New Relic are highly expensive, proprietary SaaS products, but they are vastly easier to set up and offer incredible out-of-the-box Kubernetes visibility.14. Interview Questions
- Q: Explain the necessity of the "Startup Probe" in modern Kubernetes versions. Why were Liveness and Readiness probes insufficient for handling legacy, slow-booting monolithic applications?
- Q: Describe the architectural flow of a centralized logging pipeline (e.g., Fluentd to Elasticsearch). How does this architecture solve the problem of ephemeral container filesystems?
15. Summary
In Chapter 13, we illuminated the black box of our cluster. We utilizedkubectl logs for surgical, real-time debugging, and deployed intelligent Health Probes (Liveness and Readiness) to grant Kubernetes the autonomy to detect application freezes and execute self-healing protocols. Finally, we introduced the broader observability ecosystem, establishing the necessity of Prometheus, Grafana, and Log Aggregation to achieve enterprise-grade visibility across thousands of ephemeral Pods.
16. Next Chapter Recommendation
Our cluster is highly observable, but is it secure? If a developer haskubectl access, can they delete the entire production database? Proceed to Chapter 14: Kubernetes Security Best Practices.