Skip to main content
Kubernetes Introduction
CHAPTER 13 Intermediate

Monitoring and Logging in Kubernetes

Updated: May 15, 2026
25 min read

# CHAPTER 13

Monitoring and Logging in Kubernetes

1. Introduction

If a monolithic application running on a single server crashes, you SSH into the server and read /var/log/app.log. If a microservice architecture running across 500 Pods and 20 Nodes crashes, reading logs manually is mathematically impossible. Kubernetes demands centralized observability. In this chapter, we will master basic kubectl debugging tools, explore robust Health Checks, and introduce the industry-standard observability stack: Prometheus and Grafana.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Retrieve real-time logs from standalone and multi-container Pods.
  • Differentiate between Liveness, Readiness, and Startup Probes.
  • Author YAML configurations for Health Checks.
  • Understand the architectural purpose of Prometheus (Metrics).
  • Understand the architectural purpose of Grafana (Visualization).

3. Beginner-Friendly Explanation

Imagine a massive hospital.
  • kubectl logs (The Clipboard): You ask a specific nurse to read you the notes for exactly one patient. Good for a quick check, useless for understanding the health of the entire hospital.
  • Health Probes (The Heart Monitor): A machine attached to every patient. If the patient's heart stops (Liveness Probe fails), the machine automatically pages a doctor to resuscitate them. If the patient is dizzy (Readiness Probe fails), the machine tells the receptionist to stop sending visitors to the room until they recover.
  • Prometheus & Grafana (The Command Center): A massive wall of TVs in the director's office. Prometheus acts as the thousands of sensors gathering data (temperature, heart rates, blood pressure) from every room. Grafana takes that raw data and draws beautiful, color-coded graphs on the TVs so the director can instantly see if the hospital is failing.

4. Basic Logging (kubectl logs)

To troubleshoot a failing application, you must read its output.
  • Single Container Pod: kubectl logs <pod-name>
  • Live Stream Logs: kubectl logs -f <pod-name> (Follows the logs in real-time).
  • Multi-Container Pod: If a Pod has a web container and a sidecar container, you MUST specify which one you want to read: kubectl logs <pod-name> -c <container-name>.

5. Health Checks (Probes)

A Pod with a Running status is not necessarily healthy. A Java application might be technically running, but stuck in a 10-minute deadlock, returning 500 Errors to customers. Kubernetes uses 3 Probes to test true health:
  1. 1. Startup Probe: Tests if the legacy application has finished its massive 2-minute boot sequence.
  1. 2. Readiness Probe: Tests if the Pod is ready to accept user traffic. If it fails (e.g., the Pod lost connection to the Database), the Pod stays alive, but the Service stops sending user traffic to it.
  1. 3. Liveness Probe: The ultimate test. If it fails, Kubernetes ruthlessly assassinates the Pod and restarts it.

6. Anatomy of a Health Probe YAML

You add probes directly into the container spec.
yaml
123456789
      containers:
      - name: my-api
        image: my-api:v1
        livenessProbe:
          httpGet:
            path: /health # Kubernetes will ping this URL
            port: 8080
          initialDelaySeconds: 15 # Wait 15s after boot before testing
          periodSeconds: 20 # Test every 20 seconds

7. Introduction to Prometheus and Grafana

While kubectl is great for debugging one Pod, enterprises require cluster-wide metrics.
  • Prometheus: A time-series database. It "scrapes" metrics (CPU usage, HTTP 404 errors, network latency) from every Node and Pod in the cluster every 10 seconds and stores them.
  • Grafana: A visualization tool. It connects to Prometheus and transforms the raw numbers into beautiful, interactive dashboards. It also handles alerting (e.g., "If CPU > 90% for 5 minutes, send a message to Slack").

8. Mini Project: Monitor Kubernetes Application Health

Let's deploy a self-healing application using a Liveness Probe.

Step-by-Step Tutorial:

  1. 1. Create a file named liveness-pod.yaml:

yaml
1234567891011121314151617181920
apiVersion: v1
kind: Pod
metadata:
  name: liveness-test
spec:
  containers:
  - name: liveness-app
    image: busybox
    # The app creates a file, waits 30 seconds, then deletes the file (simulating a crash)
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -f /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5
  1. 2. Apply the YAML: kubectl apply -f liveness-pod.yaml
  1. 3. Immediately run kubectl describe pod liveness-test. Scroll to the bottom "Events". It will look normal.
  1. 4. Wait exactly 35 seconds. Run kubectl describe pod liveness-test again.
  1. 5. *The Magic:* Look at the Events. You will see a Warning: Unhealthy event. Kubernetes tried to cat /tmp/healthy, but the file was gone. The next event will say Normal: Killing. Kubernetes detected the simulated crash and automatically executed the container to heal it!
  1. 6. Clean up: kubectl delete pod liveness-test

9. Real-World Scenarios

An e-commerce website uses a heavy Java backend. Sometimes, the Java Garbage Collector freezes the application for 45 seconds. Without a Readiness Probe, the Kubernetes Service continues sending customer checkout requests to the frozen Pod, causing the customers' browsers to timeout and fail. By implementing a Readiness Probe (GET /api/status), Kubernetes detects the freeze, temporarily removes the Pod from the Service rotation, and routes the checkout requests to other healthy Pods until the Java app recovers.

10. Best Practices

  • Centralized Log Aggregation (EFK Stack): In production, if a Pod crashes and gets replaced, its logs are deleted forever! You must install a cluster-level logging architecture like Elasticsearch, Fluentd, and Kibana (EFK). Fluentd runs on every Node, collects the logs from every Pod before they die, and ships them to a permanent Elasticsearch database for historical debugging.

11. Common Mistakes

  • Aggressive Liveness Probes: If you set initialDelaySeconds: 1 and periodSeconds: 1 on a Liveness Probe for a heavy Spring Boot application that takes 30 seconds to boot up, the probe will test the app at second 1, fail, and kill the app. It will try to boot again, fail at second 1, and be killed again. The Pod will enter an infinite CrashLoopBackOff death spiral simply because it wasn't given enough time to turn on!

12. Exercises

  1. 1. Differentiate between the actions Kubernetes takes when a Readiness Probe fails versus when a Liveness Probe fails.
  1. 2. Why is centralized log aggregation (like the EFK stack) absolutely mandatory for debugging in a highly volatile Kubernetes environment?

13. FAQs

Q: Can I use Datadog or New Relic instead of Prometheus/Grafana? A: Yes! Prometheus and Grafana are open-source and free, making them the default standard. Datadog and New Relic are highly expensive, proprietary SaaS products, but they are vastly easier to set up and offer incredible out-of-the-box Kubernetes visibility.

14. Interview Questions

  • Q: Explain the necessity of the "Startup Probe" in modern Kubernetes versions. Why were Liveness and Readiness probes insufficient for handling legacy, slow-booting monolithic applications?
  • Q: Describe the architectural flow of a centralized logging pipeline (e.g., Fluentd to Elasticsearch). How does this architecture solve the problem of ephemeral container filesystems?

15. Summary

In Chapter 13, we illuminated the black box of our cluster. We utilized kubectl logs for surgical, real-time debugging, and deployed intelligent Health Probes (Liveness and Readiness) to grant Kubernetes the autonomy to detect application freezes and execute self-healing protocols. Finally, we introduced the broader observability ecosystem, establishing the necessity of Prometheus, Grafana, and Log Aggregation to achieve enterprise-grade visibility across thousands of ephemeral Pods.

16. Next Chapter Recommendation

Our cluster is highly observable, but is it secure? If a developer has kubectl access, can they delete the entire production database? Proceed to Chapter 14: Kubernetes Security Best Practices.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·