CHAPTER 14
Beginner
Monitoring, Logging, and Observability
Updated: May 18, 2026
5 min read
# CHAPTER 14
Monitoring, Logging, and Observability
1. Chapter Introduction
In a monolith, debugging is easy: you SSH into the server and read the log file. In a Microservices architecture, a single user request might travel through 15 different servers and 4 databases before failing. If you cannot trace that request, your system is a black box. In senior interviews, deploying code is only half the job; the other half is proving you can operate it. This chapter covers the three pillars of Observability: Logs, Metrics, and Traces.2. The Three Pillars of Observability
Observability is the ability to measure the internal state of a system simply by examining its external outputs. To achieve this, you need three things:-
1.
Logs: A discrete record of a specific event that happened at a specific time (e.g.,
[ERROR] 2026-05-18 10:00:00 Database connection timed out).
- 2. Metrics: Numeric data aggregated over time (e.g., CPU utilization is currently at 85%, API latency is 200ms).
- 3. Traces: The end-to-end journey of a single request as it travels across multiple distributed microservices.
3. Centralized Distributed Logging
If you have 100 microservices running on 100 servers, you cannot manually check 100 local log files. You must centralize them. The ELK Stack (Elasticsearch, Logstash, Kibana) or Datadog are industry standards. *The Flow:*- 1. Every microservice prints logs to standard output (stdout) in JSON format.
- 2. A lightweight agent (Logstash/Fluentd) running on every server collects these logs and ships them to a central database.
- 3. The logs are indexed in a massive search engine (Elasticsearch).
-
4.
Engineers use a visual dashboard (Kibana) to search millions of logs instantly (e.g.,
Search: level="ERROR" AND service="Billing").
4. Distributed Tracing (The Request ID)
*The Problem:* A user clicks "Checkout" and receives an error. Which of the 15 microservices involved caused the error? *The Solution:* Distributed Tracing (e.g., Jaeger, OpenTelemetry).- 1. When the user's request hits the API Gateway, the gateway generates a unique Correlation ID (Request ID).
- 2. The Gateway passes this ID in the HTTP headers to Service A.
- 3. Service A includes the ID in its logs, and passes it to Service B.
- 4. Service B includes the ID in its logs, and passes it to Service C.
5. Metrics Collection
Logs are heavy strings. You cannot efficiently calculate "Average CPU Usage over 30 days" by parsing strings. You need Metrics. Prometheus and Grafana are the industry standards.-
*Push vs. Pull:* Microservices expose a
/metricsendpoint. A centralized Prometheus server regularly "pulls" (scrapes) these numbers.
- *Dashboards:* Grafana connects to Prometheus to build beautiful, real-time visual graphs of System Health (CPU, RAM, Network) and Business Health (Active Users, Orders per Minute).
6. Alerting Systems
Monitoring is useless if nobody looks at the dashboard. You must configure Alerts (e.g., PagerDuty). *Rule:* Only alert on actionable symptoms affecting the user, not just raw metrics.- *Bad Alert:* "CPU is at 90%." (If the system is auto-scaling correctly, 90% CPU is efficient, not a crisis).
- *Good Alert:* "HTTP 500 Error Rate on Checkout API is above 5% for the last 5 minutes." (This directly impacts revenue. Wake up the engineer).
7. Real-World Scenario: The Silent Database Failure
*Scenario:* A company deploys a new feature. Suddenly, the database CPU hits 100%. Users cannot log in. *The Monolith Approach:* The engineer panics, ssh's into servers, and spends 3 hours reading text files to find the bad query. *The Observability Approach:* PagerDuty immediately alerts the engineer: "Login Latency is > 5 seconds." The engineer opens Grafana, sees the database CPU spike. They click into the Distributed Tracing tool, sort by "Longest DB Query," and instantly identify the exact SQL statement and the Microservice that executed it. The bug is rolled back in 3 minutes.8. Mini Project: Design an Observability Pipeline
Whiteboard a logging architecture for an E-Commerce site:- 1. Microservices output JSON logs.
- 2. Fluentd agent collects logs.
- 3. Drops logs into a Kafka queue (acts as a buffer if logging volume spikes during Black Friday).
- 4. Logstash pulls from Kafka, parses data, and saves to Elasticsearch.
- 5. Kibana reads from Elasticsearch for the UI.
9. Common Mistakes
- Logging PII (Personally Identifiable Information): NEVER log plaintext passwords, credit card numbers, or social security numbers. Logs are often visible to the entire engineering team. Logging PII violates compliance (GDPR/HIPAA).
- Over-Logging: Writing a "DEBUG" log for every line of code executed. In a system processing 10,000 requests per second, logging everything will cost more in AWS storage fees than the actual application infrastructure.
10. Best Practices
-
Structured Logging: Stop logging plain text like
User 123 logged in. Log in JSON format:{"event": "login", "user_id": 123, "status": "success"}. JSON logs are infinitely easier to query and index in Elasticsearch.
11. Exercises
- 1. Explain why a Correlation ID (Request ID) is absolutely mandatory in a Microservices architecture.
- 2. What is the difference between a Log and a Metric?
12. MCQs
Question 1
What are the "Three Pillars of Observability"?
Question 2
Why is "Centralized Logging" necessary in modern system design?
Question 3
What does "Distributed Tracing" achieve?
Question 4
What is a "Correlation ID" (or Request ID)?
Question 5
How do Metrics differ from Logs?
Question 6
What constitutes a "Good Alert" in a monitoring system?
Question 7
What is a massive security risk associated with application logging?
Question 8
Why is "Structured Logging" (logging in JSON format) heavily preferred over standard text logging?
Question 9
What role does Prometheus typically play in the observability stack?
Question 10
Why might a large company place a Kafka message queue between their microservices and their Logstash/Elasticsearch database?
14. Interview Questions
- Q: "We have 50 microservices. A user clicks 'Submit Payment' and gets a generic 500 Error. Describe exactly how you would architect the system so you could debug this error in under 2 minutes."
15. FAQs
- Q: Should frontend clients (Mobile/Web) send logs to the central server?