Skip to main content
System Design
CHAPTER 14 Beginner

Monitoring, Logging, and Observability

Updated: May 18, 2026
5 min read

# CHAPTER 14

Monitoring, Logging, and Observability

1. Chapter Introduction

In a monolith, debugging is easy: you SSH into the server and read the log file. In a Microservices architecture, a single user request might travel through 15 different servers and 4 databases before failing. If you cannot trace that request, your system is a black box. In senior interviews, deploying code is only half the job; the other half is proving you can operate it. This chapter covers the three pillars of Observability: Logs, Metrics, and Traces.

2. The Three Pillars of Observability

Observability is the ability to measure the internal state of a system simply by examining its external outputs. To achieve this, you need three things:
  1. 1. Logs: A discrete record of a specific event that happened at a specific time (e.g., [ERROR] 2026-05-18 10:00:00 Database connection timed out).
  1. 2. Metrics: Numeric data aggregated over time (e.g., CPU utilization is currently at 85%, API latency is 200ms).
  1. 3. Traces: The end-to-end journey of a single request as it travels across multiple distributed microservices.

3. Centralized Distributed Logging

If you have 100 microservices running on 100 servers, you cannot manually check 100 local log files. You must centralize them. The ELK Stack (Elasticsearch, Logstash, Kibana) or Datadog are industry standards. *The Flow:*
  1. 1. Every microservice prints logs to standard output (stdout) in JSON format.
  1. 2. A lightweight agent (Logstash/Fluentd) running on every server collects these logs and ships them to a central database.
  1. 3. The logs are indexed in a massive search engine (Elasticsearch).
  1. 4. Engineers use a visual dashboard (Kibana) to search millions of logs instantly (e.g., Search: level="ERROR" AND service="Billing").

4. Distributed Tracing (The Request ID)

*The Problem:* A user clicks "Checkout" and receives an error. Which of the 15 microservices involved caused the error? *The Solution:* Distributed Tracing (e.g., Jaeger, OpenTelemetry).
  1. 1. When the user's request hits the API Gateway, the gateway generates a unique Correlation ID (Request ID).
  1. 2. The Gateway passes this ID in the HTTP headers to Service A.
  1. 3. Service A includes the ID in its logs, and passes it to Service B.
  1. 4. Service B includes the ID in its logs, and passes it to Service C.
*Result:* In your centralized logging dashboard, you can search for that exact Correlation ID and see the entire lifecycle of the request across all 15 servers, immediately identifying that Service C failed.

5. Metrics Collection

Logs are heavy strings. You cannot efficiently calculate "Average CPU Usage over 30 days" by parsing strings. You need Metrics. Prometheus and Grafana are the industry standards.
  • *Push vs. Pull:* Microservices expose a /metrics endpoint. A centralized Prometheus server regularly "pulls" (scrapes) these numbers.
  • *Dashboards:* Grafana connects to Prometheus to build beautiful, real-time visual graphs of System Health (CPU, RAM, Network) and Business Health (Active Users, Orders per Minute).

6. Alerting Systems

Monitoring is useless if nobody looks at the dashboard. You must configure Alerts (e.g., PagerDuty). *Rule:* Only alert on actionable symptoms affecting the user, not just raw metrics.
  • *Bad Alert:* "CPU is at 90%." (If the system is auto-scaling correctly, 90% CPU is efficient, not a crisis).
  • *Good Alert:* "HTTP 500 Error Rate on Checkout API is above 5% for the last 5 minutes." (This directly impacts revenue. Wake up the engineer).

7. Real-World Scenario: The Silent Database Failure

*Scenario:* A company deploys a new feature. Suddenly, the database CPU hits 100%. Users cannot log in. *The Monolith Approach:* The engineer panics, ssh's into servers, and spends 3 hours reading text files to find the bad query. *The Observability Approach:* PagerDuty immediately alerts the engineer: "Login Latency is > 5 seconds." The engineer opens Grafana, sees the database CPU spike. They click into the Distributed Tracing tool, sort by "Longest DB Query," and instantly identify the exact SQL statement and the Microservice that executed it. The bug is rolled back in 3 minutes.

8. Mini Project: Design an Observability Pipeline

Whiteboard a logging architecture for an E-Commerce site:
  1. 1. Microservices output JSON logs.
  1. 2. Fluentd agent collects logs.
  1. 3. Drops logs into a Kafka queue (acts as a buffer if logging volume spikes during Black Friday).
  1. 4. Logstash pulls from Kafka, parses data, and saves to Elasticsearch.
  1. 5. Kibana reads from Elasticsearch for the UI.

9. Common Mistakes

  • Logging PII (Personally Identifiable Information): NEVER log plaintext passwords, credit card numbers, or social security numbers. Logs are often visible to the entire engineering team. Logging PII violates compliance (GDPR/HIPAA).
  • Over-Logging: Writing a "DEBUG" log for every line of code executed. In a system processing 10,000 requests per second, logging everything will cost more in AWS storage fees than the actual application infrastructure.

10. Best Practices

  • Structured Logging: Stop logging plain text like User 123 logged in. Log in JSON format: {"event": "login", "user_id": 123, "status": "success"}. JSON logs are infinitely easier to query and index in Elasticsearch.

11. Exercises

  1. 1. Explain why a Correlation ID (Request ID) is absolutely mandatory in a Microservices architecture.
  1. 2. What is the difference between a Log and a Metric?

12. MCQs

Question 1

What are the "Three Pillars of Observability"?

Question 2

Why is "Centralized Logging" necessary in modern system design?

Question 3

What does "Distributed Tracing" achieve?

Question 4

What is a "Correlation ID" (or Request ID)?

Question 5

How do Metrics differ from Logs?

Question 6

What constitutes a "Good Alert" in a monitoring system?

Question 7

What is a massive security risk associated with application logging?

Question 8

Why is "Structured Logging" (logging in JSON format) heavily preferred over standard text logging?

Question 9

What role does Prometheus typically play in the observability stack?

Question 10

Why might a large company place a Kafka message queue between their microservices and their Logstash/Elasticsearch database?

14. Interview Questions

  • Q: "We have 50 microservices. A user clicks 'Submit Payment' and gets a generic 500 Error. Describe exactly how you would architect the system so you could debug this error in under 2 minutes."

15. FAQs

  • Q: Should frontend clients (Mobile/Web) send logs to the central server?
A: Yes, but handle with care. Mobile clients often experience network drops. Use a specialized service (like Sentry or Datadog RUM) to aggregate frontend crash reports and UI performance metrics.

16. Summary

Observability proves your system is working and allows you to fix it instantly when it isn't. Master the three pillars: Centralize your JSON Logs using the ELK stack, utilize Distributed Tracing with Correlation IDs to track requests across microservices, and collect Metrics via Prometheus/Grafana to visualize system health. Configure actionable alerts to ensure engineers only wake up when the user experience is actually degraded.

17. Next Chapter Recommendation

We have built a highly scalable, secure, and observable REST architecture. But what if we are building a multiplayer game, a stock ticker, or a chat app? Standard HTTP requests won't work. In Chapter 15: Designing Real-Time Systems, we will dive into WebSockets, Server-Sent Events, and real-time scalability.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·