Monitoring and System Logs
# CHAPTER 16
Monitoring and System Logs
1. Introduction
When a Linux server crashes in the middle of the night, it doesn't leave a sticky note explaining what happened. However, the operating system is obsessively paranoid; it silently records every single hardware failure, software error, and user login into deep, hidden text files. The ability to read these digital black boxes is what separates a junior operator from a senior system administrator. In this chapter, we will master the diagnostic tools required to check the vitals of the machine. We will monitor RAM consumption withfree, assess CPU strain with uptime, read kernel panic messages with dmesg, and interact with the all-powerful systemd logging engine using journalctl.
2. Learning Objectives
By the end of this chapter, you will be able to:-
Check system load averages and duration online using
uptime.
-
Monitor active memory (RAM) and swap space using
free.
-
Investigate hardware and driver failures using
dmesg.
-
Navigate traditional log files stored in
/var/log(e.g., syslog, auth.log).
-
Query and filter the modern centralized logging daemon using
journalctl.
3. Checking System Vitals (uptime and free)
Before digging into complex text logs, you must check the basic biological vitals of the server.
1. The uptime Command (CPU Load):
The most critical part of this output is the Load Average. It shows the CPU strain over the last 1 minute, 5 minutes, and 15 minutes.
-
If you have a 1-core CPU, a load of
1.00means the CPU is at 100% capacity.
-
If the load is
5.00on a 1-core CPU, the server is suffocating; tasks are waiting in a massive line.
2. The free Command (RAM):
When a database crashes, it is almost always because the server ran out of RAM.
The -m flag displays the output in Megabytes (instead of confusing bytes). Pay close attention to the Swap row. Swap space is an emergency overflow file on the hard drive used when the physical RAM is 100% full. If your server is heavily using Swap, the machine will run incredibly slow.
4. Hardware and Boot Logs (dmesg)
If you plug a new USB drive into a server and nothing happens, how do you know if the motherboard even detected it? You use the Diagnostic Message (dmesg) command.
dmesg prints the absolute lowest-level communications between the Linux Kernel and the physical hardware.
If a hard drive is physically dying and failing to spin, the terrifying red error messages will appear here.
5. Traditional Text Logs (/var/log)
Historically, Linux applications were programmed to write their own text files into the /var/log directory. You must know where these are:
-
/var/log/syslog(Ubuntu) or/var/log/messages(CentOS): The general "junk drawer" of system events.
-
/var/log/auth.log(Ubuntu) or/var/log/secure(CentOS): Records every single successful and failed SSH login attempt.
-
*(You view these files using the
cat,less, ortail -ftools we learned in Chapter 5).*
6. The Modern Engine (journalctl)
Modern Linux distributions (anything using systemd) have abandoned traditional loose text files in favor of a massive, centralized, binary logging database. You cannot read this database with cat. You must query it using the journalctl command.
journalctl is incredibly powerful because it allows you to filter logs mathematically:
7. Diagrams/Visual Suggestions
*Visual Concept: The Logging Funnel* Draw three separate icons: A Web Server, the SSH Service, and the Hardware Kernel. Draw arrows from all three funneling into a central, glowing cylinder labeledsystemd Journal.
Draw an arrow coming out of the bottom of the cylinder into a magnifying glass labeled journalctl (Query Tool).
This visualizes the modern shift from fragmented text files to centralized, queryable logging architecture.
8. Best Practices
-
Persistent Journals: By default on some systems, the
journalctldatabase is stored in RAM (volatile memory). This means if the server crashes and reboots, the log explaining *why* it crashed is erased! Administrators must ensure the directory/var/log/journalexists so thatsystemdwrites the logs permanently to the hard drive, allowing post-crash forensic analysis.
9. Common Mistakes
-
Ignoring the
-uflag in Journalctl: Beginners often typejournalctland are overwhelmed by 50,000 lines of system noise, concluding that the tool is useless.journalctlis a database query tool. If you want to fix a broken NGINX web server, you MUST use the-u(Unit) flag:journalctl -u nginx. This filters out 99% of the noise and shows you only the web server's specific errors.
10. Mini Project: Forensic Security Audit
Let's see who is attacking your machine right now:-
1.
Type
sudo less /var/log/auth.log(orsecureon CentOS).
-
2.
Press
Shift + Gto instantly jump to the very bottom (most recent) part of the file.
-
3.
Look for lines that say
Failed password for invalid user. These are automated bots in China and Russia attempting to brute-force your SSH port.
-
4.
Now, let's use the modern tool. Type:
sudo journalctl -u ssh --since "1 hour ago".
- 5. You just queried the master database for every SSH event that occurred in the last 60 minutes.
11. Practice Exercises
-
1.
Analyze the output of the
uptimecommand. If a server has 4 CPU cores, and the 1-minute load average is2.50, is the CPU currently overloaded?
-
2.
Explain the functional difference between reading logs via
less /var/log/syslogversus querying logs viajournalctl.
12. MCQs with Answers
A server application is repeatedly crashing. You suspect the operating system is completely exhausting its physical RAM. Which command will instantly display the total, used, and available system memory in Megabytes?
You need to investigate hardware-level errors generated by the Linux kernel during the boot sequence regarding a failing network interface card. Which command isolates and displays these specific kernel ring buffer messages?
13. Interview Questions
-
Q: A web developer states the server is acting sluggish. You run the
free -mcommand and notice that the "Swap" memory usage is steadily increasing. Explain what "Swap" memory physically is, and why its usage causes severe system latency.
-
Q: Contrast the legacy
/var/logdirectory structure with the modernsystemdlogging architecture. Why must an administrator use thejournalctlcommand instead ofcatorlesswhen interacting with systemd logs?
-
Q: Explain the significance of the three "Load Average" numbers provided by the
uptimecommand. How do you interpret these numbers in relation to the physical number of CPU cores on the motherboard?
14. FAQs
Q: Do these massive log files eventually fill up the entire hard drive? A: They would, but Linux uses an automated background utility calledlogrotate. Every night, it takes the massive syslog file, compresses it into a tiny .gz file, names it syslog.1.gz, and creates a brand new, empty syslog file for the next day. It automatically deletes logs older than a specific timeframe (usually 30 days) to prevent disk exhaustion.
15. Summary
In Chapter 16, we learned to interpret the diagnostic language of the operating system. We utilizeduptime to calculate CPU load averages and free -m to monitor RAM and Swap saturation, establishing the baseline metrics for performance troubleshooting. We explored the deep hardware diagnostics of the kernel via dmesg and navigated the traditional plaintext security logs housed in /var/log. Most importantly, we bridged the gap to modern Linux administration by mastering the journalctl query engine, allowing us to filter, slice, and extract critical forensic data from the centralized systemd database.