Skip to main content
Shell Scripting – Complete Beginner to Advanced Guide
CHAPTER 11 Intermediate

Text Processing Utilities

Updated: May 16, 2026
30 min read

# CHAPTER 11

Text Processing Utilities

1. Introduction

Linux and Unix operating systems generate a staggering volume of plaintext data. A busy web server can generate a 2-Gigabyte access.log file containing millions of lines of text every single day. If your objective is to identify how many times a specific IP address attempted to log in, you cannot open that file in a graphical text editor; the application will freeze and crash. Instead, a DevOps engineer relies on a suite of surgical text processing tools built directly into the terminal environment. In this chapter, we will master the holy trinity of Unix data extraction: grep (The Finder), cut (The Slicer), and sed (The Replacer). We will also touch upon awk, the ultimate column-parsing engine, and sort/tr for final text organization, allowing you to extract needle-in-a-haystack intelligence from massive datasets.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Extract specific rows of text from massive files using grep.
  • Slice specific columns out of explicitly delimited files using cut.
  • Perform automated, inline find-and-replace text modifications using sed.
  • Extract and format complex tabular whitespace data using awk.
  • Translate and manipulate characters using tr.
  • Chain sort and uniq together to count duplicate log entries.

3. The Finder (grep)

If you need to find a specific word in a massive file, you use grep. It searches text line-by-line and prints only the lines containing the match.
sh
12345
# Print all lines in the auth.log that contain the word "Failed"
grep "Failed" /var/log/auth.log

# Pro-Tip: Add -v to INVERT the search (Show everything EXCEPT "Failed")
grep -v "Failed" /var/log/auth.log

4. The Slicer (cut)

If you have a file where data is strictly separated by a specific character (like a Comma-Separated Values .csv file, or the colon-separated /etc/passwd file), you use cut to extract a specific column. You must define the Delimiter (-d) and the Field you want (-f).
sh
123
# Print ONLY the usernames (Column 1) from the password file.
# The delimiter is explicitly set to a colon (:).
cat /etc/passwd | cut -d ':' -f 1

5. The Replacer (sed)

sed stands for Stream Editor. It is the command-line equivalent of the "Find and Replace" tool. The primary syntax looks cryptic at first: s/FIND/REPLACE/g. (The s means substitute, the g means global—replace every occurrence on the line, not just the first one).
sh
123
# Pipe text into sed to change "apples" to "oranges"
echo "I like apples" | sed 's/apples/oranges/g'
# Output: I like oranges

*Pro-Tip (In-place Editing):* Normally sed just prints to the screen. If you use -i (or -i '' on macOS), sed will physically edit the target text file and save it permanently without opening it!

sh
12
# Permanently change Port 22 to 2222 in the SSH configuration
sudo sed -i 's/Port 22/Port 2222/g' /etc/ssh/sshd_config

6. The Engine (awk)

awk is actually an entire programming language dedicated exclusively to column extraction. It is smarter than cut because its default delimiter is "any amount of consecutive whitespace." If you run ls -l, you get messy columns of file permissions, owners, and sizes separated by random amounts of spaces.
sh
12
# Print ONLY the file size (Column 5) and the file name (Column 9)
ls -l | awk '{print $5, $9}'

7. The Organizer (sort and tr)

1. tr (Translate): Replaces or deletes individual characters.
sh
12
# Convert all lowercase text to UPPERCASE
echo "hello world" | tr 'a-z' 'A-Z'

2. sort and uniq: If you extract 5,000 IP addresses from a log file, you will likely have hundreds of duplicates. You use sort to organize them alphabetically, and uniq -c to delete the duplicates while counting them!

sh
12
# Read IPs, sort them, count duplicates, and sort by highest count (Numeric Reverse)
cat ip_list.txt | sort | uniq -c | sort -nr

8. Diagrams/Visual Suggestions

*Visual Concept: The Text Processing Pipeline* Draw a large, messy paragraph of text representing raw logs. Arrow 1 points to a magnifying glass labeled grep "Error". The output shrinks to just 3 lines of text. Arrow 2 points to a meat cleaver labeled cut -d ',' -f 2. The output shrinks to just a single vertical column of words. Arrow 3 points to a sorting tray labeled sort | uniq -c. The final output is a neat, numbered list (3 Failed, 1 Success). This visually demonstrates the progressive refinement of raw data into actionable intelligence.

9. Best Practices

  • Never sed -i without a backup: The in-place stream editor is ruthlessly destructive. If you make a typo in your search pattern, it will permanently corrupt the file it is editing. Always back up configuration files (cp config config.bak) before executing an automated sed -i command inside a script.

10. Common Mistakes

  • Using uniq without sort: The uniq command is mechanically "dumb". It only deletes duplicates if they are sitting *exactly adjacent* to each other vertically. If "Admin" is on line 1, and "Admin" is on line 5, uniq will not delete the duplicate. You MUST pipe the data through sort first to group identical lines together, and then pipe it into uniq.

11. Mini Project: Build an Access Log Analyzer

Let's build a DevOps security script that parses an Apache web server log to find the top attacker.
  1. 1. nano log_analyzer.sh
  1. 2. Write the code:
sh
123456789101112131415161718192021
#!/bin/sh

# Pretend we have an apache access.log file
LOG_FILE="/var/log/apache2/access.log"

if [ ! -f "$LOG_FILE" ]; then
    echo "No log file found. Skipping analysis."
    exit 1
fi

echo "Top 5 IP Addresses hitting the server:"
echo "--------------------------------------"

# The Pipeline Breakdown:
# 1. awk '{print $1}': Extract the very first column (IP addresses)
# 2. sort: Group identical IPs together
# 3. uniq -c: Count the duplicates and collapse them into single lines
# 4. sort -nr: Sort them Numerically (n) in Reverse (r) so the biggest number is at top
# 5. head -n 5: Only display the top 5 lines

cat "$LOG_FILE" | awk '{print $1}' | sort | uniq -c | sort -nr | head -n 5
  1. 3. This is a real-world, production-grade forensic data pipeline.

12. Practice Exercises

  1. 1. Differentiate the operational utility of the cut command versus the awk command. In a file where columns are separated by irregular amounts of whitespace (tabs and spaces mixed), which tool is superior?
  1. 2. Explain the mandatory workflow requirement regarding the execution order of the sort and uniq commands.

13. MCQs with Answers

Question 1

An administrator needs to perform an automated "Find and Replace" within a configuration file using a Shell script, changing the word False to True. Which command accomplishes this inline modification permanently?

Question 2

When parsing a .csv file where the data is explicitly delimited by commas, which command flag is utilized with the cut command to specify the comma as the delimiter?

14. Interview Questions

  • Q: A junior engineer runs cat access.log | uniq -c to count duplicate IP addresses, but the terminal output still shows hundreds of duplicated IPs scattered throughout the list. Explain the mechanical limitation of the uniq command that caused this failure, and provide the corrected command pipeline.
  • Q: Explain the syntax awk '{print $3}'. What is this command extracting, and how does awk fundamentally determine where data boundaries exist by default?
  • Q: You are writing an automated deployment script. You need to permanently change a database connection string inside a config.php file without opening a text editor. Walk me through the exact sed command syntax required to execute this modification securely.

15. FAQs

Q: What is a Regular Expression (Regex)? A: All of these Unix tools (grep, sed, awk) support Regular Expressions. Regex is an advanced mathematical syntax for matching patterns instead of explicit words. For example, instead of searching for the exact IP 192.168.1.5, a Regex like [0-9]{1,3}\.[0-9]{1,3} tells grep to "find absolutely anything in this file that matches an IP address format."

16. Summary

In Chapter 11, we conquered the chaos of massive plaintext datasets. We deployed the cut command to surgically slice strictly delimited files (:, ,) and utilized the overarching intelligence of awk to extract columns from irregularly formatted outputs. We mastered the Stream Editor (sed), leveraging the s/FIND/REPLACE/g syntax to execute rapid, automated inline configuration changes. Finally, we engineered multi-stage forensic pipelines, chaining sort and uniq -c to distill millions of messy log entries into clean, actionable intelligence regarding duplicate server events.

17. Next Chapter Recommendation

Your script can process static data brilliantly. Now, we must turn our attention to dynamic, live operations. We must learn how to monitor and control running programs. Proceed to Chapter 12: Process Management in Shell Scripts.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·