Skip to main content
Bash Scripting – Complete Beginner to Advanced Guide
CHAPTER 11 Intermediate

Text Processing Commands

Updated: May 16, 2026
30 min read

# CHAPTER 11

Text Processing Commands

1. Introduction

Linux operating systems generate a staggering amount of plaintext data. A busy NGINX web server can generate a 1-Gigabyte access.log file containing 5 million lines of text every single day. If your boss asks, "How many times did an IP from Russia try to log in yesterday?", you cannot open that file in a text editor. The application will crash. Instead, a DevOps engineer relies on a suite of surgical text processing tools built directly into the terminal. In this chapter, we will master the holy trinity of Linux data extraction: grep (The Finder), cut (The Slicer), and sed (The Replacer). We will also touch upon awk, the ultimate column-parsing engine, allowing you to extract needle-in-a-haystack intelligence from massive datasets.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Extract specific rows of text using grep and Regular Expressions.
  • Slice specific columns of text out of delimited files using cut.
  • Perform automated, inline find-and-replace text modifications using sed.
  • Extract and format complex tabular data using awk.
  • Chain sort and uniq together to count duplicate entries in log files.

3. The Slicer (cut)

If you have a CSV file (Comma Separated Values) or a file separated by colons (like /etc/passwd), and you only want to see the 1st column, you use cut. You must define the Delimiter (-d) and the Field you want (-f).
bash
123
# Print ONLY the usernames (Column 1) from the password file.
# The delimiter is a colon (:).
cat /etc/passwd | cut -d ':' -f 1

4. The Replacer (sed)

sed stands for Stream Editor. It is the command-line version of "Find and Replace". The syntax looks cryptic at first: s/FIND/REPLACE/g (The s means substitute, the g means global—replace every occurrence, not just the first one).
bash
123
# Echo a sentence, and pipe it into sed to change "apples" to "oranges"
echo "I like apples" | sed 's/apples/oranges/g'
# Output: I like oranges

*Pro-Tip: Using -i (In-place)*: Normally sed just prints to the screen. If you add -i, sed will physically edit and save the text file permanently.

bash
12
# Permanently change Port 22 to 2222 in the SSH config
sudo sed -i 's/Port 22/Port 2222/g' /etc/ssh/sshd_config

5. The Engine (awk)

awk is not just a command; it is an entire programming language dedicated to column extraction. It is smarter than cut because its default delimiter is "any amount of whitespace." If you run ls -l, you get messy columns of file permissions, owners, and sizes.
bash
12
# Print ONLY the file size (Column 5) and the file name (Column 9)
ls -l | awk '{print $5, $9}'

6. The Organizer (sort and uniq)

If you extract 5,000 IP addresses from a log file, you will likely have hundreds of duplicates. You use sort to organize them alphabetically, and uniq to delete consecutive duplicates. If you use uniq -c (Count), it will actually tell you *how many times* that duplicate appeared!
bash
123
# Read a file of messy IPs, sort them, count the duplicates, and sort by the highest count
cat ip_list.txt | sort | uniq -c | sort -nr
# Output: 450 192.168.1.5 (This IP attacked you 450 times!)

7. Diagrams/Visual Suggestions

*Visual Concept: The Text Processing Pipeline* Draw a large, messy paragraph of text. Arrow 1 points to a magnifying glass labeled grep "Error". The output shrinks to just 3 lines of text. Arrow 2 points to a meat cleaver labeled cut -d ',' -f 2. The output shrinks to just a single vertical column of words. Arrow 3 points to a sorting tray labeled sort | uniq -c. The final output is a neat list with numbers attached (3 Failed, 1 Success). This visualizes the progressive refinement of raw data into readable intelligence.

8. Best Practices

  • Never sed -i without a backup: The sed -i command is ruthlessly destructive. If you make a typo in your Regex, it will permanently corrupt the file it is editing. Always back up configuration files (cp config config.bak) before running a sed -i command in a script.

9. Common Mistakes

  • Using uniq without sort: The uniq command is notoriously "dumb". It only deletes duplicates if they are sitting *exactly next to each other* vertically. If "Apple" is on line 1, and "Apple" is on line 5, uniq will not delete the duplicate. You MUST pipe the data through sort first to group the duplicates together, and then pipe it into uniq.

10. Mini Project: Build an Access Log Analyzer

Let's build a DevOps security script that parses an Apache web server log to find the top attacker.
  1. 1. nano log_analyzer.sh
  1. 2. Write the code:
bash
123456789101112131415161718192021
#!/bin/bash

# Pretend we have an apache access.log file
LOG_FILE="/var/log/apache2/access.log"

if [ ! -f "$LOG_FILE" ]; then
    echo "No log file found."
    exit 1
fi

echo "Top 5 IP Addresses hitting the server:"
echo "--------------------------------------"

# 1. Read the file
# 2. awk '{print $1}': Extract the very first column (which is the IP address in Apache logs)
# 3. sort: Group the identical IPs together
# 4. uniq -c: Count the duplicates and collapse them into single lines
# 5. sort -nr: Sort them Numerically (n) in Reverse (r) so the biggest number is at the top
# 6. head -n 5: Only show the top 5 lines

cat "$LOG_FILE" | awk '{print $1}' | sort | uniq -c | sort -nr | head -n 5
  1. 3. This is a real-world, production-grade forensic pipeline.

11. Practice Exercises

  1. 1. Differentiate the operational utility of the cut command versus the awk command. In a file where columns are separated by irregular amounts of whitespace (tabs and spaces), which tool is superior?
  1. 2. Explain the mandatory workflow requirement regarding the execution order of the sort and uniq commands.

12. MCQs with Answers

Question 1

An administrator needs to perform an automated "Find and Replace" within a configuration file using a Bash script, changing the word False to True. Which command accomplishes this inline modification?

Question 2

When parsing a .csv file where the data is explicitly separated by commas, which command flag is utilized with the cut command to specify the comma as the delimiter?

13. Interview Questions

  • Q: A junior engineer runs cat access.log | uniq -c to count duplicate IP addresses, but the output still shows hundreds of duplicated IPs scattered throughout the list. Explain the mechanical limitation of the uniq command that caused this failure, and provide the corrected command pipeline.
  • Q: Explain the syntax awk '{print $3}'. What is this command extracting, and how does awk fundamentally determine where data boundaries exist by default?
  • Q: You are writing an automated deployment script. You need to permanently change a database connection string inside a config.php file without opening a text editor. Walk me through the exact sed command required to execute this modification safely.

14. FAQs

Q: What is a Regular Expression (Regex)? A: All of these tools (grep, sed, awk) support Regex. Regex is an advanced mathematical syntax for matching patterns instead of explicit words. For example, instead of searching for 192.168.1.5, a Regex like [0-9]{1,3}\.[0-9]{1,3} tells grep to "find absolutely anything in this file that looks like an IP address format."

15. Summary

In Chapter 11, we conquered the chaos of massive plaintext datasets. We deployed the cut command to surgically slice strictly delimited files (:, ,) and utilized the overarching intelligence of awk to extract columns from irregularly formatted outputs. We mastered the Stream Editor (sed), leveraging the s/FIND/REPLACE/g syntax to execute rapid, automated inline configuration changes. Finally, we engineered multi-stage forensic pipelines, chaining sort and uniq -c to distill millions of messy log entries into clean, actionable intelligence regarding duplicate events.

16. Next Chapter Recommendation

Your script can process data brilliantly, but it requires the user to open the script and change variables manually to alter its behavior. We must enable external arguments. Proceed to Chapter 12: Command-Line Arguments.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·