Essential Linux Command-Line Text Processing Utilities
Core Text Processing Utilities
Linux systems rely heavily on text manipulation, governed primarily by three tools known as the 'text processing triad': grep, sed, and awk.
- grep: Searches text using patterns defined by regular expressions.
- sed: A stream editor for filtering and transforming text.
- awk: A specialized language designed for text reporting and data formatting.
Grep: Pattern Matching and Extraction
The grep command (Global search Regular Expression and Print out the line) searches input files for lines matching a specific pattern. It supports three distinct regex dialects:
- grep: Standard Basic Regular Expressions (BRE).
- egrep (or
grep -E): Extended Regular Expressions (ERE). - fgrep (or
grep -F): Fixed strings, interpreting pattern literally without regex parsing.
Common Command Options
--color=auto: Highlights matching text.-i: Case-insensitive search.-v: Inverts the match, showing non-matching lines.-o: Outputs only the matched string, not the full line.-q: Quiet mode; returns exit status without output.-A n,-B n,-C n: Displays n lines After, Before, or Context (both) around the match.
Regular Expression Fundamentals
Regular expressions define the search pattern using metacharacters.
Character Matching:
.: Matches any single character.[]: Matches any single character within the brackets.[^]: Matches any character NOT in the brackets.- POSIX classes like
[:digit:],[:alpha:], and[:space:]can be used inside brackets.
Quantifiers (BRE syntax):
*: Matches the preceding character zero or more times..*: Matches any sequence of characters.\?: Matches zero or one time.\+: Matches one or more times.\{m,n\}: Matches between m and n times.
Anchors:
^: Anchors to the start of the line.$: Anchors to the end of the line.\<,\b: Anchors to the start of a word.\>,\b: Anchors to the end of a word.
Practical Grep Examples
To display lines in /etc/passwd that do NOT end with /sbin/nologin:
grep -v "/sbin/nologin$" /etc/passwd
To find empty lines or lines containing only whitespace:
grep "^[[:space:]]*$" filename.txt
To match a complete word (e.g., 'root') using word boundaries:
grep "\<root\>" /etc/passwd
Grouping and Back References
Patterns can be grouped using \(\) in BRE. The matched content is stored in registers (\1, \2) for later reference.
For example, to find lines where a word appears twice in sequence, given a file repetition.txt with content:
Time after time.
Win win situation.
The command would look for a pattern and reference it back:
grep "\(\<[a-z]\+\>\).*\1" repetition.txt
Egrep: Extended Regular Expressions
egrep simplifies syntax by removing the need to escape metacharacters. Quantifiers like +, ?, and {} are used without backslashes. It also introduces the logical OR operator |.
To search for lines starting with 'S' or 's' in /proc/meminfo:
egrep "^(s|S)" /proc/meminfo
To match numbers between 0 and 255 (useful for IP parsing logic):
egrep -o "\<([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\>"
Supplementary Text Utilities
Beyond searching, Linux provides tools for cutting, sorting, and analyzing text streams.
cut
Extracts specific sections from each line of a file.
# Extract the 1st field using ':' as delimiter
cut -d':' -f1 /etc/passwd
sort
Sorts lines of text files. Keys include -n (numeric sort), -r (reverse), -u (unique), and -k (field number).
# Sort numerically by the 3rd field
sort -t':' -k3 -n /etc/passwd
uniq
Filters adjacent matching lines. Often combined with sort.
sort data.log | uniq -c
wc
Counts lines, words, and bytes.
# Count lines in a file
wc -l /etc/passwd
diff
Compares files line by line.
diff original.file modified.file