Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Essential Linux Command-Line Text Processing Utilities

Tech May 12 3

Core Text Processing Utilities

Linux systems rely heavily on text manipulation, governed primarily by three tools known as the 'text processing triad': grep, sed, and awk.

  • grep: Searches text using patterns defined by regular expressions.
  • sed: A stream editor for filtering and transforming text.
  • awk: A specialized language designed for text reporting and data formatting.

Grep: Pattern Matching and Extraction

The grep command (Global search Regular Expression and Print out the line) searches input files for lines matching a specific pattern. It supports three distinct regex dialects:

  • grep: Standard Basic Regular Expressions (BRE).
  • egrep (or grep -E): Extended Regular Expressions (ERE).
  • fgrep (or grep -F): Fixed strings, interpreting pattern literally without regex parsing.

Common Command Options

  • --color=auto: Highlights matching text.
  • -i: Case-insensitive search.
  • -v: Inverts the match, showing non-matching lines.
  • -o: Outputs only the matched string, not the full line.
  • -q: Quiet mode; returns exit status without output.
  • -A n, -B n, -C n: Displays n lines After, Before, or Context (both) around the match.

Regular Expression Fundamentals

Regular expressions define the search pattern using metacharacters.

Character Matching:

  • .: Matches any single character.
  • []: Matches any single character within the brackets.
  • [^]: Matches any character NOT in the brackets.
  • POSIX classes like [:digit:], [:alpha:], and [:space:] can be used inside brackets.

Quantifiers (BRE syntax):

  • *: Matches the preceding character zero or more times.
  • .*: Matches any sequence of characters.
  • \?: Matches zero or one time.
  • \+: Matches one or more times.
  • \{m,n\}: Matches between m and n times.

Anchors:

  • ^: Anchors to the start of the line.
  • $: Anchors to the end of the line.
  • \<, \b: Anchors to the start of a word.
  • \>, \b: Anchors to the end of a word.

Practical Grep Examples

To display lines in /etc/passwd that do NOT end with /sbin/nologin:

grep -v "/sbin/nologin$" /etc/passwd

To find empty lines or lines containing only whitespace:

grep "^[[:space:]]*$" filename.txt

To match a complete word (e.g., 'root') using word boundaries:

grep "\<root\>" /etc/passwd

Grouping and Back References

Patterns can be grouped using \(\) in BRE. The matched content is stored in registers (\1, \2) for later reference.

For example, to find lines where a word appears twice in sequence, given a file repetition.txt with content:

Time after time.
Win win situation.

The command would look for a pattern and reference it back:

grep "\(\<[a-z]\+\>\).*\1" repetition.txt

Egrep: Extended Regular Expressions

egrep simplifies syntax by removing the need to escape metacharacters. Quantifiers like +, ?, and {} are used without backslashes. It also introduces the logical OR operator |.

To search for lines starting with 'S' or 's' in /proc/meminfo:

egrep "^(s|S)" /proc/meminfo

To match numbers between 0 and 255 (useful for IP parsing logic):

egrep -o "\<([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\>"

Supplementary Text Utilities

Beyond searching, Linux provides tools for cutting, sorting, and analyzing text streams.

cut

Extracts specific sections from each line of a file.

# Extract the 1st field using ':' as delimiter
cut -d':' -f1 /etc/passwd

sort

Sorts lines of text files. Keys include -n (numeric sort), -r (reverse), -u (unique), and -k (field number).

# Sort numerically by the 3rd field
sort -t':' -k3 -n /etc/passwd

uniq

Filters adjacent matching lines. Often combined with sort.

sort data.log | uniq -c

wc

Counts lines, words, and bytes.

# Count lines in a file
wc -l /etc/passwd

diff

Compares files line by line.

diff original.file modified.file

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.