Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

AWK Programming: Text Processing and Data Manipulation

Tech May 14 1

Introduction to AWK

AWK is one of the most powerful data processing utilities available in Linux and UNIX environments. Its name derives from the initials of its creators: Alfred Aho, Peter Weinberger, and Brian Kernighan. AWK is a programming language specifically designed for text processing and report generation. Its architecture makes it particularly suitable for handling text data organized in rows and columns. As a programming language environment, AWK provides features such as regular expression matching, flow control, operators, expressions, variables, and functions, drawing inspiration from the C language.

AWK Workflow

The AWK processing follows these steps: 1. Automatically reads text lines from the specified data file. 2. Updates the values of AWK's built-in system variables, such as the field count variable NF, record number variable NR, record variable $0, and individual field variables $1, $2, etc. 3. Executes all patterns and actions in the program sequentially. 4. After processing all patterns and actions, if there are still lines to read in the data file, returns to step 1 and repeats the process.

Basic AWK Command Syntax

Every AWK statement consists of a pattern and an action: - Pattern: A set of rules that test whether an input line should trigger the action. - Action: The execution process containing statements, functions, and expressions. In short, the pattern determines when and how the action is triggered, while the action performs the processing of the input line.

Format:

awk 'BEGIN{ commands } pattern{ commands } END{ commands }' [INPUTFILE…]
All three sections (BEGIN, pattern, and END) are optional.

BEGIN and END Patterns

The BEGIN pattern is a special built-in pattern that executes when the AWK program starts but before any data is read. The corresponding action is executed only once. BEGIN is typically used for printing report headers or initializing variables. The END pattern is another special mode that executes when the AWK program has processed all data and is about to exit. Like BEGIN, its action is executed only once during the program's lifecycle. END is generally used for final summaries or totals.

AWK Output

awk 'BEGIN{ commands } {print item1, item2, …} END{ commands }' [INPUTFILE…]
Items are separated by commas in the command, but output is separated by spaces. The output items can be strings, numbers, record fields (like $1), variables, or AWK expressions. Numbers are converted to strings before output.

AWK Program Execution Methods

Executing AWK Programs via Command Line

awk '/^$/{print "This is a blank line"}' input_file

Executing AWK Scripts via Command

For programs with multiple statements, you can write them in a script file and execute them using the awk command with the -f option:
awk -f program-file data_file

Executing AWK Scripts Directly

You can execute AWK programs like shell scripts by specifying the interpreter and granting execute permissions:
#!/bin/awk -f
# AWK program code
Execute with:
./awk-script.awk data_file

Records and Fields

AWK views input files as structured data, defining each input line as a record and each string within a line as a field. Fields are separated by delimiters (space, tab, or other symbols), with space or tab as the default. AWK uses the field operator $ to specify fields for actions. The field operator is followed by a number or variable to indicate field position. Fields in each record are numbered starting from 1, where $1 represents the first field and $0 represents all fields.

Using the -F Parameter to Specify Field Delimiters

awk -F ":" '{print $1}' /etc/passwd

Changing Delimiters via the FS System Variable

The default delimiter is stored in the FS variable. You can modify it to change field separators:
awk 'BEGIN {FS=":"} {print $2}' /etc/passwd

NR, NF, and FILENAME Variables

- NR: Record number (number of lines processed) - NF: Number of fields in the current record - FILENAME: Name of the file being processed

AWK Variables

AWK supports variable operations including definition, reference, and computation. It also includes many built-in system variables. Variable names in AWK can contain letters, numbers, and underscores, but cannot start with a number. AWK variable names are case-sensitive. AWK variables can be of two types: string and numeric. When defining AWK variables, you don't need to specify the type—AWK automatically determines it based on context.

Built-in Variables

AWK provides several built-in variables for common operations:
awk -F ":" 'BEGIN{OFS="\t"} {print $1,$2}' /etc/passwd

User-defined Variables

Users can define their own variables for use in program code:
awk 'BEGIN {test="hello world"; print test}'

AWK Operators

AWK supports various operators and expressions commonly found in programming languages.

Arithmetic Operators

awk 'BEGIN{x=2;y=3; print x+y, x-y, x/y, x%y, x^y, x**y}'

Assignment Operators

= += -= /= *= %= ^=

Ternary Conditional Operator

expression ? value1 : value2
This operator returns value1 if the expression is true, otherwise value2.

Logical Operators

Symbols: && (AND), || (OR), ! (NOT)

Relational Operators

Symbols: >, <, >=, <=, ==, !=, ~ (matches), !~ (does not match)

AWK Patterns

AWK supports various pattern types:

Relational Expressions

awk '$2>80 {print}' data_file

Regular Expressions

Like sed, AWK regular expressions must be enclosed between slashes (/regex/):
awk '/^l/ {print}' data_file

Mixed Patterns

AWK supports combining relational expressions, regular expressions, and logical operators (&&, ||, !):
awk '/^l/ && $2 > 80 {print}' data_file

AWK Control Statements

if Statement

Similar to C language:
if (expression) {
    statements
} else {
    statements
}

for Loop

Similar to C language:
for (initialization; condition; increment) {
    statements
}

while Loop

>
while (expression) {
    statements
}

do-while Loop

do {
    statements
} while (expression)

break, continue, and next Statements

- break: Exits the current loop - continue: Skips to the next iteration of the loop - next: Skips the rest of the current line and reads the next line

exit Statement

Terminates the AWK program execution.

Formatted Output

>

Format

Similar to C language:
printf("format", output_list)

Format Specifiers

- %c: Character - %d, %i: Decimal integer - %u: Unsigned integer - %f: Floating-point number - %e, %E: Scientific notation - %s: String - %%: Display a percent sign

Format Modifiers

- N: Number - -: Left alignment - +: Display numeric sign

Example

awk 'BEGIN{printf("%-10s\t%-10s\n","name","score")} {printf("%-10s\t%-10s\n",$1,$2)}' data_file

AWK Arrays

>

Indexed Arrays

Indexed arrays use numbers as subscripts:
awk 'BEGIN{a[0]="a";a[1]="b";a[2]="c";print a[0],a[1],a[2]}'

Associative Arrays

Associative arrays use strings as subscripts:
awk 'BEGIN{a["one"]="first";a["two"]="second";print a["one"],a["two"]}'

Looping Through Arrays

For Loop with Index

for (i=0; i

For Loop with 'in' Operator

>
for (variable in array) {
    statements
}

Counting Occurrences with Arrays

Arrays can be used to count occurrences of values:
awk '{count[$1]++} END{for(i in count) {print i,"count:",count[i]}}' data_file

Practical Examples

Displaying User UIDs

awk -F ":" '{print $3}' /etc/passwd

Displaying User UIDs with Header

awk -F ":" 'BEGIN{print "UserID"}{print $3}' /etc/passwd

Displaying Users with /bin/bash Shell

awk -F ":" /bash$/'{print $1} END{print "End of list"}' /etc/passwd

Displaying Users with GID 0

awk -F ":" '$4==0{print $1}' /etc/passwd

Displaying Users with GID Greater Than 500

awk -F ":" '$4>500{print $1}' /etc/passwd

Displaying Usernames and UIDs with Custom Separator

awk -F ":" 'OFS="###" {print $1,$3}' /etc/passwd

Displaying Last Field of /etc/passwd

awk -F ":" '{print $NF}' /etc/passwd

Numbering Lines in /etc/passwd

awk '{print NR,$0}' /etc/passwd

Numbering Lines Across Multiple Files

awk '{print FNR,$0}' /etc/passwd /etc/fstab

Using Custom Variables

awk -v var="example.com" BEGIN'{print var}'

Formatted Output of User Information

awk -F ":" '{printf "%-15s %d %8i\n",$1,$3,$4}' /etc/passwd

Classifying Users

awk -F ":" '{if ($1=="root") printf "%-15s: %s\n", $1,"Admin"; else printf "%-15s: %s\n", $1, "Regular User"}' /etc/passwd

Counting Users with UID Greater Than 500

awk -F ":" -v count=0 '{if ($3>=500) count++}END{print count}' /etc/passwd

Displaying Fields with 4+ Characters

awk -F ":" '{i=1;while (i<=NF) { if(length($i)>=4) {print $i}; i++ }}' /etc/passwd

Displaying First Three Fields with do-while

awk -F: '{i=1;do {print $i;i++}while(i<=3)}' /etc/passwd

Displaying First Three Fields with for

awk -F: '{for(i=1;i<=3;i++) print $i}' /etc/passwd

Counting Shell Types

awk -F ":" '$NF!~/^$/{shellCount[$NF]++}END{for(type in shellCount){printf "%-15s:%i\n",type,shellCount[type]}}' /etc/passwd

Displaying Users with Even UID

awk -F ":" '{if($3%2==1) next;{printf "%-15s%d\n",$1,$3}}' /etc/passwd

Counting TCP Connection States

netstat -ant | awk '/^tcp/ {++stateCount[$NF]} END {for(state in stateCount) print state, stateCount[state]}'

Custom Line Separators

awk -F ":" 'BEGIN{ORS="||||"}{print $0}' /etc/passwd

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.