Introduction to AWK
AWK is one of the most powerful data processing utilities available in Linux and UNIX environments. Its name derives from the initials of its creators: Alfred Aho, Peter Weinberger, and Brian Kernighan.
AWK is a programming language specifically designed for text processing and report generation. Its architecture makes it particularly suitable for handling text data organized in rows and columns.
As a programming language environment, AWK provides features such as regular expression matching, flow control, operators, expressions, variables, and functions, drawing inspiration from the C language.
AWK Workflow
The AWK processing follows these steps:
1. Automatically reads text lines from the specified data file.
2. Updates the values of AWK's built-in system variables, such as the field count variable NF, record number variable NR, record variable $0, and individual field variables $1, $2, etc.
3. Executes all patterns and actions in the program sequentially.
4. After processing all patterns and actions, if there are still lines to read in the data file, returns to step 1 and repeats the process.
Basic AWK Command Syntax
Every AWK statement consists of a pattern and an action:
- Pattern: A set of rules that test whether an input line should trigger the action.
- Action: The execution process containing statements, functions, and expressions.
In short, the pattern determines when and how the action is triggered, while the action performs the processing of the input line.
Format:
awk 'BEGIN{ commands } pattern{ commands } END{ commands }' [INPUTFILE…]
All three sections (BEGIN, pattern, and END) are optional.
BEGIN and END Patterns
The BEGIN pattern is a special built-in pattern that executes when the AWK program starts but before any data is read. The corresponding action is executed only once. BEGIN is typically used for printing report headers or initializing variables.
The END pattern is another special mode that executes when the AWK program has processed all data and is about to exit. Like BEGIN, its action is executed only once during the program's lifecycle. END is generally used for final summaries or totals.
AWK Output
awk 'BEGIN{ commands } {print item1, item2, …} END{ commands }' [INPUTFILE…]
Items are separated by commas in the command, but output is separated by spaces. The output items can be strings, numbers, record fields (like $1), variables, or AWK expressions. Numbers are converted to strings before output.
AWK Program Execution Methods
Executing AWK Programs via Command Line
awk '/^$/{print "This is a blank line"}' input_file
Executing AWK Scripts via Command
For programs with multiple statements, you can write them in a script file and execute them using the awk command with the -f option:
awk -f program-file data_file
Executing AWK Scripts Directly
You can execute AWK programs like shell scripts by specifying the interpreter and granting execute permissions:
#!/bin/awk -f
# AWK program code
Execute with:
./awk-script.awk data_file
Records and Fields
AWK views input files as structured data, defining each input line as a record and each string within a line as a field. Fields are separated by delimiters (space, tab, or other symbols), with space or tab as the default.
AWK uses the field operator $ to specify fields for actions. The field operator is followed by a number or variable to indicate field position. Fields in each record are numbered starting from 1, where $1 represents the first field and $0 represents all fields.
Using the -F Parameter to Specify Field Delimiters
awk -F ":" '{print $1}' /etc/passwd
Changing Delimiters via the FS System Variable
The default delimiter is stored in the FS variable. You can modify it to change field separators:
awk 'BEGIN {FS=":"} {print $2}' /etc/passwd
NR, NF, and FILENAME Variables
- NR: Record number (number of lines processed)
- NF: Number of fields in the current record
- FILENAME: Name of the file being processed
AWK Variables
AWK supports variable operations including definition, reference, and computation. It also includes many built-in system variables.
Variable names in AWK can contain letters, numbers, and underscores, but cannot start with a number. AWK variable names are case-sensitive.
AWK variables can be of two types: string and numeric. When defining AWK variables, you don't need to specify the type—AWK automatically determines it based on context.
Built-in Variables
AWK provides several built-in variables for common operations:
awk -F ":" 'BEGIN{OFS="\t"} {print $1,$2}' /etc/passwd
User-defined Variables
Users can define their own variables for use in program code:
awk 'BEGIN {test="hello world"; print test}'
AWK Operators
AWK supports various operators and expressions commonly found in programming languages.
Arithmetic Operators
awk 'BEGIN{x=2;y=3; print x+y, x-y, x/y, x%y, x^y, x**y}'
Assignment Operators
= += -= /= *= %= ^=
Ternary Conditional Operator
expression ? value1 : value2
This operator returns value1 if the expression is true, otherwise value2.
Logical Operators
Symbols: && (AND), || (OR), ! (NOT)
Relational Operators
Symbols: >, <, >=, <=, ==, !=, ~ (matches), !~ (does not match)
AWK Patterns
AWK supports various pattern types:
Relational Expressions
awk '$2>80 {print}' data_file
Regular Expressions
Like sed, AWK regular expressions must be enclosed between slashes (/regex/):
awk '/^l/ {print}' data_file
Mixed Patterns
AWK supports combining relational expressions, regular expressions, and logical operators (&&, ||, !):
awk '/^l/ && $2 > 80 {print}' data_file
AWK Control Statements
if Statement
Similar to C language:
if (expression) {
statements
} else {
statements
}
for Loop
Similar to C language:
for (initialization; condition; increment) {
statements
}
while Loop
>
while (expression) {
statements
}
do-while Loop
do {
statements
} while (expression)
break, continue, and next Statements
- break: Exits the current loop
- continue: Skips to the next iteration of the loop
- next: Skips the rest of the current line and reads the next line
exit Statement
Terminates the AWK program execution.
Formatted Output
>
Format
Similar to C language:
printf("format", output_list)
Format Specifiers
- %c: Character
- %d, %i: Decimal integer
- %u: Unsigned integer
- %f: Floating-point number
- %e, %E: Scientific notation
- %s: String
- %%: Display a percent sign
Format Modifiers
- N: Number
- -: Left alignment
- +: Display numeric sign
Example
awk 'BEGIN{printf("%-10s\t%-10s\n","name","score")} {printf("%-10s\t%-10s\n",$1,$2)}' data_file
AWK Arrays
>
Indexed Arrays
Indexed arrays use numbers as subscripts:
awk 'BEGIN{a[0]="a";a[1]="b";a[2]="c";print a[0],a[1],a[2]}'
Associative Arrays
Associative arrays use strings as subscripts:
awk 'BEGIN{a["one"]="first";a["two"]="second";print a["one"],a["two"]}'
Looping Through Arrays
For Loop with Index
for (i=0; i
For Loop with 'in' Operator
>
for (variable in array) {
statements
}
Counting Occurrences with Arrays
Arrays can be used to count occurrences of values:
awk '{count[$1]++} END{for(i in count) {print i,"count:",count[i]}}' data_file
Practical Examples
Displaying User UIDs
awk -F ":" '{print $3}' /etc/passwd
Displaying User UIDs with Header
awk -F ":" 'BEGIN{print "UserID"}{print $3}' /etc/passwd
Displaying Users with /bin/bash Shell
awk -F ":" /bash$/'{print $1} END{print "End of list"}' /etc/passwd
Displaying Users with GID 0
awk -F ":" '$4==0{print $1}' /etc/passwd
Displaying Users with GID Greater Than 500
awk -F ":" '$4>500{print $1}' /etc/passwd
Displaying Usernames and UIDs with Custom Separator
awk -F ":" 'OFS="###" {print $1,$3}' /etc/passwd
Displaying Last Field of /etc/passwd
awk -F ":" '{print $NF}' /etc/passwd
Numbering Lines in /etc/passwd
awk '{print NR,$0}' /etc/passwd
Numbering Lines Across Multiple Files
awk '{print FNR,$0}' /etc/passwd /etc/fstab
Using Custom Variables
awk -v var="example.com" BEGIN'{print var}'
Formatted Output of User Information
awk -F ":" '{printf "%-15s %d %8i\n",$1,$3,$4}' /etc/passwd
Classifying Users
awk -F ":" '{if ($1=="root") printf "%-15s: %s\n", $1,"Admin"; else printf "%-15s: %s\n", $1, "Regular User"}' /etc/passwd
Counting Users with UID Greater Than 500
awk -F ":" -v count=0 '{if ($3>=500) count++}END{print count}' /etc/passwd
Displaying Fields with 4+ Characters
awk -F ":" '{i=1;while (i<=NF) { if(length($i)>=4) {print $i}; i++ }}' /etc/passwd
Displaying First Three Fields with do-while
awk -F: '{i=1;do {print $i;i++}while(i<=3)}' /etc/passwd
Displaying First Three Fields with for
awk -F: '{for(i=1;i<=3;i++) print $i}' /etc/passwd
Counting Shell Types
awk -F ":" '$NF!~/^$/{shellCount[$NF]++}END{for(type in shellCount){printf "%-15s:%i\n",type,shellCount[type]}}' /etc/passwd
Displaying Users with Even UID
awk -F ":" '{if($3%2==1) next;{printf "%-15s%d\n",$1,$3}}' /etc/passwd
Counting TCP Connection States
netstat -ant | awk '/^tcp/ {++stateCount[$NF]} END {for(state in stateCount) print state, stateCount[state]}'
Custom Line Separators
awk -F ":" 'BEGIN{ORS="||||"}{print $0}' /etc/passwd