Awk Programming

Awk is very powerful in processing delimiter-separated text in Linux.

Typical Uses of AWK

Awk can be used in many different ways, such as:

Text processing,
Producing formatted text reports,
Performing arithmetic operations,
Performing string operations, and many more.

Program Structure

Syntax

awk [option] ‘BEGIN{} { … body … } END{}’

Awk’s default action is print

BEGIN block

The syntax of the BEGIN block is as follows:

BEGIN {commands}

The BEGIN block gets executed at program start-up. It executes only once.
This is good place to initialize variables. BEGIN is an AWK keyword and hence it must be in upper-case. This block is optional.

Body Block
The syntax of the body block is as follows:

/pattern/ {commands}

The body block applies AWK commands on every input line. By default, AWK executes commands on every line.
We can restrict this by providing patterns. Note that there are no keywords for the Body block.

END Block

The syntax of the END block is as follows

END {commands}

Work flow

Awk will execute command in BEGIN block first, then fetch a line from input stream, execute commands in Body block according to some pattern matching result,
once done processing a line, it will fetch the next line, until there is no line from input stream, finally it will execute commands in END block.

Awk use '' to enclose all commands

Awk use $0 to refer to the current line that have been read, user $1, $2, … to refer to each field in a line.

Example

(1) Print all lines

$ awk '{print $0}' file

(2) Print matching line

$ awk '/pattern/' file  

or without ''
$ awk /pattern/ tt

(3) Count the number of matching lines

$ awk '/pattern/{++cnt} END{print "Count=", cnt}' file

(4) Print the lines that have length greater than 18

$ awk 'length($0)>18'  file

(5) Access shell variable in awk
By default, awk can only access itself variables, we can use -v to import shell variables to awk

$ myvar="something"
$ awk -v var="$myvar" '{print var}' file

Internal variables

Awk provides a couple of internal variables, if we execute awk -d, it will generate a file named awkvars.out in current directory,
this file contains all the internal variables of awk.

These internal variables are separated into two categories：

1. Control awk

FS Input field separator
OFS Output stream field separator
RS Input record separator
ORS Output record separator

2. Convey Information

ARGC Number of input args
ARGV Array that stores all input args
ENVIRON
FILENAME input filename
FNR the current record number in the current file，it is incremented each time a new record is read
NF the number of fields in the current input record
NR the number of input records awk has processed

(1) ENVIRON

$ awk 'BEGIN {print ENVIRON["USER"]}"

(2) FILENAME

$ awk 'END {print FILENAME}' marks.txt

(3) FS
Field Separator of the input stream, by default is whitespace, can be modified using -F option.

$ awk 'BEGIN {print "FS = " FS}' | cat -vte

(4) NF

Number of fields in a line
e.g. Print lines that have more than 3 fields

$ awk 'NF > 2' One Two Three <<< "One Two\nOne Two Three\nOne Two Three Four"

(5) NR
Line number of current line

$ awk 'NR < 3' <<< "One Two\nOne Two Three\nOne Two Three Four"

(6) FNR
Line number of current line of current file
This is useful when awk is process multiple files

(7) OFS
Field separator of output stream, by default is whitespace.

(8) RS
Record separator of input stream, by default is newline character '\n'.

(9) ORS
Record separator of output stream, by default is newline character \n.

(9) RLENGTH
The length of matched string

$ awk 'BEGIN { if (match("One Two Three", "re")) { print RLENGTH } }'
2

(11) RSTART
The first position matched string appears

$ awk 'BEGIN { if (match("One Two Three", "Thre")) { print RSTART } }

(12) SUBSEP
The separator character for array subscripts, the default value is \034

$ awk 'BEGIN { print "SUBSEP = " SUBSEP }' | cat -vte

Regular Expression Operator

'~' denotes matched
'!~' denotes not matched

$ awk '$0 ~ 9' marks.txt
2) Rahul   Maths    90
5) Hari    History  89

$ awk '$0 !~ 9' marks.txt
1) Amit     Physics   80
3) Shyam    Biology   87
4) Kedar    English   85

Note: we need to use backslash to escape regular expression characters if we use awk, otherwise we should use gawk.

$ tail -n 40 /var/log/nginx/access.log | awk '$0 ~ /ip\[127\.0\.0\.1\]/'

Reference

https://www.chemie.fu-berlin.de/chemnet/use/info/gawk/gawk_11.html