An Introduction to AWK

Written By Scott Macomber, as one of the handouts for Unix Tutorials 
taught to incoming Grad Students in the Department of Geography and
the Center for Remote Sensing at Boston University.
First version October 1994.
Revised many times since then.
Current Version: 2001.07.05
*******************************************************************************
--AWK--

"Search for and process a pattern in a file."

Awk is a strong tool for manipulating files which contain rows and columns
of data (or text). It treats files much like a spread sheet.

You can type your awk commands on the command-line, OR
include them in a script which is called from the command line, OR
just put the whole thing into a shell script.

USAGE:
% awk 'program statements' [file-list]  (SINGLE QUOTES ARE NECESSARY)
 OR
% awk -f program-file [file-list]
 OR
% ./script.name      
(See --SHELL SCRIPTS-- at the end of this document)

(NOTE: awk is one of many similar programs. nawk is another similar
but more powerful program with some extra features. In the Center
for Remote Sensing (CRS) we link  awk to nawk, so when you type
awk, you really get nawk.  This occasionally leads to scripts that
work for you but don't work correctly when run on a different system. 
In this case you might check if this is the source of any problems others
may encounter when running a script that works for you but not them.)

General Format:  

AWK TREATS INPUT DATA AS A FILE CONTAINING 1 OR MORE ROWS & COLUMNS. 
One can be imaginiative about what constitutes a column. 
They do not have to line up or anything. This paragraph can be
thought of as 4 lines with a varying number of columns.

Columns are specified with $n where n=column number.
For example: To select and print the first two columns of a file,
but in reverse order: {print $2,$1}
 
At the command line, type an operation to be performed:
% awk '{print $2,$1}' IN > OUT

OR, Specify an awk command file containing the operations to perform:
% awk -f command.file IN > OUT

% cat command.file 
{print $2,$1}

To print ALL columns, one can say: {print}, {print $0}, or {print $1,$2,..,$n}

NOTE: Whatever commands you use, they will ALL be applied, in order, 
to each line before awk moves on to the next line, and they will be
applied to EVERY line in the input file (unless you specify otherwise).
*******************************************************************************
FIELD SEPARATORS:

In awk, columns are "fields" and rows are "records"

The  default separation for fields in the INPUT is any continuous white space.
That includes any number of blank spaces and tabs.

The default spacing for fields in the OUTPUT are single spaces, so 
tabs and multiple white-spaces are reduced to a single white space by default.

Other Input and Output Field Separators can be specified.
For example:
{FS=","}	The input field separators are commas.
{FS=":"}	The input field separators are colons.

{OFS="\t"}	Use tabs as field separators in output.

Examples using the print command, using this example datafile "IN".
Notice the formatting of the IN file compared to the different outputs.

% cat IN
1  100
2 2000
3	30
 4 4
   5   55

EXAMPLES:

% awk '{print $1,$2}' IN  # note white space removed
1 100
2 2000
3 30
4 4
5 55

% awk '{print $1$2 }' IN  #notice, no comma means no white space between fields
1100
22000
330
44
555

% awk '{print }' IN       # print prints everything AS IS
1  100
2 2000
3       30
 4 4
   5   55

% awk '{print $0}' IN     # $0 means all fields AS IS
1  100
2 2000
3       30
 4 4
   5   55

% awk '{print 9999,$0}' IN  # print 9999, then $0 means all fields as is
9999 1  100
9999 2 2000
9999 3       30
9999  4 4
9999    5   55

% awk '{print $1,"\t",$2}' IN  # note the tab is Field Separator
1        100
2        2000
3        30
4        4
5        55

% awk 'BEGIN {OFS="\t"} {print $1,$2}' IN  # note the tab is Field Separator,
1        100                               # BUT you don't have to type it
2        2000                              # between every column every time.
3        30
4        4
5        55

% awk '{print $1 ":" $2}' IN  # note colon is Field Separator, spaces to sides are irrelevant
1:100
2:2000
3:30
4:4
5:55

% awk '{printf $1,$2}' IN # Note that "print" automatically inserted "newlines"
12345%                    # and spaces, BUT with "printf", you must format all.
                          # NOTE that column 2 is not printed at all.
                          # (SEE "formatting" below)

*******************************************************************************
CONDITIONAL STATEMENTS: ( IF-THEN type statements)

How can you set conditions?
Generally, you do not use the actual words IF or THEN,
just state the condition.

awk '($1 != 0){print $1,$2}' IN > OUT

Means: IF column 1 does not equal zero, THEN print both columns.
(Note that paranthesis and white space inside the parans are
not necessary, but add clarity, especially when there are
multiple levels. eg.  awk '$1!=0{print $1,$2}' would work just as well.

Example:
IN:	  
0	99
1	33
0	22
1	11
2	55

OUT:    
1 33
1 11
2 55

A more complex example would be this command file:

% awk -f command-file IN > OUT
        
% cat command-file

 $2 < 0 {print $1,0}
 $2 >=0 && $2 <= 255 {print $1,$2}
 $2 > 255 {print $1,255}
	
That says the following:
 If col 2 is less than zero, print col 1, and "0" for col 2.
 If col 2 is 0 to 255 , print col 1 and col 2.
 If col 2 is greater than 255, print col 1, and "255" for col 2.

IN:
22	300
23	-60
24	200

OUT:
22 255
23 0
24 200

You COULD put all those commands on one command line, but if you make
a mistake, a script is clearer, and easier to evaluate and edit,
as you can see in this long, single line version of the above:

% awk '$2<0{print $1,0} $2>=0&&$2<=255{print $1,$2} $2>255{print $1,255}' IN

Further, one can specify actions to take only on specific records (lines)
by specify the line (or RECORD) number in the file  with "NR"

For example to skip the first line (perhaps a header):
% awk '( NR > 1 ) {print}'

or, to reduce a file by printing every other line:
% awk '((NR % 2) == 0) {print}'

(% is the modulus, or remainder. If the remainder is 0, then print)

CAUTION: awk does NOT exit when a condition is matched. If the field or record
matches the condition, the instruction is executed. Then awk tests the next 
condition on the same field/record. So, if any field/record
matches more than one of the conditional statements, it will be treated 
every time it matches.  So, one field or record from the input can create 
many fields/records in the output.  This can be bad if you are not
expecting it, but handy if you control it.

*******************************************************************************
CONDITIONAL STATEMENTS: ( IF-ELSE type statements)

One drawback to the method above, is that you must explicilty state
all possible conditions. Another way is to use if-else statements in awk.
Though less common, this method has advantages.

Embed the if-else statements in the inner AND outer curly brackets:

awk '{ if ( $1==0 || $2==0 || $3==0 ) {print "0"} else {print $0} }' IN > OUT

IF column 1 or 2 or 3 is equal to zero, print a single 0,
ELSE, print the entire line.

*******************************************************************************
WHILE LOOPS:

You can get awk to execute a command as many times as you like with
a loop using this format (notice both inside and outside brackets):

{for (a=1; a<=n ;a++) {print $1,$2}} 

where you supply n. 

note that n can be a number, a variable or a value in a column

{for (a=1;a <= 99;a++) {print $1,$2}} 
 # will print columns 1 and 2,  99 times.

BEGIN {X=something} 
{for (a=1;a <= X;a++) {print $1,$2}}
 # will print col 1 and col2 X times.

{for (a=1;a <= $3;a++) {print $1,$2}}

# Will print col 1 and col 2 as many times as column 3 says to 
# Useful for regressions
#
#
# eg. 
# 4 5 3 
# 6 9 2    becomes
#
# 4 5
# 4 5
# 4 5
# 6 9
# 6 9
#

You CAN put a condition in front, just like any other statement:

awk '(NR > 1) {for (a=1;a <= $3;a++) {print $1,$2}}'

*******************************************************************************
MATHEMATICAL EXPRESSIONS:

Operators can be used to apply mathematical expressions to the data.
Columns can be combined, or constants applied.

For example:

awk 'BEGIN {OFS="\t"} {print $1,$2,$3,(($1+$2+$3)/3)}' IN > OUT

Will print out column 1, column2, column 3, and the mean of the 3 columns.
	
IN:
10      15      20
0       20      20
5       5       9

OUT:
10 	15 	20 	15
0 	20 	20 	13.3333
5 	5 	9 	6.33333

There are many other operators for awk, usually variants of addition,
subtraction, multiplication and division. Squares can be found by multiplying
a column value by itself ($1*$1) = (col 1 squared), 

IF you are using NAWK instead of AWK (in CRS, awk is actually a link to NAWK)
you can use the caret to create a power relationship ($1^2) (in this case, 
paranthesis ARE necessary).

EXPRESSIONS:
     expressions take on string or numeric values as appropriate,
     and  are  built  using the operators +, -, *, /, %, and con-
     catenation (indicated by a blank).  The C operators ++ ,  --
     ,  +=  , -= , *= , /= , and %= are also available in expressions.

BUILT IN FUNCTIONS:
     LENGTH
     The built-in function length returns the length of its argu-
     ment taken as a string, or of the whole line if no argument.
     Very handy for printing the length of every line.
     USAGE:  {print length}

     EXP, LOG, SQRT, INT, 
     There are also built-in functions exp, log, sqrt,  and  int,
     where int truncates its argument to an integer.
     PARENTHESIS ARE REQUIRED.
     USAGE:  {print sqrt($1)}

     `substr( s, m, n )' returns the n-character substring of s  that
     begins at  position  m.   `sprintf (format, expression, expression,
     ...)' formats the expressions according to the printf format
     given by format, and returns the resulting string.

*******************************************************************************

SPECIFYING INTEGER, FLOAT, STRING, ETC.:

Formatting output can be better controlled using a "printf" statement
instead of the "print" statement, but using "printf" REQUIRES you
to format the output.

awk's default output is decimal unless floating point numbers are 
necessary as in the example above. Then, awk will use as long a
floating point number as it needs. Sometimes it is desirable to
specify the output one wants, esp. to specify the number of 
significant digits in a floating point number, or to force 
truncating or rounding to decimal numbers.

For example, this input file IN:
IN:
10000   12000   14000
60      90      120

Dividing each column by different numbers could go like this:

% awk '{print ($1/66),($2/6430),($3/627)}' IN>OUT

% cat OUT
151.515 1.86625 22.3285
0.909091 0.0139969 0.191388

Using OFS one can add tabs:
% awk '{OFS="\t"}{print ($1/66),($2/6430),($3/627)}' IN>OUT

% cat OUT
151.515 1.86625 22.3285
0.909091        0.0139969       0.191388

BUT, all those significant figures are messy and not necessary,
AND  the different number of significant digits messes up the tabs.
Use printf to format 3 significant digits and separate with tabs:

% awk '{printf "%.3f\t%.3f\t%.3f\n", ($1/66),($2/6430),($3/627)}' IN>OUT

% cat OUT
151.515 1.866   22.329
0.909   0.014   0.191

By default, awk always lines up columns on the left. When you have
different digits to the left of the decimal point, they don't
line up well for our eyes. You can line them up by specifying the
TOTAL number of spaces to use (including the decimal point) for
each number. eg.  %7.3f will allow all numbers up to 999.999 to line
up perfectly. A larger number will still print, but will be shifted over
compared to all the others, so if you can predict the size of your
numbers you can format everyting very nicely.
Note that the decimal point itself is one of those digits in the 7.

% awk '{printf "%7.3f\t%7.3f\t%7.3f\n", ($1/66),($2/6430),($3/627)}' IN >OUT

% cat OUT
151.515   1.866  22.329
  0.909   0.014   0.191

IMPORTANT DETAILS WHEN USING PRINTF:
1. printf goes inside the curly brackets.
2. after printf, formatting goes inside the double quotes.
3. after the final double-quote, you must have a comma.
4. \n is required for newlines to be printed. 
   Without this, it will print one very long line.

OTHER VARIABLE TYPES:
%d decimal
%f float
%e exponential
%g (f or e,  whichever is shorter)
%s string (a string of characters)
%o unsigned octal
%x unsigned hexadecimal

GETTING MORE TRICKY:

You have a file with 1 or more columns and many rows and you want to 
make 1 row of many columns.

% awk '{printf "%s\t", $0}' IN

*******************************************************************************
SUMS AND AVERAGES: and using VARIABLES

Running Sums, Total Sums, and Averages can all be done with a small
bit of imaginative finagling.  Actually, one can write quite complex
programs using awk. This is just a simple example.

This command computes sums and averages by assinging values to 
variables (s and t) and increasing those values after each line is read.

VARIABLES:
Variables are created and/or changed inside curly brackets. 
A letter or a string (word) can name the variable. eg. a, b, count, prev

{a == 1}    CONDITIONAL STATEMENT: IF a equals 1
{a = 1}	    VARIABLE NAME: assigns the value 1 to the variable  a
{a ++ }     increments a by 1 when each line is read.
{a += $1}   increments a by adding the value in column 1 to a when 
            each line is read.
{a += 5}    increments a by 5 when each line is read.
{b = (a/2)} b is whatever a is, divided by 2

{print a}   will print THE VALUE STORED IN a.
{print "a"} will print the letter a
{print a/2} will print the value stored in a divided by 2

Here is an example whcih converts a list of test scores, to
a formatted output with grades summed and averaged:

% awk -f Command-file IN > OUT

% cat Command-file:
BEGIN {printf "\n%s\t%s\t%s\t%s\n","subj.", "grade1","grade2","avg. grade"}
 {s += $1}       # s is a variable for the running total of col 1.
 {t += $2}       # t is a variable for the running total of col 2.
 {print NR, $0, ($1 +$2)/2}   # NR is always the line number
 END {printf "\n%s\t%d\t%d\n%s\t%.2f\t%.2f\t%.2f\n",\
"sum", s, t, "avg.", s/NR, t/NR,(s/NR+t/NR)/2}
  # s/NR  = the total of col 1, divided by the number of lines.
  # t/NR  = the total of col 2, divided by the number of lines.

% cat IN:
99      100
72      73
88      64
96      83

% cat OUT:

subj. grade1 grade2     avg. grade
1       99      100     99.5
2       72      73      72.5
3       88      64      76
4       96      83      89.5

sum     355     320
avg.    88.75   80.00	84.38

NOTES:
1. BEGIN does something once at the beginning BEFORE it reads the
first line of the input file. Good for creating headers.

2. END does something once at the end, after reading the entire file
and executing all commands.

3. Check out the END line. Notice the back slash at the end of it?
That shows awk that the next line is a continuation of this line.

4. CAUTION:   when using NR. It counts lines in the input file.  If there were,
for example,  a blank and empty line anywhere (eg. at the end of this IN 
file), the sums would be divided by 5, not 4 (and give an inaccurate value 
for avg). If necessary,  a more conservative approach would be to define a 
counter variable which is dependent on a field/record not being blank. 

eg. (NF > 0) {count ++}
If the number of fields in a line is greater than 0, increase count by one.

*******************************************************************************
COMPARING PREVIOUS LINES:

Useful if you want to reduce the size of a file by printing lines only once
when they repeat in the original file.

This command says: 
"Print a line, ONLY if the ENTIRE line is NOT identical to the previous line."

awk '$0 != prev { print; prev=$0}' IN > OUT

IN:
1 1
1 1
1 1
1 2
1 2
2 1
2 1
2 2
2 2
2 2
2 3

OUT:
1 1
1 2
2 1
2 2
2 3

Note that this is similar to the UNIQ command, described later, but 
more flexible.

But, to print a line only if the value in col 1 is not identical to the 
previous value in col 1 (print the line only the first time that the number
occurs in the column):

awk '$1 != prev {print; prev=$1}' IN > OUT

IN:        
1 1  
1 2  
1 3
1 99 
2 100
2 99
3 88
3 3

OUT:       
1 1  
2 100
3 88 
                
In this case, "prev" is a variable. 
IF col 1 is not equal to prev, then print the line; THEN, assign 
this new value to prev, 


*******************************************************************************
*******************************************************************************

--SHELL SCRIPTS-- A very brief introduction.

Shell scripts are command files which can contain a series of instructions,
and which may contain variables.  Like a program, it must be made "executable".
This is done with the command: chmod +x file-name

There are several good reasons to use shell scripts, among them:

1. They are often simpler to write than programs.

2. Automation. They will carry out instructions for you, when you need them to.

3. Sequence. Scripts are especially useful when one has several things to do 
which must be done in sequence, but you don't wish to sit around waiting for 
each step to be completed before executing the next.

4. Testing small changes. Scripts are useful when one wishes to do the same 
thing over several times, but with small changes each time. This can be done 
through the use of variables in the command file.  Variables in the file are 
expressed as $1 for the first variable, $2 for the second, etc.  
Note that in this case, $1 and $2 have a different meaning than the
field/column notation inside an awk statement.

5. Historical: It keeps a record of what you've been doing.

Misc Notes:

1. By convention, many of us in CRS use filenames beginning with "c."
   so we end up with names like c.degrees.to.radians 

2. The symbol # is used to "comment out" a line. In other words, it 
   will not be executed.

3. A script must be "executable" 
  (you will see an x in the permissions when you list the file):

% ls -Fla c.degrees.to.radians
-rwxr-xr-x   1 scottm   geog-grad   1574 May 15  1996 c.degrees.to.radians*

If it is not executable, you can make it so with the chmod command:

% chmod +x c.degrees.to.radians

4. The filename is the name used to execute the command. If a command file
expects to receive variables, they are entered directly after the command, eg.

% ./c.degrees.to.radians 75

Note: The dot-slash is used to specify that you want to  run this script that
is in this directory.  If you have "dot" in your path, you may not need
to use it here, and in this case, you only need the filename:

% c.degrees.to.radians 75

To read the script, just use "cat" or "more" or "less"

###################################################################

% cat c.degrees.to.radians

#!/bin/sh

# USAGE: c.degrees.to.radians n (where n is degrees from 1 to 360)

# This script converts degrees to radians, which is necessary when
#      you wish to obtain trigonometric values, such as sin and cos.
# Note that this script is for demonstration purposes, so it
#      is verbose, and uses several methods to demonstrate 
#      similarities and differences.
# The first (and only) input is assigned to the variable "deg"
# The input is specified as $1

deg=$1
pi=3.1417

# First Check that there is 1 and only 1 input,
# Then check that that input is between 1 and 360
#
# IF there is 1 input and it is >= 1 and <= 360, 
#  THEN do NOTHING (the ":")
#  ELSE, if it does not meet these conditions, 
#  ECHO an error statement and  EXIT.

if [ "$#" -eq 1   -a    \( "$deg" -ge 1 -a "$deg" -le 360 \) ]

    then :
    else
        echo ""
        echo "Usage: c.degrees.to.radians n (where n is degrees from 1 to 360)"
        echo ""
    exit 1
fi

# Convert degrees to radians using awk:
# Note that awk needs input.
# There are several ways to do this. Here is one:
#  Create a file with just a 1 in it, then use that as input

echo "1" > 1

rad=`awk '{print ( ('$deg' * '$pi' ) / 180 ) }' 1` 

echo ""
awk '{print "degrees = " '$deg' }' 1
awk '{print "radians = " '$rad' }' 1
echo " "
nawk '{print "sin of " '$deg' " = " sin('$rad') }' 1
echo ""
nawk '{printf "sin of %d = %.2f\n", '$deg', sin('$rad') }' 1
echo ""
nawk '{print "cos of " '$deg' " = " cos('$rad') }' 1
echo ""
nawk '{printf "cos of %d = %.2f\n", '$deg', cos('$rad') }' 1
echo ""

##################################################################

To execute (run) the file, type on the command line the
filename and any necessary arguments:

% c.degrees.to.radians 75

degrees = 75
radians = 1.30904
 
sin of 75 = 0.965937

sin of 75 = 0.97

cos of 75 = 0.258777

cos of 75 = 0.26