An Introduction to AWK Written By Scott Macomber, as one of the handouts for Unix Tutorials taught to incoming Grad Students in the Department of Geography and the Center for Remote Sensing at Boston University. First version October 1994. Revised many times since then. Current Version: 2001.07.05 ******************************************************************************* --AWK-- "Search for and process a pattern in a file." Awk is a strong tool for manipulating files which contain rows and columns of data (or text). It treats files much like a spread sheet. You can type your awk commands on the command-line, OR include them in a script which is called from the command line, OR just put the whole thing into a shell script. USAGE: % awk 'program statements' [file-list] (SINGLE QUOTES ARE NECESSARY) OR % awk -f program-file [file-list] OR % ./script.name (See --SHELL SCRIPTS-- at the end of this document) (NOTE: awk is one of many similar programs. nawk is another similar but more powerful program with some extra features. In the Center for Remote Sensing (CRS) we link awk to nawk, so when you type awk, you really get nawk. This occasionally leads to scripts that work for you but don't work correctly when run on a different system. In this case you might check if this is the source of any problems others may encounter when running a script that works for you but not them.) General Format: AWK TREATS INPUT DATA AS A FILE CONTAINING 1 OR MORE ROWS & COLUMNS. One can be imaginiative about what constitutes a column. They do not have to line up or anything. This paragraph can be thought of as 4 lines with a varying number of columns. Columns are specified with $n where n=column number. For example: To select and print the first two columns of a file, but in reverse order: {print $2,$1} At the command line, type an operation to be performed: % awk '{print $2,$1}' IN > OUT OR, Specify an awk command file containing the operations to perform: % awk -f command.file IN > OUT % cat command.file {print $2,$1} To print ALL columns, one can say: {print}, {print $0}, or {print $1,$2,..,$n} NOTE: Whatever commands you use, they will ALL be applied, in order, to each line before awk moves on to the next line, and they will be applied to EVERY line in the input file (unless you specify otherwise). ******************************************************************************* FIELD SEPARATORS: In awk, columns are "fields" and rows are "records" The default separation for fields in the INPUT is any continuous white space. That includes any number of blank spaces and tabs. The default spacing for fields in the OUTPUT are single spaces, so tabs and multiple white-spaces are reduced to a single white space by default. Other Input and Output Field Separators can be specified. For example: {FS=","} The input field separators are commas. {FS=":"} The input field separators are colons. {OFS="\t"} Use tabs as field separators in output. Examples using the print command, using this example datafile "IN". Notice the formatting of the IN file compared to the different outputs. % cat IN 1 100 2 2000 3 30 4 4 5 55 EXAMPLES: % awk '{print $1,$2}' IN # note white space removed 1 100 2 2000 3 30 4 4 5 55 % awk '{print $1$2 }' IN #notice, no comma means no white space between fields 1100 22000 330 44 555 % awk '{print }' IN # print prints everything AS IS 1 100 2 2000 3 30 4 4 5 55 % awk '{print $0}' IN # $0 means all fields AS IS 1 100 2 2000 3 30 4 4 5 55 % awk '{print 9999,$0}' IN # print 9999, then $0 means all fields as is 9999 1 100 9999 2 2000 9999 3 30 9999 4 4 9999 5 55 % awk '{print $1,"\t",$2}' IN # note the tab is Field Separator 1 100 2 2000 3 30 4 4 5 55 % awk 'BEGIN {OFS="\t"} {print $1,$2}' IN # note the tab is Field Separator, 1 100 # BUT you don't have to type it 2 2000 # between every column every time. 3 30 4 4 5 55 % awk '{print $1 ":" $2}' IN # note colon is Field Separator, spaces to sides are irrelevant 1:100 2:2000 3:30 4:4 5:55 % awk '{printf $1,$2}' IN # Note that "print" automatically inserted "newlines" 12345% # and spaces, BUT with "printf", you must format all. # NOTE that column 2 is not printed at all. # (SEE "formatting" below) ******************************************************************************* CONDITIONAL STATEMENTS: ( IF-THEN type statements) How can you set conditions? Generally, you do not use the actual words IF or THEN, just state the condition. awk '($1 != 0){print $1,$2}' IN > OUT Means: IF column 1 does not equal zero, THEN print both columns. (Note that paranthesis and white space inside the parans are not necessary, but add clarity, especially when there are multiple levels. eg. awk '$1!=0{print $1,$2}' would work just as well. Example: IN: 0 99 1 33 0 22 1 11 2 55 OUT: 1 33 1 11 2 55 A more complex example would be this command file: % awk -f command-file IN > OUT % cat command-file $2 < 0 {print $1,0} $2 >=0 && $2 <= 255 {print $1,$2} $2 > 255 {print $1,255} That says the following: If col 2 is less than zero, print col 1, and "0" for col 2. If col 2 is 0 to 255 , print col 1 and col 2. If col 2 is greater than 255, print col 1, and "255" for col 2. IN: 22 300 23 -60 24 200 OUT: 22 255 23 0 24 200 You COULD put all those commands on one command line, but if you make a mistake, a script is clearer, and easier to evaluate and edit, as you can see in this long, single line version of the above: % awk '$2<0{print $1,0} $2>=0&&$2<=255{print $1,$2} $2>255{print $1,255}' IN Further, one can specify actions to take only on specific records (lines) by specify the line (or RECORD) number in the file with "NR" For example to skip the first line (perhaps a header): % awk '( NR > 1 ) {print}' or, to reduce a file by printing every other line: % awk '((NR % 2) == 0) {print}' (% is the modulus, or remainder. If the remainder is 0, then print) CAUTION: awk does NOT exit when a condition is matched. If the field or record matches the condition, the instruction is executed. Then awk tests the next condition on the same field/record. So, if any field/record matches more than one of the conditional statements, it will be treated every time it matches. So, one field or record from the input can create many fields/records in the output. This can be bad if you are not expecting it, but handy if you control it. ******************************************************************************* CONDITIONAL STATEMENTS: ( IF-ELSE type statements) One drawback to the method above, is that you must explicilty state all possible conditions. Another way is to use if-else statements in awk. Though less common, this method has advantages. Embed the if-else statements in the inner AND outer curly brackets: awk '{ if ( $1==0 || $2==0 || $3==0 ) {print "0"} else {print $0} }' IN > OUT IF column 1 or 2 or 3 is equal to zero, print a single 0, ELSE, print the entire line. ******************************************************************************* WHILE LOOPS: You can get awk to execute a command as many times as you like with a loop using this format (notice both inside and outside brackets): {for (a=1; a<=n ;a++) {print $1,$2}} where you supply n. note that n can be a number, a variable or a value in a column {for (a=1;a <= 99;a++) {print $1,$2}} # will print columns 1 and 2, 99 times. BEGIN {X=something} {for (a=1;a <= X;a++) {print $1,$2}} # will print col 1 and col2 X times. {for (a=1;a <= $3;a++) {print $1,$2}} # Will print col 1 and col 2 as many times as column 3 says to # Useful for regressions # # # eg. # 4 5 3 # 6 9 2 becomes # # 4 5 # 4 5 # 4 5 # 6 9 # 6 9 # You CAN put a condition in front, just like any other statement: awk '(NR > 1) {for (a=1;a <= $3;a++) {print $1,$2}}' ******************************************************************************* MATHEMATICAL EXPRESSIONS: Operators can be used to apply mathematical expressions to the data. Columns can be combined, or constants applied. For example: awk 'BEGIN {OFS="\t"} {print $1,$2,$3,(($1+$2+$3)/3)}' IN > OUT Will print out column 1, column2, column 3, and the mean of the 3 columns. IN: 10 15 20 0 20 20 5 5 9 OUT: 10 15 20 15 0 20 20 13.3333 5 5 9 6.33333 There are many other operators for awk, usually variants of addition, subtraction, multiplication and division. Squares can be found by multiplying a column value by itself ($1*$1) = (col 1 squared), IF you are using NAWK instead of AWK (in CRS, awk is actually a link to NAWK) you can use the caret to create a power relationship ($1^2) (in this case, paranthesis ARE necessary). EXPRESSIONS: expressions take on string or numeric values as appropriate, and are built using the operators +, -, *, /, %, and con- catenation (indicated by a blank). The C operators ++ , -- , += , -= , *= , /= , and %= are also available in expressions. BUILT IN FUNCTIONS: LENGTH The built-in function length returns the length of its argu- ment taken as a string, or of the whole line if no argument. Very handy for printing the length of every line. USAGE: {print length} EXP, LOG, SQRT, INT, There are also built-in functions exp, log, sqrt, and int, where int truncates its argument to an integer. PARENTHESIS ARE REQUIRED. USAGE: {print sqrt($1)} `substr( s, m, n )' returns the n-character substring of s that begins at position m. `sprintf (format, expression, expression, ...)' formats the expressions according to the printf format given by format, and returns the resulting string. ******************************************************************************* SPECIFYING INTEGER, FLOAT, STRING, ETC.: Formatting output can be better controlled using a "printf" statement instead of the "print" statement, but using "printf" REQUIRES you to format the output. awk's default output is decimal unless floating point numbers are necessary as in the example above. Then, awk will use as long a floating point number as it needs. Sometimes it is desirable to specify the output one wants, esp. to specify the number of significant digits in a floating point number, or to force truncating or rounding to decimal numbers. For example, this input file IN: IN: 10000 12000 14000 60 90 120 Dividing each column by different numbers could go like this: % awk '{print ($1/66),($2/6430),($3/627)}' IN>OUT % cat OUT 151.515 1.86625 22.3285 0.909091 0.0139969 0.191388 Using OFS one can add tabs: % awk '{OFS="\t"}{print ($1/66),($2/6430),($3/627)}' IN>OUT % cat OUT 151.515 1.86625 22.3285 0.909091 0.0139969 0.191388 BUT, all those significant figures are messy and not necessary, AND the different number of significant digits messes up the tabs. Use printf to format 3 significant digits and separate with tabs: % awk '{printf "%.3f\t%.3f\t%.3f\n", ($1/66),($2/6430),($3/627)}' IN>OUT % cat OUT 151.515 1.866 22.329 0.909 0.014 0.191 By default, awk always lines up columns on the left. When you have different digits to the left of the decimal point, they don't line up well for our eyes. You can line them up by specifying the TOTAL number of spaces to use (including the decimal point) for each number. eg. %7.3f will allow all numbers up to 999.999 to line up perfectly. A larger number will still print, but will be shifted over compared to all the others, so if you can predict the size of your numbers you can format everyting very nicely. Note that the decimal point itself is one of those digits in the 7. % awk '{printf "%7.3f\t%7.3f\t%7.3f\n", ($1/66),($2/6430),($3/627)}' IN >OUT % cat OUT 151.515 1.866 22.329 0.909 0.014 0.191 IMPORTANT DETAILS WHEN USING PRINTF: 1. printf goes inside the curly brackets. 2. after printf, formatting goes inside the double quotes. 3. after the final double-quote, you must have a comma. 4. \n is required for newlines to be printed. Without this, it will print one very long line. OTHER VARIABLE TYPES: %d decimal %f float %e exponential %g (f or e, whichever is shorter) %s string (a string of characters) %o unsigned octal %x unsigned hexadecimal GETTING MORE TRICKY: You have a file with 1 or more columns and many rows and you want to make 1 row of many columns. % awk '{printf "%s\t", $0}' IN ******************************************************************************* SUMS AND AVERAGES: and using VARIABLES Running Sums, Total Sums, and Averages can all be done with a small bit of imaginative finagling. Actually, one can write quite complex programs using awk. This is just a simple example. This command computes sums and averages by assinging values to variables (s and t) and increasing those values after each line is read. VARIABLES: Variables are created and/or changed inside curly brackets. A letter or a string (word) can name the variable. eg. a, b, count, prev {a == 1} CONDITIONAL STATEMENT: IF a equals 1 {a = 1} VARIABLE NAME: assigns the value 1 to the variable a {a ++ } increments a by 1 when each line is read. {a += $1} increments a by adding the value in column 1 to a when each line is read. {a += 5} increments a by 5 when each line is read. {b = (a/2)} b is whatever a is, divided by 2 {print a} will print THE VALUE STORED IN a. {print "a"} will print the letter a {print a/2} will print the value stored in a divided by 2 Here is an example whcih converts a list of test scores, to a formatted output with grades summed and averaged: % awk -f Command-file IN > OUT % cat Command-file: BEGIN {printf "\n%s\t%s\t%s\t%s\n","subj.", "grade1","grade2","avg. grade"} {s += $1} # s is a variable for the running total of col 1. {t += $2} # t is a variable for the running total of col 2. {print NR, $0, ($1 +$2)/2} # NR is always the line number END {printf "\n%s\t%d\t%d\n%s\t%.2f\t%.2f\t%.2f\n",\ "sum", s, t, "avg.", s/NR, t/NR,(s/NR+t/NR)/2} # s/NR = the total of col 1, divided by the number of lines. # t/NR = the total of col 2, divided by the number of lines. % cat IN: 99 100 72 73 88 64 96 83 % cat OUT: subj. grade1 grade2 avg. grade 1 99 100 99.5 2 72 73 72.5 3 88 64 76 4 96 83 89.5 sum 355 320 avg. 88.75 80.00 84.38 NOTES: 1. BEGIN does something once at the beginning BEFORE it reads the first line of the input file. Good for creating headers. 2. END does something once at the end, after reading the entire file and executing all commands. 3. Check out the END line. Notice the back slash at the end of it? That shows awk that the next line is a continuation of this line. 4. CAUTION: when using NR. It counts lines in the input file. If there were, for example, a blank and empty line anywhere (eg. at the end of this IN file), the sums would be divided by 5, not 4 (and give an inaccurate value for avg). If necessary, a more conservative approach would be to define a counter variable which is dependent on a field/record not being blank. eg. (NF > 0) {count ++} If the number of fields in a line is greater than 0, increase count by one. ******************************************************************************* COMPARING PREVIOUS LINES: Useful if you want to reduce the size of a file by printing lines only once when they repeat in the original file. This command says: "Print a line, ONLY if the ENTIRE line is NOT identical to the previous line." awk '$0 != prev { print; prev=$0}' IN > OUT IN: 1 1 1 1 1 1 1 2 1 2 2 1 2 1 2 2 2 2 2 2 2 3 OUT: 1 1 1 2 2 1 2 2 2 3 Note that this is similar to the UNIQ command, described later, but more flexible. But, to print a line only if the value in col 1 is not identical to the previous value in col 1 (print the line only the first time that the number occurs in the column): awk '$1 != prev {print; prev=$1}' IN > OUT IN: 1 1 1 2 1 3 1 99 2 100 2 99 3 88 3 3 OUT: 1 1 2 100 3 88 In this case, "prev" is a variable. IF col 1 is not equal to prev, then print the line; THEN, assign this new value to prev, ******************************************************************************* ******************************************************************************* --SHELL SCRIPTS-- A very brief introduction. Shell scripts are command files which can contain a series of instructions, and which may contain variables. Like a program, it must be made "executable". This is done with the command: chmod +x file-name There are several good reasons to use shell scripts, among them: 1. They are often simpler to write than programs. 2. Automation. They will carry out instructions for you, when you need them to. 3. Sequence. Scripts are especially useful when one has several things to do which must be done in sequence, but you don't wish to sit around waiting for each step to be completed before executing the next. 4. Testing small changes. Scripts are useful when one wishes to do the same thing over several times, but with small changes each time. This can be done through the use of variables in the command file. Variables in the file are expressed as $1 for the first variable, $2 for the second, etc. Note that in this case, $1 and $2 have a different meaning than the field/column notation inside an awk statement. 5. Historical: It keeps a record of what you've been doing. Misc Notes: 1. By convention, many of us in CRS use filenames beginning with "c." so we end up with names like c.degrees.to.radians 2. The symbol # is used to "comment out" a line. In other words, it will not be executed. 3. A script must be "executable" (you will see an x in the permissions when you list the file): % ls -Fla c.degrees.to.radians -rwxr-xr-x 1 scottm geog-grad 1574 May 15 1996 c.degrees.to.radians* If it is not executable, you can make it so with the chmod command: % chmod +x c.degrees.to.radians 4. The filename is the name used to execute the command. If a command file expects to receive variables, they are entered directly after the command, eg. % ./c.degrees.to.radians 75 Note: The dot-slash is used to specify that you want to run this script that is in this directory. If you have "dot" in your path, you may not need to use it here, and in this case, you only need the filename: % c.degrees.to.radians 75 To read the script, just use "cat" or "more" or "less" ################################################################### % cat c.degrees.to.radians #!/bin/sh # USAGE: c.degrees.to.radians n (where n is degrees from 1 to 360) # This script converts degrees to radians, which is necessary when # you wish to obtain trigonometric values, such as sin and cos. # Note that this script is for demonstration purposes, so it # is verbose, and uses several methods to demonstrate # similarities and differences. # The first (and only) input is assigned to the variable "deg" # The input is specified as $1 deg=$1 pi=3.1417 # First Check that there is 1 and only 1 input, # Then check that that input is between 1 and 360 # # IF there is 1 input and it is >= 1 and <= 360, # THEN do NOTHING (the ":") # ELSE, if it does not meet these conditions, # ECHO an error statement and EXIT. if [ "$#" -eq 1 -a \( "$deg" -ge 1 -a "$deg" -le 360 \) ] then : else echo "" echo "Usage: c.degrees.to.radians n (where n is degrees from 1 to 360)" echo "" exit 1 fi # Convert degrees to radians using awk: # Note that awk needs input. # There are several ways to do this. Here is one: # Create a file with just a 1 in it, then use that as input echo "1" > 1 rad=`awk '{print ( ('$deg' * '$pi' ) / 180 ) }' 1` echo "" awk '{print "degrees = " '$deg' }' 1 awk '{print "radians = " '$rad' }' 1 echo " " nawk '{print "sin of " '$deg' " = " sin('$rad') }' 1 echo "" nawk '{printf "sin of %d = %.2f\n", '$deg', sin('$rad') }' 1 echo "" nawk '{print "cos of " '$deg' " = " cos('$rad') }' 1 echo "" nawk '{printf "cos of %d = %.2f\n", '$deg', cos('$rad') }' 1 echo "" ################################################################## To execute (run) the file, type on the command line the filename and any necessary arguments: % c.degrees.to.radians 75 degrees = 75 radians = 1.30904 sin of 75 = 0.965937 sin of 75 = 0.97 cos of 75 = 0.258777 cos of 75 = 0.26