UNIX Tutorial Seven

7.1 Text Processing

Scientists who use UNIX often deal with data that are stored in text files. A primary need for these users is to extract particular portions of that database or to manipulate values in the database. In essence, one needs a program which permits easy manipulation of structured data and the generation of formatted reports. The awk program is one of the main tools for performing these actions.

7.2 Introduction to awk

The awk utility is a pattern scanning and processing program. It searches one or more files to see if they contain lines that match specified patterns and then perform associated actions, such as writing the line to the standard output or incrementing a counter each time it finds a match. The capabilities of awk extend the idea of text editing into computation, making it possible to perform a variety of data processing tasks, including analysis, extraction, and reporting of data.

Command line syntax

The awk program has the following command line syntax

% awk options program variables filenames

When using awk, we have to tell it which data we wish to process (i.e., the filename) and then how we wish to process it (the program instructions). The program instruction can be specified on the command line if it is enclosed by apostrophe marks (i.e., '{print me}') or they can be called from a file. For this course, we will use simple enough programs that we can just specify the instructions on the command line. The other options and variables are not required by awk, but we may use them later on in the course.

Program Instructions

An awk program consists of one or more program lines containing a pattern and/or action in the following format:

pattern { action }

The pattern selects lines from the input file. The awk utility performs the action on all lines that the pattern selects. You must enclose the action within braces so that awk can differentiate it from the pattern.

To start, awk compares the first line in the input file with each pattern in the program. If a pattern selects a line (if there is a match), awk takes the action associated with the pattern. If the line is not selected, awk takes no action. When awk has completed its comparisons for the first line of the input file, it repeats the process for the next line of input. It continues this process, comparing subsequent lines in the input file, until it has read the entire input file/s.

There are two rules which occur if either a pattern or action is ommited. If a program line does not contain a pattern, awk selects all lines in the input file. If the program line does not contain an action, awk copies the selected lines to its standard output (this is usually the display, if you haven't redirected the output to another program or to a file).

7.3 awk Examples

To begin using awk, we first need to find a file that we can use as our text database. For this example, we will use a short text database of paleomagnetic measurements. Create a new directory act3 in your groupwork directory for our awk work. Then copy the paleomag.txt file from /usr/local/text/paleomag.txt to your act3 directory (Hint: remember cp requires both a source file and target location). Using cat or more you should be able to see that the file looks like this

AGE INC DEC LAT LON PLAT PLONG UNIT PUB-YEAR
0 17.8 -1.1 12 75 59.1 274.5 Current Mag 2005
20 14.8 -36.5 34.3 73.9 57.5 335 Basal Murree Formation 1986
65.5 -24.9 -23.6 20 75 36.9 281.3 Deccan Traps Overall Result 1991 1991
116 -59.7 128.5 25 87.4 9.3 304.8 Rajmahal Traps 1996
177 -64.6 -69.7 23.8 84.1 3.4 317.2 Barakar Formation 1980

This is a database of five important magnetic field measurements that come from rocks that were sampled in India ranging in age from recently formed rocks to rocks nearly 200 million years old. This file is a common form of text database where the different pieces of information are listed in different columns. The first line is the "header" which explains what each column represents. Each line after represents different data points, in this case different paleomagnetic measurements including declination and inclination. We can use this file to illustrate several ways to use awk.

Pattern matching

Let's first illustrate how we can match a text pattern by searching for a measurement in our database that has an age of 20 million years ago. If we try to do that by matching the text pattern that would represent that age, we specify this text pattern as /20/ in awk

% awk '/20/' paleomag.txt

0 17.8 -1.1 12 75 59.1 274.5 Current Mag 2005
20 14.8 -36.5 34.3 73.9 57.5 335 Basal Murree Formation 1986
65.5 -24.9 -23.6 20 75 36.9 281.3 Deccan Traps Overall Result 1991 1991

Since we just specified a pattern and not an action, the output will show the completes lines that contain the text 20. Notice that awk finds three cases that have 20 on the line, but 20 is only in the age column for one of those. In the first case, 20 is found in the publication year (2005), and in the last case, 20 is found in the latitude of the rock location. If we just wanted to select lines with 20 in the age column, we can use the other way to specify the pattern matching in awk

% awk '$1==20' paleomag.txt

20 14.8 -36.5 34.3 73.9 57.5 335 Basal Murree Formation 1986

This takes advantage of awk's ability to naturally break up a line into columns that are separated by whitespace (spaces or tabs). Within awk we can refer to these columns by using the $ character and then the number of the column we are interested in. In our example, we use the == characters to check if the age information in the first column ($1) is equal to 20.

If we are trying to match specific text in one of the columns, like the UNIT information, we would need to use quotation marks around the text to tell awk to look for that exact text. For example:
awk '$8=="Deccan"' paleomag.txt

The main comparison operators for awk are

Operator Meaning
<
<=
==
!=
>=
>
less than
less than or equal to
equal to
not equal to
greater than or equal to
greater than

Exercise 7.1

Use awk to select measurements where the inclination (INC) is less than or equal to 0 and send the output to paleomag-negative.txt.

Program Actions

The action portion of an awk command causes awk to do something when it matches a pattern. If you do not specify an action awk performs the default action, which is the print command, which would be the same as writing

% awk '{print}' paleomag.txt

AGE INC DEC LAT LON PLAT PLONG UNIT PUB-YEAR
0 17.8 -1.1 12 75 59.1 274.5 Current Mag 2005
20 14.8 -36.5 34.3 73.9 57.5 335 Basal Murree Formation 1986
65.5 -24.9 -23.6 20 75 36.9 281.3 Deccan Traps Overall Result 1991 1991
116 -59.7 128.5 25 87.4 9.3 304.8 Rajmahal Traps 1996
177 -64.6 -69.7 23.8 84.1 3.4 317.2 Barakar Formation 1980

This action copies the line from the input file to awk's standard output. You can follow a print command with arguments, causing awk to print just the arguments you specify. For example, you can print just the inclinations

% awk '{print $2}' paleomag.txt

INC
17.8
14.8
-24.9
-59.7
-64.6

You can print more than one argument with the print command. Unless you separate items in a print command with commas, awk catenates them. Commas cause awk to separate the items with a space by default. We can print the longitude and latitude of each rock unit with this command

% awk '{print $5,$4}' paleomag.txt

LON LAT
75 12
73.9 34.3
75 20
87.4 25
84.1 23.8

Additional text can also be added with the print command, which can often help to illustrate more information about the output. In this case, we can specify the directions for the longitude and latitude.

% awk '{print $5,"E",$4,"N"}' paleomag.txt

LON E LAT N
75 E 12 N
73.9 E 34.3 N
75 E 20 N
87.4 E 25 N
84.1 E 23.8 N

Exercise 7.2

What does the output look like if the commas are not included? Does using extra spaces help? Use awk to make a version of the longitude and latitude values where there is no space between the number and direction, but there is a space between the longitude and latitude values. Store the output in a file called paleomag-location.txt.

Arithmetic Functions and Variables

There are many other things that awk can do in the action part of the program, but arithmetic functions are probably the most important. We can perform mathematical adjustments to any of the input numbers. Say that you would like to output the age and inclination information, but you will need inclination to be in radians. We can do this by multiplying the second column by pi/180 when we print out the inclination

% awk '{print $1,$2*3.14/180}' paleomag.txt

AGE 0
0 0.310511
20 0.258178
65.5 -0.434367
116 -1.04143
177 -1.12691

Notice that when the mathematical action was performed on the text in the "header" it produced a spurious result. In this case, 3.14/180 was multiplied the word INC. Basically awk treats all text as zero when it performs mathematical functions, so 0 * 3.14 / 180 = 0. We can avoid this problem by using the pattern matching aspect of awk to skip the first line.

% awk 'NR>1{print $1,$2*3.14/180}' paleomag.txt

0 0.310511
20 0.258178
65.5 -0.434367
116 -1.04143
177 -1.12691

The NR characters are a built in variable within awk that represent the line number (it stands for Number of the Record). In our example, we tell awk to match lines where the line number is greater than 1, meaning that we want all lines after the first line.

We can also create our own variables in awk. We can achieve a similar result to our last example by setting a variable within the action part of our program. Unfortunately for you, awk is a little different than our shell commands, so we do not use set when establishing a variable, we can just specify the name and what it is equal to. In this case, we can create a variable called dtr to store the information used to convert degrees to radians (dtr). The other difference from shell variables is that we do not use the $ character when we call variables we create in awk, we just use the name (i.e., dtr

% awk 'NR>1{dtr=3.14/180; print $1,$2*dtr}' paleomag.txt

0 0.310511
20 0.258178
65.5 -0.434367
116 -1.04143
177 -1.12691

This example also introduces the ; character, which allows us to seperate commands within the action part of an awk program. The first part of the action sets the dtr variable, the second part prints the information. We could add additional commands after the second part, but this should be enough of an introduction to awk.

Exercise 7.3

Use awk to calculate the distance in kilometers of paleopole from the geographic North pole and store the values in a file called paleomag-distance.txt. Some helpful hints to make this calculation are that the paleopole latitude is stored in the sixth column of the databse and you can approximate distance on the Earth's surface using the equation for a sqhere:

distance = angle (in radians) * radius (of sphere)

Notice that the angle in degrees is the angle difference between the north pole (90) and the paleolatitude listed in the database (90 - paleolat, but don't forget to convert to radians), and the radius of the Earth is 6371 km.

You can click on the following file name to see what the first 3 lines of the paleomag-distance.txt file is expected to look like.

Summary

awk options program variables file awk command syntax for processing text files
pattern { action } format for an awk program text
/text/ awk program to find lines that match text
print text awk command to output text
NR awk variable that stores the line number
$number awk character to identify column number of a text file
; character to separate commands within an awk program

 

brudzimr@muohio.edu, 28th August 2006