AWK: a precursor to RegEX

By lingy | Lingy | 22 Aug 2022

Banner

The AWK language is a text processing language created for early versions of Unix. You can think of it as the grandfather of RegEx, as you could create simple scripts to search for text on the lines of files you wanted and then filter them.

The first version of this language appeared in 1977 as a scripting language for text processing, helping to increase the power of shell scripts and offer new functionalities, characteristics that made AWK influence several other languages, such as Perl, the new versions of Shell and Lua.

The language even had updates from the 1980s, making it possible to incorporate RegEx into the original AWK scripts.

With the end of Unix and the construction of GNU/Linux, BSD and other variations, the original AWK was in the past. However, mainly to maintain the backward compatibility of the scripts of the Unix users, several interpreters of the AWK appeared. The most popular are:

BWK: is the oldest interpreter after the end of Unix, having direct involvement from the original creators of AWK, mainly Brian Kernighan. This version is used on FreeBSD, OpenBSD, NetBSD and Mac OS X;
GAWK: GNU AWK is the most popular interpreter today, coming by default in most Linux distros and being available in the repositories of almost all of them as well. It has popular uses even today in the community, being able to find certain text patterns from complex functions and received recent updates to guarantee features to operate in TCP/IP networks;
TAWK: Thompson AWK is an AWK compiler for DOS, Solaris, OS/2 and Windows. Formerly sold by Thompson Automation Software, today you can get it for free from the official website. Despite claiming to offer a Windows version, it only has official compatibility up to Windows XP, so bugs may occur in later versions of Microsoft's software;
AWKA: is an AWK compiler that converts code into C language and then compiles it, causing long scripts to be interpreted faster than they would have been in the original version. The performance of compiled scripts is much higher compared to other languages, and it is still compatible with GAWK 3.1.0, having native functions for TCP/IP network interfaces and the like. You can download it from official website;

Hello World

There are two ways to run AWK. One of them is putting the command inline, all on the same line, and executing it directly from the terminal. The second is by entering the entire command inside a file and executing that file

Directly in the terminal

The first time you can run it directly in the terminal. Run the following command:

awk 'BEGIN{print "Hello World!"}'

This will print Hello World in the terminal. Simple, do you agree?

In a file

BEGIN {
     print "Hello World!";
}

You can run it in 2 ways. First, run the file with the following command:

awk -f hello.awk

And that's it, you'll have Hello World written on the screen.

In the second (and most common) way, you add a hashbang before the code, with the following code:

#!/bin/gawk -f

This will tell the shell which interpreter will be used to run the script. Save the file and run the following command:

chmod +x hello.awk

This adds execute permission to the hello.awk file. Now just run:

./hello.awk

And that's it, you'll have Hello World written on the screen.

Practical use - Listed only specific data

Let's work with the gender_submission.csv file from a Kaggle Titanic dataset (available here). Let's start by printing all the lines of the file:

cat gender_submission.csv | awk '{print $0;}'

For this file you will need to change the default divisor of items. The AWK default is a space, but our file uses the CSV default, which is a comma. How to change it? Simple, let's use the BEGIN block. The BEGIN block is executed once in the code, before everything else, while the following block is executed once per line. So let's change the FS variable, which sets the line parameter separator, right before running the rest of the code:

cat gender_submission.csv | awk 'BEGIN{FS=",";} {print $0;}'

Want to know if it worked? How about placing two arrows between one field and another?

cat gender_submission.csv | gawk 'BEGIN{FS=",";}{print $1 " → → " $2}'

Our aim here is to list only the IDs, but only of the people who survived. How can we do that? The answer is: adding an if conditional. When parameter 2 is equal to 1, it means it survived, and if it survived, we can show it on the screen. Our code looks like this:

cat gender_submission.csv | gawk 'BEGIN{FS=",";}{if ($2 == 1) print($1);}'

And that's it, you'll have a list of desired IDs. Simple, no?

You can still throw the output to a file:

(cat gender_submission.csv | gawk 'BEGIN{FS=",";}{if ($2 == 1) print($1);}') \
>> titanic_survivors_id.txt

Note that I put the command for printing outputs on another line in the shell to improve visibility.

Improving usage

You can also do the same thing by running commands directly from a file. How to do this? Come with me and I'll show you.

First write all your code inside a gs.awk file:

BEGIN {
    FS=",";
}
{
    if ($2 == 1)
        print($0);
}

And save the file in the same folder as the file. Now, you can run the command like this:

cat gender_submission.csv | gawk -f gs.awk >> titanic_survivors_id.txt

We can simplify it even further using the hashbang. Just insert the following line at the top of the file:

#!/bin/gawk -f

Your code will look like this:

#!/bin/gawk -f
BEGIN {
    FS=",";
}
{
    if ($2 == 1)
        print($0);
}

Now save the file and add execute permission to the file:

chmod +x gs.awk

And then your command will look like this:

cat gender_submission.csv | ./gs.awk >> titanic_survivors_id.txt

Of course, there are infinite other improvements we could make, like printing the lines directly to the correct file, but for an introduction, it was already quite interesting, don't you agree?

Interesting projects

In case you want to study, there are several complex projects that challenge the limitations of the language, as well as comprehensive tutorials that explore specific details of the language. Here are some cool repositories:

AWK Raycaster: a DOOM-style game created to run in terminal
JSON.awk: a JSON reader written in AWK
Opera Bookmarks: converts Chromium and derivative bookmarks data to SQLite and CSV
AWKLISP: LISP parser written in AWK
learn_gnuawk: AWK tutorial
AHO: a complete implementation of GIT written in AWK