old-www/LDP/LG/issue67/nazario.html

<!--startcut  ==============================================-->
<!-- *** BEGIN HTML header *** -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML><HEAD>
<title>An Introduction to awk LG #67</title>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
ALINK="#FF0000">
<!-- *** END HTML header *** -->

<CENTER>
<A HREF="http://www.linuxgazette.com/">
<H1><IMG ALT="LINUX GAZETTE" SRC="../gx/lglogo.png"
	WIDTH="600" HEIGHT="124" border="0"></H1></A>
<BR>

<!-- *** BEGIN navbar *** -->
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="gilliam.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue67/nazario.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="nazario2.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
<!-- *** END navbar *** -->
<P>
</CENTER>

<!--endcut ============================================================-->

<H4 ALIGN="center">
"Linux Gazette...<I>making Linux just a little more fun!</I>"
</H4>

<P> <HR> <P>
<!--===================================================================-->

<center>
<H1><font color="maroon">An Introduction to awk</font></H1>
<H4>By <a href="mailto:jose@cwru.edu">Jose Nazario</a></H4>
</center>
<P> <HR> <P>

<!-- END header -->


<h3>
Abstract</h3>
<p>
The awk programming language often gets overlooked in the face of Perl,
which is a more capable language. However, awk is found even more
ubiquitously than Perl, has a less steep learning curve than Perl, and can
be used just about everywhere in system monitoring scripts where
efficiency is key. This brief tutorial is designed to help you get started
in awk programming.
<p>
<h3>
The Basics</h3>
<p>

The awk language is a small, C style language which was designed for the
processing of regularly formatted text. This usually includes database
dumps and system log files. It's built around regular expressions and
pattern handling, much like Perl is. In fact, Perl is considered to be a
grandchild of the awk language.
<p>
The funny name of the awk language is due to the names of its original
authors, who were Alfred V. Aho, Brian W. Kernighan, and Peter J.
Weinberger. Most of you will recognize the name of Kernighan, one of the
fathers of the C programming language and a major force in the UNIX world.
<p><h3>
Using awk in a One Liner</h3><p>

This is how I began using awk, to print specific fields in output. This
would work surprisingly well, but the efficiency went through the floor
when I was writing large scripts that took minutes to complete.
<p>

But, here you go, this can be useful sometimes:
<p>
<pre>
ls -l /tmp/foobar | awk '{print $1"\t"$9}'
</pre>
<p>
What this will do is take some input like this:
<pre>

-rw-rw-rw-   1 root     root            1 Jul 14  1997 tmpmsg
</pre><p>

and will generate some output like this:
<pre>

-rw-rw-rw-      tmpmsg
</pre>

<p>
Quite intuitive what it just did, it printed only the first and ninth
fields. Now you can see why it's so popular for one line data extraction.
But, let's move on to a full fledged awk program.
<p>
<h3>
An awk Program Structure</h3><p>

One of my favorite things about awk is the amazing readability of it,
despite it's power compared to Perl or Python. Every awk program has three
parts: a BEGIN block, which is executed once before any input is read; a
main loop which is executed for every line of input; and an END block,
which is executed after all of the input is read. Quite intuitive! Yes,
I'll keep saying that about awk, because i find it to be very true.
<p>

This is a very simple awk program highlighting some of the features of the
language. See if you can pick out what is happening before we dissect it:
<pre>

#!/usr/bin/awk -f
#
# check the sulog for failures..
# copyright 2001 (c) jose nazario
#
# works for Solaris, IRIX and HPUX 10.20
BEGIN {
  print "--- checking sulog"
  failed=0
  }
{
  if ($4 == "-") {
    print "failed su:\t"$6"\tat\t"$2"\t"$3
    failed=failed+1
    }
}
END {
  print "---------------------------------------"
  printf("\ttotal number of records:\t%d\n", NR)
  printf("\ttotal number of failed su's:\t%d\n",failed)
}
</pre><p>

Have you figured it out yet? Would it help to know the format of a typical
line of the input file (sulog, from, say, IRIX)? Here's a typical pair of
lines:
<pre>

        SU 01/30 13:15 - ttyq1 jose-root
        SU 01/30 13:15 + ttyq1 jose-root

</pre>
OK, read up and see if you can figure out the script. The BEGIN block sets
everything up, printing out a header and initializing our one variable (in
this case failed) to zero. The main loop then reads each line of input
(which is the sulog file, a log of su attempts) and compares field four
against the minu sign. If they match, it is because the attempt failed, so
we increment out counter by one and note which attempt failed and when. At
the end final tallies are presented, showing the total number of lines of
input as the number of records (NR, an internal awk variable) and the
number of failed su attempts we noted. Output looks like this:
<pre>

failed su:      jose-root       at      01/30   13:15
        ---------------------------------------
        total number of records:        272
        total number of failed su's:    73

</pre>
You should also be able to see how printf works, almost exactly like the
printf does in C. In short, awk is a rather intuitive language.
<p>

By default the field separator is whitespace, but you can tweak that. In
password files I set it to be a colon. This small script looks for users
with an ID of 0 (root equivilent) and no passwords:
<pre>

#!/usr/bin/awk -f
BEGIN { FS=":" }
{
  if ($3 == 0) print $1
  if ($2 == "") print $1
}
</pre>
<p>

Other internals from awk to know and use are RS for record separator
(defaults to a newline, or \n), OFS for output field separator (defaults
to nothing, I think) and ORS (defaults to a newline), for output record
separator. These can all be set within the script, of course.

<p>

<h3> Regular Expressions</h3><p>

The awk language matches normal regular expressions that you have come to
know and love, and does so better than grep. For instance, I use the
following awk search pattern to look for the presence of a likely exploit
on Intel Linux systems:
<pre>

#!/usr/bin/awk -f
{ if ($0 ~ /\x90/) print "exploit at line " NR }
</pre><p>

You can't look for hex value 0x90 in grep, but 0x90 is popular in Intel
exploits -- its the NOP call, which is used as padding in shellcode
portions.
<p>
You can look for hex values using \xdd, where dd is the hex number to look
for; you can look for decimal (ie ASCII) values by looking for \ddd, using
the decimal value, and regular expressions based on text will, of course,
work.
<p>
<h3>Random awk bits</h3><p>

Random numbers in awk are readily generated, but there is an interesting
caveat. The rand() function does exactly what you would expect it to, it
returns a random number, in this case between 0 and 1. You can scale it,
of course, to get larger values. Here's some example code to show you
this, as well as an interesting bit of behavior:

<pre>
#!/usr/bin/awk -f
{
  for(i=1;i<=10;i++)
  print rand(); exit
}
</pre>

Run that a couple of times and you will see a problem: the random numbers
are hardly random, they repeat every time you run it!
<p>
So what's the problem? Well we didn't seed the random number generator.
Normally, we're used to our random number generator pulling entropy from a
good source, like (in Linux) /dev/random. However, awk doesn't do this. To
really get random numbers, we should seed our random number generator.
This improved code will do this:

<pre>
#!/usr/bin/awk -f
BEGIN {
  srand()
}
{
  for(i=1;i<=10;i++)
  print rand(); exit
}
</pre>

The seeding of the random number generator in the BEGIN block is what does
the trick. The function srand() can take an argument, and in the absence
of one the current date and time is used to seed the generator. <em>Note
that the same seed will always produce the same 'random' sequence.</em>

<p><h3>
Conclusion</h3><p>

This isn't the most detailed intro to awk you will find, but I hope that
it is more clear to you how to use awk in a program setting. Myself, I'm
quite happy programming in awk, and I've got a lot more to learn.
<p>

We haven't even touched upon arrays, self built functions or other complex
language features, but suffice it to say awk is hardly Perl's little
brother.
<p>

Go forth and awk!
<h3>

Resources</h3><p>

Kernighan's homepage contains a list of good awk books as well as the
source for the 'one true awk', aka "nawk". It also contains a host of
other interesting links and information from Kernighan.<p>
<A HREF="http://cm.bell-labs.com/who/bwk/">http://cm.bell-labs.com/who/bwk/</A>

<p>
The standard awk implementation, nawk (for "new awk", as opposed to the "old
awk, sometimes found as 'oawk' for compatability), is based on the POSIX
awk definitions, and contains a few functions that were introduced by two
other awk implementations, gawk and mawk. I usually keep this one around
as 'nawk' and use it to test the portability of my awk scripts. This one is
usually found on my commercial UNIX machines, where I often don't have
gawk installed.<p>
Source for nawk: <A HREF="http://cm.bell-labs.com/who/bwk/awk.tar.gz">http://cm.bell-labs.com/who/bwk/awk.tar.gz</A>

<p>
The GNU project's awk, gawk, is also based on the POSIX awk standard, but
adds a significant number of useful features, as well. These include command
line features like 'lint' checking and reversion to struct POSIX mode. My
favorite feature in gawk is the line breaks, using '\', and the extended
regular expressions. The gawk documentation has a complete discussion of
GNU extensions to the awk language. This is also the standard awk on Linux
and BSD systems.
<p>
Source for gawk: <A HREF="ftp://gnudist.gnu.org/gnu/gawk/gawk-3.0.6.tar.gz">ftp://gnudist.gnu.org/gnu/gawk/gawk-3.0.6.tar.gz</A> (the GNU Project's version of awk)

<p>
This is perhaps the most popular book on these two small programs, and is
highly regarded. It contains, among other things, a discussion of popular
awk implementations (ie gawk, nawk, mawk), a great selection of functions and
the usual O'Reilly readability. The awk homepage lists several other books
on the awk programming language, though this one remains my favorite.
<p>
The sed &amp; awk book: <A HREF="http://www.oreilly.com/catalog/sed2">http://www.oreilly.com/catalog/sed2</A>


<!-- *** BEGIN bio *** -->
<SPACER TYPE="vertical" SIZE="30">
<P>
<H4><IMG ALIGN=BOTTOM ALT="" SRC="../gx/note.gif">Jose Nazario</H4>
<CITE>Jos&eacute; is a Ph.D. student in the department of biochemistry at Case
Western Reserve University in Cleveland, OH. He has been using UNIX for
nearly ten years, and Linux since kernels 1.2.</CITE>

<!-- *** END bio *** -->

<!-- *** BEGIN copyright *** -->
<P> <hr> <!-- P -->
<H5 ALIGN=center>

Copyright &copy; 2001, Jose Nazario.<BR>
Copying license <A HREF="../copying.html">http://www.linuxgazette.com/copying.html</A><BR>
Published in Issue 67 of <i>Linux Gazette</i>, June 2001</H5>
<!-- *** END copyright *** -->

<!--startcut ==========================================================-->
<HR><P>
<CENTER>
<!-- *** BEGIN navbar *** -->
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="gilliam.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue67/nazario.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="nazario2.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
<!-- *** END navbar *** -->
</CENTER>
</BODY></HTML>
<!--endcut ============================================================-->