old-www/HOWTO/Lex-YACC-HOWTO-3.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
 <META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
 <TITLE>Lex and YACC primer/HOWTO: Lex</TITLE>
 <LINK HREF="Lex-YACC-HOWTO-4.html" REL=next>
 <LINK HREF="Lex-YACC-HOWTO-2.html" REL=previous>
 <LINK HREF="Lex-YACC-HOWTO.html#toc3" REL=contents>
</HEAD>
<BODY>
<A HREF="Lex-YACC-HOWTO-4.html">Next</A>
<A HREF="Lex-YACC-HOWTO-2.html">Previous</A>
<A HREF="Lex-YACC-HOWTO.html#toc3">Contents</A>
<HR>
<H2><A NAME="s3">3. Lex</A></H2>

<P>The program Lex generates a so called `Lexer'. This is a function that takes
a stream of characters as its input, and whenever it sees a group of
characters that match a key, takes a certain action. A very simple example:
<P>
<BLOCKQUOTE><CODE>
<PRE>
%{
#include &lt;stdio.h>
%}

%%
stop    printf("Stop command received\n");
start   printf("Start command received\n");
%%
</PRE>
</CODE></BLOCKQUOTE>
<P>The first section, in between the %{ and %} pair is included directly in the
output program. We need this, because we use printf later on, which is
defined in stdio.h.
<P>Sections are separated using '%%', so the first line of the second section
starts with the 'stop' key. Whenever the 'stop' key is encountered in the
input, the rest of the line (a printf() call) is executed.
<P>Besides 'stop', we've also defined 'start', which otherwise does mostly the
same.
<P>We terminate the code section with '%%' again.
<P>To compile Example 1, do this:
<BLOCKQUOTE><CODE>
<PRE>
lex example1.l
cc lex.yy.c -o example1 -ll
</PRE>
</CODE></BLOCKQUOTE>
<P>
<BLOCKQUOTE><CODE>
NOTE: If you are using flex, instead of lex, you may have to change '-ll'
to '-lfl' in the compilation scripts. RedHat 6.x and SuSE need this, even when
you invoke 'flex' as 'lex'!
</CODE></BLOCKQUOTE>
<P>This will generate the file 'example1'. If you run it, it waits for you to
type some input. Whenever you type something that is not matched by any of
the defined keys (ie, 'stop' and 'start') it's output again. If you
enter 'stop' it will output 'Stop command received';
<P>Terminate with a EOF (^D).
<P>You may wonder how the program runs, as we didn't define a main() function.
This function is defined for you in libl (liblex) which we compiled in with
the -ll command.
<P>
<H2><A NAME="ss3.1">3.1 Regular expressions in matches</A>
</H2>

<P>This example wasn't very useful in itself, and our next one won't be either.
It will however show how to use regular expressions in Lex, which are
massively useful later on.
<P>Example 2:
<BLOCKQUOTE><CODE>
<PRE>
%{
#include &lt;stdio.h>
%}

%%
[0123456789]+           printf("NUMBER\n");
[a-zA-Z][a-zA-Z0-9]*    printf("WORD\n");
%%
</PRE>
</CODE></BLOCKQUOTE>
<P>This Lex file describes two kinds of matches (tokens): WORDs and NUMBERs.
Regular expressions can be pretty daunting but with only a little work it is
easy to understand them. Let's examine the NUMBER match:
<P>[0123456789]+
<P>This says: a sequence of one or more characters from the group 0123456789.
We could also have written it shorter as:
<P>[0-9]+
<P>Now, the WORD match is somewhat more involved:
<P>[a-zA-Z][a-zA-Z0-9]*
<P>The first part matches 1 and only 1 character that is between 'a' and 'z',
or between 'A' and 'Z'. In other words, a letter. This initial letter then
needs to be followed by zero or more characters which are either a letter or
a digit. Why use an asterisk here? The '+' signifies 1 or more matches, but
a WORD might very well consist of only one character, which we've already
matched. So the second part may have zero matches, so we write a '*'.
<P>This way, we've mimicked the behaviour of many programming languages which
demand that a variable name *must* start with a letter, but can contain
digits afterwards. In other words, 'temperature1' is a valid name,
but '1temperature' is not.
<P>Try compiling Example 2, lust like Example 1, and feed it some text. Here is
a sample session:
<P>
<BLOCKQUOTE><CODE>
<PRE>
$ ./example2
foo
WORD

bar
WORD

123
NUMBER

bar123
WORD

123bar
NUMBER
WORD
</PRE>
</CODE></BLOCKQUOTE>
<P>You may also be wondering where all this whitespace is coming from in the
output. The reason is simple: it was in the input, and we don't match on it
anywhere, so it gets output again.
<P>The Flex manpage documents its regular expressions in detail. Many people
feel that the perl regular expression manpage (perlre) is also very useful,
although Flex does not implement everything perl does.
<P>Make sure that you do not create zero length matches like '[0-9]*' - your
lexer might get confused and start matching empty strings repeatedly.
<H2><A NAME="ss3.2">3.2 A more complicated example for a C like syntax</A>
</H2>

<P>Let's say we want to parse a file that looks like this:
<BLOCKQUOTE><CODE>
<PRE>
logging {
        category lame-servers { null; };
        category cname { null; };
};

zone "." {
        type hint;
        file "/etc/bind/db.root";
};
</PRE>
</CODE></BLOCKQUOTE>
<P>We clearly see a number of categories (tokens) in this file:
<UL>
<LI> WORDs, like 'zone' and 'type'</LI>
<LI> FILENAMEs, like '/etc/bind/db.root'</LI>
<LI> QUOTEs, like those surrounding the filename</LI>
<LI> OBRACEs, {</LI>
<LI> EBRACEs, }</LI>
<LI> SEMICOLONs, ;</LI>
</UL>
<P>The corresponding Lex file is Example 3:
<BLOCKQUOTE><CODE>
<PRE>
%{
#include &lt;stdio.h>
%}

%%
[a-zA-Z][a-zA-Z0-9]*    printf("WORD ");
[a-zA-Z0-9\/.-]+        printf("FILENAME ");
\"                      printf("QUOTE ");
\{                      printf("OBRACE ");
\}                      printf("EBRACE ");
;                       printf("SEMICOLON ");
\n                      printf("\n");
[ \t]+                  /* ignore whitespace */;
%%
</PRE>
</CODE></BLOCKQUOTE>
<P>When we feed our file to the program this Lex file generates (using
example3.compile), we get:
<P>
<BLOCKQUOTE><CODE>
<PRE>
WORD OBRACE
WORD FILENAME OBRACE WORD SEMICOLON EBRACE SEMICOLON
WORD WORD OBRACE WORD SEMICOLON EBRACE SEMICOLON
EBRACE SEMICOLON

WORD QUOTE FILENAME QUOTE OBRACE
WORD WORD SEMICOLON
WORD QUOTE FILENAME QUOTE SEMICOLON
EBRACE SEMICOLON
</PRE>
</CODE></BLOCKQUOTE>
<P>When compared with the configuration file mentioned above, it is clear that
we have neatly 'Tokenized' it. Each part of the configuration file has been
matched, and converted into a token.
<P>And this is exactly what we need to put YACC to good use.
<P>
<H2><A NAME="ss3.3">3.3 What we've seen</A>
</H2>

<P>We've seen that Lex is able to read arbitrary input, and determine what each
part of the input is. This is called 'Tokenizing'.
<P>
<HR>
<A HREF="Lex-YACC-HOWTO-4.html">Next</A>
<A HREF="Lex-YACC-HOWTO-2.html">Previous</A>
<A HREF="Lex-YACC-HOWTO.html#toc3">Contents</A>
</BODY>
</HTML>