211 lines
6.6 KiB
HTML
211 lines
6.6 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
|
|
<TITLE>Lex and YACC primer/HOWTO: Lex</TITLE>
|
|
<LINK HREF="Lex-YACC-HOWTO-4.html" REL=next>
|
|
<LINK HREF="Lex-YACC-HOWTO-2.html" REL=previous>
|
|
<LINK HREF="Lex-YACC-HOWTO.html#toc3" REL=contents>
|
|
</HEAD>
|
|
<BODY>
|
|
<A HREF="Lex-YACC-HOWTO-4.html">Next</A>
|
|
<A HREF="Lex-YACC-HOWTO-2.html">Previous</A>
|
|
<A HREF="Lex-YACC-HOWTO.html#toc3">Contents</A>
|
|
<HR>
|
|
<H2><A NAME="s3">3. Lex</A></H2>
|
|
|
|
<P>The program Lex generates a so called `Lexer'. This is a function that takes
|
|
a stream of characters as its input, and whenever it sees a group of
|
|
characters that match a key, takes a certain action. A very simple example:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
%{
|
|
#include <stdio.h>
|
|
%}
|
|
|
|
%%
|
|
stop printf("Stop command received\n");
|
|
start printf("Start command received\n");
|
|
%%
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>The first section, in between the %{ and %} pair is included directly in the
|
|
output program. We need this, because we use printf later on, which is
|
|
defined in stdio.h.
|
|
<P>Sections are separated using '%%', so the first line of the second section
|
|
starts with the 'stop' key. Whenever the 'stop' key is encountered in the
|
|
input, the rest of the line (a printf() call) is executed.
|
|
<P>Besides 'stop', we've also defined 'start', which otherwise does mostly the
|
|
same.
|
|
<P>We terminate the code section with '%%' again.
|
|
<P>To compile Example 1, do this:
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
lex example1.l
|
|
cc lex.yy.c -o example1 -ll
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
NOTE: If you are using flex, instead of lex, you may have to change '-ll'
|
|
to '-lfl' in the compilation scripts. RedHat 6.x and SuSE need this, even when
|
|
you invoke 'flex' as 'lex'!
|
|
</CODE></BLOCKQUOTE>
|
|
<P>This will generate the file 'example1'. If you run it, it waits for you to
|
|
type some input. Whenever you type something that is not matched by any of
|
|
the defined keys (ie, 'stop' and 'start') it's output again. If you
|
|
enter 'stop' it will output 'Stop command received';
|
|
<P>Terminate with a EOF (^D).
|
|
<P>You may wonder how the program runs, as we didn't define a main() function.
|
|
This function is defined for you in libl (liblex) which we compiled in with
|
|
the -ll command.
|
|
<P>
|
|
<H2><A NAME="ss3.1">3.1 Regular expressions in matches</A>
|
|
</H2>
|
|
|
|
<P>This example wasn't very useful in itself, and our next one won't be either.
|
|
It will however show how to use regular expressions in Lex, which are
|
|
massively useful later on.
|
|
<P>Example 2:
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
%{
|
|
#include <stdio.h>
|
|
%}
|
|
|
|
%%
|
|
[0123456789]+ printf("NUMBER\n");
|
|
[a-zA-Z][a-zA-Z0-9]* printf("WORD\n");
|
|
%%
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>This Lex file describes two kinds of matches (tokens): WORDs and NUMBERs.
|
|
Regular expressions can be pretty daunting but with only a little work it is
|
|
easy to understand them. Let's examine the NUMBER match:
|
|
<P>[0123456789]+
|
|
<P>This says: a sequence of one or more characters from the group 0123456789.
|
|
We could also have written it shorter as:
|
|
<P>[0-9]+
|
|
<P>Now, the WORD match is somewhat more involved:
|
|
<P>[a-zA-Z][a-zA-Z0-9]*
|
|
<P>The first part matches 1 and only 1 character that is between 'a' and 'z',
|
|
or between 'A' and 'Z'. In other words, a letter. This initial letter then
|
|
needs to be followed by zero or more characters which are either a letter or
|
|
a digit. Why use an asterisk here? The '+' signifies 1 or more matches, but
|
|
a WORD might very well consist of only one character, which we've already
|
|
matched. So the second part may have zero matches, so we write a '*'.
|
|
<P>This way, we've mimicked the behaviour of many programming languages which
|
|
demand that a variable name *must* start with a letter, but can contain
|
|
digits afterwards. In other words, 'temperature1' is a valid name,
|
|
but '1temperature' is not.
|
|
<P>Try compiling Example 2, lust like Example 1, and feed it some text. Here is
|
|
a sample session:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
$ ./example2
|
|
foo
|
|
WORD
|
|
|
|
bar
|
|
WORD
|
|
|
|
123
|
|
NUMBER
|
|
|
|
bar123
|
|
WORD
|
|
|
|
123bar
|
|
NUMBER
|
|
WORD
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>You may also be wondering where all this whitespace is coming from in the
|
|
output. The reason is simple: it was in the input, and we don't match on it
|
|
anywhere, so it gets output again.
|
|
<P>The Flex manpage documents its regular expressions in detail. Many people
|
|
feel that the perl regular expression manpage (perlre) is also very useful,
|
|
although Flex does not implement everything perl does.
|
|
<P>Make sure that you do not create zero length matches like '[0-9]*' - your
|
|
lexer might get confused and start matching empty strings repeatedly.
|
|
<H2><A NAME="ss3.2">3.2 A more complicated example for a C like syntax</A>
|
|
</H2>
|
|
|
|
<P>Let's say we want to parse a file that looks like this:
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
logging {
|
|
category lame-servers { null; };
|
|
category cname { null; };
|
|
};
|
|
|
|
zone "." {
|
|
type hint;
|
|
file "/etc/bind/db.root";
|
|
};
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>We clearly see a number of categories (tokens) in this file:
|
|
<UL>
|
|
<LI> WORDs, like 'zone' and 'type'</LI>
|
|
<LI> FILENAMEs, like '/etc/bind/db.root'</LI>
|
|
<LI> QUOTEs, like those surrounding the filename</LI>
|
|
<LI> OBRACEs, {</LI>
|
|
<LI> EBRACEs, }</LI>
|
|
<LI> SEMICOLONs, ;</LI>
|
|
</UL>
|
|
<P>The corresponding Lex file is Example 3:
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
%{
|
|
#include <stdio.h>
|
|
%}
|
|
|
|
%%
|
|
[a-zA-Z][a-zA-Z0-9]* printf("WORD ");
|
|
[a-zA-Z0-9\/.-]+ printf("FILENAME ");
|
|
\" printf("QUOTE ");
|
|
\{ printf("OBRACE ");
|
|
\} printf("EBRACE ");
|
|
; printf("SEMICOLON ");
|
|
\n printf("\n");
|
|
[ \t]+ /* ignore whitespace */;
|
|
%%
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>When we feed our file to the program this Lex file generates (using
|
|
example3.compile), we get:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
WORD OBRACE
|
|
WORD FILENAME OBRACE WORD SEMICOLON EBRACE SEMICOLON
|
|
WORD WORD OBRACE WORD SEMICOLON EBRACE SEMICOLON
|
|
EBRACE SEMICOLON
|
|
|
|
WORD QUOTE FILENAME QUOTE OBRACE
|
|
WORD WORD SEMICOLON
|
|
WORD QUOTE FILENAME QUOTE SEMICOLON
|
|
EBRACE SEMICOLON
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>When compared with the configuration file mentioned above, it is clear that
|
|
we have neatly 'Tokenized' it. Each part of the configuration file has been
|
|
matched, and converted into a token.
|
|
<P>And this is exactly what we need to put YACC to good use.
|
|
<P>
|
|
<H2><A NAME="ss3.3">3.3 What we've seen</A>
|
|
</H2>
|
|
|
|
<P>We've seen that Lex is able to read arbitrary input, and determine what each
|
|
part of the input is. This is called 'Tokenizing'.
|
|
<P>
|
|
<HR>
|
|
<A HREF="Lex-YACC-HOWTO-4.html">Next</A>
|
|
<A HREF="Lex-YACC-HOWTO-2.html">Previous</A>
|
|
<A HREF="Lex-YACC-HOWTO.html#toc3">Contents</A>
|
|
</BODY>
|
|
</HTML>
|