old-www/HOWTO/Lex-YACC-HOWTO-4.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
 <META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
 <TITLE>Lex and YACC primer/HOWTO: YACC</TITLE>
 <LINK HREF="Lex-YACC-HOWTO-5.html" REL=next>
 <LINK HREF="Lex-YACC-HOWTO-3.html" REL=previous>
 <LINK HREF="Lex-YACC-HOWTO.html#toc4" REL=contents>
</HEAD>
<BODY>
<A HREF="Lex-YACC-HOWTO-5.html">Next</A>
<A HREF="Lex-YACC-HOWTO-3.html">Previous</A>
<A HREF="Lex-YACC-HOWTO.html#toc4">Contents</A>
<HR>
<H2><A NAME="s4">4. YACC</A></H2>

<P>YACC can parse input streams consisting of tokens with certain values. This
clearly describes the relation YACC has with Lex, YACC has no idea
what 'input streams' are, it needs preprocessed tokens. While you can write your
own Tokenizer, we will leave that entirely up to Lex.
<P>A note on grammars and parsers. When YACC saw the light of day, the tool was
used to parse input files for compilers: programs. Programs written in a
programming language for computers are typically *not* ambiguous - they have
just one meaning. As such, YACC does not cope with ambiguity and will
complain about shift/reduce or reduce/reduce conflicts. More about
ambiguity and YACC "problems" can be found in 'Conflicts' chapter.
<P>
<P>
<H2><A NAME="ss4.1">4.1 A simple thermostat controller</A>
</H2>

<P>Let's say we have a thermostat that we want to control using a simple
language. A session with the thermostat may look like this:
<P>
<BLOCKQUOTE><CODE>
<PRE>
heat on
        Heater on!
heat off
        Heater off!
target temperature 22
        New temperature set!
</PRE>
</CODE></BLOCKQUOTE>
<P>The tokens we need to recognize are: heat, on/off (STATE), target, temperature,
NUMBER.
<P>The Lex tokenizer (Example 4) is:
<BLOCKQUOTE><CODE>
<PRE>
%{
#include &lt;stdio.h>
#include "y.tab.h"
%}
%%
[0-9]+                  return NUMBER;
heat                    return TOKHEAT;
on|off                  return STATE;
target                  return TOKTARGET;
temperature             return TOKTEMPERATURE;
\n                      /* ignore end of line */;
[ \t]+                  /* ignore whitespace */;
%%
</PRE>
</CODE></BLOCKQUOTE>
<P>We note two important changes. First, we include the file 'y.tab.h', and
secondly, we no longer print stuff, we return names of tokens. This change
is because we are now feeding it all to YACC, which isn't interested in
what we output to the screen. Y.tab.h has definitions for these tokens.
<P>But where does y.tab.h come from? It is generated by YACC from the Grammar
File we are about to create. As our language is very basic, so is the grammar:
<P>
<BLOCKQUOTE><CODE>
<PRE>
commands: /* empty */
        | commands command
        ;

command:
        heat_switch
        |
        target_set
        ;

heat_switch:
        TOKHEAT STATE
        {
                printf("\tHeat turned on or off\n");
        }
        ;

target_set:
        TOKTARGET TOKTEMPERATURE NUMBER
        {
                printf("\tTemperature set\n");
        }
        ;
</PRE>
</CODE></BLOCKQUOTE>
<P>The first part is what I call the 'root'. It tells us that we
have 'commands', and that these commands consist of individual 'command'
parts. As you can see this rule is very recursive, because it again
contains the word 'commands'. What this means is that the program is now
capable of reducing a series of commands one by one. Read the chapter 'How
do Lex and YACC work internally' for important details on recursion.
<P>The second rule defines what a command is. We support only two kinds of
commands, the 'heat_switch' and the 'target_set'. This is what the |-symbol
signifies - 'a command consists of either a heat_switch or a target_set'.
<P>A heat_switch consists of the HEAT token, which is simply the word 'heat',
followed by a state (which we defined in the Lex file as 'on' or 'off').
<P>Somewhat more complicated is the target_set, which consists of the TARGET
token (the word 'target'), the TEMPERATURE token (the word 'temperature')
and a number.
<P>
<H3>A complete YACC file</H3>

<P>The previous section only showed the grammar part of the YACC file, but
there is more. This is the header that we omitted:
<P>
<BLOCKQUOTE><CODE>
<PRE>
%{
#include &lt;stdio.h>
#include &lt;string.h>

void yyerror(const char *str)
{
        fprintf(stderr,"error: %s\n",str);
}

int yywrap()
{
        return 1;
}

main()
{
        yyparse();
}

%}

%token NUMBER TOKHEAT STATE TOKTARGET TOKTEMPERATURE
</PRE>
</CODE></BLOCKQUOTE>

The yyerror() function is called by YACC if it finds an error. We simply
output the message passed, but there are smarter things to do. See
the 'Further reading' section at the end.
<P>The function yywrap() can be used to continue reading from another file. It
is called at EOF and you can than open another file, and return 0. Or you
can return 1, indicating that this is truly the end. For more about this,
see the 'How do Lex and YACC work internally' chapter.
<P>Then there is the main() function, that does nothing but set everything in
motion.
<P>The last line simply defines the tokens we will be using. These are output
using y.tab.h if YACC is invoked with the '-d' option.
<P>
<H3>Compiling &amp; running the thermostat controller</H3>

<P>
<BLOCKQUOTE><CODE>
<PRE>
lex example4.l
yacc -d example4.y
cc lex.yy.c y.tab.c -o example4
</PRE>
</CODE></BLOCKQUOTE>

A few things have changed. We now also invoke YACC to compile our grammar,
which creates y.tab.c and y.tab.h. We then call Lex as usual. When
compiling, we remove the -ll flag: we now have our own main() function and
don't need the one provided by libl.
<P>
<BLOCKQUOTE><CODE>
NOTE: if you get an error about your compiler not being able to
find 'yylval', add this to example4.l, just beneath #include
&lt;y.tab.h&gt;:
<PRE>
extern YYSTYPE yylval;
</PRE>

This is explained in the 'How Lex and YACC work internally' section.
</CODE></BLOCKQUOTE>
<P>A sample session:
<BLOCKQUOTE><CODE>
<PRE>
$ ./example4
heat on
        Heat turned on or off
heat off
        Heat turned on or off
target temperature 10
        Temperature set
target humidity 20
error: parse error
$
</PRE>
</CODE></BLOCKQUOTE>
<P>This is not quite what we set out to achieve, but in the interest of keeping
the learning curve manageable, not all cool stuff can be presented at once.
<P>
<H2><A NAME="ss4.2">4.2 Expanding the thermostat to handle parameters</A>
</H2>

<P>As we've seen, we now parse the thermostat commands correctly, and even flag
mistakes properly. But as you might have guessed by the weasely wording, the
program has no idea of what it should do, it does not get passed any of the
values you enter.
<P>Let's start by adding the ability to read the new target temperature. In
order to do so, we need to learn the NUMBER match in the Lexer to convert
itself into an integer value, which can then be read in YACC.
<P>Whenever Lex matches a target, it puts the text of the match in the
character string 'yytext'. YACC in turn expects to find a value in the
variable 'yylval'. In Example 5, we see the obvious solution:
<P>
<BLOCKQUOTE><CODE>
<PRE>
%{
#include &lt;stdio.h>
#include "y.tab.h"
%}
%%
[0-9]+                  yylval=atoi(yytext); return NUMBER;
heat                    return TOKHEAT;
on|off                  yylval=!strcmp(yytext,"on"); return STATE;
target                  return TOKTARGET;
temperature             return TOKTEMPERATURE;
\n                      /* ignore end of line */;
[ \t]+                  /* ignore whitespace */;
%%
</PRE>
</CODE></BLOCKQUOTE>
<P>As you can see, we run atoi() on yytext, and put the result in yylval, where
YACC can see it. We do much the same for the STATE match, where we compare
it to 'on', and set yylval to 1 if it is equal. Please note that having a
separate 'on' and 'off' match in Lex would produce faster code, but I wanted
to show a more complicated rule and action for a change.
<P>Now we need to learn YACC how to deal with this. What is called 'yylval' in
Lex has a different name in YACC. Let's examine the rule setting the new
temperature target:
<P>
<BLOCKQUOTE><CODE>
<PRE>
target_set:
        TOKTARGET TOKTEMPERATURE NUMBER
        {
                printf("\tTemperature set to %d\n",$3);
        }
        ;
</PRE>
</CODE></BLOCKQUOTE>
<P>To access the value of the third part of the rule (ie, NUMBER), we need to
use $3. Whenever yylex() returns, the contents of yylval are attached to the
terminal, the value of which can be accessed with the $-construct.
<P>To expound on this further, let's observe the new 'heat_switch' rule:
<P>
<BLOCKQUOTE><CODE>
<PRE>
heat_switch:
        TOKHEAT STATE
        {
                if($2)
                        printf("\tHeat turned on\n");
                else
                        printf("\tHeat turned off\n");
        }
        ;
</PRE>
</CODE></BLOCKQUOTE>
<P>If you now run example5, it properly outputs what you entered.
<P>
<H2><A NAME="ss4.3">4.3 Parsing a configuration file</A>
</H2>

<P>Let's repeat part of the configuration file we mentioned earlier:
<BLOCKQUOTE><CODE>
<PRE>
zone "." {
        type hint;
        file "/etc/bind/db.root";
};
</PRE>
</CODE></BLOCKQUOTE>
<P>Remember that we already wrote a Lexer for this file. Now all we need to do
is write the YACC grammar, and modify the Lexer so it returns values in
a format YACC can understand.
<P>In the lexer from Example 6 we see:
<BLOCKQUOTE><CODE>
<PRE>
%{
#include &lt;stdio.h>
#include "y.tab.h"
%}

%%

zone                    return ZONETOK;
file                    return FILETOK;
[a-zA-Z][a-zA-Z0-9]*    yylval=strdup(yytext); return WORD;
[a-zA-Z0-9\/.-]+        yylval=strdup(yytext); return FILENAME;
\"                      return QUOTE;
\{                      return OBRACE;
\}                      return EBRACE;
;                       return SEMICOLON;
\n                      /* ignore EOL */;
[ \t]+                  /* ignore whitespace */;
%%
</PRE>
</CODE></BLOCKQUOTE>
<P>If you look carefully, you can see that yylval has changed! We no longer
expect it to be an integer, but in fact assume that it is a char *. In the
interest of keeping things simple, we invoke strdup and waste a lot of
memory. Please note that this may not be a problem in many areas where you
only need to parse a file once, and then exit.
<P>We want to store character strings because we are now mostly dealing with
names: file names and zone names. In a later chapter we will explain how to
deal with multiple types of data.
<P>In order to tell YACC about the new type of yylval, we add this line to the
header of our YACC grammar:
<P>
<PRE>
#define YYSTYPE char *
</PRE>
<P>The grammar itself is again more complicated. We chop it in parts to make it
easier to digest.
<P>
<BLOCKQUOTE><CODE>
<PRE>
commands:
        |
        commands command SEMICOLON
        ;


command:
        zone_set
        ;

zone_set:
        ZONETOK quotedname zonecontent
        {
                printf("Complete zone for '%s' found\n",$2);
        }
        ;
</PRE>
</CODE></BLOCKQUOTE>
<P>This is the intro, including the aforementioned recursive 'root'. Please
note that we specify that commands are terminated (and separated) by ;'s. We
define one kind of command, the 'zone_set'. It consists of the ZONE token
(the word 'zone'), followed by a quoted name and the 'zonecontent'. This
zonecontent starts out simple enough:
<P>
<BLOCKQUOTE><CODE>
<PRE>
zonecontent:
        OBRACE zonestatements EBRACE
</PRE>
</CODE></BLOCKQUOTE>
<P>It needs to start with an OBRACE, a {. Then follow the zonestatements,
followed by an EBRACE, }.
<P>
<BLOCKQUOTE><CODE>
<PRE>
quotedname:
        QUOTE FILENAME QUOTE
        {
                $$=$2;
        }
</PRE>
</CODE></BLOCKQUOTE>
<P>This section defines what a 'quotedname' is: a FILENAME between QUOTEs.
Then it says something special: the value of a quotedname token is the value
of the FILENAME. This means that the quotedname has as its value the
filename without quotes.
<P>This is what the magic '$$=$2;' command does. It says: my value is the value
of my second part. When the quotedname is now referenced in other rules, and
you access its value with the $-construct, you see the value that we set
here with $$=$2.
<P>
<BLOCKQUOTE><CODE>
NOTE: this grammar chokes on filenames without either a '.' or a '/' in
them.
</CODE></BLOCKQUOTE>
<P>
<P>
<BLOCKQUOTE><CODE>
<PRE>
zonestatements:
        |
        zonestatements zonestatement SEMICOLON
        ;

zonestatement:
        statements
        |
        FILETOK quotedname
        {
                printf("A zonefile name '%s' was encountered\n", $2);
        }
        ;
</PRE>
</CODE></BLOCKQUOTE>
<P>This is a generic statement that catches all kinds of statements within
the 'zone' block. We again see the recursiveness.
<P>
<BLOCKQUOTE><CODE>
<PRE>
block:
        OBRACE zonestatements EBRACE SEMICOLON
        ;

statements:
        | statements statement
        ;

statement: WORD | block | quotedname
</PRE>
</CODE></BLOCKQUOTE>
<P>This defines a block, and 'statements' which may be found within.
<P>When executed, the output is like this:
<P>
<BLOCKQUOTE><CODE>
<PRE>
$ ./example6
zone "." {
        type hint;
        file "/etc/bind/db.root";
        type hint;
};
A zonefile name '/etc/bind/db.root' was encountered
Complete zone for '.' found
</PRE>
</CODE></BLOCKQUOTE>
<P>
<HR>
<A HREF="Lex-YACC-HOWTO-5.html">Next</A>
<A HREF="Lex-YACC-HOWTO-3.html">Previous</A>
<A HREF="Lex-YACC-HOWTO.html#toc4">Contents</A>
</BODY>
</HTML>