441 lines
13 KiB
HTML
441 lines
13 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
|
|
<TITLE>Lex and YACC primer/HOWTO: YACC</TITLE>
|
|
<LINK HREF="Lex-YACC-HOWTO-5.html" REL=next>
|
|
<LINK HREF="Lex-YACC-HOWTO-3.html" REL=previous>
|
|
<LINK HREF="Lex-YACC-HOWTO.html#toc4" REL=contents>
|
|
</HEAD>
|
|
<BODY>
|
|
<A HREF="Lex-YACC-HOWTO-5.html">Next</A>
|
|
<A HREF="Lex-YACC-HOWTO-3.html">Previous</A>
|
|
<A HREF="Lex-YACC-HOWTO.html#toc4">Contents</A>
|
|
<HR>
|
|
<H2><A NAME="s4">4. YACC</A></H2>
|
|
|
|
<P>YACC can parse input streams consisting of tokens with certain values. This
|
|
clearly describes the relation YACC has with Lex, YACC has no idea
|
|
what 'input streams' are, it needs preprocessed tokens. While you can write your
|
|
own Tokenizer, we will leave that entirely up to Lex.
|
|
<P>A note on grammars and parsers. When YACC saw the light of day, the tool was
|
|
used to parse input files for compilers: programs. Programs written in a
|
|
programming language for computers are typically *not* ambiguous - they have
|
|
just one meaning. As such, YACC does not cope with ambiguity and will
|
|
complain about shift/reduce or reduce/reduce conflicts. More about
|
|
ambiguity and YACC "problems" can be found in 'Conflicts' chapter.
|
|
<P>
|
|
<P>
|
|
<H2><A NAME="ss4.1">4.1 A simple thermostat controller</A>
|
|
</H2>
|
|
|
|
<P>Let's say we have a thermostat that we want to control using a simple
|
|
language. A session with the thermostat may look like this:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
heat on
|
|
Heater on!
|
|
heat off
|
|
Heater off!
|
|
target temperature 22
|
|
New temperature set!
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>The tokens we need to recognize are: heat, on/off (STATE), target, temperature,
|
|
NUMBER.
|
|
<P>The Lex tokenizer (Example 4) is:
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
%{
|
|
#include <stdio.h>
|
|
#include "y.tab.h"
|
|
%}
|
|
%%
|
|
[0-9]+ return NUMBER;
|
|
heat return TOKHEAT;
|
|
on|off return STATE;
|
|
target return TOKTARGET;
|
|
temperature return TOKTEMPERATURE;
|
|
\n /* ignore end of line */;
|
|
[ \t]+ /* ignore whitespace */;
|
|
%%
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>We note two important changes. First, we include the file 'y.tab.h', and
|
|
secondly, we no longer print stuff, we return names of tokens. This change
|
|
is because we are now feeding it all to YACC, which isn't interested in
|
|
what we output to the screen. Y.tab.h has definitions for these tokens.
|
|
<P>But where does y.tab.h come from? It is generated by YACC from the Grammar
|
|
File we are about to create. As our language is very basic, so is the grammar:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
commands: /* empty */
|
|
| commands command
|
|
;
|
|
|
|
command:
|
|
heat_switch
|
|
|
|
|
target_set
|
|
;
|
|
|
|
heat_switch:
|
|
TOKHEAT STATE
|
|
{
|
|
printf("\tHeat turned on or off\n");
|
|
}
|
|
;
|
|
|
|
target_set:
|
|
TOKTARGET TOKTEMPERATURE NUMBER
|
|
{
|
|
printf("\tTemperature set\n");
|
|
}
|
|
;
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>The first part is what I call the 'root'. It tells us that we
|
|
have 'commands', and that these commands consist of individual 'command'
|
|
parts. As you can see this rule is very recursive, because it again
|
|
contains the word 'commands'. What this means is that the program is now
|
|
capable of reducing a series of commands one by one. Read the chapter 'How
|
|
do Lex and YACC work internally' for important details on recursion.
|
|
<P>The second rule defines what a command is. We support only two kinds of
|
|
commands, the 'heat_switch' and the 'target_set'. This is what the |-symbol
|
|
signifies - 'a command consists of either a heat_switch or a target_set'.
|
|
<P>A heat_switch consists of the HEAT token, which is simply the word 'heat',
|
|
followed by a state (which we defined in the Lex file as 'on' or 'off').
|
|
<P>Somewhat more complicated is the target_set, which consists of the TARGET
|
|
token (the word 'target'), the TEMPERATURE token (the word 'temperature')
|
|
and a number.
|
|
<P>
|
|
<H3>A complete YACC file</H3>
|
|
|
|
<P>The previous section only showed the grammar part of the YACC file, but
|
|
there is more. This is the header that we omitted:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
%{
|
|
#include <stdio.h>
|
|
#include <string.h>
|
|
|
|
void yyerror(const char *str)
|
|
{
|
|
fprintf(stderr,"error: %s\n",str);
|
|
}
|
|
|
|
int yywrap()
|
|
{
|
|
return 1;
|
|
}
|
|
|
|
main()
|
|
{
|
|
yyparse();
|
|
}
|
|
|
|
%}
|
|
|
|
%token NUMBER TOKHEAT STATE TOKTARGET TOKTEMPERATURE
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
|
|
The yyerror() function is called by YACC if it finds an error. We simply
|
|
output the message passed, but there are smarter things to do. See
|
|
the 'Further reading' section at the end.
|
|
<P>The function yywrap() can be used to continue reading from another file. It
|
|
is called at EOF and you can than open another file, and return 0. Or you
|
|
can return 1, indicating that this is truly the end. For more about this,
|
|
see the 'How do Lex and YACC work internally' chapter.
|
|
<P>Then there is the main() function, that does nothing but set everything in
|
|
motion.
|
|
<P>The last line simply defines the tokens we will be using. These are output
|
|
using y.tab.h if YACC is invoked with the '-d' option.
|
|
<P>
|
|
<H3>Compiling & running the thermostat controller</H3>
|
|
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
lex example4.l
|
|
yacc -d example4.y
|
|
cc lex.yy.c y.tab.c -o example4
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
|
|
A few things have changed. We now also invoke YACC to compile our grammar,
|
|
which creates y.tab.c and y.tab.h. We then call Lex as usual. When
|
|
compiling, we remove the -ll flag: we now have our own main() function and
|
|
don't need the one provided by libl.
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
NOTE: if you get an error about your compiler not being able to
|
|
find 'yylval', add this to example4.l, just beneath #include
|
|
<y.tab.h>:
|
|
<PRE>
|
|
extern YYSTYPE yylval;
|
|
</PRE>
|
|
|
|
This is explained in the 'How Lex and YACC work internally' section.
|
|
</CODE></BLOCKQUOTE>
|
|
<P>A sample session:
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
$ ./example4
|
|
heat on
|
|
Heat turned on or off
|
|
heat off
|
|
Heat turned on or off
|
|
target temperature 10
|
|
Temperature set
|
|
target humidity 20
|
|
error: parse error
|
|
$
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>This is not quite what we set out to achieve, but in the interest of keeping
|
|
the learning curve manageable, not all cool stuff can be presented at once.
|
|
<P>
|
|
<H2><A NAME="ss4.2">4.2 Expanding the thermostat to handle parameters</A>
|
|
</H2>
|
|
|
|
<P>As we've seen, we now parse the thermostat commands correctly, and even flag
|
|
mistakes properly. But as you might have guessed by the weasely wording, the
|
|
program has no idea of what it should do, it does not get passed any of the
|
|
values you enter.
|
|
<P>Let's start by adding the ability to read the new target temperature. In
|
|
order to do so, we need to learn the NUMBER match in the Lexer to convert
|
|
itself into an integer value, which can then be read in YACC.
|
|
<P>Whenever Lex matches a target, it puts the text of the match in the
|
|
character string 'yytext'. YACC in turn expects to find a value in the
|
|
variable 'yylval'. In Example 5, we see the obvious solution:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
%{
|
|
#include <stdio.h>
|
|
#include "y.tab.h"
|
|
%}
|
|
%%
|
|
[0-9]+ yylval=atoi(yytext); return NUMBER;
|
|
heat return TOKHEAT;
|
|
on|off yylval=!strcmp(yytext,"on"); return STATE;
|
|
target return TOKTARGET;
|
|
temperature return TOKTEMPERATURE;
|
|
\n /* ignore end of line */;
|
|
[ \t]+ /* ignore whitespace */;
|
|
%%
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>As you can see, we run atoi() on yytext, and put the result in yylval, where
|
|
YACC can see it. We do much the same for the STATE match, where we compare
|
|
it to 'on', and set yylval to 1 if it is equal. Please note that having a
|
|
separate 'on' and 'off' match in Lex would produce faster code, but I wanted
|
|
to show a more complicated rule and action for a change.
|
|
<P>Now we need to learn YACC how to deal with this. What is called 'yylval' in
|
|
Lex has a different name in YACC. Let's examine the rule setting the new
|
|
temperature target:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
target_set:
|
|
TOKTARGET TOKTEMPERATURE NUMBER
|
|
{
|
|
printf("\tTemperature set to %d\n",$3);
|
|
}
|
|
;
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>To access the value of the third part of the rule (ie, NUMBER), we need to
|
|
use $3. Whenever yylex() returns, the contents of yylval are attached to the
|
|
terminal, the value of which can be accessed with the $-construct.
|
|
<P>To expound on this further, let's observe the new 'heat_switch' rule:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
heat_switch:
|
|
TOKHEAT STATE
|
|
{
|
|
if($2)
|
|
printf("\tHeat turned on\n");
|
|
else
|
|
printf("\tHeat turned off\n");
|
|
}
|
|
;
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>If you now run example5, it properly outputs what you entered.
|
|
<P>
|
|
<H2><A NAME="ss4.3">4.3 Parsing a configuration file</A>
|
|
</H2>
|
|
|
|
<P>Let's repeat part of the configuration file we mentioned earlier:
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
zone "." {
|
|
type hint;
|
|
file "/etc/bind/db.root";
|
|
};
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>Remember that we already wrote a Lexer for this file. Now all we need to do
|
|
is write the YACC grammar, and modify the Lexer so it returns values in
|
|
a format YACC can understand.
|
|
<P>In the lexer from Example 6 we see:
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
%{
|
|
#include <stdio.h>
|
|
#include "y.tab.h"
|
|
%}
|
|
|
|
%%
|
|
|
|
zone return ZONETOK;
|
|
file return FILETOK;
|
|
[a-zA-Z][a-zA-Z0-9]* yylval=strdup(yytext); return WORD;
|
|
[a-zA-Z0-9\/.-]+ yylval=strdup(yytext); return FILENAME;
|
|
\" return QUOTE;
|
|
\{ return OBRACE;
|
|
\} return EBRACE;
|
|
; return SEMICOLON;
|
|
\n /* ignore EOL */;
|
|
[ \t]+ /* ignore whitespace */;
|
|
%%
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>If you look carefully, you can see that yylval has changed! We no longer
|
|
expect it to be an integer, but in fact assume that it is a char *. In the
|
|
interest of keeping things simple, we invoke strdup and waste a lot of
|
|
memory. Please note that this may not be a problem in many areas where you
|
|
only need to parse a file once, and then exit.
|
|
<P>We want to store character strings because we are now mostly dealing with
|
|
names: file names and zone names. In a later chapter we will explain how to
|
|
deal with multiple types of data.
|
|
<P>In order to tell YACC about the new type of yylval, we add this line to the
|
|
header of our YACC grammar:
|
|
<P>
|
|
<PRE>
|
|
#define YYSTYPE char *
|
|
</PRE>
|
|
<P>The grammar itself is again more complicated. We chop it in parts to make it
|
|
easier to digest.
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
commands:
|
|
|
|
|
commands command SEMICOLON
|
|
;
|
|
|
|
|
|
command:
|
|
zone_set
|
|
;
|
|
|
|
zone_set:
|
|
ZONETOK quotedname zonecontent
|
|
{
|
|
printf("Complete zone for '%s' found\n",$2);
|
|
}
|
|
;
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>This is the intro, including the aforementioned recursive 'root'. Please
|
|
note that we specify that commands are terminated (and separated) by ;'s. We
|
|
define one kind of command, the 'zone_set'. It consists of the ZONE token
|
|
(the word 'zone'), followed by a quoted name and the 'zonecontent'. This
|
|
zonecontent starts out simple enough:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
zonecontent:
|
|
OBRACE zonestatements EBRACE
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>It needs to start with an OBRACE, a {. Then follow the zonestatements,
|
|
followed by an EBRACE, }.
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
quotedname:
|
|
QUOTE FILENAME QUOTE
|
|
{
|
|
$$=$2;
|
|
}
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>This section defines what a 'quotedname' is: a FILENAME between QUOTEs.
|
|
Then it says something special: the value of a quotedname token is the value
|
|
of the FILENAME. This means that the quotedname has as its value the
|
|
filename without quotes.
|
|
<P>This is what the magic '$$=$2;' command does. It says: my value is the value
|
|
of my second part. When the quotedname is now referenced in other rules, and
|
|
you access its value with the $-construct, you see the value that we set
|
|
here with $$=$2.
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
NOTE: this grammar chokes on filenames without either a '.' or a '/' in
|
|
them.
|
|
</CODE></BLOCKQUOTE>
|
|
<P>
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
zonestatements:
|
|
|
|
|
zonestatements zonestatement SEMICOLON
|
|
;
|
|
|
|
zonestatement:
|
|
statements
|
|
|
|
|
FILETOK quotedname
|
|
{
|
|
printf("A zonefile name '%s' was encountered\n", $2);
|
|
}
|
|
;
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>This is a generic statement that catches all kinds of statements within
|
|
the 'zone' block. We again see the recursiveness.
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
block:
|
|
OBRACE zonestatements EBRACE SEMICOLON
|
|
;
|
|
|
|
statements:
|
|
| statements statement
|
|
;
|
|
|
|
statement: WORD | block | quotedname
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>This defines a block, and 'statements' which may be found within.
|
|
<P>When executed, the output is like this:
|
|
<P>
|
|
<BLOCKQUOTE><CODE>
|
|
<PRE>
|
|
$ ./example6
|
|
zone "." {
|
|
type hint;
|
|
file "/etc/bind/db.root";
|
|
type hint;
|
|
};
|
|
A zonefile name '/etc/bind/db.root' was encountered
|
|
Complete zone for '.' found
|
|
</PRE>
|
|
</CODE></BLOCKQUOTE>
|
|
<P>
|
|
<HR>
|
|
<A HREF="Lex-YACC-HOWTO-5.html">Next</A>
|
|
<A HREF="Lex-YACC-HOWTO-3.html">Previous</A>
|
|
<A HREF="Lex-YACC-HOWTO.html#toc4">Contents</A>
|
|
</BODY>
|
|
</HTML>
|