old-www/LDP/LG/issue87/ramankutty.html

571 lines
22 KiB
HTML

<!--startcut ==============================================-->
<!-- *** BEGIN HTML header *** -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML><HEAD>
<title>Yacc/Bison - Parser Generators - Part 1 LG #87</title>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
ALINK="#FF0000">
<!-- *** END HTML header *** -->
<!-- *** BEGIN navbar *** -->
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="qubism.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue87/ramankutty.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../lg_faq.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="sunil.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
<!-- *** END navbar *** -->
<!--endcut ============================================================-->
<TABLE BORDER><TR><TD WIDTH="200">
<A HREF="http://www.linuxgazette.com/">
<IMG ALT="LINUX GAZETTE" SRC="../gx/2002/lglogo_200x41.png"
WIDTH="200" HEIGHT="41" border="0"></A>
<BR CLEAR="all">
<SMALL>...<I>making Linux just a little more fun!</I></SMALL>
</TD><TD WIDTH="380">
<CENTER>
<BIG><BIG><STRONG><FONT COLOR="maroon">Yacc/Bison - Parser Generators - Part 1</FONT></STRONG></BIG></BIG>
<BR>
<STRONG>By <A HREF="../authors/ramankutty.html">Hiran Ramankutty</A></STRONG>
</CENTER>
</TD></TR>
</TABLE>
<P>
<!-- END header -->
<h2><b>Introduction</b></h2>
<p>
<b>Yacc</b> ("Yet Another Compiler Compiler") is used to parse a language
described by a <b>context-free grammar</b>. Not all context-free languages can
be handled by Yacc or Bison and only those that are <b>LALR(1)</b> can be
parsed. To be specific, this means that it must be possible to tell how to
parse any portion of an input string with just a single token of look-ahead. I
will explain that clearly later in this article.
</p>
<p>
<b>Bison</b> is also a parser generator in the style of <b>Yacc</b>.
It was written primarily by Robert Corbett and Richard Stallman
made it Yacc compatible. There are differences between Bison and Yacc,
but that is not the purpose of this article.
</p>
<h2><b>Languages and Context-Free Grammars</b></h2>
<p>
Grammar can be associated with English language as a set of rules to
construct meaningful sentences. We can say the same for context-free
grammars. Almost all programming languages are based on context-free
grammars. The set of rules in any grammar will deal with syntactic
groupings that will help in the construction of semantic structures.
To be specific, it means that we specify one or more syntactic
groupings and give rules for constructing them from their parts. For
example in C: `expression' is one kind of grouping. One rule for making
an expression is, "An expression can be made of a minus sign and
another expression". Another would be, "An expression is an integer".
You must have noticed that the rules are recursive. In fact, every
such grammar must then have a rule which leads out of the recursion.
</p>
<p>
The most common formal system for representing such rules is the
<b>Backus-Naur Form</b> or "BNF". All BNF grammars are context-free
grammars.
</p>
<p>
In the grammatical rules for a language, we name a grouping as a symbol.
Those symbols which can be sub-divided into smaller constructs are
called non-terminals and those which cannot be subdivided are called
terminals. If a piece of input is a single terminal then it is called a
token and if it is a single nonterminal it is called a grouping. For
example: `identifier', `number', `string' are distinguished as tokens,
Whereas `expression', `statement', `declaration' and `function
definition' are groupings in C language. Now, the full grammar may use
additional language constructs with another set of nonterminal symbols.
</p>
<h2><b>Basic Parsing Techniques</b></h2>
<p>
A parser for grammar G determines whether an input string say `w' is a
sentence of G or not. If `w' is a sentence of G then the parser
produces the parse tree for `w' otherwise, an error message is
produced. By parse tree we mean a diagram that represents the syntactic
structure of a string `w'. There are two basic types of parsers for
context-free grammars - <b>bottom-up</b> and <b>top-down</b>, the former one
being of our interest.
</p>
<h3><b>Bottom-Up Parsing</b></h3>
<p>
It is also known as <b>Shift-Reduce Parsing</b>. Here, attempts to
construct a parse tree for an input begin at the leaves (bottom) and
work up towards the root (top). In other words this will lead to a
process of `reduction' of the input string to the start symbol of the
grammar based on its production rules. For example, consider the
grammar:
</p><p></p>
<pre>
<b><i>S -> aAcBe
A -> Ab/b
B -> d</i></b>
</pre>
<p></p>
<p>
Let w = "abbcde". Our aim is to reduce this string `w' to
<b><i>S</i></b>, where <b><i>S</i></b> is the start symbol. We scan
"abbcde" looking for substrings that match the right side of some
production. The substrings `b' and `d' qualify. Again there are 2 b's
to be considered. Let us proceed with leftmost `b'. We replace it by
`A' the left side of the production <b><i>A -> b</i></b>. The string
has now become "aAbcde". We now see that `Ab', `b' and `d' each match
the right side of some production. This time we will choose to replace
`Ab' by `A', the left side of the production <b><i>A -> Ab</i></b>.
The string now becomes "aAcde". Then replacing `d' by `B', the left
side of the production <b><i>B -> d</i></b>, we obtain "aAcBe". The
entire string can now be replaced by <b><i>S</i></b>.
</p>
<p>
Basically, we are replacing the right side of a production by the left
side the process being called a <i>reduction</i>. Quite easy! Not always. It
sometimes so happen that, the substring that we choose to reduce may produce
a string which is not decomposable to the start symbol <b><i>S</i></b>.
</p>
The substrings that are the right side of a production and when replaced
with the left side of that production in the input string that leads
eventually to the start symbol is called a <b>`handle'</b>. Now, the process
of bottom-up parsing may be viewed as one of finding and reducing `handles',
the reduction sequence being known as <b>`handle pruning'</b>.
</p>
<h4><b>Stack Implementation of Shift-Reduce Parsing</b></h4>
<p>
A convenient way to implement a shift-reduce parser is to use a stack and an
input buffer. Let `$' symbol mark the bottom of the stack and the right end
of the input.
</p>
<p>
The main concept is to shift the input symbols onto the stack until a
handle <font face=`Symbol'>b</font> is on top of the stack. Now we
reduce <font face=`Symbol'>b</font> to the left side of the appropriate
production. The parser repeats this cycle until it has detected an
error or until the stack contains the start symbol and the input is
empty:
</p>
<p>
In fact, there are four possible actions that a shift-reduce parser can
make and they are;
</p>
<p></p>
<ol>
<li>In a <i>shift</i> action, the next input symbol is shifted to the top of
the stack.
<li>In a <i>reduce</i> action, the parser knows that the right end of the
handle is at the top of the stack. It must then locate the left end of the
handle within the stack and decide with what nonterminal to replace the
handle.
<li>In an <i>accept</i> action, the parser announces successful completion
of parsing.
<li>In an <i>error</i> action, the parser discovers that a syntax error has
occurred and calls an error recovery routine.
</ol>
<p>
Let us see how these concepts are put into action in the example below.
</p>
<p>Consider the grammar below:</p>
<p></p>
<pre>
<b>E -> E + E
E -> E * E
E -> (E)
E -> id</b>
</pre>
<p>
Let the input string be
<b>id<sub>1</sub> + id<sub>2</sub> * id<sub>3</sub></b>
</p>
<P> <A HREF="misc/ramankutty/figure1.png">Figure 1</A> </P>
<h4><b>Constructing a Parse Tree</b></h4>
<p>
The bottom-up tree construction process has two aspects.
</p>
<ol>
<li>When we shift an input symbol <i>a</i> onto the stack we create a
one-node tree labeled <i>a</i>. Both the root and the yield of this tree
are <i>a</i>, and the yield truly represents the string of terminals
"reduced" (by zero reductions) to symbol <i>a</i>.
<li>When we reduce <i>X</i><sub>1</sub><i>X</i><sub>2</sub>...
<i>X</i><sub>n</sub> to <i>A</i>, we create a new node labeled <i>A</i>. Its
children, from left to right, are the roots of the trees for
<i>X</i><sub>1</sub>,<i>X</i><sub>2</sub>,...,<i>X</i><sub>n</sub>. If for
all i<sup>i</sup> the tree for <i>X</i><sub>i</sub> has yield
<i>x</i><sub>i</sub>, then the yield for the new tree is
<i>x</i><sub>1</sub><i>x</i><sub>2</sub>...<i>x</i><sub>n</sub>. This string
has in fact been reduced to <i>A</i> by a series of reductions culminating
in the present one. As a special case, if we reduce <i>E</i> to <i>A</i> we
create a node labeled <i>A</i> with one child labeled <i>E</i>.
</ol>
<h2><b>LR Parsing Algorithm</b></h2>
<p>
Construction of LALR parser requires the basic understanding of
constructing an LR parser. LR parser gets its name because it scans the
input from left-to-right and constructs a rightmost derivation in
reverse.
</p>
<p>
A parser generates a parsing table for a grammar. The parsing table
consists of two parts, a parsing action function <b>ACTION</b> and a
goto function <b>GOTO</b>.
</p>
<p>
An LR parser has an input, a stack, and a parsing table.
The input is read from left to right, one symbol at a time. The stack
contains a string of the form
s<sub>0</sub>X<sub>1</sub>s<sub>1</sub>...X<sub>m</sub>s<sub>m</sub>
where s<sub>m</sub> is on top. Each X<sub>i</sub> is a grammar symbol
and each s<sub>i</sub> is a symbol called a state. Each state symbol
summarizes the information contained in the stack below it and is used
to guide the shift-reduce decision.
</p>
<p>
The function <b>ACTION</b> stores values for s<sub>m</sub> that is
topmost stack element and a<sub>i</sub> that is the current input
symbol. The entry ACTION[s<sub>m</sub>, a<sub>i</sub>] can have one of
four values:
</p>
<ol>
<li>shift s
<li>reduce A -&gt; <font face="symbol">B</font>
<li>accept
<li>error
</ol>
<p>
The function <b>GOTO</b> takes a state and grammar symbol as arguments
and produces a state. Somewhat analogous to the transition table of a
deterministic finite automaton whose input symbols are the terminals
and nonterminals of the grammar.
</p>
<p>
A <i>configuration</i> of an LR parser is a pair whose first component
is the stack contents and whose second component is the unexpended
input:
</p>
<p align=center>
(s<sub>0</sub> X<sub>1</sub> s<sub>1</sub> . . . X<sub>m</sub> s<sub>m</sub>, a<sub>i</sub> a<sub>i+1</sub> . . . a<sub>n</sub>$) </p>
<p>
The next move of the parser is determined by reading a<sub>i</sub>, the
current input symbol, and s<sub>m</sub> the state on top of the stack,
and then consulting the action table entry
ACTION[s<sub>m</sub>, a<sub>i</sub>]. The four value mentioned above
for action table entry will produce four different configurations as
follows:
</p>
<p></p>
<ol>
<li> If ACTION[s<sub>m</sub>, a<sub>i</sub>] = shift s, the parser
executes a shift move, entering the configuration
<p align=center>
(s<sub>0</sub> X<sub>1</sub> s<sub>1</sub> . . . X<sub>m</sub> s<sub>m</sub> a<sub>i</sub> s, a<sub>i+1</sub> . . . a<sub>n</sub>$) </p>
Here the configuration has shifted the current input symbol
a<sub>i</sub> and the next state
s = GOTO[s<sub>m</sub>, a<sub>i</sub>] onto the stack;
a<sub>i+1</sub> becomes the new current input symbol.
<li>If ACTION[s<sub>m</sub>, a<sub>i</sub>] =
reduce A - &gt; <font face="symbol">B</font>,then the parser
executes a reduce a move, entering the configuration
<p align=center>
(s<sub>0</sub> X<sub>1</sub> s<sub>1</sub> . . . X<sub>m-r</sub> s<sub>m-r</sub> A s, a<sub>i</sub> a<sub>i+1</sub> . . . a<sub>n</sub>$) </p>
where s = GOTO[s<sub>m-r</sub>, A] and r is the length of
<font face="symbol">B</font>, the right side of the production.
Here the first popped 2r symbols off the stack (r state symbols and r
grammar symbols), exposing state s<sub>m-r</sub>. The parser then
pushed both A, the left side of the production, and s, the entry for
ACTION[s<sub>m-r</sub>, A], onto the stack. The current input symbol
is not changed in a reduce move. Specifically,
X<sub>m-r+1</sub> . . . X<sub>m</sub>, the sequence of grammar symbols
are popped off the stack and will always match
<font face="symbol">B</font>, the right side of the reducing
production.
<li> If ACTION[s<sub>m</sub>, a<sub>i</sub>] = accept, parsing is
completed.
<li> If ACTION[s<sub>m</sub>, a<sub>i</sub>] = error, the parser has
discovered an error and calls an error recovery routine.
</ol>
<p>
The LR parsing algorithm is simple. Initially the LR parser is in the
configuration
(s<sub>0</sub>, a<sub>1</sub>a<sub>2</sub>...a<sub>n</sub>$) where
s<sub>0</sub> is a designated intial state and
a<sub>1</sub>a<sub>2</sub>...a<sub>n</sub> is the string to be parsed.
Then the parser executes moves until an accept or error action is
encountered.
</p>
<p>
I mentioned earlier that the GOTO function is essentially the
transition table of a deterministic finite automaton whose input
symbols (terminals and nonterminals) and a state when taken as
arguments produce another state. Hence the GOTO function can be
represented by a graph (directed) like scheme, where each node or state
will be a set of items with elements that are productions in the
grammar. The elements comprise the core of the items. The edges
representing the transition will be labeled with the symbol for which
the transition from one state to another is predetermined.
</p>
<p>
In the LALR (<i>lookahead</i>-LR) technique, LR items with common core
are coalesced, and the parsing actions are determined on the basis of
the new GOTO function generated. The tables obtained are considerably
smaller than the LR tables, yet most common syntactic constructs of
programming languages can be expressed conveniently by LALR grammar.
</p>
<p>
I am not going deep into construction of tables. Instead, I would like
to explain the use of a tool called <b>Yacc</b> for parser generation.
</p>
<h2><b>Calculator Using Yacc</b></h2>
<p>
Input to Yacc can be divided into three sections:
</p>
<p></p>
<ol>
<li>definitions sections that consists of token declarations, and C
code bracketed by "%{" and "}%"
<li>the BNF grammar in the rules section
<li>and user subroutines in the subroutines section.
</ol>
<p></p>
<p>
We shall illustrate that by designing a small calculator that can
add and subtract numbers. Let us start with the definitions section
for the Yacc input file:
</p>
<p></p>
<pre>
/* File name - calc1.l*/
%{
#include "y.tab.h"
#include &lt; stdlib.h &gt;
void yyerror(char *);
}%
%%
[0-9]+ {
yylval = atoi(yytext);
return INTEGER;
}
[-+\n] {
return *yytext;
}
[ \t] ; /*skip whitespace*/
. yyerror("Unknown character");
%%
int yywrap(void) {
return 1;
}
</pre>
<p></p>
<p>
Yacc when run generates a parser in the file <b>y.tab.c</b>, along side
which another file <b>y.tab.h</b> is also generated. Lex includes this
file and utilizes the definitions for token values. Lex returns the
values associated with the tokens in variable <b>yylval</b>. But to get
tokens, yacc calls <b>yylex</b> the return value of which is integer.
</p>
<p>
The yacc input specification is given below:
</p>
<pre>
/* file name calc1.y */
%{
int yylex(void);
void yyerror(char *);
%}
%token INTEGER
%%
program:
program expr '\n' { printf("%d\n", $2); }
|
;
expr:
INTEGER
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
;
%%
void yyerror(char *s) {
fprintf(stderr, "%s\n", s);
}
int main(void) {
yyparse();
return 0;
}
</pre>
<p></p>
<p>
Here, the grammar is specified using productions. Left hand side of a
production being a non terminal followed by a colon and then the right
hand side of a production. The contents of the braces show the action
associated with the productions. So what does the rules say ?
</p>
<p>
It says that a program consists of zero or more expressions. Each
expression terminates with a newline. When a newline is detected, we
print the value of the expression.
</p>
<p>
Now execute yacc as shown:
</p>
<p></p>
<pre>
yacc -d calc1.l
</pre>
<p></p>
<p>
You get a message "shift/reduce conflict". Shift/reduce conflict arises
when the grammar is ambiguous and there is a possibility of more than
one derivation tree. To understand this, consider the example given in
the stack implementation of shift-reduce parsing. In step 6, instead of
shifting we could have reduced appropriately as per the grammar . Then
addition will have higher precedence over multiplication.
</p>
<p>
Before proceeding you must know about another kind of conflict that is
reduce-reduce conflict. This arises when there are more than one option
for reducing a stack symbol. For example: In the grammar below
<b>id</b> can be reduced to <b>T</b> or <b>E</b>.
</p>
<p></p>
<pre>
E -&gt; T
E -&gt; id
T -&gt; id
</pre>
<p></p>
<p>
Yacc takes a default action when conflicts arise. When there is
shift-reduce conflict, yacc will shift and when there is reduce-reduce
conflict, it will use the first rule in the listing. Yacc also issues a
warning message when conflicts arise. Warnings can be eliminated by
making the grammar unambiguous.
</p>
<p>
Coming back, yacc produces two files; <b>y.tab.c</b> and <b>y.tab.h</b>.
Some lines one has to notice are:
</p>
<p></p>
<pre>
#ifndef YYSTYPE
typedef int YYSTYPE
#endif
#define INTEGER 257
YYSTYPE yylval
</pre>
<p></p>
<p>
Internally, yacc maintains two stacks in memory; a parse stack and a
value stack. The current parsing state is determined by the terminals
and/or non terminals that are present in the parse stack. The value
stack is always synchronized and holds an array of <b>YYSTYPE</b>
elements, which associates a value with each element in the parse stack.
So for example, when lex returns an INTEGER token, yacc shifts this
token to the parse stack. At the same time, the corresponding yylval is
shifted to the value stack. This makes it easier in finding the value
of a token at any given time.
</p>
<p>
So when we apply the rule
</p>
<p></p>
<pre>
expr: expr '+' expr { $$ = $1 + $3; }
</pre>
<p>
we pop <b>"expr '+' expr"</b> and replace it by <b>"expr"</b>. In other
words we replace the right hand side of a production by left hand side
of the same production. Here we pop three terms off the stack and push
back one term. The value stack will contain "$1" for the first term on
the right-hand side of the production, "$2" for the second and so on.
"$$" designates the top of the stack after reduction has taken place.
The above action adds the values associated with two expressions, pops
three terms off the value stack, and pushes back a single sum. Thus the
two stacks remain synchronized and when a newline is encountered, the
value associated with <b>expr</b> is printed.
</p>
<p>
The last function that we need is a 'main'. But the grammar is
ambiguous and yacc will issue shift-reduce warnings and will process
the grammar using shift as the default operation.
</p>
<p>
I am not giving the function here because there is more to learn.
I shall come up with that in the next part. I shall also explain how
to remove ambiguity from the grammar and then design the calculator
for it. In fact, some more funcionalities shall be incorporated into
the grammar to have a better understanding.
<!-- *** BEGIN author bio *** -->
<P>&nbsp;
<P>
<!-- *** BEGIN bio *** -->
<P>
<img ALIGN="LEFT" ALT="[BIO]" SRC="../gx/2002/note.png">
<em>
I am a final year student of Computer Science at Government Engineering
College, Trichur, Kerala, India. Apart from Linux I enjoy reading books
on theoretical physics.
</em>
<br CLEAR="all">
<!-- *** END bio *** -->
<!-- *** END author bio *** -->
<!-- *** BEGIN copyright *** -->
<hr>
<CENTER><SMALL><STRONG>
Copyright &copy; 2003, Hiran Ramankutty.
Copying license <A HREF="../copying.html">http://www.linuxgazette.com/copying.html</A><BR>
Published in Issue 87 of <i>Linux Gazette</i>, February 2003
</STRONG></SMALL></CENTER>
<!-- *** END copyright *** -->
<HR>
<!--startcut ==========================================================-->
<CENTER>
<!-- *** BEGIN navbar *** -->
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="qubism.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue87/ramankutty.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../lg_faq.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="sunil.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
<!-- *** END navbar *** -->
</CENTER>
</BODY></HTML>
<!--endcut ============================================================-->