old-www/HOWTO/GCC-Frontend-HOWTO-2.html

147 lines
6.0 KiB
HTML

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
<TITLE> GCC Frontend HOWTO: Some general ideas about Compilers </TITLE>
<LINK HREF="GCC-Frontend-HOWTO-3.html" REL=next>
<LINK HREF="GCC-Frontend-HOWTO-1.html" REL=previous>
<LINK HREF="GCC-Frontend-HOWTO.html#toc2" REL=contents>
</HEAD>
<BODY>
<A HREF="GCC-Frontend-HOWTO-3.html">Next</A>
<A HREF="GCC-Frontend-HOWTO-1.html">Previous</A>
<A HREF="GCC-Frontend-HOWTO.html#toc2">Contents</A>
<HR>
<H2><A NAME="s2">2. Some general ideas about Compilers </A></H2>
<P>A compiler is a translator which accepts programs
in a source language and converts them into
programs in another language which is most
often the assembly code of a real (or virtual)
microprocessor. Designing real compilers is a
complex undertaking and requires formal training
in Computer Science and Mathematics.
<P>
<P>The compilation process can be divided into a number of
subtasks called phases. The different phases involved are
<OL>
<LI>Lexical analysis</LI>
<LI>Syntax analysis</LI>
<LI>Intermediate code generation</LI>
<LI>Code Optimization</LI>
<LI>Code generation.</LI>
</OL>
<P>Symbol tables and error handlers are involved in all the
above phases. Let us look at these stages.
<OL>
<LI>Lexical analysis
<P>The lexical analyzer reads the source program and emits
tokens. Tokens are atomic units, which represent a sequence of
characters. They can be treated as single logical entitities. Identifiers,
keywords, constants, operators, punctuation symbols etc. are
examples of tokens. Consider the C statement,
<PRE>
return 5;
</PRE>
<P>It has 3 tokens in it. The keyword 'return', the constant 5 and the
punctuation semicolon.
</LI>
<LI>Syntax analysis
<P>Tokens from the lexical analyzer are the input to this
phase. A set of rules (also known as productions) will be
provided for any programming language. This defines the grammar
of the language. The syntax analyzer checks whether the given input is
a valid one. ie. Whether it is permitted by the given grammar.
We can write a small set of productions.
<PRE>
Sentence -&gt; Noun Verb
Noun -&gt; boy | girl | bird
Verb -&gt; eats | runs | flies
</PRE>
<P>This is a grammar of no practical importance (and also no
meaning). But it gives a rough idea about a grammar.
<P>A parser for a grammar takes a string as input. It can
produce a parse tree as output. Two types of parsing are possible.
Top down parsing and bottom up parsing. The meaning is clear from
the names. A bottom up parser starts from the leaves and traverse
to the root of the tree.
<P>In the above grammar if we are given "bird flies" as input,
it is possible to trace back to the root for a valid 'Sentence'.
One type of bottom-up parsing is a "shift-reduce" parsing.
The general method used here is to take the input symbols and
push it in a stack until the right side of a production appears
on top of the stack. In that case it is reduced with the left
side. Thus, it consists of shifting the input and reducing it
when possible. An LR (left to right) bottom up parser can be
used.
</LI>
<LI>Intermediate Code Generation.
<P>Once the syntactic constructs are determined, the compiler
can generate object code for each construct. But the compiler
creates an intermediate form. It helps in code optimization and
also to make a clear-cut separation between machine independent
phases (lexical, syntax) and machine dependent phases (optimization,
code generation).
<P>One form of intermediate code is a parse tree. A parse tree
may contain variables as the terminal nodes. A binary operator
will be having a left and right branch for operand1 and operand2.
<P>Another form of intermediate code is a three-address code.
It has got a general structure of A = B op C, where A, B and C
can be names, constants, temporary names etc. op can be any operator.
Postfix notation is yet another form of intermediate code.
</LI>
<LI>Optimization
<P>Optimization involves the technique of improving the object
code created from the source program. A large number of object
codes can be produced from the source program. Some of the object
codes may
be comparatively better. Optimization is a search for a better
one (may not be the best, but better).
<P>A number of techniques are used for the optimization.
Arithmetic simplification, Constant folding are a few among them.
Loops are a major target of this phase. It is mainly because of
the large amount of time spent by the program in inner loops.
Invariants are removed from the loop.
</LI>
<LI>Code generation
<P>The code generation phase converts the intermediate code
generated into a sequence of machine instructions. If we are
using simple routines for code generation, it may lead to
a number of redundant loads and stores. Such inefficient
resource utilization should be avoided. A good code generator
uses its registers efficiently.
</LI>
<LI>Symbol table
<P>A large number of names (such as variable names) will be
appearing in the source program. A compiler needs to collect
information about these names and use them properly. A data
structure used for this purpose is known as a symbol table.
All the phases of the compiler use the symbol table in one way
or other.
<P>Symbol table can be implemented in many ways. It ranges
from the simple arrays to the complex hashing methods. We have
to insert new names and information into the symbol table and
also recover them as and when required.
</LI>
<LI>Error handling.
<P>A good compiler should be capable of detecting and
reporting errors to the user in a most efficient manner. The error
messages should be highly understandable and flexible. Errors
can be caused because of a number of reasons ranging from
simple typing mistakes to complex errors included by the
compiler (which should be avoided at any cost).
</LI>
</OL>
<P>
<P>
<P>
<P>
<P>
<HR>
<A HREF="GCC-Frontend-HOWTO-3.html">Next</A>
<A HREF="GCC-Frontend-HOWTO-1.html">Previous</A>
<A HREF="GCC-Frontend-HOWTO.html#toc2">Contents</A>
</BODY>
</HTML>