147 lines
6.0 KiB
HTML
147 lines
6.0 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<META NAME="GENERATOR" CONTENT="SGML-Tools 1.0.9">
|
|
<TITLE> GCC Frontend HOWTO: Some general ideas about Compilers </TITLE>
|
|
<LINK HREF="GCC-Frontend-HOWTO-3.html" REL=next>
|
|
<LINK HREF="GCC-Frontend-HOWTO-1.html" REL=previous>
|
|
<LINK HREF="GCC-Frontend-HOWTO.html#toc2" REL=contents>
|
|
</HEAD>
|
|
<BODY>
|
|
<A HREF="GCC-Frontend-HOWTO-3.html">Next</A>
|
|
<A HREF="GCC-Frontend-HOWTO-1.html">Previous</A>
|
|
<A HREF="GCC-Frontend-HOWTO.html#toc2">Contents</A>
|
|
<HR>
|
|
<H2><A NAME="s2">2. Some general ideas about Compilers </A></H2>
|
|
|
|
<P>A compiler is a translator which accepts programs
|
|
in a source language and converts them into
|
|
programs in another language which is most
|
|
often the assembly code of a real (or virtual)
|
|
microprocessor. Designing real compilers is a
|
|
complex undertaking and requires formal training
|
|
in Computer Science and Mathematics.
|
|
<P>
|
|
<P>The compilation process can be divided into a number of
|
|
subtasks called phases. The different phases involved are
|
|
<OL>
|
|
<LI>Lexical analysis</LI>
|
|
<LI>Syntax analysis</LI>
|
|
<LI>Intermediate code generation</LI>
|
|
<LI>Code Optimization</LI>
|
|
<LI>Code generation.</LI>
|
|
</OL>
|
|
<P>Symbol tables and error handlers are involved in all the
|
|
above phases. Let us look at these stages.
|
|
<OL>
|
|
<LI>Lexical analysis
|
|
<P>The lexical analyzer reads the source program and emits
|
|
tokens. Tokens are atomic units, which represent a sequence of
|
|
characters. They can be treated as single logical entitities. Identifiers,
|
|
keywords, constants, operators, punctuation symbols etc. are
|
|
examples of tokens. Consider the C statement,
|
|
<PRE>
|
|
return 5;
|
|
</PRE>
|
|
<P>It has 3 tokens in it. The keyword 'return', the constant 5 and the
|
|
punctuation semicolon.
|
|
</LI>
|
|
<LI>Syntax analysis
|
|
<P>Tokens from the lexical analyzer are the input to this
|
|
phase. A set of rules (also known as productions) will be
|
|
provided for any programming language. This defines the grammar
|
|
of the language. The syntax analyzer checks whether the given input is
|
|
a valid one. ie. Whether it is permitted by the given grammar.
|
|
We can write a small set of productions.
|
|
<PRE>
|
|
|
|
Sentence -> Noun Verb
|
|
Noun -> boy | girl | bird
|
|
Verb -> eats | runs | flies
|
|
</PRE>
|
|
<P>This is a grammar of no practical importance (and also no
|
|
meaning). But it gives a rough idea about a grammar.
|
|
<P>A parser for a grammar takes a string as input. It can
|
|
produce a parse tree as output. Two types of parsing are possible.
|
|
Top down parsing and bottom up parsing. The meaning is clear from
|
|
the names. A bottom up parser starts from the leaves and traverse
|
|
to the root of the tree.
|
|
<P>In the above grammar if we are given "bird flies" as input,
|
|
it is possible to trace back to the root for a valid 'Sentence'.
|
|
One type of bottom-up parsing is a "shift-reduce" parsing.
|
|
The general method used here is to take the input symbols and
|
|
push it in a stack until the right side of a production appears
|
|
on top of the stack. In that case it is reduced with the left
|
|
side. Thus, it consists of shifting the input and reducing it
|
|
when possible. An LR (left to right) bottom up parser can be
|
|
used.
|
|
</LI>
|
|
<LI>Intermediate Code Generation.
|
|
<P>Once the syntactic constructs are determined, the compiler
|
|
can generate object code for each construct. But the compiler
|
|
creates an intermediate form. It helps in code optimization and
|
|
also to make a clear-cut separation between machine independent
|
|
phases (lexical, syntax) and machine dependent phases (optimization,
|
|
code generation).
|
|
<P>One form of intermediate code is a parse tree. A parse tree
|
|
may contain variables as the terminal nodes. A binary operator
|
|
will be having a left and right branch for operand1 and operand2.
|
|
<P>Another form of intermediate code is a three-address code.
|
|
It has got a general structure of A = B op C, where A, B and C
|
|
can be names, constants, temporary names etc. op can be any operator.
|
|
Postfix notation is yet another form of intermediate code.
|
|
</LI>
|
|
<LI>Optimization
|
|
<P>Optimization involves the technique of improving the object
|
|
code created from the source program. A large number of object
|
|
codes can be produced from the source program. Some of the object
|
|
codes may
|
|
be comparatively better. Optimization is a search for a better
|
|
one (may not be the best, but better).
|
|
<P>A number of techniques are used for the optimization.
|
|
Arithmetic simplification, Constant folding are a few among them.
|
|
Loops are a major target of this phase. It is mainly because of
|
|
the large amount of time spent by the program in inner loops.
|
|
Invariants are removed from the loop.
|
|
</LI>
|
|
<LI>Code generation
|
|
<P>The code generation phase converts the intermediate code
|
|
generated into a sequence of machine instructions. If we are
|
|
using simple routines for code generation, it may lead to
|
|
a number of redundant loads and stores. Such inefficient
|
|
resource utilization should be avoided. A good code generator
|
|
uses its registers efficiently.
|
|
</LI>
|
|
<LI>Symbol table
|
|
<P>A large number of names (such as variable names) will be
|
|
appearing in the source program. A compiler needs to collect
|
|
information about these names and use them properly. A data
|
|
structure used for this purpose is known as a symbol table.
|
|
All the phases of the compiler use the symbol table in one way
|
|
or other.
|
|
<P>Symbol table can be implemented in many ways. It ranges
|
|
from the simple arrays to the complex hashing methods. We have
|
|
to insert new names and information into the symbol table and
|
|
also recover them as and when required.
|
|
</LI>
|
|
<LI>Error handling.
|
|
<P>A good compiler should be capable of detecting and
|
|
reporting errors to the user in a most efficient manner. The error
|
|
messages should be highly understandable and flexible. Errors
|
|
can be caused because of a number of reasons ranging from
|
|
simple typing mistakes to complex errors included by the
|
|
compiler (which should be avoided at any cost).
|
|
</LI>
|
|
</OL>
|
|
<P>
|
|
<P>
|
|
<P>
|
|
<P>
|
|
<P>
|
|
<HR>
|
|
<A HREF="GCC-Frontend-HOWTO-3.html">Next</A>
|
|
<A HREF="GCC-Frontend-HOWTO-1.html">Previous</A>
|
|
<A HREF="GCC-Frontend-HOWTO.html#toc2">Contents</A>
|
|
</BODY>
|
|
</HTML>
|