old-www/LDP/LG/issue19/python.html

<!--startcut ==========================================================-->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<title>Using Python to Generate HTML Pages Issue 19</title>
</HEAD>
<BODY BGCOLOR="#EEE1CC" TEXT="#000000" LINK="#0000FF" VLINK="#0020F0"
ALINK="#FF0000">
<!--endcut ============================================================-->

<H4>
&quot;Linux Gazette...<I>making Linux just a little more fun!</I>&quot;
</H4>

<P> <HR> <P>
<!--===================================================================-->

<center>
<H2>Using Python to Generate HTML Pages</H2>
<H4>By Richie Bielak,
<a href="mailto:richieb@netlabs.net">richieb@netlabs.net</a></H4>
</center>
<P><HR>


<h2>Introduction</h2>

<p>I have waited for a long time to set up my own Web site, mostly
because I didn't know what to put there that others may want to
see. Then I got an idea. Since I'm an avid reader and an aviation
enthusiast, I decided to create pages with a list of aviation books I
have read. My initial intention was to write reviews for each book.
<p>

Setting up the pages was easy to start with, but as I added more books
the maintenance became tedious. I had to update couple of indices with
the same data and I had to sort them by hand, and alphabetizing was
never my strong suit. I needed to find a better way.
<p>

Around the same time I became interested in the programming language
Python and it seemed that Python would be a good tool to automatically
generate the various HTML pages from a simple text file. This would
greatly simplify the updates of my book pages, as I would only add one
entry to one file and then create complete pages by running a Python
script.
<p>

I was attracted to Python for two main reasons: it's very good at
processing strings and it's object oriented. Of course the fact that
Python interpreter is free and that it runs on many different systems
helped. At first I installed Python on my Win95 machine, but I just
couldn't force myself to do any programming in the Windows
environment, even in Python. Instead I installed Linux and moved all
my Web projects there.
<p>

<h2>The Problem</h2>

The main goal of the program is to generate three different book
indices, by author, by title and by subject, from a single input
file.  I started by defining the format of this file. Here is what a
typical entry describing one book looks like:
<pre>
	title: Zero Three Bravo
	author: Gosnell, Mariana
	subject: General Aviation
	url: 3zb.htm
	# this is a comment
</pre>
Each line starts with a keyword (eg. "title:" or "author:") and is
followed by a value that will be shown in the final HTML
page. Description of each book must start the "title:" line, there
must be at least one "author:" tag, and the "url:" entry points to a
review of the book, if there is one.
<p>

Since Python is object-oriented we begin program design by
looking for "objects". In a nutshell, object oriented (OO) programming
is a way to structure your code around the things, that is "objects",
that the program is working with. This rather simple idea of
organizing software around what it works with (objects), rather than
what it does (functions), turns out to be surprisingly powerful.
<p>

Within an OO program similar objects are grouped into "classes" and the
code we write describes each class. Objects that belong to a given
class are called "instances of the class".
<p>

I hope it is pretty obvious to you that since the program will
manipulate "book" objects, we need a Python class that will represent
a single book. Just knowing this is enough to let us suspend design
and write some code.
<p>

<h2>The Book Class</h2>

Before we start looking at the code we need to consider briefly how
Python programs are organized. Each program consists of a number of
modules, each module is contained in a file (usually named with the
extension ".py") and the name of the file (without the ".py") serves
as the module name.  A module can contain any number of routines or
classes. Typically things that are related are kept in one module. For
example, there is <tt>string</tt> module that contains functions that
operate on strings. To access functions or classes from another module
we use the <tt>import</tt> statement. For example the first line of
the <tt>Book</tt> module is:
<pre>
    from string import split, strip
</pre>
which says that the routines <tt>split</tt> and <tt>strip</tt> are
obtained from the <tt>strings</tt> module.<p>

Next, I have to point out few syntactic features of Python that are
not immediately obvious the code. The most important is the fact that
in Python indentation is part of the syntax. To see which statements
will be executed following an "if", all you need to look at is
indentation - there is no need for curly braces, <tt>BEGIN/END</tt>
pairs or "fi" statements.<p>

Here is a typical "if" statement extracted from the <tt>set_author</tt>
routine in the <tt>Book</tt> class:
<pre>
	if new_author:
	    names = split (new_author, ",")
	    self.last_name.append (strip (names[0]))
	    self.first_name.append (strip (names[1]))
	else:
	    self.last_name = []
	    self.first_name = []
</pre>
The three statements following the "if" are executed if "new_author"
variable contains a non-null value. The amount of indentation is not
important, but it must be consistent. Also note the colon (":") which
is used to terminate the header of each compound statement.<p>


The <tt>Book</tt> class turns out to be very simple. It consists
of routines that set the values for author, title, subject and the URL
for each book. For example, here is the <tt>set_title</tt> routine:
<pre>
    def set_title (self, new_title):
	self.title = new_title
</pre>
The first argument to the "set_title" method (that is a routine which
belongs to a class) is "self". This argument always refers to the
instance to which the method is applied. Furthermore, the attributes
(i.e. the data contained in each object) must be qualified with "self"
when referenced within the body of a method.  In the example above the
attribute "title" of a "Book" object is set to value of "new_title".
<p>
If in another part of a program we have variable "b" that references an
instance of a "Book" class this call would set the book's title:
<pre>
    b.set_title ("Fate is the Hunter")
</pre>
Note that the "self" argument is <i>not</i> present in the call,
instead the object to which the method is applied (i.e. the object
before the ".", "b" above) becomes the "self" argument.
<p>


At this point a reasonable question to ask is "Where do the objects
come from?"  Each object is created by a special call that uses the
class name as the name of a function. In addition a class can define a
method with the name <tt>__init__</tt> which will automatically be
called to initialize the new object's attributes (in C++ such a
routine is called a constructor).
<p>
Here is the <tt>__init__</tt> routine for the <tt>Book</tt> class:
<pre>
    def __init__ (self, t="", a="", s="", u=""):
	#
	# Create an instance of Book
	#
	self.title = t
	self.last_name = []
	self.first_name = []
	self.set_author (a)
	self.subject = s
	self.url = u
</pre>
The main purpose of the above routine is to create all the attributes
of the new "Book" object. Note that the arguments to "__init__" are
specified with default values, so that the caller needs only to pass the
arguments that differ from the default.
<p>

Here are some examples of calls to create "Book" objects:
<pre>
    a = Book()
    b = Book ("Fate is the Hunter")
    c = Book ("Some book", "First, Author")
</pre>
<p>
There is one small complication in the "Book" class. It is possible
for a book to have more than one author. That's why the attributes
"first_name" and "last_name" are actually lists. We'll look more at
lists in the next section. <p>

The complete <tt>Book</tt> class is show in <a href=book.html>
Listing #1</a>. To test the class we add a little piece of code at the end
of the file to test if the code is running as <tt>__main__</tt> routine,
that is execution started in this file. If so, the code to test the <tt>Book</tt>
will run.

<h2>The Book_List Class</h2>

Once the <tt>Book</tt> is tested we can go back to designing. The next
obvious object is a list which will contain all the "book"
objects. For the purposes of our program we have to be able to create
the book list from the input file and we have to sort the books in the
list by author, title or subject. Sorted list will then be used as
input into the code that actually generates HTML pages. <p>

As it turns out one of Python's built-in data structures is a list. Here is
a snippet of code showing creation of a list and addition of some items
(this example was produced by running Python interactively):
<pre>
Python 1.4 (Dec 18 1996)  [GCC 2.7.2.1]
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
&gt;&gt;&gt; s = []
&gt;&gt;&gt; s.append ("a")
&gt;&gt;&gt; s.append ("hello")
&gt;&gt;&gt; s.append (1)
&gt;&gt;&gt; print s
['a', 'hello', 1]
</pre>
Above we create a list called "s" and add three items to it. Lists
allow "slicing" operations, which let you pull out pieces of a list by
specifying element numbers. These examples illustrate the idea:
<pre>
&gt;&gt;&gt; print s[1]
hello
&gt;&gt;&gt; print s[1:]
['hello', 1]
&gt;&gt;&gt; print s[:2]
['a', 'hello']
&gt;&gt;&gt; print s[0]
a
</pre>
<tt>s[1]</tt> denotes the second element of the list (indexing starts
at zero), <tt>s[1:]</tt> is the slice from the second element to the
end of the list, <tt>s[:2]</tt> goes from the start to the third
element, and <tt>s[0]</tt> is the first item.
<p>

Finally, lists have a "sort" operator which sorts the elements according to
a user supplied comparison function.
<p>
Armed with the knowledge of Python lists, writing the <tt>Book_List</tt> class
is easy. The class will have a single attribute, "contents", which will be a
list of books.
<p>
The constructor for the <tt>Book_List</tt> class simply creates a
"contents" attribute and initializes it to be an empty list. The
routine that parses the input file and creates list elements is called
"make_from_file" and it begins with the code:
<pre>
   def make_from_file (self, file):
	#
	# Read the file and create a book list
	#
	lines = file.readlines ()
	self.contents = []
</pre>
The "file" argument is a handle to an open text file that contains the
descriptions of the books. The first step this routine performs is to
read the entire file into a list of strings, each string representing
one line of text. Next, using Python's "for" loop we step through this
list and examine each line of text:
<pre>
	#
	# Parse each line and create a list of Book objects
	#
	for one_line in lines:
	    # It's  not a comment or empty line
	    if (len(one_line) &gt; 0) and (one_line[0] != "#"):
    	            # Split into tokens
		    tokens = string.split (one_line)
</pre>
If the line is not empty or is not a comment (that is the first
character is not a "#") then we split the line into words, a word
being a sequence of characters without spaces. The call "tokens =
string.split (one_line)" uses the "split" routine from the "string"
module. "split" returns the words it found in a list.
<pre>
		    if len (tokens) &gt; 0:
			if (tokens[0] == "title:"):
			    current_book = book.Book (string.join (tokens[1:]))
			    self.contents.append (current_book)
			elif (tokens[0] == "author:"):
			    current_book.set_author (string.join (tokens[1:]))
			elif (tokens[0] == "subject:"):
			    current_book.set_subject (string.join (tokens[1:]))
			elif (tokens[0] == "url:"):
			    current_book.set_url (string.join (tokens[1:]))

</pre>
The first token (i.e. word) on the line is the keyword that tells us
what to do. If it is "title:" then we create a new <tt>Book</tt>
object and append it to the list of books, otherwise we just set the
proper attributes. Note that the remaining tokens found on each line
are joined together into a string (using "string.join" routine). There
is probably a more efficient way to code this, but for my purposes
this code works fast enough.
<p>
The other interesting parts of the <tt>Book_List</tt> class are the sort
routines. Here is how the list is sorted by title:
<pre>
    def sort_by_title (self):
	#
	# Sort book list by title
	#
	self.contents.sort (lambda x, y: cmp (x.title, y.title))

</pre>
We simply call "sort" routine on the list. To get proper ordering we
need to supply a function that compares two <tt>Book</tt> objects. For
sorting by title we have to supply an anonymous function, which is
introduced with the keyword "lambda" (those of you familiar with Lisp,
or other functional languages should recognize this construct). The definition:
<pre>
      lambda x, y: cmp (x.title, y.title)
</pre>
simply says that this is a function of two arguments and function result comes
from calling the Python built-in function "cmp" (i.e. compare) on the "title"
attribute of the two objects.<p>

The other sort routines are similar, except that in "sort_by_author" I
used a local function instead of a "lambda", because the comparison
was little more complicated - I wanted to have all the books with the
same author appear alphabetically by title.


<h2>Generating Pages:</h2>

Now that we have constructed a list of books, the next step is to create
the HTML pages. We begin by creating a class, called <tt>Html_Page</tt>, that
generates basic outline of a page and then we extend that class to create
the titles, authors and subjects pages.<p>

The idea that existing code can be extended yet not changed is the
second most import idea of OO programming. The mechanism for doing
this is called "inheritance" and it allows the programmer to create a
new class by adding new properties to an old class and the old class
does not have to change. A way to think about inheritance is as
"programming by differences". In our program we will create three
classes that inherit from <tt>Html_Page</tt>.<p>

<tt>Html_Page</tt> is quite simple. It consists of routines that
generate the header and the trailer tags for an HTML page. It also
contains an empty routine for generating the body of the page. This
routine will be defined in descendant classes. The <tt>__init__</tt>
routine let's the user of this class specify a title and a top level
heading for the page.<p>

When I first tested the output of the HTML generators I simply printed
it to the screen and manually saved it into a file, so I could see the
page in a browser. But once I was happy with the appearance, I had to
change the code to save the data into a file. That's why in <tt>Html_Page</tt>
you will see code like this:
<pre>
	self.f.write ("&lt;html&gt;\n")
	self.f.write ("&lt;head&gt;\n")
</pre>
for writing the output to a file referenced by the attribute "f". <p>
However, since the actual output file will be different for each page
opening of the file is deferred to a descendant class. <p>

You can see complete code for <tt>Html_Page</tt> in
<a href="html_page.html">Listing #3</a>.

The three classes <tt>Authors_Page</tt>, <tt>Titles_Page</tt> and
<tt>Subjects_Page</tt> are used to create the final HTML pages. Since these
classes belong together I put them in one module, called <tt>books_pages</tt>.
Because the code for these is classes is very similar we will only look at
the first one.<p>

Here is how <tt>Authors_Page</tt> begins:
<pre>
class Authors_Page (Html_Page):

    def __init__ (self):
	Html_Page.__init__ (self, "Aviation Books: by Author",
			    "&lt;i&gt;Aviation Books: indexed by Author&lt;/i&gt;")
	self.f = open ("books_by_author.html", "w")
	print "Authors page in--&gt; " + self.f.name
</pre>
To start with that the class heading lists the name of the class from
which <tt>Authors_Page</tt> inherits, mainly <tt>Html_Page</tt>. Next
notice that the constructor invokes the constructor from the parent
class, by calling the <tt>__init__</tt> routine qualified by the class
name. Finally, the constructor names and opens the output file. I decided
not to make the file name a parameter for my own convenience to keep
things simple. <p>

Since the book list is needed for to generate the body of each page I added
a <tt>book_list</tt> attribute to each page class. This attribute is set
before HTML generation starts. <p>

The <tt>generate_body</tt> routine redefines the empty routine from
the parent class. Although fairly long, the code is pretty easy to
understand once you know that the book list is represented as an HTML
table and the "+" is the concatenation operator for strings. <p>

In addition to replacing  the <tt>generate_body</tt> routine we also redefine
<tt>generate_trailer</tt> routine in order to put a back link to the book index
at the bottom of each page:
<pre>
    def generate_trailer (self):
	self.f.write ("&lt;hr&gt;\n")
	self.f.write ("&lt;center&gt;&lt;a href=books.html&gt;Back to Aviation Books Top Page&lt;/a&gt;&lt;/center&gt;\n")
	self.f.write ("&lt;hr&gt;\n")
	Html_Page.generate_trailer (self)
</pre>
Notice how right after we generate the back link, we include a call to
parent's <tt>generate_trailer</tt> routine to finish off the page with
correct terminating tags.<p>

Complete listing for the three page generating classes are found in
<a href="books_pages.html">Listing #4</a>.<p>


The main line of the entire program is shown in
<a href="book_page_gen.html">Listing #5</a>. By now the code there
should be self explanatory.

<h2>Summary</h2>

As you can see this particular program was not hard to write. Python is
well suited for these types of tasks, you can quickly put together
a useful program with minimal fuss. <p>

After I have got the program to work I realized that its design
is not the best. For example, the HTML generating code could be more
general, perhaps the <tt>Book</tt> class should generate it's own
HTML table entries. But for now the program fits my purposes, but
I will modify if I need to create other HTML generating applications.<p>

If you like to see the results of this script visit my
<a href="http://www.netlabs.net/hp/richieb/books.html">book page.</a><p>

To learn more about Python you should start with the <a
href="http://www.python.org">Python Home Page</a> which will point you
to many Python resources on the net. I also found the O'Reilly book
<i>Programming in Python</i> by Mark Lutz extremely helpful.
<p>
Finally, any mistakes in the description of Python features are
my own fault, as I'm still a Python novice.
<p>


<!--===================================================================-->
<P> <hr> <P>
<center><H5>Copyright &copy; 1997, Richie Bielak<BR>
Published in Issue 19 of the Linux Gazette, July 1997</H5></center>

<!--===================================================================-->
<P> <hr> <P>
<A HREF="./index.html"><IMG ALIGN=BOTTOM SRC="../gx/indexnew.gif"
ALT="[ TABLE OF CONTENTS ]"></A>
<A HREF="../index.html"><IMG ALIGN=BOTTOM SRC="../gx/homenew.gif"
ALT="[ FRONT PAGE ]"></A>
<A HREF="./trade.html"><IMG SRC="../gx/back2.gif"
ALT=" Back "></A>
<A HREF="./micro.html"><IMG SRC="../gx/fwd.gif" ALT=" Next "></A>
<P> <hr> <P>
<!--startcut ==========================================================-->
</BODY>
</HTML>
<!--endcut ============================================================-->