475 lines
19 KiB
HTML
475 lines
19 KiB
HTML
<!--startcut ==========================================================-->
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<title>Using Python to Generate HTML Pages Issue 19</title>
|
|
</HEAD>
|
|
<BODY BGCOLOR="#EEE1CC" TEXT="#000000" LINK="#0000FF" VLINK="#0020F0"
|
|
ALINK="#FF0000">
|
|
<!--endcut ============================================================-->
|
|
|
|
<H4>
|
|
"Linux Gazette...<I>making Linux just a little more fun!</I>"
|
|
</H4>
|
|
|
|
<P> <HR> <P>
|
|
<!--===================================================================-->
|
|
|
|
<center>
|
|
<H2>Using Python to Generate HTML Pages</H2>
|
|
<H4>By Richie Bielak,
|
|
<a href="mailto:richieb@netlabs.net">richieb@netlabs.net</a></H4>
|
|
</center>
|
|
<P><HR>
|
|
|
|
|
|
<h2>Introduction</h2>
|
|
|
|
<p>I have waited for a long time to set up my own Web site, mostly
|
|
because I didn't know what to put there that others may want to
|
|
see. Then I got an idea. Since I'm an avid reader and an aviation
|
|
enthusiast, I decided to create pages with a list of aviation books I
|
|
have read. My initial intention was to write reviews for each book.
|
|
<p>
|
|
|
|
Setting up the pages was easy to start with, but as I added more books
|
|
the maintenance became tedious. I had to update couple of indices with
|
|
the same data and I had to sort them by hand, and alphabetizing was
|
|
never my strong suit. I needed to find a better way.
|
|
<p>
|
|
|
|
Around the same time I became interested in the programming language
|
|
Python and it seemed that Python would be a good tool to automatically
|
|
generate the various HTML pages from a simple text file. This would
|
|
greatly simplify the updates of my book pages, as I would only add one
|
|
entry to one file and then create complete pages by running a Python
|
|
script.
|
|
<p>
|
|
|
|
I was attracted to Python for two main reasons: it's very good at
|
|
processing strings and it's object oriented. Of course the fact that
|
|
Python interpreter is free and that it runs on many different systems
|
|
helped. At first I installed Python on my Win95 machine, but I just
|
|
couldn't force myself to do any programming in the Windows
|
|
environment, even in Python. Instead I installed Linux and moved all
|
|
my Web projects there.
|
|
<p>
|
|
|
|
<h2>The Problem</h2>
|
|
|
|
The main goal of the program is to generate three different book
|
|
indices, by author, by title and by subject, from a single input
|
|
file. I started by defining the format of this file. Here is what a
|
|
typical entry describing one book looks like:
|
|
<pre>
|
|
title: Zero Three Bravo
|
|
author: Gosnell, Mariana
|
|
subject: General Aviation
|
|
url: 3zb.htm
|
|
# this is a comment
|
|
</pre>
|
|
Each line starts with a keyword (eg. "title:" or "author:") and is
|
|
followed by a value that will be shown in the final HTML
|
|
page. Description of each book must start the "title:" line, there
|
|
must be at least one "author:" tag, and the "url:" entry points to a
|
|
review of the book, if there is one.
|
|
<p>
|
|
|
|
Since Python is object-oriented we begin program design by
|
|
looking for "objects". In a nutshell, object oriented (OO) programming
|
|
is a way to structure your code around the things, that is "objects",
|
|
that the program is working with. This rather simple idea of
|
|
organizing software around what it works with (objects), rather than
|
|
what it does (functions), turns out to be surprisingly powerful.
|
|
<p>
|
|
|
|
Within an OO program similar objects are grouped into "classes" and the
|
|
code we write describes each class. Objects that belong to a given
|
|
class are called "instances of the class".
|
|
<p>
|
|
|
|
I hope it is pretty obvious to you that since the program will
|
|
manipulate "book" objects, we need a Python class that will represent
|
|
a single book. Just knowing this is enough to let us suspend design
|
|
and write some code.
|
|
<p>
|
|
|
|
<h2>The Book Class</h2>
|
|
|
|
Before we start looking at the code we need to consider briefly how
|
|
Python programs are organized. Each program consists of a number of
|
|
modules, each module is contained in a file (usually named with the
|
|
extension ".py") and the name of the file (without the ".py") serves
|
|
as the module name. A module can contain any number of routines or
|
|
classes. Typically things that are related are kept in one module. For
|
|
example, there is <tt>string</tt> module that contains functions that
|
|
operate on strings. To access functions or classes from another module
|
|
we use the <tt>import</tt> statement. For example the first line of
|
|
the <tt>Book</tt> module is:
|
|
<pre>
|
|
from string import split, strip
|
|
</pre>
|
|
which says that the routines <tt>split</tt> and <tt>strip</tt> are
|
|
obtained from the <tt>strings</tt> module.<p>
|
|
|
|
Next, I have to point out few syntactic features of Python that are
|
|
not immediately obvious the code. The most important is the fact that
|
|
in Python indentation is part of the syntax. To see which statements
|
|
will be executed following an "if", all you need to look at is
|
|
indentation - there is no need for curly braces, <tt>BEGIN/END</tt>
|
|
pairs or "fi" statements.<p>
|
|
|
|
Here is a typical "if" statement extracted from the <tt>set_author</tt>
|
|
routine in the <tt>Book</tt> class:
|
|
<pre>
|
|
if new_author:
|
|
names = split (new_author, ",")
|
|
self.last_name.append (strip (names[0]))
|
|
self.first_name.append (strip (names[1]))
|
|
else:
|
|
self.last_name = []
|
|
self.first_name = []
|
|
</pre>
|
|
The three statements following the "if" are executed if "new_author"
|
|
variable contains a non-null value. The amount of indentation is not
|
|
important, but it must be consistent. Also note the colon (":") which
|
|
is used to terminate the header of each compound statement.<p>
|
|
|
|
|
|
The <tt>Book</tt> class turns out to be very simple. It consists
|
|
of routines that set the values for author, title, subject and the URL
|
|
for each book. For example, here is the <tt>set_title</tt> routine:
|
|
<pre>
|
|
def set_title (self, new_title):
|
|
self.title = new_title
|
|
</pre>
|
|
The first argument to the "set_title" method (that is a routine which
|
|
belongs to a class) is "self". This argument always refers to the
|
|
instance to which the method is applied. Furthermore, the attributes
|
|
(i.e. the data contained in each object) must be qualified with "self"
|
|
when referenced within the body of a method. In the example above the
|
|
attribute "title" of a "Book" object is set to value of "new_title".
|
|
<p>
|
|
If in another part of a program we have variable "b" that references an
|
|
instance of a "Book" class this call would set the book's title:
|
|
<pre>
|
|
b.set_title ("Fate is the Hunter")
|
|
</pre>
|
|
Note that the "self" argument is <i>not</i> present in the call,
|
|
instead the object to which the method is applied (i.e. the object
|
|
before the ".", "b" above) becomes the "self" argument.
|
|
<p>
|
|
|
|
|
|
At this point a reasonable question to ask is "Where do the objects
|
|
come from?" Each object is created by a special call that uses the
|
|
class name as the name of a function. In addition a class can define a
|
|
method with the name <tt>__init__</tt> which will automatically be
|
|
called to initialize the new object's attributes (in C++ such a
|
|
routine is called a constructor).
|
|
<p>
|
|
Here is the <tt>__init__</tt> routine for the <tt>Book</tt> class:
|
|
<pre>
|
|
def __init__ (self, t="", a="", s="", u=""):
|
|
#
|
|
# Create an instance of Book
|
|
#
|
|
self.title = t
|
|
self.last_name = []
|
|
self.first_name = []
|
|
self.set_author (a)
|
|
self.subject = s
|
|
self.url = u
|
|
</pre>
|
|
The main purpose of the above routine is to create all the attributes
|
|
of the new "Book" object. Note that the arguments to "__init__" are
|
|
specified with default values, so that the caller needs only to pass the
|
|
arguments that differ from the default.
|
|
<p>
|
|
|
|
Here are some examples of calls to create "Book" objects:
|
|
<pre>
|
|
a = Book()
|
|
b = Book ("Fate is the Hunter")
|
|
c = Book ("Some book", "First, Author")
|
|
</pre>
|
|
<p>
|
|
There is one small complication in the "Book" class. It is possible
|
|
for a book to have more than one author. That's why the attributes
|
|
"first_name" and "last_name" are actually lists. We'll look more at
|
|
lists in the next section. <p>
|
|
|
|
The complete <tt>Book</tt> class is show in <a href=book.html>
|
|
Listing #1</a>. To test the class we add a little piece of code at the end
|
|
of the file to test if the code is running as <tt>__main__</tt> routine,
|
|
that is execution started in this file. If so, the code to test the <tt>Book</tt>
|
|
will run.
|
|
|
|
<h2>The Book_List Class</h2>
|
|
|
|
Once the <tt>Book</tt> is tested we can go back to designing. The next
|
|
obvious object is a list which will contain all the "book"
|
|
objects. For the purposes of our program we have to be able to create
|
|
the book list from the input file and we have to sort the books in the
|
|
list by author, title or subject. Sorted list will then be used as
|
|
input into the code that actually generates HTML pages. <p>
|
|
|
|
As it turns out one of Python's built-in data structures is a list. Here is
|
|
a snippet of code showing creation of a list and addition of some items
|
|
(this example was produced by running Python interactively):
|
|
<pre>
|
|
Python 1.4 (Dec 18 1996) [GCC 2.7.2.1]
|
|
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
|
|
>>> s = []
|
|
>>> s.append ("a")
|
|
>>> s.append ("hello")
|
|
>>> s.append (1)
|
|
>>> print s
|
|
['a', 'hello', 1]
|
|
</pre>
|
|
Above we create a list called "s" and add three items to it. Lists
|
|
allow "slicing" operations, which let you pull out pieces of a list by
|
|
specifying element numbers. These examples illustrate the idea:
|
|
<pre>
|
|
>>> print s[1]
|
|
hello
|
|
>>> print s[1:]
|
|
['hello', 1]
|
|
>>> print s[:2]
|
|
['a', 'hello']
|
|
>>> print s[0]
|
|
a
|
|
</pre>
|
|
<tt>s[1]</tt> denotes the second element of the list (indexing starts
|
|
at zero), <tt>s[1:]</tt> is the slice from the second element to the
|
|
end of the list, <tt>s[:2]</tt> goes from the start to the third
|
|
element, and <tt>s[0]</tt> is the first item.
|
|
<p>
|
|
|
|
Finally, lists have a "sort" operator which sorts the elements according to
|
|
a user supplied comparison function.
|
|
<p>
|
|
Armed with the knowledge of Python lists, writing the <tt>Book_List</tt> class
|
|
is easy. The class will have a single attribute, "contents", which will be a
|
|
list of books.
|
|
<p>
|
|
The constructor for the <tt>Book_List</tt> class simply creates a
|
|
"contents" attribute and initializes it to be an empty list. The
|
|
routine that parses the input file and creates list elements is called
|
|
"make_from_file" and it begins with the code:
|
|
<pre>
|
|
def make_from_file (self, file):
|
|
#
|
|
# Read the file and create a book list
|
|
#
|
|
lines = file.readlines ()
|
|
self.contents = []
|
|
</pre>
|
|
The "file" argument is a handle to an open text file that contains the
|
|
descriptions of the books. The first step this routine performs is to
|
|
read the entire file into a list of strings, each string representing
|
|
one line of text. Next, using Python's "for" loop we step through this
|
|
list and examine each line of text:
|
|
<pre>
|
|
#
|
|
# Parse each line and create a list of Book objects
|
|
#
|
|
for one_line in lines:
|
|
# It's not a comment or empty line
|
|
if (len(one_line) > 0) and (one_line[0] != "#"):
|
|
# Split into tokens
|
|
tokens = string.split (one_line)
|
|
</pre>
|
|
If the line is not empty or is not a comment (that is the first
|
|
character is not a "#") then we split the line into words, a word
|
|
being a sequence of characters without spaces. The call "tokens =
|
|
string.split (one_line)" uses the "split" routine from the "string"
|
|
module. "split" returns the words it found in a list.
|
|
<pre>
|
|
if len (tokens) > 0:
|
|
if (tokens[0] == "title:"):
|
|
current_book = book.Book (string.join (tokens[1:]))
|
|
self.contents.append (current_book)
|
|
elif (tokens[0] == "author:"):
|
|
current_book.set_author (string.join (tokens[1:]))
|
|
elif (tokens[0] == "subject:"):
|
|
current_book.set_subject (string.join (tokens[1:]))
|
|
elif (tokens[0] == "url:"):
|
|
current_book.set_url (string.join (tokens[1:]))
|
|
|
|
</pre>
|
|
The first token (i.e. word) on the line is the keyword that tells us
|
|
what to do. If it is "title:" then we create a new <tt>Book</tt>
|
|
object and append it to the list of books, otherwise we just set the
|
|
proper attributes. Note that the remaining tokens found on each line
|
|
are joined together into a string (using "string.join" routine). There
|
|
is probably a more efficient way to code this, but for my purposes
|
|
this code works fast enough.
|
|
<p>
|
|
The other interesting parts of the <tt>Book_List</tt> class are the sort
|
|
routines. Here is how the list is sorted by title:
|
|
<pre>
|
|
def sort_by_title (self):
|
|
#
|
|
# Sort book list by title
|
|
#
|
|
self.contents.sort (lambda x, y: cmp (x.title, y.title))
|
|
|
|
</pre>
|
|
We simply call "sort" routine on the list. To get proper ordering we
|
|
need to supply a function that compares two <tt>Book</tt> objects. For
|
|
sorting by title we have to supply an anonymous function, which is
|
|
introduced with the keyword "lambda" (those of you familiar with Lisp,
|
|
or other functional languages should recognize this construct). The definition:
|
|
<pre>
|
|
lambda x, y: cmp (x.title, y.title)
|
|
</pre>
|
|
simply says that this is a function of two arguments and function result comes
|
|
from calling the Python built-in function "cmp" (i.e. compare) on the "title"
|
|
attribute of the two objects.<p>
|
|
|
|
The other sort routines are similar, except that in "sort_by_author" I
|
|
used a local function instead of a "lambda", because the comparison
|
|
was little more complicated - I wanted to have all the books with the
|
|
same author appear alphabetically by title.
|
|
|
|
|
|
<h2>Generating Pages:</h2>
|
|
|
|
Now that we have constructed a list of books, the next step is to create
|
|
the HTML pages. We begin by creating a class, called <tt>Html_Page</tt>, that
|
|
generates basic outline of a page and then we extend that class to create
|
|
the titles, authors and subjects pages.<p>
|
|
|
|
The idea that existing code can be extended yet not changed is the
|
|
second most import idea of OO programming. The mechanism for doing
|
|
this is called "inheritance" and it allows the programmer to create a
|
|
new class by adding new properties to an old class and the old class
|
|
does not have to change. A way to think about inheritance is as
|
|
"programming by differences". In our program we will create three
|
|
classes that inherit from <tt>Html_Page</tt>.<p>
|
|
|
|
<tt>Html_Page</tt> is quite simple. It consists of routines that
|
|
generate the header and the trailer tags for an HTML page. It also
|
|
contains an empty routine for generating the body of the page. This
|
|
routine will be defined in descendant classes. The <tt>__init__</tt>
|
|
routine let's the user of this class specify a title and a top level
|
|
heading for the page.<p>
|
|
|
|
When I first tested the output of the HTML generators I simply printed
|
|
it to the screen and manually saved it into a file, so I could see the
|
|
page in a browser. But once I was happy with the appearance, I had to
|
|
change the code to save the data into a file. That's why in <tt>Html_Page</tt>
|
|
you will see code like this:
|
|
<pre>
|
|
self.f.write ("<html>\n")
|
|
self.f.write ("<head>\n")
|
|
</pre>
|
|
for writing the output to a file referenced by the attribute "f". <p>
|
|
However, since the actual output file will be different for each page
|
|
opening of the file is deferred to a descendant class. <p>
|
|
|
|
You can see complete code for <tt>Html_Page</tt> in
|
|
<a href="html_page.html">Listing #3</a>.
|
|
|
|
The three classes <tt>Authors_Page</tt>, <tt>Titles_Page</tt> and
|
|
<tt>Subjects_Page</tt> are used to create the final HTML pages. Since these
|
|
classes belong together I put them in one module, called <tt>books_pages</tt>.
|
|
Because the code for these is classes is very similar we will only look at
|
|
the first one.<p>
|
|
|
|
Here is how <tt>Authors_Page</tt> begins:
|
|
<pre>
|
|
class Authors_Page (Html_Page):
|
|
|
|
def __init__ (self):
|
|
Html_Page.__init__ (self, "Aviation Books: by Author",
|
|
"<i>Aviation Books: indexed by Author</i>")
|
|
self.f = open ("books_by_author.html", "w")
|
|
print "Authors page in--> " + self.f.name
|
|
</pre>
|
|
To start with that the class heading lists the name of the class from
|
|
which <tt>Authors_Page</tt> inherits, mainly <tt>Html_Page</tt>. Next
|
|
notice that the constructor invokes the constructor from the parent
|
|
class, by calling the <tt>__init__</tt> routine qualified by the class
|
|
name. Finally, the constructor names and opens the output file. I decided
|
|
not to make the file name a parameter for my own convenience to keep
|
|
things simple. <p>
|
|
|
|
Since the book list is needed for to generate the body of each page I added
|
|
a <tt>book_list</tt> attribute to each page class. This attribute is set
|
|
before HTML generation starts. <p>
|
|
|
|
The <tt>generate_body</tt> routine redefines the empty routine from
|
|
the parent class. Although fairly long, the code is pretty easy to
|
|
understand once you know that the book list is represented as an HTML
|
|
table and the "+" is the concatenation operator for strings. <p>
|
|
|
|
In addition to replacing the <tt>generate_body</tt> routine we also redefine
|
|
<tt>generate_trailer</tt> routine in order to put a back link to the book index
|
|
at the bottom of each page:
|
|
<pre>
|
|
def generate_trailer (self):
|
|
self.f.write ("<hr>\n")
|
|
self.f.write ("<center><a href=books.html>Back to Aviation Books Top Page</a></center>\n")
|
|
self.f.write ("<hr>\n")
|
|
Html_Page.generate_trailer (self)
|
|
</pre>
|
|
Notice how right after we generate the back link, we include a call to
|
|
parent's <tt>generate_trailer</tt> routine to finish off the page with
|
|
correct terminating tags.<p>
|
|
|
|
Complete listing for the three page generating classes are found in
|
|
<a href="books_pages.html">Listing #4</a>.<p>
|
|
|
|
|
|
The main line of the entire program is shown in
|
|
<a href="book_page_gen.html">Listing #5</a>. By now the code there
|
|
should be self explanatory.
|
|
|
|
<h2>Summary</h2>
|
|
|
|
As you can see this particular program was not hard to write. Python is
|
|
well suited for these types of tasks, you can quickly put together
|
|
a useful program with minimal fuss. <p>
|
|
|
|
After I have got the program to work I realized that its design
|
|
is not the best. For example, the HTML generating code could be more
|
|
general, perhaps the <tt>Book</tt> class should generate it's own
|
|
HTML table entries. But for now the program fits my purposes, but
|
|
I will modify if I need to create other HTML generating applications.<p>
|
|
|
|
If you like to see the results of this script visit my
|
|
<a href="http://www.netlabs.net/hp/richieb/books.html">book page.</a><p>
|
|
|
|
To learn more about Python you should start with the <a
|
|
href="http://www.python.org">Python Home Page</a> which will point you
|
|
to many Python resources on the net. I also found the O'Reilly book
|
|
<i>Programming in Python</i> by Mark Lutz extremely helpful.
|
|
<p>
|
|
Finally, any mistakes in the description of Python features are
|
|
my own fault, as I'm still a Python novice.
|
|
<p>
|
|
|
|
|
|
<!--===================================================================-->
|
|
<P> <hr> <P>
|
|
<center><H5>Copyright © 1997, Richie Bielak<BR>
|
|
Published in Issue 19 of the Linux Gazette, July 1997</H5></center>
|
|
|
|
<!--===================================================================-->
|
|
<P> <hr> <P>
|
|
<A HREF="./index.html"><IMG ALIGN=BOTTOM SRC="../gx/indexnew.gif"
|
|
ALT="[ TABLE OF CONTENTS ]"></A>
|
|
<A HREF="../index.html"><IMG ALIGN=BOTTOM SRC="../gx/homenew.gif"
|
|
ALT="[ FRONT PAGE ]"></A>
|
|
<A HREF="./trade.html"><IMG SRC="../gx/back2.gif"
|
|
ALT=" Back "></A>
|
|
<A HREF="./micro.html"><IMG SRC="../gx/fwd.gif" ALT=" Next "></A>
|
|
<P> <hr> <P>
|
|
<!--startcut ==========================================================-->
|
|
</BODY>
|
|
</HTML>
|
|
<!--endcut ============================================================-->
|
|
|