704 lines
28 KiB
HTML
704 lines
28 KiB
HTML
<!--startcut ==============================================-->
|
||
<!-- *** BEGIN HTML header *** -->
|
||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
||
<HTML><HEAD>
|
||
<title>XML parsing in AOLserver LG #63</title>
|
||
</HEAD>
|
||
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
|
||
ALINK="#FF0000">
|
||
<!-- *** END HTML header *** -->
|
||
|
||
<CENTER>
|
||
<A HREF="http://www.linuxgazette.com/">
|
||
<H1><IMG ALT="LINUX GAZETTE" SRC="../gx/lglogo.png"
|
||
WIDTH="600" HEIGHT="124" border="0"></H1></A>
|
||
|
||
<!-- *** BEGIN navbar *** -->
|
||
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="sharma.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue63/washington.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="lg_backpage63.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
||
<!-- *** END navbar *** -->
|
||
<P>
|
||
</CENTER>
|
||
|
||
<!--endcut ============================================================-->
|
||
|
||
<H4 ALIGN="center">
|
||
"Linux Gazette...<I>making Linux just a little more fun!</I>"
|
||
</H4>
|
||
|
||
<P> <HR> <P>
|
||
<!--===================================================================-->
|
||
|
||
<center>
|
||
<H1><font color="maroon">XML parsing in AOLserver</font></H1>
|
||
<H4>By <a href="mailto:irvingw@pobox.com">Irving Washington</a></H4>
|
||
</center>
|
||
<P> <HR> <P>
|
||
|
||
<!-- END header -->
|
||
|
||
|
||
|
||
|
||
<h3>AOLserver</h3>
|
||
|
||
<a href="http://www.aolserver.com">AOLserver</a> is an open-source,
|
||
multi-threaded, high-performance web server. AOLserver is less known
|
||
than Apache but it has a few features that put it ahead of
|
||
Apache: rich and well-thought extension API, superior database
|
||
connectivity API, embedded and tightly integrated Tcl interpreter.
|
||
Read my <a href="../issue58/washington.html">previous LG article </a> to
|
||
learn more about AOLserver.
|
||
|
||
<h3>XML</h3>
|
||
|
||
If you're going to do serious work with XML you'll have to learn about
|
||
it and you'll have to do it somewhere else. The best summary of XML
|
||
I've seen is: XML is an (inefficient) way to to represent data in
|
||
tree form as text (ASCII) files. Text is good because it's simple.
|
||
Tree is good because a lot can be represented as trees
|
||
(e.g., a non-circular list is just a degenerated tree and a circular
|
||
list can be described with multiple trees). Inefficient is bad but it
|
||
usually makes an engineering sense to trade inefficiency for
|
||
extensibility and wide adoption that XML enjoys (lots of tools,
|
||
lots of information).
|
||
|
||
<h3>XML support in AOLserver</h3>
|
||
|
||
XML processing (parsing and modification of XML documents) in
|
||
AOLserver is possible thanks to an <b>ns_xml</b> module written
|
||
by <a href="http://www.arsdigita.com">ArsDigita</a>. This module is a
|
||
wrapper around version 2.x (>2.2.5) of <a
|
||
href="http://www.xmlsoft.org/">libxml</a> library and adds
|
||
<code>ns_xml</code> command to the embedded Tcl interpreter.
|
||
You can <a
|
||
href="http://www.aolserver.com/download/index.adp?dir=%2fmodules%2fnsxml">
|
||
download the source</a> or get it directly from the CVS repository doing:
|
||
<pre>
|
||
cvs -d:pserver:anonymous@cvs.aolserver.sourceforge.net:/cvsroot/aolserver login
|
||
cvs -z3 -d:pserver:anonymous@cvs.aolserver.sourceforge.net:/cvsroot/aolserver co nsxml
|
||
</pre>
|
||
You need to press <i>Enter</i> after first command since CVS is
|
||
waiting for a password (which is empty).
|
||
<p>
|
||
As of Dec. 2000 Linux distributions usually come with
|
||
version 1.x of libxml library so chances are that you'll need to
|
||
install 2.x by yourself (this will change in the future since
|
||
everyone is migrating to 2.x). To install <code>nsxml</code> module go
|
||
into <tt>nsxml</tt> directory, optionally edit a path in
|
||
<code>Makefile</code> to point into AOLserver source directory. Then
|
||
run <code>make</code>. You should get <code>nsxml.so</code> module
|
||
that should be placed in AOLserver bin directory (the same that has
|
||
main <code>nsd</code> executable). Add the following to your
|
||
<code>nsd.tcl</code> config file:
|
||
<pre>
|
||
ns_section "ns/server/${servername}/modules"
|
||
ns_param nsxml ${bindir}/ns_xml.so
|
||
</pre>
|
||
and restart AOLserver. You can verify that the module gets loaded by
|
||
watching server.log, I usually use a shell window with:
|
||
<pre>
|
||
tail -f $AOLSERVERDIR/log/server.log
|
||
</pre>
|
||
This is also a great way to debug Tcl scripts since AOLserver will
|
||
dump detailed debug information every time there is an error in the
|
||
script.
|
||
|
||
<h3>XML Quick reference</h3>
|
||
|
||
Here's a quick reference of all commands available through ns_xml.
|
||
|
||
<p>
|
||
|
||
<table bgcolor=#ffffff cellspacing=1>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
set doc_id [<font color=gray>ns_xml parse</font> <font color=red>?-persist? $string</font>]
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Parse the XML document in a <font color=red>$string</font> and return document id
|
||
(handle to in-memory parsed tree). If you don't provide
|
||
<font color=red>?-persist?</font> flag the memory will be automatically freed when the
|
||
script exits. Otherwise you'll have to free the memory by calling
|
||
<font color=gray>ns_xml doc free</font>. You need to use <font color=red>-persist</font> flag if you want
|
||
to share parsed XML docs between scripts.
|
||
|
||
</td></tr>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
set doc_stats [<font color=gray>ns_xml doc stats</font> <font color=red>$doc_id</font>]
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Return document's statistics.
|
||
|
||
</td></tr>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
<font color=gray>ns_xml doc free</font> <font color=red>$doc_id</font>
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Free a document. Should only be called on a document if
|
||
<font color=red>?-persistent?</font> flag has been passed to either
|
||
<font color=gray>ns_xml parse</font> or <font color=gray>ns_xml doc create</font>
|
||
|
||
</td></tr>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
set node_id [<font color=gray>ns_xml doc root</font> <font color=red>$doc_id</font>]
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Return the node id of the document root (you start traversal of the
|
||
document tree from here.)
|
||
|
||
</td></tr>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
set children_list [<font color=gray>ns_xml node children</font> <font color=red>$node_id</font>]
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Return a list of children nodes of a given node.
|
||
|
||
</td></tr>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
set node_name [<font color=gray>ns_xml node name</font> <font color=red>$node_id</font>]
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Return the name of a node.
|
||
|
||
</td></tr>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
set node_type [<font color=gray>ns_xml node type</font> <font color=red>$node_id</font>]
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Return the type of a node. Possible types: <i>element, attribute,
|
||
text, cdata_section, entity_ref, entity, pi, comment, document,
|
||
document_type, document_frag, notation, html_document</i>
|
||
|
||
</td></tr>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
set content [<font color=gray>ns_xml node getcontent</font> <font color=red>$node_id</font>]
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Get a content (text) of a given node.
|
||
|
||
</td></tr>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
set attr [<font color=gray>ns_xml node getattr</font> <font color=red>$node_id $attr_name</font>]
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Return the value of an attribute of a given node.
|
||
|
||
</td></tr>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
set doc_id [<font color=gray>ns_xml doc create</font> <font color=red>?-persist? $doc-version</font>]
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Create a new document in memory. If <font color=red>-persist</font> flag is given you'll
|
||
have to explicitely free the memory taken by the document with
|
||
<font color=gray>ns_xml doc free</font>, otherwise it'll be freed automatically after
|
||
execution of the script. <font color=red>$doc_version</font> is a version of an XML
|
||
doc, if not specified it'll be "1.0".
|
||
|
||
</td></tr>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
set xml_string [<font color=gray>ns_xml doc render</font> <font color=red>$doc_id</font>]
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Generate XML from the in-memory representation of the document.
|
||
|
||
</td></tr>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
set node_id [<font color=gray>ns_xml doc new_root</font> <font color=red>$doc_id $node_name $node_content</font>]
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Create a root node for a document.
|
||
|
||
</td></tr>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
set node_id [<font color=gray>ns_xml node new_sibling</font> <font color=red>$node_id $name $content</font>]
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Create a new sibling of a given node.
|
||
|
||
</td></tr>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
set node_id [<font color=gray>ns_xml node new_child</font> <font color=red>$node_id $name $content</font>]
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Create a child of a given node.
|
||
|
||
</td></tr>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
<font color=gray>ns_xml node setcontent</font> <font color=red>$node_id $content</font>
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Set a content for a given node.
|
||
|
||
</td></tr>
|
||
<tr> <td bgcolor=wheat> <b><code>
|
||
<font color=gray>ns_xml node setattr</font> <font color=red>$node_id $attr_name $value</font>
|
||
</code> </b> </td> </tr>
|
||
<tr> <td>
|
||
Set the value of an attribute in a given node.
|
||
</td></tr>
|
||
</table>
|
||
|
||
<h3>A simple example</h3>
|
||
|
||
An educational and simple thing to do is to parse a document and print
|
||
out its tree structure. Stripped to bare bones the process is:
|
||
<ul>
|
||
<li> use <font color=gray> <code>ns_xml parse $xml_doc</code></font>
|
||
to parse XML document in string <font color=gray>$xml_doc</font> and get
|
||
its document id
|
||
<li> use <font color=gray> <code>ns_xml doc root $doc_id</code>
|
||
</font> to get the id of a root node
|
||
<li> use <font color=gray> <code>ns_xml node children
|
||
$node_id</code> </font> to traverse document tree and <font
|
||
color=gray> <code>ns_xml node ...</code> </font>commands to get
|
||
node content and attributes
|
||
</ul>
|
||
|
||
If you provide <font color=gray> <code>-persist</code> </font> flag to
|
||
<font color=gray><code>ns_xml parse</code> </font>
|
||
you'll have to explicitly call <font color=gray> <code>ns_xml doc
|
||
free $doc_id </code> </font> to free memory associated with this
|
||
document, otherwise it will get automatically freed after execution of
|
||
a script.
|
||
<p>
|
||
In code it could look like this:
|
||
|
||
<pre>
|
||
proc dump_node {node_id level} {
|
||
set name [ns_xml node name $node_id]
|
||
set type [ns_xml node type $node_id]
|
||
set content [ns_xml node getcontent $node_id]
|
||
ns_write "<li>"
|
||
ns_write "node id=$node_id name=$name type=$type"
|
||
if { [string compare $type "attribute"] != 0 } {
|
||
ns_write " content=$content\n"
|
||
}
|
||
}
|
||
|
||
proc dump_tree_rec {children} {
|
||
ns_write "<ul>\n"
|
||
foreach child_id $children {
|
||
dump_node $child_id
|
||
set new_children [ns_xml node children $child_id]
|
||
if { [llength $new_children] > 0 } {
|
||
dump_tree_rec $new_children
|
||
}
|
||
}
|
||
}
|
||
|
||
proc dump_tree {node_id} {
|
||
dump_tree_rec [list $node_id] 0
|
||
}
|
||
|
||
proc dump_doc {doc_id} {
|
||
ns_write "doc id=$doc_id<br>\n"
|
||
set root_id [ns_xml doc root $doc_id]
|
||
dump_tree $root_id
|
||
}
|
||
|
||
set xml_doc "<test version="1.0">this is a
|
||
<blind>test</blind> of xml</test>"
|
||
set doc_id [ns_xml parse $xml_doc]
|
||
dump_doc $doc_id
|
||
</pre>
|
||
|
||
<font color=gray> <code>ns_xml parse</code> </font> command will throw
|
||
an error if XML document is not valid (e.g., not well formed) so in
|
||
production code we should catch it and display a meaningful error
|
||
message, e.g.:
|
||
|
||
<pre>
|
||
if { [catch {set doc_id [ns_xml parse $xml_doc]} err] } {
|
||
ns_write "There was an error parsing the following XML document: "
|
||
ns_write [ns_quotehtml $xml_doc]
|
||
ns_write "Error message is:"
|
||
ns_write [ns_quotehtml $err]
|
||
ns_write "</body></html>\n"
|
||
return
|
||
}
|
||
</pre>
|
||
|
||
Code like this takes more time to write but some day it may save a lot of
|
||
debugging time (and a day like this always comes).
|
||
|
||
<p>
|
||
<a href="http://www.fifthgate.org/articles/aolserver/xml/test_xml.tcl">See how the code works</a> in practice
|
||
[external site running AOLserver]
|
||
and <a href="misc/washington/test_xml.tcl.txt">get the full
|
||
source</a> [included in <I>Linux Gazette</I>]. It's a bit more complex than the
|
||
above snippet. You can see the structure of an arbitrary XML document by typing
|
||
it in the provided text area. The script also shows how to parse form data and
|
||
has more robust error handling.
|
||
|
||
<h3> Real life example</h3>
|
||
|
||
XML is better than other similar formats because it is a standard, it
|
||
has gained wide acceptance and its usage is growing rapidly.
|
||
One of the possible usages of XML is as a way of communication between
|
||
web sites (web services). The simplest scenario is that of one web server
|
||
grabbing information in XML format from another web server. A popular
|
||
example of such communication is a congregation of headlines, e.g., if
|
||
you go to <a
|
||
href="http://www.freshmeat.net">freshmeat.net</a> you'll see that they
|
||
provide current headlines from
|
||
<a href="http://www.linuxtoday.com">linuxtoday.com</a>. We'll do the
|
||
same thing (vive l'originalite!). <p>
|
||
In the past it could've been done in a rather distasteful way by
|
||
grabbing the whole HTML page and trying to extract relevant
|
||
information. It would be hard to program and fragile (a change in the
|
||
way HTML page is generated would most likely break such parsing).
|
||
<p>
|
||
Today the site that wants to provide headlines for others can
|
||
publish this data in an easily to parse XML format under some URL.
|
||
In our case the data are provided at
|
||
<a href="http://www.linuxtoday.com/backend/linuxtoday.xml">
|
||
http://www.linuxtoday.com/backend/linuxtoday.xml</a>.
|
||
<a href="misc/washington/test_xml.tcl.txt">See the format of this
|
||
file</a> (using previously developed script). <!-- ?show_linuxtoday_p=1 -->
|
||
<p>
|
||
As you can see XML document represent headlines on LinuxToday site. It
|
||
is a set of stories, each story having
|
||
title, url, author etc. We know that after parsing the XML document we
|
||
would like to have a way to easily extract the information.
|
||
Let's use a "wishful-thinking" (in other words top-down) method of
|
||
writing the code advocated in a <a
|
||
href="http://sicp.arsdigita.org">Structure and interpretation of
|
||
computer programs</a> (a truly great CS book). Let's assume that we've
|
||
converted XML representation into an object. To build an
|
||
HTML table showing the data we need the following procedures:
|
||
<ul>
|
||
<li> get total number of stories: <font color=gray><code>headlines_get_stories_count $headlines</code> </font>
|
||
<li> get n-th story: <font color=gray><code>headlines_get_story $headline $story_no</code></font>
|
||
<li> get URL of a given story: <font color=gray><code>story_get_url $story</code></font>
|
||
<li> get title of a given story: <font color=gray><code>story_get_title $story</code></font>
|
||
</ul>
|
||
For simplicity I only use URL and title but extending this to more
|
||
attributes should be trivial.
|
||
<p>
|
||
Having those procedures we can generate the simplest (but rather ugly)
|
||
table:
|
||
<pre>
|
||
proc story_to_html_table_row { story } {
|
||
set url [story_get_url $story]
|
||
set title [story_get_title $story]
|
||
return "- <a href=\"$url\"><font color=#000000>$title</font></a><br>\n"
|
||
}
|
||
|
||
# given headlines generate HTML code of the table with this data
|
||
proc headlines_to_html_table { headlines } {
|
||
set to_return "<table border=0 cellspacing=1 cellpadding=3>"
|
||
append to_return "<tr><td><small>"
|
||
|
||
set stories_count [headlines_get_stories_count $headlines]
|
||
for {set i 0} {$i < $stories_count} {incr i} {
|
||
set story [headlines_get_story $headlines $i]
|
||
append to_return [story_to_html_table_row $story]
|
||
}
|
||
|
||
append to_return "</td></tr></table>\n"
|
||
return $to_return
|
||
}
|
||
</pre>
|
||
|
||
Tcl doesn't give us much choice for representing this object; we'll
|
||
use lists.
|
||
<pre>
|
||
proc headlines_get_stories_count { headlines } {
|
||
return [llength $headlines]
|
||
}
|
||
|
||
proc headlines_get_story { headlines story_no } {
|
||
return [lindex $headlines $story_no]
|
||
}
|
||
|
||
proc story_get_url { story } {
|
||
return [lindex $story 0]
|
||
}
|
||
|
||
proc story_get_title { story } {
|
||
return [lindex $story 1]
|
||
}
|
||
</pre>
|
||
|
||
Note that if we forget about purity (just for a while) we can rewrite
|
||
the following part of <code>headlines_to_html_table</code>:
|
||
<pre>
|
||
set stories_count [headlines_get_stories_count $headlines]
|
||
for {set i 0} {$i < $stories_count} {incr i} {
|
||
set story [headlines_get_story $headlines $i]
|
||
append to_return [story_to_html_table_row $story]
|
||
}
|
||
</pre>
|
||
in a bit more terse way:
|
||
<pre>
|
||
foreach story $headlines {
|
||
append to_return [story_to_html_table_row $story]
|
||
}
|
||
</pre>
|
||
|
||
Now the most important part: converting XML doc into the
|
||
representation we've chosen.
|
||
<pre>
|
||
# does a name of the node identified by $node_id equals $name
|
||
proc is_node_name_p { node_id name } {
|
||
set node_name [ns_xml node name $node_id]
|
||
if { [string_equal_p $name $node_name] } {
|
||
return 1
|
||
} else {
|
||
return 0
|
||
}
|
||
}
|
||
|
||
# does a type of the node identified by $node_id equals $type
|
||
proc is_node_type_p { node_id type } {
|
||
set node_type [ns_xml node type $node_id]
|
||
if { [string_equal_p $type $node_type] } {
|
||
return 1
|
||
} else {
|
||
return 0
|
||
}
|
||
}
|
||
|
||
# is this an node of type "attribute"?
|
||
proc is_attribute_node_p { node_id } {
|
||
return [is_node_type_p $node_id "attribute"]
|
||
}
|
||
|
||
# raise an error if node name is different than $name
|
||
proc error_if_node_name_not {node_id name} {
|
||
if { ![is_node_name_p $node_id $name] } {
|
||
set node_name [ns_xml node name $node_id]
|
||
error "node name should be $name and not $node_name"
|
||
}
|
||
}
|
||
|
||
# raise an error if node type is different than $type
|
||
proc error_if_node_type_not {node_id type} {
|
||
if { ![is_node_type_p $node_id $type] } {
|
||
set node_type [ns_xml node type $node_id]
|
||
error "node type should be $type and not $node_type"
|
||
}
|
||
}
|
||
|
||
# given url and title construct a story object with
|
||
# those attributes
|
||
proc define_story { url title } {
|
||
return [list $url $title]
|
||
}
|
||
|
||
# convert a node of name "story" into an object
|
||
# that represents story
|
||
proc story_node_to_story {node_id} {
|
||
set url ""
|
||
set title ""
|
||
# go through all children and extract content of url and title nodes
|
||
set children [ns_xml node children $node_id]
|
||
foreach node_id $children {
|
||
# we're only interested in nodes whose name is "url" or "title"
|
||
if { [is_attribute_node_p $node_id]} {
|
||
if { [is_node_name_p $node_id "url"] || [is_node_name_p $node_id "title"]} {
|
||
set node_children [ns_xml node children $node_id]
|
||
# those should only have one children node with
|
||
# the name "text" and type "cdata_section"
|
||
if { [llength $node_children] != 1 } {
|
||
set name [ns_xml node name $node_id]
|
||
error "$name node should only have 1 child"
|
||
}
|
||
set one_node_id [lindex $node_children 0]
|
||
error_if_node_type_not $one_node_id "cdata_section"
|
||
error_if_node_name_not $one_node_id "text"
|
||
set txt [ns_xml node getcontent $one_node_id]
|
||
if { [is_node_name_p $node_id "url"] } {
|
||
set url $txt
|
||
}
|
||
if { [is_node_name_p $node_id "title"]} {
|
||
set title $txt
|
||
}
|
||
}
|
||
}
|
||
}
|
||
return [define_story $url $title]
|
||
}
|
||
|
||
# convert XML doc to headlines object
|
||
proc xml_to_headlines { doc_id } {
|
||
set headlines [list]
|
||
set root_id [ns_xml doc root $doc_id]
|
||
# root node should be named "linuxtoday" and of type "attribute"
|
||
error_if_node_name_not $root_id "linuxtoday"
|
||
error_if_node_type_not $root_id "attribute"
|
||
set children [ns_xml node children $root_id]
|
||
foreach node_id $children {
|
||
# only interested in attribute type nodes whose name is "story"
|
||
if { [is_node_name_p $node_id "story"] && [is_attribute_node_p $node_id]} {
|
||
set story [story_node_to_story $node_id]
|
||
lappend headlines $story
|
||
}
|
||
}
|
||
return $headlines
|
||
}
|
||
</pre>
|
||
|
||
The code is rather straightforward. We use the knowledge about the
|
||
structure of XML file. In this case we know that root node is named
|
||
<tt>linuxtoday</tt> and should have a child named
|
||
<tt>story</tt>. Each <tt>story</tt> node should have children named
|
||
<tt>url</tt> and <tt>title</tt> etc. The previous script that dumps
|
||
general structure of the tree helped me a lot in writing this
|
||
function. Note the usage of <font color=gray> <tt>error</tt> </font>
|
||
command to abort the script if XML doesn't look good to us.
|
||
<p>
|
||
Having an intermediate representation of the data might look like an
|
||
excess given that it costs us more code and some performance but there
|
||
are very good reasons to have it. We could have written a proc
|
||
<code>xml_to_html_table</code> that would create HTML table directly
|
||
from XML document but such code would be more complex, more buggy and
|
||
harder to modify. Separation that we've made provides an abstraction
|
||
that reduces complexity, which is always good. It also gives us more
|
||
flexibility: we can easily imagine writing another
|
||
<code>headlines_to_html_table</code> procedure that gives us slightly
|
||
different table.
|
||
|
||
<p>
|
||
<a
|
||
href="http://www.fifthgate.org/articles/aolserver/xml/test_linuxtoday_xml.tcl">
|
||
See how it works in practice</a>
|
||
[external site running AOLserver]
|
||
and
|
||
<a href="misc/washington/test_linuxtoday_xml.tcl.txt">get the source</a>
|
||
[included in <I>Linux Gazette</I>]. It should
|
||
produce something like this:
|
||
<p>
|
||
<center>
|
||
<TABLE WIDTH="40%" BORDER="0" CELLSPACING="1" CELLPADDING="3">
|
||
<TR>
|
||
<TD ALIGN="center" BGCOLOR="#cccccc">
|
||
<B>
|
||
<FONT FACE="Lucida,Verdana,Helvetica,Arial">
|
||
<A href="http://linuxtoday.com">
|
||
<FONT color="#000000">linuxtoday</FONT>
|
||
</A>
|
||
</FONT>
|
||
</B>
|
||
</TD>
|
||
</TR>
|
||
|
||
<TR>
|
||
<TD BGCOLOR="#eeeeee">
|
||
<SMALL><FONT FACE="Lucida,Verdana,Helvetica,Arial">
|
||
|
||
- <A HREF="http://linuxtoday.com/news_story.php3?ltsn=2000-12-28-001-04-OS-DB"><FONT COLOR="#000000">Kernel Cousin Debian Hurd #73 By Paul Emsley And Zack Brown</FONT></A><BR>
|
||
- <A HREF="http://linuxtoday.com/news_story.php3?ltsn=2000-12-27-006-04-OS-SW"><FONT COLOR="#000000">Zope 2.2.5 b1 released</FONT></A><BR>
|
||
- <A HREF="http://linuxtoday.com/news_story.php3?ltsn=2000-12-27-014-06-SC"><FONT COLOR="#000000">O#39;Reilly Network: Insecurities in a Nutshell: SAMBA, pine, ircd, and More</FONT></A><BR>
|
||
- <A HREF="http://linuxtoday.com/news_story.php3?ltsn=2000-12-27-005-04-OP-HW"><FONT COLOR="#000000">ZDNet: Linux Laptop SuperGuide</FONT></A><BR>
|
||
- <A HREF="http://linuxtoday.com/news_story.php3?ltsn=2000-12-27-004-04-OP-MS"><FONT COLOR="#000000">ComputerWorld: Think tank warns that Microsoft hack could pose national security risk</FONT></A><BR>
|
||
<EFBFBD></FONT></SMALL>
|
||
</TD>
|
||
</TR>
|
||
</TABLE>
|
||
</center>
|
||
|
||
<p>
|
||
One thing missing in this code is caching. As it is, it
|
||
will grab the XML file from other people's server everytime it is
|
||
invoked. This is not nice. It would be fairly easy to add a logic to
|
||
cache XML file (or its in-memory representation) and only
|
||
fetch a new version if, say, 1 hour passed since it was last retrieved.
|
||
|
||
|
||
<h3>Conclusion about XML as a data exchange language</h3>
|
||
|
||
Is this data exchange thing between web servers a novel idea? No. You
|
||
could do everything described here with the first generation of web
|
||
servers. You would probably use different technologies (C code running
|
||
inside a web server or a CGI script instead of an embedded scripting
|
||
language; some ad-hoc text or binary format instead of XML) but the
|
||
idea would be the same: one web server acts as a client, grabs the
|
||
data from the other server using HTTP protocol and does something
|
||
useful with the data. The other web server acts as a server providing
|
||
data for others. It's just another implementation of
|
||
a client-server paradigm. It's nothing new. It is just a sign that web
|
||
programming is maturing. After 5+ years we've finally solved most of the
|
||
problems with presenting static html pages or generating dynamic web
|
||
pages from the data kept on the server (e.g., in a database). Now we
|
||
enter the times of providing services and data for other web
|
||
sites. Today state-of-the-art is pretty much limited to exchanging
|
||
headlines and similar trivia but possibilities are bigger, ranging
|
||
from simple things like providing stock quotes or dictionary
|
||
definitions to executing complex (e.g., financial) transactions
|
||
following an agreed upon protocol.
|
||
<p>
|
||
|
||
<h3>Conclusion about XML parsing in AOLserver</h3>
|
||
|
||
Beside parsing you can also create and manipulate XML documents in
|
||
memory and convert them to XML ASCII representation. It is not
|
||
covered in this article but it's so straightforward that you should
|
||
be able to do it just by looking at the API.
|
||
<p>
|
||
ns_xml module provides basics of XML processing. Although you can do
|
||
quite a bit with it one could wish to do more. Things that are
|
||
obviously missing:
|
||
<ul>
|
||
<li> SAX API (it's already present in libxml so this would only
|
||
require extending ns_xml)
|
||
<li> support for XSLT (support for XSLT, although planned, is not yet
|
||
present in libxml)
|
||
</ul>
|
||
An alternative approach to ns_xml module would be to:
|
||
<ul>
|
||
<li> use <a href="http://pywx.idyll.org">PyWx</a>, a Python
|
||
interpreter embedded inside AOLserver and standard
|
||
<a href="http://www.python.org/sigs/xml-sig/">PyXML</a> Python module
|
||
<li> write another module wrapping some other XML parsing library
|
||
<li> use pure Tcl parser
|
||
</ul>
|
||
|
||
<h3>Links</h3>
|
||
<ul>
|
||
<li> to find out more about AOLserver read
|
||
<a href="../issue58/washington.html">
|
||
intro in December 2000 issue of LG</a> or
|
||
<a
|
||
href="http://www.arsdigita.com/asj/aolserver/introduction-1">part
|
||
one</a> and <a
|
||
href="http://www.arsdigita.com/asj/aolserver/introduction-2">part
|
||
two</a> of another intro
|
||
<li> <a href="http://www.aolserver.com">AOLserver</a> home page
|
||
<li> <a href="http://www.arsdigita.com/books/panda">Philip and Alex's
|
||
Guide to Web Publishing</a>, a book
|
||
that will make you a better web programmer
|
||
<li> <a href="http://sicp.arsdigita.org/">Structure and Interpretation
|
||
of Computer Programs</a>, a book that will make you a better
|
||
programmer
|
||
<li> <a href="http://www.arsdigita.com/books/tcl">Tcl for Web
|
||
Nerds</a>, a handy book on Tcl
|
||
<li> everybody has a web page and <a
|
||
href="http://www.fifthgate.org"> this one is mine </a>
|
||
</ul>
|
||
|
||
<address> If you have comments or suggestions,
|
||
<a href="mailto:irvingw@pobox.com">send them in</a>. </address>
|
||
|
||
|
||
|
||
|
||
|
||
<!-- *** BEGIN copyright *** -->
|
||
<P> <hr> <!-- P -->
|
||
<H5 ALIGN=center>
|
||
|
||
Copyright © 2001, Irving Washington.<BR>
|
||
Copying license <A HREF="../copying.html">http://www.linuxgazette.com/copying.html</A><BR>
|
||
Published in Issue 63 of <i>Linux Gazette</i>, Mid-February (EXTRA) 2001</H5>
|
||
<!-- *** END copyright *** -->
|
||
|
||
<!--startcut ==========================================================-->
|
||
<HR><P>
|
||
<CENTER>
|
||
<!-- *** BEGIN navbar *** -->
|
||
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="sharma.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue63/washington.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="lg_backpage63.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
||
<!-- *** END navbar *** -->
|
||
</CENTER>
|
||
</BODY></HTML>
|
||
<!--endcut ============================================================-->
|