704 lines
28 KiB
HTML
704 lines
28 KiB
HTML
|
<!--startcut ==============================================-->
|
|||
|
<!-- *** BEGIN HTML header *** -->
|
|||
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
|||
|
<HTML><HEAD>
|
|||
|
<title>XML parsing in AOLserver LG #63</title>
|
|||
|
</HEAD>
|
|||
|
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
|
|||
|
ALINK="#FF0000">
|
|||
|
<!-- *** END HTML header *** -->
|
|||
|
|
|||
|
<CENTER>
|
|||
|
<A HREF="http://www.linuxgazette.com/">
|
|||
|
<H1><IMG ALT="LINUX GAZETTE" SRC="../gx/lglogo.png"
|
|||
|
WIDTH="600" HEIGHT="124" border="0"></H1></A>
|
|||
|
|
|||
|
<!-- *** BEGIN navbar *** -->
|
|||
|
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="sharma.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue63/washington.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="lg_backpage63.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
|||
|
<!-- *** END navbar *** -->
|
|||
|
<P>
|
|||
|
</CENTER>
|
|||
|
|
|||
|
<!--endcut ============================================================-->
|
|||
|
|
|||
|
<H4 ALIGN="center">
|
|||
|
"Linux Gazette...<I>making Linux just a little more fun!</I>"
|
|||
|
</H4>
|
|||
|
|
|||
|
<P> <HR> <P>
|
|||
|
<!--===================================================================-->
|
|||
|
|
|||
|
<center>
|
|||
|
<H1><font color="maroon">XML parsing in AOLserver</font></H1>
|
|||
|
<H4>By <a href="mailto:irvingw@pobox.com">Irving Washington</a></H4>
|
|||
|
</center>
|
|||
|
<P> <HR> <P>
|
|||
|
|
|||
|
<!-- END header -->
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
<h3>AOLserver</h3>
|
|||
|
|
|||
|
<a href="http://www.aolserver.com">AOLserver</a> is an open-source,
|
|||
|
multi-threaded, high-performance web server. AOLserver is less known
|
|||
|
than Apache but it has a few features that put it ahead of
|
|||
|
Apache: rich and well-thought extension API, superior database
|
|||
|
connectivity API, embedded and tightly integrated Tcl interpreter.
|
|||
|
Read my <a href="../issue58/washington.html">previous LG article </a> to
|
|||
|
learn more about AOLserver.
|
|||
|
|
|||
|
<h3>XML</h3>
|
|||
|
|
|||
|
If you're going to do serious work with XML you'll have to learn about
|
|||
|
it and you'll have to do it somewhere else. The best summary of XML
|
|||
|
I've seen is: XML is an (inefficient) way to to represent data in
|
|||
|
tree form as text (ASCII) files. Text is good because it's simple.
|
|||
|
Tree is good because a lot can be represented as trees
|
|||
|
(e.g., a non-circular list is just a degenerated tree and a circular
|
|||
|
list can be described with multiple trees). Inefficient is bad but it
|
|||
|
usually makes an engineering sense to trade inefficiency for
|
|||
|
extensibility and wide adoption that XML enjoys (lots of tools,
|
|||
|
lots of information).
|
|||
|
|
|||
|
<h3>XML support in AOLserver</h3>
|
|||
|
|
|||
|
XML processing (parsing and modification of XML documents) in
|
|||
|
AOLserver is possible thanks to an <b>ns_xml</b> module written
|
|||
|
by <a href="http://www.arsdigita.com">ArsDigita</a>. This module is a
|
|||
|
wrapper around version 2.x (>2.2.5) of <a
|
|||
|
href="http://www.xmlsoft.org/">libxml</a> library and adds
|
|||
|
<code>ns_xml</code> command to the embedded Tcl interpreter.
|
|||
|
You can <a
|
|||
|
href="http://www.aolserver.com/download/index.adp?dir=%2fmodules%2fnsxml">
|
|||
|
download the source</a> or get it directly from the CVS repository doing:
|
|||
|
<pre>
|
|||
|
cvs -d:pserver:anonymous@cvs.aolserver.sourceforge.net:/cvsroot/aolserver login
|
|||
|
cvs -z3 -d:pserver:anonymous@cvs.aolserver.sourceforge.net:/cvsroot/aolserver co nsxml
|
|||
|
</pre>
|
|||
|
You need to press <i>Enter</i> after first command since CVS is
|
|||
|
waiting for a password (which is empty).
|
|||
|
<p>
|
|||
|
As of Dec. 2000 Linux distributions usually come with
|
|||
|
version 1.x of libxml library so chances are that you'll need to
|
|||
|
install 2.x by yourself (this will change in the future since
|
|||
|
everyone is migrating to 2.x). To install <code>nsxml</code> module go
|
|||
|
into <tt>nsxml</tt> directory, optionally edit a path in
|
|||
|
<code>Makefile</code> to point into AOLserver source directory. Then
|
|||
|
run <code>make</code>. You should get <code>nsxml.so</code> module
|
|||
|
that should be placed in AOLserver bin directory (the same that has
|
|||
|
main <code>nsd</code> executable). Add the following to your
|
|||
|
<code>nsd.tcl</code> config file:
|
|||
|
<pre>
|
|||
|
ns_section "ns/server/${servername}/modules"
|
|||
|
ns_param nsxml ${bindir}/ns_xml.so
|
|||
|
</pre>
|
|||
|
and restart AOLserver. You can verify that the module gets loaded by
|
|||
|
watching server.log, I usually use a shell window with:
|
|||
|
<pre>
|
|||
|
tail -f $AOLSERVERDIR/log/server.log
|
|||
|
</pre>
|
|||
|
This is also a great way to debug Tcl scripts since AOLserver will
|
|||
|
dump detailed debug information every time there is an error in the
|
|||
|
script.
|
|||
|
|
|||
|
<h3>XML Quick reference</h3>
|
|||
|
|
|||
|
Here's a quick reference of all commands available through ns_xml.
|
|||
|
|
|||
|
<p>
|
|||
|
|
|||
|
<table bgcolor=#ffffff cellspacing=1>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
set doc_id [<font color=gray>ns_xml parse</font> <font color=red>?-persist? $string</font>]
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Parse the XML document in a <font color=red>$string</font> and return document id
|
|||
|
(handle to in-memory parsed tree). If you don't provide
|
|||
|
<font color=red>?-persist?</font> flag the memory will be automatically freed when the
|
|||
|
script exits. Otherwise you'll have to free the memory by calling
|
|||
|
<font color=gray>ns_xml doc free</font>. You need to use <font color=red>-persist</font> flag if you want
|
|||
|
to share parsed XML docs between scripts.
|
|||
|
|
|||
|
</td></tr>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
set doc_stats [<font color=gray>ns_xml doc stats</font> <font color=red>$doc_id</font>]
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Return document's statistics.
|
|||
|
|
|||
|
</td></tr>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
<font color=gray>ns_xml doc free</font> <font color=red>$doc_id</font>
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Free a document. Should only be called on a document if
|
|||
|
<font color=red>?-persistent?</font> flag has been passed to either
|
|||
|
<font color=gray>ns_xml parse</font> or <font color=gray>ns_xml doc create</font>
|
|||
|
|
|||
|
</td></tr>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
set node_id [<font color=gray>ns_xml doc root</font> <font color=red>$doc_id</font>]
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Return the node id of the document root (you start traversal of the
|
|||
|
document tree from here.)
|
|||
|
|
|||
|
</td></tr>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
set children_list [<font color=gray>ns_xml node children</font> <font color=red>$node_id</font>]
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Return a list of children nodes of a given node.
|
|||
|
|
|||
|
</td></tr>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
set node_name [<font color=gray>ns_xml node name</font> <font color=red>$node_id</font>]
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Return the name of a node.
|
|||
|
|
|||
|
</td></tr>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
set node_type [<font color=gray>ns_xml node type</font> <font color=red>$node_id</font>]
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Return the type of a node. Possible types: <i>element, attribute,
|
|||
|
text, cdata_section, entity_ref, entity, pi, comment, document,
|
|||
|
document_type, document_frag, notation, html_document</i>
|
|||
|
|
|||
|
</td></tr>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
set content [<font color=gray>ns_xml node getcontent</font> <font color=red>$node_id</font>]
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Get a content (text) of a given node.
|
|||
|
|
|||
|
</td></tr>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
set attr [<font color=gray>ns_xml node getattr</font> <font color=red>$node_id $attr_name</font>]
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Return the value of an attribute of a given node.
|
|||
|
|
|||
|
</td></tr>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
set doc_id [<font color=gray>ns_xml doc create</font> <font color=red>?-persist? $doc-version</font>]
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Create a new document in memory. If <font color=red>-persist</font> flag is given you'll
|
|||
|
have to explicitely free the memory taken by the document with
|
|||
|
<font color=gray>ns_xml doc free</font>, otherwise it'll be freed automatically after
|
|||
|
execution of the script. <font color=red>$doc_version</font> is a version of an XML
|
|||
|
doc, if not specified it'll be "1.0".
|
|||
|
|
|||
|
</td></tr>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
set xml_string [<font color=gray>ns_xml doc render</font> <font color=red>$doc_id</font>]
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Generate XML from the in-memory representation of the document.
|
|||
|
|
|||
|
</td></tr>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
set node_id [<font color=gray>ns_xml doc new_root</font> <font color=red>$doc_id $node_name $node_content</font>]
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Create a root node for a document.
|
|||
|
|
|||
|
</td></tr>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
set node_id [<font color=gray>ns_xml node new_sibling</font> <font color=red>$node_id $name $content</font>]
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Create a new sibling of a given node.
|
|||
|
|
|||
|
</td></tr>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
set node_id [<font color=gray>ns_xml node new_child</font> <font color=red>$node_id $name $content</font>]
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Create a child of a given node.
|
|||
|
|
|||
|
</td></tr>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
<font color=gray>ns_xml node setcontent</font> <font color=red>$node_id $content</font>
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Set a content for a given node.
|
|||
|
|
|||
|
</td></tr>
|
|||
|
<tr> <td bgcolor=wheat> <b><code>
|
|||
|
<font color=gray>ns_xml node setattr</font> <font color=red>$node_id $attr_name $value</font>
|
|||
|
</code> </b> </td> </tr>
|
|||
|
<tr> <td>
|
|||
|
Set the value of an attribute in a given node.
|
|||
|
</td></tr>
|
|||
|
</table>
|
|||
|
|
|||
|
<h3>A simple example</h3>
|
|||
|
|
|||
|
An educational and simple thing to do is to parse a document and print
|
|||
|
out its tree structure. Stripped to bare bones the process is:
|
|||
|
<ul>
|
|||
|
<li> use <font color=gray> <code>ns_xml parse $xml_doc</code></font>
|
|||
|
to parse XML document in string <font color=gray>$xml_doc</font> and get
|
|||
|
its document id
|
|||
|
<li> use <font color=gray> <code>ns_xml doc root $doc_id</code>
|
|||
|
</font> to get the id of a root node
|
|||
|
<li> use <font color=gray> <code>ns_xml node children
|
|||
|
$node_id</code> </font> to traverse document tree and <font
|
|||
|
color=gray> <code>ns_xml node ...</code> </font>commands to get
|
|||
|
node content and attributes
|
|||
|
</ul>
|
|||
|
|
|||
|
If you provide <font color=gray> <code>-persist</code> </font> flag to
|
|||
|
<font color=gray><code>ns_xml parse</code> </font>
|
|||
|
you'll have to explicitly call <font color=gray> <code>ns_xml doc
|
|||
|
free $doc_id </code> </font> to free memory associated with this
|
|||
|
document, otherwise it will get automatically freed after execution of
|
|||
|
a script.
|
|||
|
<p>
|
|||
|
In code it could look like this:
|
|||
|
|
|||
|
<pre>
|
|||
|
proc dump_node {node_id level} {
|
|||
|
set name [ns_xml node name $node_id]
|
|||
|
set type [ns_xml node type $node_id]
|
|||
|
set content [ns_xml node getcontent $node_id]
|
|||
|
ns_write "<li>"
|
|||
|
ns_write "node id=$node_id name=$name type=$type"
|
|||
|
if { [string compare $type "attribute"] != 0 } {
|
|||
|
ns_write " content=$content\n"
|
|||
|
}
|
|||
|
}
|
|||
|
|
|||
|
proc dump_tree_rec {children} {
|
|||
|
ns_write "<ul>\n"
|
|||
|
foreach child_id $children {
|
|||
|
dump_node $child_id
|
|||
|
set new_children [ns_xml node children $child_id]
|
|||
|
if { [llength $new_children] > 0 } {
|
|||
|
dump_tree_rec $new_children
|
|||
|
}
|
|||
|
}
|
|||
|
}
|
|||
|
|
|||
|
proc dump_tree {node_id} {
|
|||
|
dump_tree_rec [list $node_id] 0
|
|||
|
}
|
|||
|
|
|||
|
proc dump_doc {doc_id} {
|
|||
|
ns_write "doc id=$doc_id<br>\n"
|
|||
|
set root_id [ns_xml doc root $doc_id]
|
|||
|
dump_tree $root_id
|
|||
|
}
|
|||
|
|
|||
|
set xml_doc "<test version="1.0">this is a
|
|||
|
<blind>test</blind> of xml</test>"
|
|||
|
set doc_id [ns_xml parse $xml_doc]
|
|||
|
dump_doc $doc_id
|
|||
|
</pre>
|
|||
|
|
|||
|
<font color=gray> <code>ns_xml parse</code> </font> command will throw
|
|||
|
an error if XML document is not valid (e.g., not well formed) so in
|
|||
|
production code we should catch it and display a meaningful error
|
|||
|
message, e.g.:
|
|||
|
|
|||
|
<pre>
|
|||
|
if { [catch {set doc_id [ns_xml parse $xml_doc]} err] } {
|
|||
|
ns_write "There was an error parsing the following XML document: "
|
|||
|
ns_write [ns_quotehtml $xml_doc]
|
|||
|
ns_write "Error message is:"
|
|||
|
ns_write [ns_quotehtml $err]
|
|||
|
ns_write "</body></html>\n"
|
|||
|
return
|
|||
|
}
|
|||
|
</pre>
|
|||
|
|
|||
|
Code like this takes more time to write but some day it may save a lot of
|
|||
|
debugging time (and a day like this always comes).
|
|||
|
|
|||
|
<p>
|
|||
|
<a href="http://www.fifthgate.org/articles/aolserver/xml/test_xml.tcl">See how the code works</a> in practice
|
|||
|
[external site running AOLserver]
|
|||
|
and <a href="misc/washington/test_xml.tcl.txt">get the full
|
|||
|
source</a> [included in <I>Linux Gazette</I>]. It's a bit more complex than the
|
|||
|
above snippet. You can see the structure of an arbitrary XML document by typing
|
|||
|
it in the provided text area. The script also shows how to parse form data and
|
|||
|
has more robust error handling.
|
|||
|
|
|||
|
<h3> Real life example</h3>
|
|||
|
|
|||
|
XML is better than other similar formats because it is a standard, it
|
|||
|
has gained wide acceptance and its usage is growing rapidly.
|
|||
|
One of the possible usages of XML is as a way of communication between
|
|||
|
web sites (web services). The simplest scenario is that of one web server
|
|||
|
grabbing information in XML format from another web server. A popular
|
|||
|
example of such communication is a congregation of headlines, e.g., if
|
|||
|
you go to <a
|
|||
|
href="http://www.freshmeat.net">freshmeat.net</a> you'll see that they
|
|||
|
provide current headlines from
|
|||
|
<a href="http://www.linuxtoday.com">linuxtoday.com</a>. We'll do the
|
|||
|
same thing (vive l'originalite!). <p>
|
|||
|
In the past it could've been done in a rather distasteful way by
|
|||
|
grabbing the whole HTML page and trying to extract relevant
|
|||
|
information. It would be hard to program and fragile (a change in the
|
|||
|
way HTML page is generated would most likely break such parsing).
|
|||
|
<p>
|
|||
|
Today the site that wants to provide headlines for others can
|
|||
|
publish this data in an easily to parse XML format under some URL.
|
|||
|
In our case the data are provided at
|
|||
|
<a href="http://www.linuxtoday.com/backend/linuxtoday.xml">
|
|||
|
http://www.linuxtoday.com/backend/linuxtoday.xml</a>.
|
|||
|
<a href="misc/washington/test_xml.tcl.txt">See the format of this
|
|||
|
file</a> (using previously developed script). <!-- ?show_linuxtoday_p=1 -->
|
|||
|
<p>
|
|||
|
As you can see XML document represent headlines on LinuxToday site. It
|
|||
|
is a set of stories, each story having
|
|||
|
title, url, author etc. We know that after parsing the XML document we
|
|||
|
would like to have a way to easily extract the information.
|
|||
|
Let's use a "wishful-thinking" (in other words top-down) method of
|
|||
|
writing the code advocated in a <a
|
|||
|
href="http://sicp.arsdigita.org">Structure and interpretation of
|
|||
|
computer programs</a> (a truly great CS book). Let's assume that we've
|
|||
|
converted XML representation into an object. To build an
|
|||
|
HTML table showing the data we need the following procedures:
|
|||
|
<ul>
|
|||
|
<li> get total number of stories: <font color=gray><code>headlines_get_stories_count $headlines</code> </font>
|
|||
|
<li> get n-th story: <font color=gray><code>headlines_get_story $headline $story_no</code></font>
|
|||
|
<li> get URL of a given story: <font color=gray><code>story_get_url $story</code></font>
|
|||
|
<li> get title of a given story: <font color=gray><code>story_get_title $story</code></font>
|
|||
|
</ul>
|
|||
|
For simplicity I only use URL and title but extending this to more
|
|||
|
attributes should be trivial.
|
|||
|
<p>
|
|||
|
Having those procedures we can generate the simplest (but rather ugly)
|
|||
|
table:
|
|||
|
<pre>
|
|||
|
proc story_to_html_table_row { story } {
|
|||
|
set url [story_get_url $story]
|
|||
|
set title [story_get_title $story]
|
|||
|
return "- <a href=\"$url\"><font color=#000000>$title</font></a><br>\n"
|
|||
|
}
|
|||
|
|
|||
|
# given headlines generate HTML code of the table with this data
|
|||
|
proc headlines_to_html_table { headlines } {
|
|||
|
set to_return "<table border=0 cellspacing=1 cellpadding=3>"
|
|||
|
append to_return "<tr><td><small>"
|
|||
|
|
|||
|
set stories_count [headlines_get_stories_count $headlines]
|
|||
|
for {set i 0} {$i < $stories_count} {incr i} {
|
|||
|
set story [headlines_get_story $headlines $i]
|
|||
|
append to_return [story_to_html_table_row $story]
|
|||
|
}
|
|||
|
|
|||
|
append to_return "</td></tr></table>\n"
|
|||
|
return $to_return
|
|||
|
}
|
|||
|
</pre>
|
|||
|
|
|||
|
Tcl doesn't give us much choice for representing this object; we'll
|
|||
|
use lists.
|
|||
|
<pre>
|
|||
|
proc headlines_get_stories_count { headlines } {
|
|||
|
return [llength $headlines]
|
|||
|
}
|
|||
|
|
|||
|
proc headlines_get_story { headlines story_no } {
|
|||
|
return [lindex $headlines $story_no]
|
|||
|
}
|
|||
|
|
|||
|
proc story_get_url { story } {
|
|||
|
return [lindex $story 0]
|
|||
|
}
|
|||
|
|
|||
|
proc story_get_title { story } {
|
|||
|
return [lindex $story 1]
|
|||
|
}
|
|||
|
</pre>
|
|||
|
|
|||
|
Note that if we forget about purity (just for a while) we can rewrite
|
|||
|
the following part of <code>headlines_to_html_table</code>:
|
|||
|
<pre>
|
|||
|
set stories_count [headlines_get_stories_count $headlines]
|
|||
|
for {set i 0} {$i < $stories_count} {incr i} {
|
|||
|
set story [headlines_get_story $headlines $i]
|
|||
|
append to_return [story_to_html_table_row $story]
|
|||
|
}
|
|||
|
</pre>
|
|||
|
in a bit more terse way:
|
|||
|
<pre>
|
|||
|
foreach story $headlines {
|
|||
|
append to_return [story_to_html_table_row $story]
|
|||
|
}
|
|||
|
</pre>
|
|||
|
|
|||
|
Now the most important part: converting XML doc into the
|
|||
|
representation we've chosen.
|
|||
|
<pre>
|
|||
|
# does a name of the node identified by $node_id equals $name
|
|||
|
proc is_node_name_p { node_id name } {
|
|||
|
set node_name [ns_xml node name $node_id]
|
|||
|
if { [string_equal_p $name $node_name] } {
|
|||
|
return 1
|
|||
|
} else {
|
|||
|
return 0
|
|||
|
}
|
|||
|
}
|
|||
|
|
|||
|
# does a type of the node identified by $node_id equals $type
|
|||
|
proc is_node_type_p { node_id type } {
|
|||
|
set node_type [ns_xml node type $node_id]
|
|||
|
if { [string_equal_p $type $node_type] } {
|
|||
|
return 1
|
|||
|
} else {
|
|||
|
return 0
|
|||
|
}
|
|||
|
}
|
|||
|
|
|||
|
# is this an node of type "attribute"?
|
|||
|
proc is_attribute_node_p { node_id } {
|
|||
|
return [is_node_type_p $node_id "attribute"]
|
|||
|
}
|
|||
|
|
|||
|
# raise an error if node name is different than $name
|
|||
|
proc error_if_node_name_not {node_id name} {
|
|||
|
if { ![is_node_name_p $node_id $name] } {
|
|||
|
set node_name [ns_xml node name $node_id]
|
|||
|
error "node name should be $name and not $node_name"
|
|||
|
}
|
|||
|
}
|
|||
|
|
|||
|
# raise an error if node type is different than $type
|
|||
|
proc error_if_node_type_not {node_id type} {
|
|||
|
if { ![is_node_type_p $node_id $type] } {
|
|||
|
set node_type [ns_xml node type $node_id]
|
|||
|
error "node type should be $type and not $node_type"
|
|||
|
}
|
|||
|
}
|
|||
|
|
|||
|
# given url and title construct a story object with
|
|||
|
# those attributes
|
|||
|
proc define_story { url title } {
|
|||
|
return [list $url $title]
|
|||
|
}
|
|||
|
|
|||
|
# convert a node of name "story" into an object
|
|||
|
# that represents story
|
|||
|
proc story_node_to_story {node_id} {
|
|||
|
set url ""
|
|||
|
set title ""
|
|||
|
# go through all children and extract content of url and title nodes
|
|||
|
set children [ns_xml node children $node_id]
|
|||
|
foreach node_id $children {
|
|||
|
# we're only interested in nodes whose name is "url" or "title"
|
|||
|
if { [is_attribute_node_p $node_id]} {
|
|||
|
if { [is_node_name_p $node_id "url"] || [is_node_name_p $node_id "title"]} {
|
|||
|
set node_children [ns_xml node children $node_id]
|
|||
|
# those should only have one children node with
|
|||
|
# the name "text" and type "cdata_section"
|
|||
|
if { [llength $node_children] != 1 } {
|
|||
|
set name [ns_xml node name $node_id]
|
|||
|
error "$name node should only have 1 child"
|
|||
|
}
|
|||
|
set one_node_id [lindex $node_children 0]
|
|||
|
error_if_node_type_not $one_node_id "cdata_section"
|
|||
|
error_if_node_name_not $one_node_id "text"
|
|||
|
set txt [ns_xml node getcontent $one_node_id]
|
|||
|
if { [is_node_name_p $node_id "url"] } {
|
|||
|
set url $txt
|
|||
|
}
|
|||
|
if { [is_node_name_p $node_id "title"]} {
|
|||
|
set title $txt
|
|||
|
}
|
|||
|
}
|
|||
|
}
|
|||
|
}
|
|||
|
return [define_story $url $title]
|
|||
|
}
|
|||
|
|
|||
|
# convert XML doc to headlines object
|
|||
|
proc xml_to_headlines { doc_id } {
|
|||
|
set headlines [list]
|
|||
|
set root_id [ns_xml doc root $doc_id]
|
|||
|
# root node should be named "linuxtoday" and of type "attribute"
|
|||
|
error_if_node_name_not $root_id "linuxtoday"
|
|||
|
error_if_node_type_not $root_id "attribute"
|
|||
|
set children [ns_xml node children $root_id]
|
|||
|
foreach node_id $children {
|
|||
|
# only interested in attribute type nodes whose name is "story"
|
|||
|
if { [is_node_name_p $node_id "story"] && [is_attribute_node_p $node_id]} {
|
|||
|
set story [story_node_to_story $node_id]
|
|||
|
lappend headlines $story
|
|||
|
}
|
|||
|
}
|
|||
|
return $headlines
|
|||
|
}
|
|||
|
</pre>
|
|||
|
|
|||
|
The code is rather straightforward. We use the knowledge about the
|
|||
|
structure of XML file. In this case we know that root node is named
|
|||
|
<tt>linuxtoday</tt> and should have a child named
|
|||
|
<tt>story</tt>. Each <tt>story</tt> node should have children named
|
|||
|
<tt>url</tt> and <tt>title</tt> etc. The previous script that dumps
|
|||
|
general structure of the tree helped me a lot in writing this
|
|||
|
function. Note the usage of <font color=gray> <tt>error</tt> </font>
|
|||
|
command to abort the script if XML doesn't look good to us.
|
|||
|
<p>
|
|||
|
Having an intermediate representation of the data might look like an
|
|||
|
excess given that it costs us more code and some performance but there
|
|||
|
are very good reasons to have it. We could have written a proc
|
|||
|
<code>xml_to_html_table</code> that would create HTML table directly
|
|||
|
from XML document but such code would be more complex, more buggy and
|
|||
|
harder to modify. Separation that we've made provides an abstraction
|
|||
|
that reduces complexity, which is always good. It also gives us more
|
|||
|
flexibility: we can easily imagine writing another
|
|||
|
<code>headlines_to_html_table</code> procedure that gives us slightly
|
|||
|
different table.
|
|||
|
|
|||
|
<p>
|
|||
|
<a
|
|||
|
href="http://www.fifthgate.org/articles/aolserver/xml/test_linuxtoday_xml.tcl">
|
|||
|
See how it works in practice</a>
|
|||
|
[external site running AOLserver]
|
|||
|
and
|
|||
|
<a href="misc/washington/test_linuxtoday_xml.tcl.txt">get the source</a>
|
|||
|
[included in <I>Linux Gazette</I>]. It should
|
|||
|
produce something like this:
|
|||
|
<p>
|
|||
|
<center>
|
|||
|
<TABLE WIDTH="40%" BORDER="0" CELLSPACING="1" CELLPADDING="3">
|
|||
|
<TR>
|
|||
|
<TD ALIGN="center" BGCOLOR="#cccccc">
|
|||
|
<B>
|
|||
|
<FONT FACE="Lucida,Verdana,Helvetica,Arial">
|
|||
|
<A href="http://linuxtoday.com">
|
|||
|
<FONT color="#000000">linuxtoday</FONT>
|
|||
|
</A>
|
|||
|
</FONT>
|
|||
|
</B>
|
|||
|
</TD>
|
|||
|
</TR>
|
|||
|
|
|||
|
<TR>
|
|||
|
<TD BGCOLOR="#eeeeee">
|
|||
|
<SMALL><FONT FACE="Lucida,Verdana,Helvetica,Arial">
|
|||
|
|
|||
|
- <A HREF="http://linuxtoday.com/news_story.php3?ltsn=2000-12-28-001-04-OS-DB"><FONT COLOR="#000000">Kernel Cousin Debian Hurd #73 By Paul Emsley And Zack Brown</FONT></A><BR>
|
|||
|
- <A HREF="http://linuxtoday.com/news_story.php3?ltsn=2000-12-27-006-04-OS-SW"><FONT COLOR="#000000">Zope 2.2.5 b1 released</FONT></A><BR>
|
|||
|
- <A HREF="http://linuxtoday.com/news_story.php3?ltsn=2000-12-27-014-06-SC"><FONT COLOR="#000000">O#39;Reilly Network: Insecurities in a Nutshell: SAMBA, pine, ircd, and More</FONT></A><BR>
|
|||
|
- <A HREF="http://linuxtoday.com/news_story.php3?ltsn=2000-12-27-005-04-OP-HW"><FONT COLOR="#000000">ZDNet: Linux Laptop SuperGuide</FONT></A><BR>
|
|||
|
- <A HREF="http://linuxtoday.com/news_story.php3?ltsn=2000-12-27-004-04-OP-MS"><FONT COLOR="#000000">ComputerWorld: Think tank warns that Microsoft hack could pose national security risk</FONT></A><BR>
|
|||
|
<EFBFBD></FONT></SMALL>
|
|||
|
</TD>
|
|||
|
</TR>
|
|||
|
</TABLE>
|
|||
|
</center>
|
|||
|
|
|||
|
<p>
|
|||
|
One thing missing in this code is caching. As it is, it
|
|||
|
will grab the XML file from other people's server everytime it is
|
|||
|
invoked. This is not nice. It would be fairly easy to add a logic to
|
|||
|
cache XML file (or its in-memory representation) and only
|
|||
|
fetch a new version if, say, 1 hour passed since it was last retrieved.
|
|||
|
|
|||
|
|
|||
|
<h3>Conclusion about XML as a data exchange language</h3>
|
|||
|
|
|||
|
Is this data exchange thing between web servers a novel idea? No. You
|
|||
|
could do everything described here with the first generation of web
|
|||
|
servers. You would probably use different technologies (C code running
|
|||
|
inside a web server or a CGI script instead of an embedded scripting
|
|||
|
language; some ad-hoc text or binary format instead of XML) but the
|
|||
|
idea would be the same: one web server acts as a client, grabs the
|
|||
|
data from the other server using HTTP protocol and does something
|
|||
|
useful with the data. The other web server acts as a server providing
|
|||
|
data for others. It's just another implementation of
|
|||
|
a client-server paradigm. It's nothing new. It is just a sign that web
|
|||
|
programming is maturing. After 5+ years we've finally solved most of the
|
|||
|
problems with presenting static html pages or generating dynamic web
|
|||
|
pages from the data kept on the server (e.g., in a database). Now we
|
|||
|
enter the times of providing services and data for other web
|
|||
|
sites. Today state-of-the-art is pretty much limited to exchanging
|
|||
|
headlines and similar trivia but possibilities are bigger, ranging
|
|||
|
from simple things like providing stock quotes or dictionary
|
|||
|
definitions to executing complex (e.g., financial) transactions
|
|||
|
following an agreed upon protocol.
|
|||
|
<p>
|
|||
|
|
|||
|
<h3>Conclusion about XML parsing in AOLserver</h3>
|
|||
|
|
|||
|
Beside parsing you can also create and manipulate XML documents in
|
|||
|
memory and convert them to XML ASCII representation. It is not
|
|||
|
covered in this article but it's so straightforward that you should
|
|||
|
be able to do it just by looking at the API.
|
|||
|
<p>
|
|||
|
ns_xml module provides basics of XML processing. Although you can do
|
|||
|
quite a bit with it one could wish to do more. Things that are
|
|||
|
obviously missing:
|
|||
|
<ul>
|
|||
|
<li> SAX API (it's already present in libxml so this would only
|
|||
|
require extending ns_xml)
|
|||
|
<li> support for XSLT (support for XSLT, although planned, is not yet
|
|||
|
present in libxml)
|
|||
|
</ul>
|
|||
|
An alternative approach to ns_xml module would be to:
|
|||
|
<ul>
|
|||
|
<li> use <a href="http://pywx.idyll.org">PyWx</a>, a Python
|
|||
|
interpreter embedded inside AOLserver and standard
|
|||
|
<a href="http://www.python.org/sigs/xml-sig/">PyXML</a> Python module
|
|||
|
<li> write another module wrapping some other XML parsing library
|
|||
|
<li> use pure Tcl parser
|
|||
|
</ul>
|
|||
|
|
|||
|
<h3>Links</h3>
|
|||
|
<ul>
|
|||
|
<li> to find out more about AOLserver read
|
|||
|
<a href="../issue58/washington.html">
|
|||
|
intro in December 2000 issue of LG</a> or
|
|||
|
<a
|
|||
|
href="http://www.arsdigita.com/asj/aolserver/introduction-1">part
|
|||
|
one</a> and <a
|
|||
|
href="http://www.arsdigita.com/asj/aolserver/introduction-2">part
|
|||
|
two</a> of another intro
|
|||
|
<li> <a href="http://www.aolserver.com">AOLserver</a> home page
|
|||
|
<li> <a href="http://www.arsdigita.com/books/panda">Philip and Alex's
|
|||
|
Guide to Web Publishing</a>, a book
|
|||
|
that will make you a better web programmer
|
|||
|
<li> <a href="http://sicp.arsdigita.org/">Structure and Interpretation
|
|||
|
of Computer Programs</a>, a book that will make you a better
|
|||
|
programmer
|
|||
|
<li> <a href="http://www.arsdigita.com/books/tcl">Tcl for Web
|
|||
|
Nerds</a>, a handy book on Tcl
|
|||
|
<li> everybody has a web page and <a
|
|||
|
href="http://www.fifthgate.org"> this one is mine </a>
|
|||
|
</ul>
|
|||
|
|
|||
|
<address> If you have comments or suggestions,
|
|||
|
<a href="mailto:irvingw@pobox.com">send them in</a>. </address>
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
<!-- *** BEGIN copyright *** -->
|
|||
|
<P> <hr> <!-- P -->
|
|||
|
<H5 ALIGN=center>
|
|||
|
|
|||
|
Copyright © 2001, Irving Washington.<BR>
|
|||
|
Copying license <A HREF="../copying.html">http://www.linuxgazette.com/copying.html</A><BR>
|
|||
|
Published in Issue 63 of <i>Linux Gazette</i>, Mid-February (EXTRA) 2001</H5>
|
|||
|
<!-- *** END copyright *** -->
|
|||
|
|
|||
|
<!--startcut ==========================================================-->
|
|||
|
<HR><P>
|
|||
|
<CENTER>
|
|||
|
<!-- *** BEGIN navbar *** -->
|
|||
|
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="sharma.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue63/washington.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="lg_backpage63.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
|
|||
|
<!-- *** END navbar *** -->
|
|||
|
</CENTER>
|
|||
|
</BODY></HTML>
|
|||
|
<!--endcut ============================================================-->
|