From ddecb9d973b36d7686342a637f7bc3836e8329c0 Mon Sep 17 00:00:00 2001
From: gferg <>
Date: Tue, 24 Sep 2002 13:27:52 +0000
Subject: [PATCH] updated
---
.../DocBook-Demystification-HOWTO.xml | 1484 +++++++++--------
1 file changed, 746 insertions(+), 738 deletions(-)
diff --git a/LDP/howto/docbook/DocBook-Demystification-HOWTO/DocBook-Demystification-HOWTO.xml b/LDP/howto/docbook/DocBook-Demystification-HOWTO/DocBook-Demystification-HOWTO.xml
index b14e371a..081ed308 100644
--- a/LDP/howto/docbook/DocBook-Demystification-HOWTO/DocBook-Demystification-HOWTO.xml
+++ b/LDP/howto/docbook/DocBook-Demystification-HOWTO/DocBook-Demystification-HOWTO.xml
@@ -1,738 +1,746 @@
-
-
-
-]>
-
-
-
- DocBook Demystification HOWTO
-
-
- Eric
- Raymond
-
-
- esr@thyrsus.com
-
-
-
-
-
-
- v1.0
- 2002-09-20
- esr
-
- Initial version.
-
-
-
-
-
- This HOWTO attempts to clear the fog and mystery surrounding the
- DocBook markup system and the tools that go with it. It is aimed at
- authors of technical documentation for open-source projects hosted
- on Linux, but should be useful for people composing other kinds on
- other Unixes as well.
-
-
-
-
-Introduction
-
-A great many major open-source projects are converging on
-DocBook as a standard format for their documentation — projects
-including the Linux kernel, GNOME, KDE, Samba, and the Linux
-Documentation Project. The advocates of XML-based "structural markup"
-(as opposed to the older style of "presentation markup" exemplified by
-troff, Tex, and Texinfo) seem to have won the theoretical
-battle.
-
-Nevertheless, a lot of confusion surrounds DocBook and the
-programs that support it. Its devotees speak an argot that is dense
-and forbidding even by computer-science standards, slinging around
-acronyms that have no obvious relationship to the things you need to
-do to write markup and make HTML or Postscript from it. XML standards
-and technical papers are notoriously obscure. Most DocBook-related
-tools are very poorly documented, and their documentation is
-especially prone to assume way too much prior knowledge on the
-reader's part.
-
-This HOWTO will attempt to clear up the major mysteries
-surrounding DocBook and its application to open-source documentation
-— both the technical and political ones. Our objective is to equip
-you to understand not just what you need to do to make documents, but
-why the process is as complex as it is — and how it can be
-expected to change as newer DocBook-related tools become
-available.
-
-
-Why care about DocBook at all?
-
-There are two possibilities that make DocBook really
-interesting. One is multi-mode rendering and the
-other is searchable documentation
-databases.
-
-Multi-mode rendering is the easier, nearer-term possibility; it's
-the ability to write a document in a single master format that can be
-rendered in many different display modes (in particular, as both HTML
-for on-line viewing and as Postscript for high-quality printed
-output). This capability is pretty well implemented now.
-
-Searchable documentation databases is
-shorthand for the possibility that DocBook might help get us to a
-world in which all the documentation on your open-source operating
-system is one rich, searchable, cross-indexed and hyperlinked
-database (rather than being scattered across several different formats
-in multiple locations as it is now).
-
-Ideally, whenever you install a software package on your machine
-it would register its DocBook documentation into your system's
-catalog. HTML, properly indexed and cross-linked to the HTML in the
-rest of your catalog, would be generated. The new package's
-documentation would then be available through your browser. All
-your documentation would would be searchable through an interface
-resembling a good Web search engine.
-
-HTML itself is not quite rich enough a format to get us to that
-world. To name just one lack, you can't explicitly declare index
-entries in HTML. DocBook does have the semantic
-richness to support structured documentation databases. Fundamentally
-that's why so many projects are adopting it.
-
-DocBook has the vices that go with its virtues. Some people
-find it unpleasantly heavyweight, and too verbose to be really
-comfortable as a composition format. That's OK; as long as the markup
-tools they like (things like Perl POD or GNU Texinfo) can generate
-DocBook out their back ends, we can all still get we want. It doesn't
-matter whether or not everybody writes in DocBook — as long as
-it becomes the common document interchange format that everyone uses,
-we'll still get unified searchable documentation databases.
-
-
-Structural markup: a primer
-
-Older formatting languages like Tex, Texinfo, and Troff
-supported presentation
-markup. In these systems, the instructions you
-gave were about the appearance and physical layout of the text (font
-changes, indentation changes, that sort of thing).
-
-Presentation markup was adequate as long as your objective was
-to print to a single medium or type of display device. You run into
-its limits, however, when you want to mark up a document so that (a)
-it can be formatted for very different display media (such as printing
-vs. Web display), or (b) you want to support searching and indexing the
-document by its logical structure (as you are likely to want to do,
-for example, if you are incorporating it into a hypertext system).
-
-To support these capabilities properly, you need a system of
-structural markup. In structural
-markup, you describe not the physical appearance of the document but
-the logical properties of its parts.
-
-As an example: In a presentation-markup language, if you want to
-emphasize a word, you might instruct the formatter to set it in
-boldface. In
-troff1
-this would look like so:
-
-
-All your base
-.B are
-belong to us!
-
-
-In a structural-markup language, you would tell the formatter to
-emphasize the word:
-
-
-All your base <emphasis>are</emphasis> belong to us!
-
-
- The "<emphasis>" and </emphasis>in the line above
-are called markup tags, or
-just tags for short. They are the instructions
-to your formatter.
-
-In a structural-markup language, the physical appearance of the
-final document would be controlled by a
-stylesheet. It is the
-stylesheet that would tell the formatter "render emphasis as a font
-change to boldface". One advantage of presentation-markup languages
-is that by changing a stylesheet you can globally change the
-presentation of the document (to use different fonts, for example)
-without having to hack all the the individual instances of (say)
-.B in the document itself.
-
-
-Document Type Definitions
-
-(Note: to keep the explanation simple, most of this
-section is going to tell some lies, mainly by omitting a lot of
-history. Truthfulness will be fully restored in a following
-section.)
-
-DocBook is a structural-level markup language. Specifically, it
-is a dialect of XML. A DocBook document is a hunk of XML that uses
-XML tags for structural markup.
-
- In order for a document formatter to apply a stylesheet to your
-document and make it look good, it needs to know things about the
-overall structure of your document. For example, it needs to know
-that a book manuscript normally consists of front matter, a sequence
-of chapters, and back matter in order to physically format chapter
-headers properly. In order for it to know this sort of thing, you
-need to give it a Document Type
-Definition or DTD. The DTD tells your
-formatter what sorts of elements can be in the document structure, and
-in what orders they can appear.
-
-What we mean by calling DocBook an `application' of XML is
-actually that DocBook is a DTD — a rather large DTD, with
-somewhere around 400 tags in it.
-
-Lurking behind DocBook is a kind of program called a
-validating parser.When you
-format a DocBook document, the first step is to pass it through a
-validating parser (the front end of the DocBook formatter). This
-program checks your document against the DocBook DTD to make sure you
-aren't breaking any of the DTD's structural rules (otherwise the back
-end of the formatter, the part that applies your style sheet, might
-become quite confused)
-
-The validating parser will either bomb out, giving you error
-messages about places where the document structure is broken, or translate
-the document into a stream of formatting events
-which the parser back end combines with the information in your stylesheet
-to produce formatted output
-
-Here is a diagram of the whole process:
-
-
-
-
-
-The part of the diagram inside the dotted box is your formatting
-software, or toolchain. Besides the obvious and
-visible input to the formatter (the document source) you'll need to
-keep the two `hidden' inputs of the formatter (DTD and stylesheet) in
-mind to understand what follows.
-
-Other DTDs
-
-A brief digression into other DTDs may help make clear what parts of
-the previous section were specific to DocBook and what parts are general to
-all structural-markup languages.
-
-TEI (Text Encoding
-Initiative) is a large, elaborate DTD used primarily in academia for
-computer transcription of literary texts. TEI's Unix-based toolchains
-use many of the same tools that are involved with DocBook, but with
-different stylesheets and (of course) a different DTD.
-
-XHTML, the latest version of HTML, is also an XML application
-described by a DTD, which explains the family resemblance between
-XHTML and DocBook tags. The XHTML toolchain consists of web browsers
-and a number of ad-hoc HTML-to-print utilities.
-
-Many other XML DTDs are maintained to help people exchange
-structured information in fields as diverse as bioinformatics and
-banking. You can look at a list of
-repositories to get some idea of the variety out
-there.
-
-
-The DocBook toolchain
-
-Normally, what you'll do to make XHTML from your
-DocBook sources will look like this:
-
-
-bash$ xmlto xhtml foo.xml
-Convert to XHTML
-bash$ ls *.html
-ar01s02.html ar01s03.html ar01s04.html index.html
-
-
-In this example, you converted an XML-Docbook document named
-foo.xml with three top-level sections into an
-index page and two parts. Making one big page is just as easy:
-
-
-bash$ xmlto xhtml-nochunks foo.xml
-Convert to XHTML
-bash$ ls *.html
-foo.html
-
-
-Finally, here is how you make Postscript for printing:
-
-
-bash$ xmlto ps foo.xml # To make Postscript
-Convert to XSL-FO
-Making portrait pages on A4 paper (210mmx297mm)
-Post-process XSL-FO to DVI
-Post-process DVI to PS
-bash$ ls *.ps
-foo.ps
-
-
-To turn your documents into HTML or Postscript, you need an
-engine that can apply the combination of DocBook DTD and
-a suitable stylesheet to your document. Here is how the
-open-source tools for doing this fit together:
-
-
-
-
-
-Parsing your document and applying the stylesheet transformation
-will be handled by one of three programs. The most likely one is
-xsltproc, the parser
-that ships with Red Hat 7.3. The other possibilities are two Java
-programs, Saxon and
-Xalan,
-
-It is relatively easy to generate high-quality XHTML from either
-DocBook; the fact that XHTML is simply another XML DTD helps a lot.
-Translation to HTML is done by applying a rather simple stylesheet,
-and that's the end of the story. RTF is also simple to generate in
-this way, and from XHTML or RTF it's easy to generate a flat ASCII
-text approximation in a pinch.
-
-The awkward case is print. Generating high-quality printed
-output (which means, in practice, Adobe's
-PDF (Portable Document
-Format) is difficult. Doing it right requires algorithmically
-duplicating the delicate judgments of a human typesetter moving from
-content to presentation level.
-
-So, first, a stylesheet translates Docbook's structural markup
-into another dialect of XML —
-FO (Formatting Objects). FO
-markup is very much presentation-level; you can think of it as a sort
-of XML functional equivalent of troff. It has to be translated to
-Postscript for packaging in a PDF.
-
-In the toolchain shipped with Red Hat, this job is handled by a
-TeX macro package called
-PassiveTeX. It translates the
-formatting objects generated by xsltproc into
-Donald Knuth's TeX language. TeX was one of the earliest open-source
-projects, an old but powerful presentation-level formatting language
-much beloved of mathematicians (to whom it provides particulaly
-elaborate facilities for describing mathematical notation). TeX is
-also famously good at basic typesetting tasks like kerning, line
-filling, and hyphenating. TeX's output, in what's called
-DVI (DeVice Independent)
-format, is then massaged into PDF.
-
-If you think this bucket chain of XML to Tex macros to DVI to
-PDF sounds like an awkward kludge, you're right. It clanks, it
-wheezes, and it has ugly warts. Fonts are a significant problem,
-since XML and TeX and PDF have very different models of how fonts
-work; also, handling internationalization and localization is a
-nightmare. About the only thing this code path has going for it is
-that it works.
-
-The elegant way will be
-FOP, a direct
-FO-to-Postscript translator being developed by the Apache project.
-With FOP, the internationalization problem is, if not solved, at least
-well confined; XML tools handle Unicode all the way through to FOP.
-Glyph to font mapping is also strictly FOP's problem. The only
-trouble with this approach is that it doesn't work — yet. As of
-August 2002 FOP is in an unfinished alpha state — usable, but
-with rough edges and missing features.
-
-Here is what the FOP toolchain looks like:
-
-
-
-
-
-FOP has competition. There is another project called
-xsl-fo-proc which aims to do
-the same things as FOP, but in C++ (and therefore both faster than
-Java and not relying on the Java environment). As of August 2002 FOP
-is in an unfinished alpha state, not as far along as FOP.
-
-
-Who are the projects and the players?
-
-The DocBook DTD itself is maintained by the DocBook Technical
-Committee, headed by Norman Walsh. Norm is the inventor of DocBook, a
-man who has focused remarkable energy and talent over many years on
-the extremely complex problems it addresses. He is as universally
-respected in the DocBook/SGML/XML community as Linus Torvalds is in
-the Linux world.
-
-The
-docbook-tools project provides open-source tools for
-converting SGML DocBook to HTML, Postscript, and other formats. This
-package is shipped with Red Hat and other Linux distributions. It is
-maintained by Mark Galassi.
-
-Jade is an
-engine used to apply DSSSL stylesheets to SGML documents. It is
-maintained by James Clark.
-
-OpenJade
-is a community roject undertaken because the founders thought James
-Clark's maintainance of Jade was spotty. The docbook-tools programs
-use OpenJade.
-
-libxslt is a C
-library that interprers XSLT, applying stylesheets to XML documents.
-It includes a wrapper program, xsltproc, that can be
-used as an XML formatter. The code was written by Daniel Veillard
-under the auspices of the GNOME project, but does not require any
-GNOME code to run. I hear it's blazingly fast compared to the
-Java alternatives, not a surprising claim.
-
-xmlto is the
-user interface of the XML toolchain that Red Hat ships. It's written
-and maintained by Tim Waugh.
-
-Saxon
-and Xalan are Java
-programs that interpret XSLT. Saxon seems to be designed to work
-under Windows. Xalan is part of the XML Apache project and native to
-Linux and BSD; it's designed to work with FOP.
-
-JadeTex is
-the package of LaTeX macros that OpenJade uses for producing DVI.
-PassiveTeX
-performs a similar function on the XML side.
-
-FOP translates
-XML Formatting Objects to PDF. It is part of the Apache XML project
-and is designed to work with Xalan.
-
-
-Migration tools
-
-The second biggest problem with DocBook is the effort needed to
-convert old-style presentation markup to DocBook markup. Human beings
-can usually parse the presentatition of a document into logical
-structure automatically, because (for example) they can tell from
-context when an italic font means `emphasis' and when it meabs
-something else such as `this is a foreign phrase'.
-
-Somehow, in converting documents to DocBook, those
-sorts of distinctions need to be made explicit. Sometimes
-they're present in the old markup; often they are not, and the
-missing structural information has to be either deduced by
-clever heuristics or added by a human.
-
-Here is a summary of the state of conversion tools from
-various other formats:
-
-
-
-GNU Texinfo
-
-The Free Software Foundation has made a policy decision to move
-towards DocBook and away from Texinfo, its traditional format.
-Texinfo has enough structure to make reasonably good automatic
-conversion possible, and the 4.x versions of makeinfo
-feature a switch that generates DocBook.
-More at the makeinfo
-project page.
-
-
-
-
-POD
-
-There is a POD::DocBook
-module that translates Plain Old Documentation markup to DocBook. It
-claims to support every DocBook tag except the L<> italic tag.
-The man page also says "Nested =over/=back lists are not supported
-within DocBook." but notes that the module has been heavily
-tested.
-
-
-
-
-LaTeX
-
-LaTeX is a (mostly) structural markup macro language built on
-top of the TeX formatter. There is a project called
-TeX4htthat (according to the author of PassiveTeX) can
-generate DocBook from LaTeX.
-
-
-
-
-man pages and other troff-based markups
-
-This is generally considered the biggest and nastiest conversion
-problem. And indeed, the basic
-troff
-1 markup is at too low a presentation
-level for automatic conversion tools to do much of any good. However,
-the gloom in the picture lightens significantly if we consider
-translation from sources of documents written in macro packages like
-man
-7. These have enough structural
-features for automatic translation to get some traction.
-
-I wrote a tool to do this myself, because I couldn't find
-anything else that did a half-decent job of it (and the problem is
-interesting). It's called doclifter. It will
-translate to either SGML or XML DocBook from
-man
-7,
-mdoc
-7,
-ms
-7, or
-me
-7 macros. See the documentation
-for details.
-
-
-
-
-
-Editing tools
-
-One thing we presently do not have is a good open-source
-structure editor for SGML/XML documents.
-
-LyX is a GUI word processor
-that uses LaTeX for printing and supports structural editing of LaTeX
-markup. There is a LaTeX package that generates DocBook, and a
-how-to document
-escribing how to write SGML and XML in the LyX GUI.
-
-GeTox, the
-GNOME XML Editor, aims at nontechnical users. But the software is
-still (as of August 2001) alpha, more a proof of concept than anything
-useful, and the project group seems not to be very active; there have
-been no updates of the website between May 2001 and August 2002 (time of
-writing).
-
- GNU
-TeXMacs is a project aimed at producing an editor that is good
-for technical and mathematical material, including displayed formulas.
-1.0 was released in April 2002. The developers plan XML support in
-the future, but it's not there yet.
-
-ThotBook
-is a project to put together a GUI editor for DocBook based on
-the Thot toolkit. It way be moribund; the web page was not updated
-from November 2001 to August 2002 (time of writing).
-
-Most people still hack the tags by hand using either vi or Emacs, using
-psgml to validate the results.
-
-
-Related standards and practices
-
-The tools are coming together, if slowly, to edit and format
-DocBook markup. But DocBook itself is a means, not an end. We'll need
-other standards besides DocBook itself to accomplish the
-searchable-documentation-database objective I laid out at the
-beginning of this document. There are two big issues: document
-cataloguing and metadata.
-
-The Scrollkeeper
-project aims directly to meet this need. It provides a simple set of
-script hooks that can be used by package install and uninstall
-productions to register and unregister their documentation.
-
-Scrollkeeper uses the Open Metadata Format.
-This is a standard for indexing open-source documentation analogous to
-a library card-catalog system. The idea is to support rich search
-facilities that use the card-catalog metadata as well as the source
-text of the documentation itself.
-
-
-
-SGML and SGML-Tools
-
-In previous sections, I have thrown away a lot of DocBook's
-history. XML has an older brother,
-SGML or Standard Generalized
-Markup Language.
-
-Until mid-2002, no discussion of DocBook would have been
-complete without a long excursion into SGML, the differences between
-SGML and XML, and detailed descriptions of the SGML DocBook toolchain.
-Life can be simpler now; a XML DocBook toolchain is available in open
-source, works as well as the SGML toolchain ever did, and is easier to
-use, If you don't think you'll ever have to deal with old SGML-Docbook
-documents, you can skip the remainder of this section.
-
-DocBook SGML
-
-DocBook was originally an SGML application, and there was an
-SGML-based DocBook toolchain that is now moribund. There are minor
-differences between the DocBook SGML DTD and the DocBook XML DTD, but
-for an introductory discussion we can ignore them. The only one that's
-normally user-visible is that in SGML contentless tags did not need to
-have a trailing slash added to them before the closing >.
-(Requiring the trailing / means XML parsers can be a lot simpler,
-because they don't have to know about the DTD to know which opening
-tags need closers.)
-
-Versions of HTML up to 4.01 (before XHTML) were SGML
-applications. TEI was originally an SGML application, too. The
-groups managing all three DTDs jumped to XML for the same reason
-DocBook's developers did — it's drastically simpler. SGML was
-extremely complex; unmanageably so, as it turns out. The
-specification was a dense 150 pages and it is not reliably reported
-that any software ever fully implemented it.
-
-The toolchain diagram I gave earlier was simplified; it
-only showed the XML toolchain. Here is the historically
-correct version:
-
-
-
-
-
-The DSSSL toolchain is what processed DocBook SGML.
-Under it, a document goes from DocBook format through one of two
-closely-related stylesheet engines called Jade and OpenJade. These
-turn it into a TeX-macro markup. which is processed by a package called
-JadeTeX, into DVIs, which then get turned into Postscript.
-
-
-Why SGML DocBook is dead
-
-The DSSSL toolchain is, as far as new development goes,
-effectively dead. The XSLT toolchain has just reached production
-status as I write in August 2002; a working version shipped in Red Hat
-7.3. It's where DocBook developers are putting almost all of their
-effort.
-
-The reason for the change to XML was threefold. First,
-SGML turned out to be too complicated to use; then, DSSSL turned out
-to be too complicated to live with; then, significant parts of the
-DSSSL toolchain turned out to be weak and irredeemably messy.
-
-Relative to SGML, XML has a reduced feature set that is
-sufficient for almost all purposes but much easier to understand and
-build parsers for. SGML-processing tools (such as validating parsers) have
-to carry around support for a lot of features that DocBook and other
-text markup systems never actually used. Removing these features
-made XML simpler and XML-processing tools faster.
-
-The language used to describe SGML DTDs is sufficiently spiky
-and forbidding that composing SGML DTDs was something of a black art.
-XML DTDs, on the other hand, can be described in a dialect of XML
-itself; there does not need to be a separate DTD language. An XML
-description of an XML DTD is called a
-schema; the term DTD itself
-will probably pass out of use as the standards for schemas firm
-up.
-
-But mostly the DSSSL toolchain is dead because DSSSL itself, the
-SGML stylesheet description language in that toolchain, proved just too
-arcane for most human beings, and made stylesheets too difficult to
-write and modify. (It was a dialect of Scheme. Your humble editor, a
-LISP-head from way back, shakes his head in sad bemusement that
-this should drive people away.)
-
-XML fans like to sum up all these changes with "XML: tastes great, less
-filling."
-
-
-SGML-Tools
-
-SGML-Tools was the name of a DTD used by the Linux Documentation Project,
-developed a few years ago when today's DocBook toolchains didn't exist.
-SGML-Tools markup was simpler, but also much less flexible than
-DocBook. The original SGML-Tools formatter/DTD/stylesheet(s)
-toolchain has been dead for some time now, but a successor called SGML-tools
-Lite is still maintained.
-
-The LDP has been phasing out SGML-Tools in favor of DocBook, but
-it is still possible you might take over an old HOWTO. These can be
-regognized by the identifying header "<!doctype linuxdoc
-system>. If this happens to you, convert the thing to XML DocBook
-and give the old version a quick burial.
-
-
-
-References
-
-One of the things that makes learning DocBook difficult is that
-the sites related to it tend to overwhelm the newbie with long lists
-of W3C standards, massive exercises in SGML theology, and dense
-thickets of abstract terminology. We're going to try to avoid that
-here by giving you just a few selected references to look at.
-
-Michael Smith's
-Take My Advice: Don't Learn XML surveys the XML world from
-an angle similar to this document.
-
-Norman Walsh's DocBook: The Definitive
-Guide is available in print and
-on the
-web. This is indeed the definitive reference, but as an
-introduction or tutorial it's a disaster. Instead, read this:
-
-Writing
-Documentation Using DocBook: A Crash Course. This is an excellent
-tutorial.
-
-If you're writing for the Linux Documentation Project, read the
-
-LDP Author Guide.
-
-The best general introduction to SGML and XML that I've
-personally read all the way through is David Megginson's Structuring
-XML Documents (Prentice-Hall, ISBN: 0-13-642299-3).
-
-For XML only, XML In A Nutshell
-by W. Scott Means and Elliotte "Rusty" Harold is very good.
-
-The XML
-Bible looks like a pretty comprehensive reference on XML and
-related standards (including Formatting Objects).
-
-Finally, the The XML
-Cover Pages will take you into the jungle of XML standards
-if you really want to go there.
-
-
-
-
-
+
+
+
+]>
+
+
+
+ DocBook Demystification HOWTO
+
+
+ Eric
+ Raymond
+
+
+ esr@thyrsus.com
+
+
+
+
+
+
+ v1.0
+ 2002-09-20
+ esr
+
+ Initial version.
+
+
+
+
+
+ This HOWTO attempts to clear the fog and mystery surrounding the
+ DocBook markup system and the tools that go with it. It is aimed at
+ authors of technical documentation for open-source projects hosted
+ on Linux, but should be useful for people composing other kinds on
+ other Unixes as well.
+
+
+
+
+Introduction
+
+A great many major open-source projects are converging on
+DocBook as a standard format for their documentation — projects
+including the Linux kernel, GNOME, KDE, Samba, and the Linux
+Documentation Project. The advocates of XML-based "structural markup"
+(as opposed to the older style of "presentation markup" exemplified by
+troff, Tex, and Texinfo) seem to have won the theoretical
+battle.
+
+Nevertheless, a lot of confusion surrounds DocBook and the
+programs that support it. Its devotees speak an argot that is dense
+and forbidding even by computer-science standards, slinging around
+acronyms that have no obvious relationship to the things you need to
+do to write markup and make HTML or Postscript from it. XML standards
+and technical papers are notoriously obscure. Most DocBook-related
+tools are very poorly documented, and their documentation is
+especially prone to assume way too much prior knowledge on the
+reader's part.
+
+This HOWTO will attempt to clear up the major mysteries
+surrounding DocBook and its application to open-source documentation
+— both the technical and political ones. Our objective is to equip
+you to understand not just what you need to do to make documents, but
+why the process is as complex as it is — and how it can be
+expected to change as newer DocBook-related tools become
+available.
+
+
+Why care about DocBook at all?
+
+There are two possibilities that make DocBook really
+interesting. One is multi-mode rendering and the
+other is searchable documentation
+databases.
+
+Multi-mode rendering is the easier, nearer-term possibility; it's
+the ability to write a document in a single master format that can be
+rendered in many different display modes (in particular, as both HTML
+for on-line viewing and as Postscript for high-quality printed
+output). This capability is pretty well implemented now.
+
+Searchable documentation databases is
+shorthand for the possibility that DocBook might help get us to a
+world in which all the documentation on your open-source operating
+system is one rich, searchable, cross-indexed and hyperlinked
+database (rather than being scattered across several different formats
+in multiple locations as it is now).
+
+Ideally, whenever you install a software package on your machine
+it would register its DocBook documentation into your system's
+catalog. HTML, properly indexed and cross-linked to the HTML in the
+rest of your catalog, would be generated. The new package's
+documentation would then be available through your browser. All
+your documentation would would be searchable through an interface
+resembling a good Web search engine.
+
+HTML itself is not quite rich enough a format to get us to that
+world. To name just one lack, you can't explicitly declare index
+entries in HTML. DocBook does have the semantic
+richness to support structured documentation databases. Fundamentally
+that's why so many projects are adopting it.
+
+DocBook has the vices that go with its virtues. Some people
+find it unpleasantly heavyweight, and too verbose to be really
+comfortable as a composition format. That's OK; as long as the markup
+tools they like (things like Perl POD or GNU Texinfo) can generate
+DocBook out their back ends, we can all still get we want. It doesn't
+matter whether or not everybody writes in DocBook — as long as
+it becomes the common document interchange format that everyone uses,
+we'll still get unified searchable documentation databases.
+
+
+Structural markup: a primer
+
+Older formatting languages like Tex, Texinfo, and Troff
+supported presentation
+markuppresentation
+markup. In these systems, the instructions you
+gave were about the appearance and physical layout of the text (font
+changes, indentation changes, that sort of thing).
+
+Presentation markup was adequate as long as your objective was
+to print to a single medium or type of display device. You run into
+its limits, however, when you want to mark up a document so that (a)
+it can be formatted for very different display media (such as printing
+vs. Web display), or (b) you want to support searching and indexing the
+document by its logical structure (as you are likely to want to do,
+for example, if you are incorporating it into a hypertext system).
+
+To support these capabilities properly, you need a system of
+structural markupstructural
+markup. In structural markup, you describe not
+the physical appearance of the document but the logical properties of
+its parts.
+
+As an example: In a presentation-markup language, if you want to
+emphasize a word, you might instruct the formatter to set it in
+boldface. In
+troff1
+this would look like so:
+
+
+All your base
+.B are
+belong to us!
+
+
+In a structural-markup language, you would tell the formatter to
+emphasize the word:
+
+
+All your base <emphasis>are</emphasis> belong to us!
+
+
+ The "<emphasis>" and </emphasis>in the line above
+are called markup
+tagsmarkup tags,
+or just tags for short. They are the
+instructions to your formatter.
+
+In a structural-markup language, the physical appearance of the
+final document would be controlled by a stylesheet
+stylesheet. It is the
+stylesheet that would tell the formatter "render emphasis as a font
+change to boldface". One advantage of presentation-markup languages
+is that by changing a stylesheet you can globally change the
+presentation of the document (to use different fonts, for example)
+without having to hack all the the individual instances of (say)
+.B in the document itself.
+
+
+Document Type Definitions
+
+(Note: to keep the explanation simple, most of this
+section is going to tell some lies, mainly by omitting a lot of
+history. Truthfulness will be fully restored in a following
+section.)
+
+DocBook is a structural-level markup language. Specifically, it
+is a dialect of XML. A DocBook document is a hunk of XML that uses
+XML tags for structural markup.
+
+ In order for a document formatter to apply a stylesheet to your
+document and make it look good, it needs to know things about the
+overall structure of your document. For example, it needs to know
+that a book manuscript normally consists of front matter, a sequence
+of chapters, and back matter in order to physically format chapter
+headers properly. In order for it to know this sort of thing, you
+need to give it a Document Type
+DefinitionDocument Type
+DefinitionDTD or DTD. The
+DTD tells your formatter what sorts of elements can be in the document
+structure, and in what orders they can appear.
+
+What we mean by calling DocBook an `application' of XML is
+actually that DocBook is a DTD — a rather large DTD, with
+somewhere around 400 tags in it.
+
+Lurking behind DocBook is a kind of program called a
+validating parservalidating
+parser.When you format a DocBook document, the
+first step is to pass it through a validating parser (the front end of
+the DocBook formatter). This program checks your document against the
+DocBook DTD to make sure you aren't breaking any of the DTD's
+structural rules (otherwise the back end of the formatter, the part
+that applies your style sheet, might become quite confused)
+
+The validating parser will either bomb out, giving you error
+messages about places where the document structure is broken, or translate
+the document into a stream of formatting events
+which the parser back end combines with the information in your stylesheet
+to produce formatted output
+
+Here is a diagram of the whole process:
+
+
+
+
+
+The part of the diagram inside the dotted box is your formatting
+software, or toolchain. Besides the obvious and
+visible input to the formatter (the document source) you'll need to
+keep the two `hidden' inputs of the formatter (DTD and stylesheet) in
+mind to understand what follows.
+
+Other DTDs
+
+A brief digression into other DTDs may help make clear what parts of
+the previous section were specific to DocBook and what parts are general to
+all structural-markup languages.
+
+TEI (Text Encoding
+Initiative) is a large, elaborate DTD used primarily in academia for
+computer transcription of literary texts. TEI's Unix-based toolchains
+use many of the same tools that are involved with DocBook, but with
+different stylesheets and (of course) a different DTD.
+
+XHTML, the latest version of HTML, is also an XML application
+described by a DTD, which explains the family resemblance between
+XHTML and DocBook tags. The XHTML toolchain consists of web browsers
+and a number of ad-hoc HTML-to-print utilities.
+
+Many other XML DTDs are maintained to help people exchange
+structured information in fields as diverse as bioinformatics and
+banking. You can look at a list of
+repositories to get some idea of the variety out
+there.
+
+
+The DocBook toolchain
+
+Normally, what you'll do to make XHTML from your
+DocBook sources will look like this:
+
+
+bash$ xmlto xhtml foo.xml
+Convert to XHTML
+bash$ ls *.html
+ar01s02.html ar01s03.html ar01s04.html index.html
+
+
+In this example, you converted an XML-Docbook document named
+foo.xml with three top-level sections into an
+index page and two parts. Making one big page is just as easy:
+
+
+bash$ xmlto xhtml-nochunks foo.xml
+Convert to XHTML
+bash$ ls *.html
+foo.html
+
+
+Finally, here is how you make Postscript for printing:
+
+
+bash$ xmlto ps foo.xml # To make Postscript
+Convert to XSL-FO
+Making portrait pages on A4 paper (210mmx297mm)
+Post-process XSL-FO to DVI
+Post-process DVI to PS
+bash$ ls *.ps
+foo.ps
+
+
+To turn your documents into HTML or Postscript, you need an
+engine that can apply the combination of DocBook DTD and
+a suitable stylesheet to your document. Here is how the
+open-source tools for doing this fit together:
+
+
+
+
+
+Parsing your document and applying the stylesheet transformation
+will be handled by one of three programs. The most likely one is
+xsltprocxsltproc,
+the parser that ships with Red Hat 7.3. The other possibilities are
+two Java programs,
+SaxonSaxon
+and
+XalanXalan,
+
+It is relatively easy to generate high-quality XHTML from either
+DocBook; the fact that XHTML is simply another XML DTD helps a lot.
+Translation to HTML is done by applying a rather simple stylesheet,
+and that's the end of the story. RTF is also simple to generate in
+this way, and from XHTML or RTF it's easy to generate a flat ASCII
+text approximation in a pinch.
+
+The awkward case is print. Generating high-quality printed
+output (which means, in practice, Adobe's
+PDFPDF
+(Portable Document Format) is difficult. Doing it right requires
+algorithmically duplicating the delicate judgments of a human
+typesetter moving from content to presentation level.
+
+So, first, a stylesheet translates Docbook's structural markup
+into another dialect of XML —
+FOFO
+(Formatting Objects). FO markup is very much presentation-level; you
+can think of it as a sort of XML functional equivalent of troff. It
+has to be translated to Postscript for packaging in a PDF.
+
+In the toolchain shipped with Red Hat, this job is handled by a
+TeX macro package called
+PassiveTeXPassiveTeX. It
+translates the formatting objects generated by
+xsltproc into Donald Knuth's TeX language. TeX was
+one of the earliest open-source projects, an old but powerful
+presentation-level formatting language much beloved of mathematicians
+(to whom it provides particulaly elaborate facilities for describing
+mathematical notation). TeX is also famously good at basic
+typesetting tasks like kerning, line filling, and hyphenating. TeX's
+output, in what's called DVIDVI
+(DeVice Independent) format, is then massaged into PDF.
+
+If you think this bucket chain of XML to Tex macros to DVI to
+PDF sounds like an awkward kludge, you're right. It clanks, it
+wheezes, and it has ugly warts. Fonts are a significant problem,
+since XML and TeX and PDF have very different models of how fonts
+work; also, handling internationalization and localization is a
+nightmare. About the only thing this code path has going for it is
+that it works.
+
+The elegant way will be
+FOPFOP, a direct
+FO-to-Postscript translator being developed by the Apache project.
+With FOP, the internationalization problem is, if not solved, at least
+well confined; XML tools handle Unicode all the way through to FOP.
+Glyph to font mapping is also strictly FOP's problem. The only
+trouble with this approach is that it doesn't work — yet. As of
+August 2002 FOP is in an unfinished alpha state — usable, but
+with rough edges and missing features.
+
+Here is what the FOP toolchain looks like:
+
+
+
+
+
+FOP has competition. There is another project called
+xsl-fo-procxsl-fo-proc
+which aims to do the same things as FOP, but in C++ (and therefore
+both faster than Java and not relying on the Java environment). As of
+August 2002 FOP is in an unfinished alpha state, not as far along as
+FOP.
+
+
+Who are the projects and the players?
+
+The DocBook DTD itself is maintained by the DocBook Technical
+Committee, headed by Norman Walsh. Norm is the inventor of DocBook, a
+man who has focused remarkable energy and talent over many years on
+the extremely complex problems it addresses. He is as universally
+respected in the DocBook/SGML/XML community as Linus Torvalds is in
+the Linux world.
+
+The
+docbook-tools project provides open-source tools for
+converting SGML DocBook to HTML, Postscript, and other formats. This
+package is shipped with Red Hat and other Linux distributions. It is
+maintained by Mark Galassi.
+
+Jade is an
+engine used to apply DSSSL stylesheets to SGML documents. It is
+maintained by James Clark.
+
+OpenJade
+is a community roject undertaken because the founders thought James
+Clark's maintainance of Jade was spotty. The docbook-tools programs
+use OpenJade.
+
+libxslt is a C
+library that interprers XSLT, applying stylesheets to XML documents.
+It includes a wrapper program, xsltproc, that can be
+used as an XML formatter. The code was written by Daniel Veillard
+under the auspices of the GNOME project, but does not require any
+GNOME code to run. I hear it's blazingly fast compared to the
+Java alternatives, not a surprising claim.
+
+xmlto is the
+user interface of the XML toolchain that Red Hat ships. It's written
+and maintained by Tim Waugh.
+
+Saxon
+and Xalan are Java
+programs that interpret XSLT. Saxon seems to be designed to work
+under Windows. Xalan is part of the XML Apache project and native to
+Linux and BSD; it's designed to work with FOP.
+
+JadeTex is
+the package of LaTeX macros that OpenJade uses for producing DVI.
+PassiveTeX
+performs a similar function on the XML side.
+
+FOP translates
+XML Formatting Objects to PDF. It is part of the Apache XML project
+and is designed to work with Xalan.
+
+
+Migration tools
+
+The second biggest problem with DocBook is the effort needed to
+convert old-style presentation markup to DocBook markup. Human beings
+can usually parse the presentatition of a document into logical
+structure automatically, because (for example) they can tell from
+context when an italic font means `emphasis' and when it meabs
+something else such as `this is a foreign phrase'.
+
+Somehow, in converting documents to DocBook, those
+sorts of distinctions need to be made explicit. Sometimes
+they're present in the old markup; often they are not, and the
+missing structural information has to be either deduced by
+clever heuristics or added by a human.
+
+Here is a summary of the state of conversion tools from
+various other formats:
+
+
+
+GNU Texinfo
+
+The Free Software Foundation has made a policy decision to move
+towards DocBook and away from Texinfo, its traditional format.
+Texinfo has enough structure to make reasonably good automatic
+conversion possible, and the 4.x versions of makeinfo
+feature a switch that generates DocBook.
+More at the makeinfo
+project page.
+
+
+
+
+POD
+
+There is a POD::DocBook
+module that translates Plain Old Documentation markup to DocBook. It
+claims to support every DocBook tag except the L<> italic tag.
+The man page also says "Nested =over/=back lists are not supported
+within DocBook." but notes that the module has been heavily
+tested.
+
+
+
+
+LaTeX
+
+LaTeX is a (mostly) structural markup macro language built on
+top of the TeX formatter. There is a project called
+TeX4htthat (according to the author of PassiveTeX) can
+generate DocBook from LaTeX.
+
+
+
+
+man pages and other troff-based markups
+
+This is generally considered the biggest and nastiest conversion
+problem. And indeed, the basic
+troff
+1 markup is at too low a presentation
+level for automatic conversion tools to do much of any good. However,
+the gloom in the picture lightens significantly if we consider
+translation from sources of documents written in macro packages like
+man
+7. These have enough structural
+features for automatic translation to get some traction.
+
+I wrote a tool to do this myself, because I couldn't find
+anything else that did a half-decent job of it (and the problem is
+interesting). It's called doclifter. It will
+translate to either SGML or XML DocBook from
+man
+7,
+mdoc
+7,
+ms
+7, or
+me
+7 macros. See the documentation
+for details.
+
+
+
+
+
+Editing tools
+
+One thing we presently do not have is a good open-source
+structure editor for SGML/XML documents.
+
+LyX is a GUI word processor
+that uses LaTeX for printing and supports structural editing of LaTeX
+markup. There is a LaTeX package that generates DocBook, and a
+how-to document
+escribing how to write SGML and XML in the LyX GUI.
+
+GeTox, the
+GNOME XML Editor, aims at nontechnical users. But the software is
+still (as of August 2001) alpha, more a proof of concept than anything
+useful, and the project group seems not to be very active; there have
+been no updates of the website between May 2001 and August 2002 (time of
+writing).
+
+ GNU
+TeXMacs is a project aimed at producing an editor that is good
+for technical and mathematical material, including displayed formulas.
+1.0 was released in April 2002. The developers plan XML support in
+the future, but it's not there yet.
+
+ThotBook
+is a project to put together a GUI editor for DocBook based on
+the Thot toolkit. It way be moribund; the web page was not updated
+from November 2001 to August 2002 (time of writing).
+
+Most people still hack the tags by hand using either vi or Emacs, using
+psgml to validate the results.
+
+
+Related standards and practices
+
+The tools are coming together, if slowly, to edit and format
+DocBook markup. But DocBook itself is a means, not an end. We'll need
+other standards besides DocBook itself to accomplish the
+searchable-documentation-database objective I laid out at the
+beginning of this document. There are two big issues: document
+cataloguing and metadata.
+
+The Scrollkeeper
+project aims directly to meet this need. It provides a simple set of
+script hooks that can be used by package install and uninstall
+productions to register and unregister their documentation.
+
+Scrollkeeper uses the Open Metadata Format.
+This is a standard for indexing open-source documentation analogous to
+a library card-catalog system. The idea is to support rich search
+facilities that use the card-catalog metadata as well as the source
+text of the documentation itself.
+
+
+
+SGML and SGML-Tools
+
+In previous sections, I have thrown away a lot of DocBook's
+history. XML has an older brother,
+SGMLSGML or Standard Generalized
+Markup Language.
+
+Until mid-2002, no discussion of DocBook would have been
+complete without a long excursion into SGML, the differences between
+SGML and XML, and detailed descriptions of the SGML DocBook toolchain.
+Life can be simpler now; a XML DocBook toolchain is available in open
+source, works as well as the SGML toolchain ever did, and is easier to
+use, If you don't think you'll ever have to deal with old SGML-Docbook
+documents, you can skip the remainder of this section.
+
+DocBook SGML
+
+DocBook was originally an SGML application, and there was an
+SGML-based DocBook toolchain that is now moribund. There are minor
+differences between the DocBook SGML DTD and the DocBook XML DTD, but
+for an introductory discussion we can ignore them. The only one that's
+normally user-visible is that in SGML contentless tags did not need to
+have a trailing slash added to them before the closing >.
+(Requiring the trailing / means XML parsers can be a lot simpler,
+because they don't have to know about the DTD to know which opening
+tags need closers.)
+
+Versions of HTML up to 4.01 (before XHTML) were SGML
+applications. TEI was originally an SGML application, too. The
+groups managing all three DTDs jumped to XML for the same reason
+DocBook's developers did — it's drastically simpler. SGML was
+extremely complex; unmanageably so, as it turns out. The
+specification was a dense 150 pages and it is not reliably reported
+that any software ever fully implemented it.
+
+The toolchain diagram I gave earlier was simplified; it
+only showed the XML toolchain. Here is the historically
+correct version:
+
+
+
+
+
+The DSSSL toolchain is what processed DocBook SGML.
+Under it, a document goes from DocBook format through one of two
+closely-related stylesheet engines called Jade and OpenJade. These
+turn it into a TeX-macro markup. which is processed by a package called
+JadeTeX, into DVIs, which then get turned into Postscript.
+
+
+Why SGML DocBook is dead
+
+The DSSSL toolchain is, as far as new development goes,
+effectively dead. The XSLT toolchain has just reached production
+status as I write in August 2002; a working version shipped in Red Hat
+7.3. It's where DocBook developers are putting almost all of their
+effort.
+
+The reason for the change to XML was threefold. First,
+SGML turned out to be too complicated to use; then, DSSSL turned out
+to be too complicated to live with; then, significant parts of the
+DSSSL toolchain turned out to be weak and irredeemably messy.
+
+Relative to SGML, XML has a reduced feature set that is
+sufficient for almost all purposes but much easier to understand and
+build parsers for. SGML-processing tools (such as validating parsers) have
+to carry around support for a lot of features that DocBook and other
+text markup systems never actually used. Removing these features
+made XML simpler and XML-processing tools faster.
+
+The language used to describe SGML DTDs is sufficiently spiky
+and forbidding that composing SGML DTDs was something of a black art.
+XML DTDs, on the other hand, can be described in a dialect of XML
+itself; there does not need to be a separate DTD language. An XML
+description of an XML DTD is called a
+schemaschema;
+the term DTD itself will probably pass out of use as the standards for
+schemas firm up.
+
+But mostly the DSSSL toolchain is dead because DSSSL itself, the
+SGML stylesheet description language in that toolchain, proved just too
+arcane for most human beings, and made stylesheets too difficult to
+write and modify. (It was a dialect of Scheme. Your humble editor, a
+LISP-head from way back, shakes his head in sad bemusement that
+this should drive people away.)
+
+XML fans like to sum up all these changes with "XML: tastes great, less
+filling."
+
+
+SGML-Tools
+
+SGML-Tools was the name of a DTD used by the Linux Documentation Project,
+developed a few years ago when today's DocBook toolchains didn't exist.
+SGML-Tools markup was simpler, but also much less flexible than
+DocBook. The original SGML-Tools formatter/DTD/stylesheet(s)
+toolchain has been dead for some time now, but a successor called SGML-tools
+Lite is still maintained.
+
+The LDP has been phasing out SGML-Tools in favor of DocBook, but
+it is still possible you might take over an old HOWTO. These can be
+regognized by the identifying header "<!doctype linuxdoc
+system>. If this happens to you, convert the thing to XML DocBook
+and give the old version a quick burial.
+
+
+
+References
+
+One of the things that makes learning DocBook difficult is that
+the sites related to it tend to overwhelm the newbie with long lists
+of W3C standards, massive exercises in SGML theology, and dense
+thickets of abstract terminology. We're going to try to avoid that
+here by giving you just a few selected references to look at.
+
+Michael Smith's
+Take My Advice: Don't Learn XML surveys the XML world from
+an angle similar to this document.
+
+Norman Walsh's DocBook: The Definitive
+Guide is available in print and
+on the
+web. This is indeed the definitive reference, but as an
+introduction or tutorial it's a disaster. Instead, read this:
+
+Writing
+Documentation Using DocBook: A Crash Course. This is an excellent
+tutorial.
+
+If you're writing for the Linux Documentation Project, read the
+
+LDP Author Guide.
+
+The best general introduction to SGML and XML that I've
+personally read all the way through is David Megginson's Structuring
+XML Documents (Prentice-Hall, ISBN: 0-13-642299-3).
+
+For XML only, XML In A Nutshell
+by W. Scott Means and Elliotte "Rusty" Harold is very good.
+
+The XML
+Bible looks like a pretty comprehensive reference on XML and
+related standards (including Formatting Objects).
+
+Finally, the The XML
+Cover Pages will take you into the jungle of XML standards
+if you really want to go there.
+
+
+
+
+
+--