LDP/LDP/guide/docbook/LDP-Author-Guide/x2docbook.xml

472 lines
18 KiB
XML

<!--
<!DOCTYPE book PUBLIC '-//OASIS//DTD DocBook XML V4.2//EN'>
-->
<appendix id="x2docbook">
<title>Converting Documents to DocBook XML</title>
<para>A variety of free and commercial tools exist for doing <quote>up conversion</quote> of non-XML formats to DocBook. A few are listed here for your convenience. A more comprehensive list is available from <ulink url="http://wiki.docbook.org/topic/ConvertOtherFormatsToDocBook">Convert Other Formats to DocBook</ulink>.</para>
<section id="txt2docbook">
<title>Text to DocBook</title>
<para>The <application>txt2docbook</application> tool allows one to convert a ascii (README-like) file to a valid docbook xml document. It simplifies the publishing of rather small papers significantly. The input format was inspired by the APT-Convert tool from <ulink url="http://www.xmlmind.com/aptconvert.html" />.</para>
<para>The script can be downloaded from <ulink url="http://txt2docbook.sourceforge.net/" />.</para>
</section>
<section id="oo2docbook">
<title>OpenOffice.org to DocBook</title>
<!-- duplicated from tools-word-processors.xml -->
<para>As of <ulink url="http://www.openoffice.org">OpenOffice.org (OOo)</ulink> 1.1RC there has been support for
exporting files to DocBook format.</para>
<para>Although OOo uses the full DocBook document type declaration,
it does not actually export the full list of DocBook elements. It
uses a <quote>simplified</quote> DocBook tag set which is geared
to on-the-fly rendering. (Although it is not the official
Simplified DocBook which is described in <xref linkend="dtd" />.)
The OpenOffice simplified (or <quote>special</quote> docbook) is available from
<ulink
url="http://www.chez.com/ebellot/ooo2sdbk/">http://www.chez.com/ebellot/ooo2sdbk/</ulink>.</para>
<section id="ooo-1-0">
<title>Open Office 1.0.x</title>
<para>
OOo has been tested by LDP volunteers with mostly positive
results. Thanks to Charles Curley
(<ulink
url="http://www.charlescurley.com">charlescurley.com</ulink>)
for the following notes on using OOo version 1.0.x:
</para>
<note><title>Check the version of your OpenOffice</title>
<para>
These notes may not apply to the version of OOo you
are using.
</para>
</note>
<itemizedlist> <listitem><para>
To be able to export to DocBook, you must have a Java runtime
environment (JRE) installed and registered with OOo--a minimum of
version 4.2.x is recommended. The configuration instructions will
depend on how you installed your JRE. Visit the OOo web site for
help with your setup.
</para>
<para>
Contrary to the OOo documentation, the Linux OOo did not come with
a JRE. I got one from Sun.
</para> </listitem> <!-- openoffice -->
<listitem><para>The exported file has lots of empty lines. My 54 line exported
file
had 5 lines of actual XML code.</para></listitem>
<listitem><para>There was no effort at pretty printing.</para></listitem>
<listitem><para> The header is:
<computeroutput>
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;
&lt;!DOCTYPE article PUBLIC &quot;-//OASIS//DTD DocBook XML V4.1.2//EN&quot;
&quot;http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd&quot;&gt;
</computeroutput>
</para></listitem>
<listitem><para> The pull-down menu in the <menuchoice><guimenu>File</guimenu><guimenuitem>Save
As</guimenuitem></menuchoice> dialog box for
file format
indicates that the export format is <quote>DocBook (simplified).</quote> There is
no explanation of what that <quote>simplified</quote> indicates. Does OOo export
a subset of DocBook? If so, which elements are ignored? Is there any
way to enter any of them manually?
</para></listitem>
<listitem><para> There is NO documentation on the DocBook export filter
or whether
OOo will import it again.
</para></listitem> </itemizedlist>
<para>
Conclusions: OOo 1.1RC is worth looking at if you want a word
processor for preparing DocBook documents.
</para>
<para>However, I hope they cure the lack of documentation. For one
thing, it would be nice to know which native OOo styles map to
which DocBook elements. It would also be nice to know how to
map one's own OOo styles to DocBook elements.</para>
</section>
<section id="ooo-1-1">
<title>Open Office 1.1</title>
<para>
<ulink url="http://www.merlinmonroe.com">Tabatha Marshall</ulink>
offers the following additional information for OOo 1.1.
</para>
<blockquote>
<para> The first problem was when I tried to do everything on version
1.0.1. That was
obviously a problem. I have RH8, and it was installed via rpm packages,
so I ripped it out and did a full, new install of OpenOffice 1.1.
It took a while to find out 1.1 was a requirement for XML to work.
</para>
<para>
During the install process I believe I was offered the choice to install
the XML features. I have a tendency to do full installs of my office
programs, so I selected everything.
</para>
<para>
I can't offer any advice to those trying to update their current
OO 1.1. Their <quote>3 ways</quote> aren't documented very well at the site
(<ulink url="http://xml.openoffice.org">xml.openoffice.org</ulink>) and as of this writing, I can't even find THAT
on their site anymore. I think more current documentation is needed
there to walk people through the process. Most of this was unclear
and I had to pretty much experiment to get things working.
</para>
<para>
Well, after I installed everything I had some configuration to do.
I opened the application, and got started by opening a new file,
choosing templates, then selecting the DocBook template. A nice menu
of <guisubmenu>Paragraph Styles</guisubmenu> popped up for me, which are the names for all those
tags, I noticed (you can see I don't use WYSIWYG often).
</para>
<para>
With a blank doc before me (couldn't get to the <guisubmenu>XML Filter
Settings</guisubmenu> menu unless some type of doc was opened), I went into
<menuchoice><guimenu>Tools</guimenu><guimenuitem>XML
Filter Settings</guimenuitem></menuchoice>, and edited the entry for DocBook file.
I configured mine as follows:
</para>
<itemizedlist>
<listitem>
<para>
<guilabel>Doctype</guilabel>
<userinput>-//OASIS//DTD DocBook XML V4.2//EN</userinput>
</para>
</listitem>
<listitem>
<para>
<guilabel>DTD</guilabel>
<userinput>http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd</userinput>
</para>
</listitem>
<listitem>
<para>
<guilabel>XSLT for export</guilabel>
<userinput>/usr/local/OpenOffice.org1.1.0/share/xslt/docbook/ldp-html.xsl</userinput>
</para>
</listitem>
<listitem>
<para>
<guilabel>XSLT for import</guilabel>
<userinput>/usr/local/OpenOffice.org1.1.0/share/xslt/docbook/docbooktosoffheadings.xsl</userinput>
(this is the default)
</para>
</listitem>
<listitem>
<para>
<guilabel>Template for import</guilabel>
<userinput>/home/tabatha/OpenOffice/user/template/DocBook
File/DocBookTemplate.stw</userinput>
</para>
</listitem>
</itemizedlist>
<para>
At first, if I opened an XML file that had even one parsing error, it
would just open the file anyway and display the markup in OO. I have
many XML files that use &amp;copy; and other types of entities which show
up as parse errors (depending on the encoding) even though they can be
processed through. But today I was unable to open any of those files.
I got input/output errors instead. Still investigating that one.
</para>
<para>
However when you do successfully open a document (one parsing with no
errors), it puts it automatically into WYSIWYG based on the markup,
and you can then work from the paragraph styles menu like any other
such editor.
</para>
<para>
To validate the document, I used <menuchoice><guimenu>Tools</guimenu><guimenuitem>XML
Filter Settings</guimenuitem></menuchoice>, then
clicked the <guibutton>Test XSLTs</guibutton> button. On my screen, I set up the XSLT
for export to be <filename>ldp-html.xsl</filename>. If you test and there are errors,
a new window pops up with error messages at the bottom, and the lines
that need to be changed up at the top. You can change them there and
progress through the errors until they're all gone, and keep testing
until they're gone.
</para>
<para>
If you want to open a file to see the source instead of the processed
results, go to <menuchoice><guimenu>Tools</guimenu><guimenuitem>XML Filter
Settings</guimenuitem><guisubmenu>Test XSLTs</guisubmenu></menuchoice>, and then
under the <menuchoice><guimenu>Import</guimenu></menuchoice> section, check the
<guilabel>Display Source</guilabel> box. My import XSLT
is currently <filename>docbooktosoffheadings.xsl</filename> (the default) and the template
for import is <filename>DocBookTemplate.stw</filename> (also default).
</para>
<para>
I think this might work for some people, but unfortunately not for me.
I've never used WYSIWYG to edit markup. <application>Emacs with
PSGML</application> can tell me
what my next tag is no matter where I am, validate by moving through
the trouble spots, and I can parse and process from command line.
</para>
<para>
With OpenOffice, you have to visit <ulink
url="http://xml.openoffice.org/filters.html">http://xml.openoffice.org/filters.html</ulink>
to find conversion tools.
</para> </blockquote>
</section> </section>
<section id="word2docbook">
<title>Microsoft Word to DocBook</title>
<para>Even if you want to use MS Word to write your documents, you may
find <ulink url="http://www.docsoft.com/w2xmlv2.htm">w2XML</ulink>
useful. Note that this is not free software--the cost is around $130USD.
There is, however, a trial version of the software.</para>
</section>
<section id="tex4ht">
<title>LaTeX to DocBook</title>
<para>Siep Kroonenberg reports that there is a package <application>tex4ht</application> <ulink url="http://www.cse.ohio-state.edu/~gurari/TeX4ht/" /> that converts TeX and LaTeX to various flavors of XML. Currently, its support for DocBook output is limited. If you want to use <application>tex4ht</application> in its current state for LDP you will probably have to hack your LaTeX source beforehand and the generated XML afterwords.</para>
</section>
<section id="lyx2docbook">
<title>LyX to DocBook</title>
<para>This section is contributed by Chris Karakas.</para>
<para>You can use the <application>LyX-to-X</application> package to write your document conveniently in <application>LyX</application> (a visual editor originally developped as a graphical frontend to <application>LaTeX</application>), then export it to DocBook SGML and submit it to The LDP. This method is presented by <ulink url="http://www.karakas-online.de">Chris Karakas</ulink> <ulink url="http://www.karakas-online.de/mySGML"><citetitle>Document processing with LyX and SGML</citetitle></ulink>.</para>
<para>In the LyX-to-X project, LyX is used as a comfortable graphical SGML editor. Once the document is exported to SGML from LyX, it undergoes a series of transformations through <application>sed</application> and <application>awk</application> scripts that correct the SGML code, computes the Index, inserts the Bibliography and the Appendix and takes care of the correct invocation of <application>openjade</application>, <application>pdftex</application>, <application>pdfjadetex</application> and all the other necessary programs for the generation of HTML (chunked or not), PDF (with images, bookmarks, thumbnails and hyperlinks), PS, RTF and TXT versions.</para>
<para>All aspects of document processing are handled,
including automatic Index generation, display of Mathematics in TeX quality both online and in print formats, as well as the use of bibliographic databases with <ulink url="http://refdb.sourceforge.net/">RefDB</ulink>. Special care is taken so that the document processing is as transparent to the user as possible - the aim being that the user writes in LyX, then presses a button, and the LyX-to-X script does the rest. Download the documentation and the LyX-to-X package from the <ulink url="http://www.karakas-online.de/mySGML/formats.html">Formats section</ulink>.
</para>
</section>
<section id="docbook2docbook">
<title>DocBook to DocBook Transformations</title>
<section id="xmldifferences">
<title>XML and SGML DocBook</title>
<para>
There are a few changes between DocBook XML and SGML. Handling
these differences should be relatively easy for most small documents,
and many authors will not need to make any changes
to convert their documents other than the XML and DocBook
declarations at the start of their document.
</para>
<para>
For others, here is a list of what you should keep in mind
when converting your documents from SGML to XML.
</para>
<note><title>Differences between XML and SGML elements</title>
<para>
An XML element typically has three parts: the start tag,
the content (your words) and the end tag. Qualifiers
are added in the start tag and are known as
attributes. They will always have a name and a
quoted value.</para>
<screen>
&lt;filename class="directory"&gt;/usr/local&lt;filename&gt;
</screen>
<para>
The start tag contains one attribute (class)
with a value of <quote>directory</quote>. The end tag (also filename)
must not contain any attributes.
</para>
</note>
<itemizedlist>
<listitem>
<para>
Element names (tags) and their attributes are
case-dependent--typically lowercase.
The following will not validate because the end tag
&lt;PARA&gt; is uppercase:
</para>
<screen>
&lt;para&gt;This part will fail XML validation&lt;/PARA&gt;
</screen>
</listitem>
<listitem>
<para>
All attributes in the start tag must be
"quoted". This can
be either single (') or double (") quotes, but
not
reverse (`) or <quote>smart quotes</quote>.
The quote used to start a name="value"
pair must be the same quote used at the end of
the value. In other words: "this" would
validate, but 'that" would not.
</para>
</listitem>
<listitem>
<para>
Tags that have a start tag, but no end tag are
referred to as <quote>empty</quote> because they do
not contain (wrap around) anything. These tags must still be
closed with a trailing slash (/). For example:
<sgmltag>xref</sgmltag> must be written as
&lt;xref linkend="software"/&gt;. You may not
have any spaces between the / and &gt;.
(Although you may have a space after the final
attribute: &lt;xref linkend="foo" /&gt;.)
</para>
</listitem>
<listitem>
<para>
Processing instructions that get sent to the transformation
engine (DSSSL or XSLT) and must have a question mark at the
end of the tag. All processing instructions are removed from
the output stream. The XML version of this tag
would look like this:
</para>
<screen>
&lt;?dbhtml filename="foo"?&gt;
</screen>
</listitem>
<listitem>
<para>
If you're converting from SGML to XML, be sure
file names refer to .xml files instead of .sgml.
Some tools may get confused if a .sgml file contains XML.
</para>
</listitem>
<listitem>
<para>
Tag minimizations were used in SGML instead of
writing out the element name in the end tag.
Example: <sgmltag class="starttag">para</sgmltag>This is foo.<sgmltag
class="endtag"></sgmltag> Tag minimizations are
not supported in XML and their use is
discouraged in DocBook.
</para>
</listitem>
</itemizedlist>
</section>
<section id="differences">
<title>Changing DTDs</title>
<para>
The significant changes between version changes in the DTD
involve changes to the elements (tags). Elements may be:
deprecated (which means they will be removed in
future versions); removed; modified; or added.
Almost all authors will run into a changed or
deprecated tag when going from a lower version
of DocBook to a higher version.
</para>
<para>
<citetitle>DocBook: The Definitive Guide</citetitle> does
an excellent job of showing you how elements fit
together. For each element it tells you what an
element must contain (its content model) and what is
may be contained in (who its parents are). For
example: a <sgmltag>note</sgmltag> must contain a
<sgmltag>para</sgmltag>. If you try to write
&lt;note&gt;Content in a
note&lt;/note&gt; your document will not validate.
Learning how elements are assembled will make it a
lot easier to understand any validation errors that
are thrown at you. If you get truly stuck you can
also email the LDP's docbook mailing list for extra
hints. Information on subscribing is available from
<xref linkend="mailinglists" />
</para>
<para>
All tags that have been deprecated or changed for 4.x are listed
in <citetitle>DocBook: The definitive guide</citetitle>,
published by O'Reilly and Associates. This book is
also available on-line from <ulink
url="http://www.docbook.org">http://www.docbook.org</ulink>.
</para>
<section id="differences3040">
<title>Differences between version 3.x and 4.x</title>
<para>
Here are a few elements that are of particular
relevance to LDP authors:
</para>
<itemizedlist>
<listitem>
<formalpara>
<title><sgmltag>artheader</sgmltag></title>
<para>has been changed to
<sgmltag>articleinfo</sgmltag>.
Most other header elements have been renamed to info.
</para>
</formalpara>
</listitem>
<listitem>
<formalpara>
<title><sgmltag>graphic</sgmltag></title>
<para>has been deprecated and will be removed as of DocBook 5.x.
To prepare for this, start using
<sgmltag>mediaobject</sgmltag>. There is more
information about <sgmltag>mediaobject</sgmltag>
in <xref linkend="inserting-pictures"/>.
</para>
</formalpara>
</listitem>
<listitem>
<formalpara>
<title><sgmltag>imagedata</sgmltag></title>
<para>file formats
must now be written in UPPERCASE letters. If you
use lowercase or mixed-case spellings
for your file formats, it will fail.
</para>
</formalpara>
<para>
Valid:
</para>
<screen>
&lt;imagedata format="EPS" fileref="foo.eps"&gt;
</screen>
<para>
Invalid:
</para>
<screen>
&lt;imagedata format="eps" fileref="foo.eps"&gt;
</screen>
</listitem>
</itemizedlist>
</section>
</section>
</section>
</appendix>