mirror of https://github.com/tLDP/LDP
472 lines
18 KiB
XML
472 lines
18 KiB
XML
<!--
|
|
<!DOCTYPE book PUBLIC '-//OASIS//DTD DocBook XML V4.2//EN'>
|
|
-->
|
|
<appendix id="x2docbook">
|
|
<title>Converting Documents to DocBook XML</title>
|
|
|
|
<para>A variety of free and commercial tools exist for doing <quote>up conversion</quote> of non-XML formats to DocBook. A few are listed here for your convenience. A more comprehensive list is available from <ulink url="http://wiki.docbook.org/topic/ConvertOtherFormatsToDocBook">Convert Other Formats to DocBook</ulink>.</para>
|
|
|
|
<section id="txt2docbook">
|
|
<title>Text to DocBook</title>
|
|
|
|
<para>The <application>txt2docbook</application> tool allows one to convert a ascii (README-like) file to a valid docbook xml document. It simplifies the publishing of rather small papers significantly. The input format was inspired by the APT-Convert tool from <ulink url="http://www.xmlmind.com/aptconvert.html" />.</para>
|
|
|
|
<para>The script can be downloaded from <ulink url="http://txt2docbook.sourceforge.net/" />.</para>
|
|
</section>
|
|
|
|
<section id="oo2docbook">
|
|
<title>OpenOffice.org to DocBook</title>
|
|
<!-- duplicated from tools-word-processors.xml -->
|
|
|
|
<para>As of <ulink url="http://www.openoffice.org">OpenOffice.org (OOo)</ulink> 1.1RC there has been support for
|
|
exporting files to DocBook format.</para>
|
|
|
|
<para>Although OOo uses the full DocBook document type declaration,
|
|
it does not actually export the full list of DocBook elements. It
|
|
uses a <quote>simplified</quote> DocBook tag set which is geared
|
|
to on-the-fly rendering. (Although it is not the official
|
|
Simplified DocBook which is described in <xref linkend="dtd" />.)
|
|
The OpenOffice simplified (or <quote>special</quote> docbook) is available from
|
|
<ulink
|
|
url="http://www.chez.com/ebellot/ooo2sdbk/">http://www.chez.com/ebellot/ooo2sdbk/</ulink>.</para>
|
|
|
|
<section id="ooo-1-0">
|
|
<title>Open Office 1.0.x</title>
|
|
<para>
|
|
OOo has been tested by LDP volunteers with mostly positive
|
|
results. Thanks to Charles Curley
|
|
(<ulink
|
|
url="http://www.charlescurley.com">charlescurley.com</ulink>)
|
|
for the following notes on using OOo version 1.0.x:
|
|
</para>
|
|
|
|
<note><title>Check the version of your OpenOffice</title>
|
|
<para>
|
|
These notes may not apply to the version of OOo you
|
|
are using.
|
|
</para>
|
|
</note>
|
|
|
|
<itemizedlist> <listitem><para>
|
|
To be able to export to DocBook, you must have a Java runtime
|
|
environment (JRE) installed and registered with OOo--a minimum of
|
|
version 4.2.x is recommended. The configuration instructions will
|
|
depend on how you installed your JRE. Visit the OOo web site for
|
|
help with your setup.
|
|
</para>
|
|
|
|
<para>
|
|
Contrary to the OOo documentation, the Linux OOo did not come with
|
|
a JRE. I got one from Sun.
|
|
</para> </listitem> <!-- openoffice -->
|
|
|
|
<listitem><para>The exported file has lots of empty lines. My 54 line exported
|
|
file
|
|
had 5 lines of actual XML code.</para></listitem>
|
|
|
|
<listitem><para>There was no effort at pretty printing.</para></listitem>
|
|
|
|
<listitem><para> The header is:
|
|
<computeroutput>
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
|
|
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
|
|
</computeroutput>
|
|
</para></listitem>
|
|
|
|
<listitem><para> The pull-down menu in the <menuchoice><guimenu>File</guimenu><guimenuitem>Save
|
|
As</guimenuitem></menuchoice> dialog box for
|
|
file format
|
|
indicates that the export format is <quote>DocBook (simplified).</quote> There is
|
|
no explanation of what that <quote>simplified</quote> indicates. Does OOo export
|
|
a subset of DocBook? If so, which elements are ignored? Is there any
|
|
way to enter any of them manually?
|
|
</para></listitem>
|
|
|
|
<listitem><para> There is NO documentation on the DocBook export filter
|
|
or whether
|
|
OOo will import it again.
|
|
</para></listitem> </itemizedlist>
|
|
|
|
<para>
|
|
Conclusions: OOo 1.1RC is worth looking at if you want a word
|
|
processor for preparing DocBook documents.
|
|
</para>
|
|
|
|
<para>However, I hope they cure the lack of documentation. For one
|
|
thing, it would be nice to know which native OOo styles map to
|
|
which DocBook elements. It would also be nice to know how to
|
|
map one's own OOo styles to DocBook elements.</para>
|
|
</section>
|
|
|
|
<section id="ooo-1-1">
|
|
<title>Open Office 1.1</title>
|
|
<para>
|
|
<ulink url="http://www.merlinmonroe.com">Tabatha Marshall</ulink>
|
|
offers the following additional information for OOo 1.1.
|
|
</para>
|
|
|
|
<blockquote>
|
|
<para> The first problem was when I tried to do everything on version
|
|
1.0.1. That was
|
|
obviously a problem. I have RH8, and it was installed via rpm packages,
|
|
so I ripped it out and did a full, new install of OpenOffice 1.1.
|
|
It took a while to find out 1.1 was a requirement for XML to work.
|
|
</para>
|
|
|
|
<para>
|
|
During the install process I believe I was offered the choice to install
|
|
the XML features. I have a tendency to do full installs of my office
|
|
programs, so I selected everything.
|
|
</para>
|
|
|
|
<para>
|
|
I can't offer any advice to those trying to update their current
|
|
OO 1.1. Their <quote>3 ways</quote> aren't documented very well at the site
|
|
(<ulink url="http://xml.openoffice.org">xml.openoffice.org</ulink>) and as of this writing, I can't even find THAT
|
|
on their site anymore. I think more current documentation is needed
|
|
there to walk people through the process. Most of this was unclear
|
|
and I had to pretty much experiment to get things working.
|
|
</para>
|
|
|
|
<para>
|
|
Well, after I installed everything I had some configuration to do.
|
|
I opened the application, and got started by opening a new file,
|
|
choosing templates, then selecting the DocBook template. A nice menu
|
|
of <guisubmenu>Paragraph Styles</guisubmenu> popped up for me, which are the names for all those
|
|
tags, I noticed (you can see I don't use WYSIWYG often).
|
|
</para>
|
|
|
|
<para>
|
|
With a blank doc before me (couldn't get to the <guisubmenu>XML Filter
|
|
Settings</guisubmenu> menu unless some type of doc was opened), I went into
|
|
<menuchoice><guimenu>Tools</guimenu><guimenuitem>XML
|
|
Filter Settings</guimenuitem></menuchoice>, and edited the entry for DocBook file.
|
|
I configured mine as follows:
|
|
</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<guilabel>Doctype</guilabel>
|
|
<userinput>-//OASIS//DTD DocBook XML V4.2//EN</userinput>
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<guilabel>DTD</guilabel>
|
|
<userinput>http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd</userinput>
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<guilabel>XSLT for export</guilabel>
|
|
<userinput>/usr/local/OpenOffice.org1.1.0/share/xslt/docbook/ldp-html.xsl</userinput>
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<guilabel>XSLT for import</guilabel>
|
|
<userinput>/usr/local/OpenOffice.org1.1.0/share/xslt/docbook/docbooktosoffheadings.xsl</userinput>
|
|
(this is the default)
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<guilabel>Template for import</guilabel>
|
|
<userinput>/home/tabatha/OpenOffice/user/template/DocBook
|
|
File/DocBookTemplate.stw</userinput>
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>
|
|
At first, if I opened an XML file that had even one parsing error, it
|
|
would just open the file anyway and display the markup in OO. I have
|
|
many XML files that use &copy; and other types of entities which show
|
|
up as parse errors (depending on the encoding) even though they can be
|
|
processed through. But today I was unable to open any of those files.
|
|
I got input/output errors instead. Still investigating that one.
|
|
</para>
|
|
|
|
<para>
|
|
However when you do successfully open a document (one parsing with no
|
|
errors), it puts it automatically into WYSIWYG based on the markup,
|
|
and you can then work from the paragraph styles menu like any other
|
|
such editor.
|
|
</para>
|
|
|
|
<para>
|
|
To validate the document, I used <menuchoice><guimenu>Tools</guimenu><guimenuitem>XML
|
|
Filter Settings</guimenuitem></menuchoice>, then
|
|
clicked the <guibutton>Test XSLTs</guibutton> button. On my screen, I set up the XSLT
|
|
for export to be <filename>ldp-html.xsl</filename>. If you test and there are errors,
|
|
a new window pops up with error messages at the bottom, and the lines
|
|
that need to be changed up at the top. You can change them there and
|
|
progress through the errors until they're all gone, and keep testing
|
|
until they're gone.
|
|
</para>
|
|
|
|
<para>
|
|
If you want to open a file to see the source instead of the processed
|
|
results, go to <menuchoice><guimenu>Tools</guimenu><guimenuitem>XML Filter
|
|
Settings</guimenuitem><guisubmenu>Test XSLTs</guisubmenu></menuchoice>, and then
|
|
under the <menuchoice><guimenu>Import</guimenu></menuchoice> section, check the
|
|
<guilabel>Display Source</guilabel> box. My import XSLT
|
|
is currently <filename>docbooktosoffheadings.xsl</filename> (the default) and the template
|
|
for import is <filename>DocBookTemplate.stw</filename> (also default).
|
|
</para>
|
|
|
|
<para>
|
|
I think this might work for some people, but unfortunately not for me.
|
|
I've never used WYSIWYG to edit markup. <application>Emacs with
|
|
PSGML</application> can tell me
|
|
what my next tag is no matter where I am, validate by moving through
|
|
the trouble spots, and I can parse and process from command line.
|
|
</para>
|
|
|
|
<para>
|
|
With OpenOffice, you have to visit <ulink
|
|
url="http://xml.openoffice.org/filters.html">http://xml.openoffice.org/filters.html</ulink>
|
|
to find conversion tools.
|
|
</para> </blockquote>
|
|
|
|
</section> </section>
|
|
|
|
<section id="word2docbook">
|
|
<title>Microsoft Word to DocBook</title>
|
|
<para>Even if you want to use MS Word to write your documents, you may
|
|
find <ulink url="http://www.docsoft.com/w2xmlv2.htm">w2XML</ulink>
|
|
useful. Note that this is not free software--the cost is around $130USD.
|
|
There is, however, a trial version of the software.</para>
|
|
</section>
|
|
|
|
<section id="tex4ht">
|
|
<title>LaTeX to DocBook</title>
|
|
<para>Siep Kroonenberg reports that there is a package <application>tex4ht</application> <ulink url="http://www.cse.ohio-state.edu/~gurari/TeX4ht/" /> that converts TeX and LaTeX to various flavors of XML. Currently, its support for DocBook output is limited. If you want to use <application>tex4ht</application> in its current state for LDP you will probably have to hack your LaTeX source beforehand and the generated XML afterwords.</para>
|
|
</section>
|
|
|
|
<section id="lyx2docbook">
|
|
<title>LyX to DocBook</title>
|
|
<para>This section is contributed by Chris Karakas.</para>
|
|
|
|
<para>You can use the <application>LyX-to-X</application> package to write your document conveniently in <application>LyX</application> (a visual editor originally developped as a graphical frontend to <application>LaTeX</application>), then export it to DocBook SGML and submit it to The LDP. This method is presented by <ulink url="http://www.karakas-online.de">Chris Karakas</ulink> <ulink url="http://www.karakas-online.de/mySGML"><citetitle>Document processing with LyX and SGML</citetitle></ulink>.</para>
|
|
|
|
<para>In the LyX-to-X project, LyX is used as a comfortable graphical SGML editor. Once the document is exported to SGML from LyX, it undergoes a series of transformations through <application>sed</application> and <application>awk</application> scripts that correct the SGML code, computes the Index, inserts the Bibliography and the Appendix and takes care of the correct invocation of <application>openjade</application>, <application>pdftex</application>, <application>pdfjadetex</application> and all the other necessary programs for the generation of HTML (chunked or not), PDF (with images, bookmarks, thumbnails and hyperlinks), PS, RTF and TXT versions.</para>
|
|
|
|
<para>All aspects of document processing are handled,
|
|
including automatic Index generation, display of Mathematics in TeX quality both online and in print formats, as well as the use of bibliographic databases with <ulink url="http://refdb.sourceforge.net/">RefDB</ulink>. Special care is taken so that the document processing is as transparent to the user as possible - the aim being that the user writes in LyX, then presses a button, and the LyX-to-X script does the rest. Download the documentation and the LyX-to-X package from the <ulink url="http://www.karakas-online.de/mySGML/formats.html">Formats section</ulink>.
|
|
</para>
|
|
|
|
</section>
|
|
|
|
<section id="docbook2docbook">
|
|
<title>DocBook to DocBook Transformations</title>
|
|
<section id="xmldifferences">
|
|
<title>XML and SGML DocBook</title>
|
|
<para>
|
|
There are a few changes between DocBook XML and SGML. Handling
|
|
these differences should be relatively easy for most small documents,
|
|
and many authors will not need to make any changes
|
|
to convert their documents other than the XML and DocBook
|
|
declarations at the start of their document.
|
|
</para>
|
|
|
|
<para>
|
|
For others, here is a list of what you should keep in mind
|
|
when converting your documents from SGML to XML.
|
|
</para>
|
|
|
|
<note><title>Differences between XML and SGML elements</title>
|
|
<para>
|
|
An XML element typically has three parts: the start tag,
|
|
the content (your words) and the end tag. Qualifiers
|
|
are added in the start tag and are known as
|
|
attributes. They will always have a name and a
|
|
quoted value.</para>
|
|
<screen>
|
|
<filename class="directory">/usr/local<filename>
|
|
</screen>
|
|
<para>
|
|
The start tag contains one attribute (class)
|
|
with a value of <quote>directory</quote>. The end tag (also filename)
|
|
must not contain any attributes.
|
|
</para>
|
|
</note>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Element names (tags) and their attributes are
|
|
case-dependent--typically lowercase.
|
|
The following will not validate because the end tag
|
|
<PARA> is uppercase:
|
|
</para>
|
|
<screen>
|
|
<para>This part will fail XML validation</PARA>
|
|
</screen>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
All attributes in the start tag must be
|
|
"quoted". This can
|
|
be either single (') or double (") quotes, but
|
|
not
|
|
reverse (`) or <quote>smart quotes</quote>.
|
|
The quote used to start a name="value"
|
|
pair must be the same quote used at the end of
|
|
the value. In other words: "this" would
|
|
validate, but 'that" would not.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Tags that have a start tag, but no end tag are
|
|
referred to as <quote>empty</quote> because they do
|
|
not contain (wrap around) anything. These tags must still be
|
|
closed with a trailing slash (/). For example:
|
|
<sgmltag>xref</sgmltag> must be written as
|
|
<xref linkend="software"/>. You may not
|
|
have any spaces between the / and >.
|
|
(Although you may have a space after the final
|
|
attribute: <xref linkend="foo" />.)
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Processing instructions that get sent to the transformation
|
|
engine (DSSSL or XSLT) and must have a question mark at the
|
|
end of the tag. All processing instructions are removed from
|
|
the output stream. The XML version of this tag
|
|
would look like this:
|
|
</para>
|
|
<screen>
|
|
<?dbhtml filename="foo"?>
|
|
</screen>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
If you're converting from SGML to XML, be sure
|
|
file names refer to .xml files instead of .sgml.
|
|
Some tools may get confused if a .sgml file contains XML.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
Tag minimizations were used in SGML instead of
|
|
writing out the element name in the end tag.
|
|
Example: <sgmltag class="starttag">para</sgmltag>This is foo.<sgmltag
|
|
class="endtag"></sgmltag> Tag minimizations are
|
|
not supported in XML and their use is
|
|
discouraged in DocBook.
|
|
</para>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
</section>
|
|
|
|
<section id="differences">
|
|
<title>Changing DTDs</title>
|
|
<para>
|
|
The significant changes between version changes in the DTD
|
|
involve changes to the elements (tags). Elements may be:
|
|
deprecated (which means they will be removed in
|
|
future versions); removed; modified; or added.
|
|
Almost all authors will run into a changed or
|
|
deprecated tag when going from a lower version
|
|
of DocBook to a higher version.
|
|
</para>
|
|
<para>
|
|
<citetitle>DocBook: The Definitive Guide</citetitle> does
|
|
an excellent job of showing you how elements fit
|
|
together. For each element it tells you what an
|
|
element must contain (its content model) and what is
|
|
may be contained in (who its parents are). For
|
|
example: a <sgmltag>note</sgmltag> must contain a
|
|
<sgmltag>para</sgmltag>. If you try to write
|
|
<note>Content in a
|
|
note</note> your document will not validate.
|
|
Learning how elements are assembled will make it a
|
|
lot easier to understand any validation errors that
|
|
are thrown at you. If you get truly stuck you can
|
|
also email the LDP's docbook mailing list for extra
|
|
hints. Information on subscribing is available from
|
|
<xref linkend="mailinglists" />
|
|
</para>
|
|
<para>
|
|
All tags that have been deprecated or changed for 4.x are listed
|
|
in <citetitle>DocBook: The definitive guide</citetitle>,
|
|
published by O'Reilly and Associates. This book is
|
|
also available on-line from <ulink
|
|
url="http://www.docbook.org">http://www.docbook.org</ulink>.
|
|
</para>
|
|
|
|
<section id="differences3040">
|
|
<title>Differences between version 3.x and 4.x</title>
|
|
<para>
|
|
Here are a few elements that are of particular
|
|
relevance to LDP authors:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<formalpara>
|
|
<title><sgmltag>artheader</sgmltag></title>
|
|
<para>has been changed to
|
|
<sgmltag>articleinfo</sgmltag>.
|
|
Most other header elements have been renamed to info.
|
|
</para>
|
|
</formalpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<formalpara>
|
|
<title><sgmltag>graphic</sgmltag></title>
|
|
<para>has been deprecated and will be removed as of DocBook 5.x.
|
|
To prepare for this, start using
|
|
<sgmltag>mediaobject</sgmltag>. There is more
|
|
information about <sgmltag>mediaobject</sgmltag>
|
|
in <xref linkend="inserting-pictures"/>.
|
|
</para>
|
|
</formalpara>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<formalpara>
|
|
<title><sgmltag>imagedata</sgmltag></title>
|
|
<para>file formats
|
|
must now be written in UPPERCASE letters. If you
|
|
use lowercase or mixed-case spellings
|
|
for your file formats, it will fail.
|
|
</para>
|
|
</formalpara>
|
|
<para>
|
|
Valid:
|
|
</para>
|
|
<screen>
|
|
<imagedata format="EPS" fileref="foo.eps">
|
|
</screen>
|
|
<para>
|
|
Invalid:
|
|
</para>
|
|
<screen>
|
|
<imagedata format="eps" fileref="foo.eps">
|
|
</screen>
|
|
</listitem>
|
|
|
|
</itemizedlist>
|
|
|
|
</section>
|
|
</section>
|
|
</section>
|
|
|
|
</appendix>
|