Table of Contents
Pandoc is a utility created by John McFarlane used for converting between various document formats. Some of the "from" formats supported are:
docbook
textile
rst
html
mediawiki
haddock
markdown (various types)
latex
Some of the "to" formats supported are:
docbook
docx
plain (text)
markdown (various types)
html (various types)
json
rtf
odt
The DocBook style sheets do a very good job of transforming DocBook to HTML, PDF, EPUB and other output formats so DocBook users will principally be interested in exporting to other formats and in converting other formats to DocBook. That's what this article is concerned with.
Note: Version 1.12.4.2 of pandoc was used when writing this article.
Converting between different formats uses a number of switches.
Convert a markdown file to DocBook as follows: pandoc -s -f markdown -t docbook in.md -o
out.xml
.
The switches used with the preceding command are:
-s
- Create a stand-alone document
-f markdown
- The input format is
markdown
-t docbook
- The output format is
DocBook
-o out.xml
- Output to a file
named out.xml
This command creates a standalone DocBook 4.5 file using the default
template. If you do not specify the -s
switch, pandoc outputs a fragment. You can view the template that is
used by issuing the command pandoc -D
docbook
. As the following output shows, DocBook is
output as an <article>
:
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"> $endif$ <article> <articleinfo> <title>$title$</title> $for(author)$ <author> $author$ </author> $endfor$ $if(date)$ <date>$date$</date> $endif$ </articleinfo> $for(include-before)$ $include-before$ $endfor$ $body$ $for(include-after)$ $include-after$ $endfor$ </article>
If you wish, you can use an alternate template or change the default template. Documentation of template syntax is found at Pandoc User's Guide
The -t
switch defines the destination
format and the -f
switch defines the
source format. The input file, in this case in.md
, does not require a switch. Specify an output
file using the -o
switch; if you do not
use this switch, output is sent to stdout.
Pandoc requires DocBook 4.5 for input and it also outputs DocBook 4.5. If you use DocBook 5.x you will need to convert your DocBook files before passing them to pandoc and you will also likely want to upgrade the output of pandoc. Downgrading to DocBook 4.5 is easily done thanks to the style sheets provided by Thomas Schraitle at Converting DocBook from Version 5 to Version 4. Likewise, upgrading is equally easily done using the upgrade style sheet provided along with the latest style sheets at the DocBook Project.
Install the downgrade style sheets to the directory where you store your customization XSL and you can downgrade DocBook 5 to Docbook 4.5 using xsltproc in the following way:
shell> xsltproc --xinclude --nonet --output out.xml path/to/db5to4-withinfo.xsl in.xml
The upgrade style sheet comes with all versions of the DocBook 5 XSL
files and it is found in the tools
directory. Use it in the following way:
shell> xsltproc --xinclude --nonet --output out.xml path/to/tools/db4-upgrade.xsl in.xml
If you prefer you can use the Saxon XSLT processor instead of xsltproc. |
If you work with DocBook regularly, you will have set up the style
sheets for conversion to the various output formats so there is
little likelihood that you will want to use pandoc for HTML
conversions of DocBook. However, if you don't have the style sheets
readily accessible, pandoc can be a useful. Also, if you wish to use
code highlighting, converting DocBook to HTML using pandoc can be a
quick solution. For example, to convert a DocBook file to a
standalone HTML file with code highlighting in the tango style use:
pandoc -s --highlight-style=tango -f
docbook -t html in.xml -o out.html
. The highlighting
options are: pygments (the default), kate, monochrome, espresso,
zenburn, haddock, and tango. For a discussion of the highlighting
styles and the languages supported see http://johnmacfarlane.net/highlighting-kate/.
Using DocBook style sheets to create HTML converts all id attributes to anchors. You should be aware that this is not the case when you use pandoc. Converting to HTML using pandoc suffers from the following shortcomings:
<figure>
tags lose their
<title>
s.
<example>
tags lose their
<title>
s.
The id
attribute is not converted
to an anchor.
The linkend
attribute doesn't get
converted to the appropriate hyperlink. Even if pandoc created
a legitimate cross reference, there would be no anchor to go to
since ids are not converted.
As you'll soon see, these failings also apply to transformations to other formats.
Pandoc can't convert Word documents to DocBook but conversion of
DocBook to docx
is supported. The
following commands convert a DocBook 5 file to 4.5 and then output a
docx
file.
shell> xsltproc --xinclude --nonet --output tmp.xml path/to/db5to4-withinfo.xsl in.xml shell> pandoc -t docx -f docbook -o out.docx tmp.xml
Opening out.docx
in Word reveals the
same failings noted in Section 4, "DocBook to HTML";
titles of figures and examples are lost as are linkend
s.
Unfortunately pandoc conversion to docx
or HTML format has the major flaws identified in Section
5, "DocBook to Microsoft Word" and Section 4,
"DocBook to HTML" and none of the following workarounds solve
these problems:
Converting to HTML using the DocBook style sheets and then converting to docx using pandoc
Converting to RTF rather than to docx
Converting to ODT and then converting this format to docx
The failure to convert an id
to an
anchor looks insurmountable because this is a function of how pandoc
is programmed. However, there is an XSLT workaround for the issue
with figures and examples.
To recap, when converting to HTML from DocBook, pandoc successfully
converts a figure tag to an <img>
tag but it ignores the title and the same thing happens when
converting to docx
. Consider the
following DocBook figure:
<figure id="figure"> <blockinfo> <title>The Image Title</title> <blockinfo> <mediaobject> <imageobject> <imagedata fileref="images/an_image.png" format="PNG" lang="en"/> </imageobject> <textobject> <phrase lang="en">The Image Phrase</phrase> </textobject> </mediaobject> </figure>
The figure example has an id
attribute and a title nested within a <blockinfo>
tag. You can use an XSLT
transformation to adjust the XML to something that pandoc
understands. The following transformation style sheet removes figure
and blockinfo tags and converts the figure title to a <bridgehead>
tag bearing the id previously
associated with the figure. It also performs the same transformation
on example tags.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="@* | node()"> <xsl:copy> <xsl:apply-templates select="@* | node()"/> </xsl:copy> </xsl:template> <!-- convert figure title to bridgehead, capture id --> <xsl:template match="title"> <xsl:choose> <xsl:when test="(name(parent::*) = 'blockinfo') and (name(../..) = 'figure' or name(../..) = 'example')"> <bridgehead> <!-- check that there is an id first! --> <xsl:choose> <xsl:when test="../../@id"> <xsl:attribute name="id"> <!-- id will be grandparent --> <xsl:value-of select="../../@id"/> </xsl:attribute> </xsl:when> <xsl:otherwise><!-- do nothing --></xsl:otherwise> </xsl:choose> <xsl:apply-templates select="@*|node()"/> </bridgehead> </xsl:when> <xsl:when test="name(parent::*) = 'figure' or name(parent::*) = 'example'"> <bridgehead> <xsl:choose> <xsl:when test="../@id"> <xsl:attribute name="id"> <xsl:value-of select="../@id"/> </xsl:attribute> </xsl:when> <xsl:otherwise><!-- do nothing --></xsl:otherwise> </xsl:choose> <xsl:apply-templates select="@*|node()"/> </bridgehead> </xsl:when> <xsl:otherwise> <title> <xsl:apply-templates select="@*|node()"/> </title> </xsl:otherwise> </xsl:choose> </xsl:template> <xsl:template match="sectioninfo"> <xsl:apply-templates/> </xsl:template> <!-- remove keywordset, authorblurb and all contents, also blockinfo--> <xsl:template match="blockinfo"> <xsl:apply-templates/> </xsl:template> <!-- get rid of entirely --> <xsl:template match="keywordset"/> <xsl:template match="authorblurb"/> <!-- remove figure and example --> <xsl:template match="figure | example"> <xsl:apply-templates/> </xsl:template> <xsl:output method="xml" indent="yes" doctype-public="-//OASIS//DTD DocBook XML V4.5//EN" doctype-system="http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"/> </xsl:stylesheet>
If the transformation in the section called
"fix_figure.xsl" is applied prior to using pandoc, figure and
example titles will be preserved in the output regardless of the type
of output. Converting DocBook 5 to docx
(or HTML) then requires the following steps:
shell> xsltproc --xinclude --nonet --output tmp.xml path/to/db5to4-withinfo.xsl in.xml
shell> xsltproc --xinclude --nonet --output tmp.xml path/to/fix_figure.xsl tmp.xml
shell> pandoc -t docx -f docbook -o out.docx tmp.xml
The additional step is shown in bold. Note that the input file for this step is the output file of the first step.
Since writing this article I have discovered the following issues with conversion of DocBook to a Word file:
The titles of <sidebar>
tags
are not output.
Solution: Adapt the
fix_figure.xsl transformation style sheet
to process sidebar
tags in the
same way as figure
and
example
tags.
The <userinput>
tag is
treated as a verbatim, block tag.
Solution: Adapt the
fix_figure.xsl transformation style sheet
to convert userinput
tags to
emphasis
tags.
The content of <remark>
tags
is output.
Solution: Adapt the
fix_figure.xsl transformation style sheet
to ignore remark
tags.
The diff file of the changes is as follows:
<xsl:apply-templates select="@* | node()"/> </xsl:copy> </xsl:template> - <!-- convert figure title to bridgehead, capture id --> + <!-- convert figure title to bridgehead, capture id. + Do this for figures, examples and sidebars. --> <xsl:template match="title"> <xsl:choose> <xsl:when test="(name(parent::*) = 'blockinfo') and (name(../..) = 'figure' - or name(../..) = 'example')"> + or name(../..) = 'example' + or name(../..) = 'sidebar')"> <bridgehead> <!-- check that there is an id first! --> <xsl:choose> @@ -27,7 +29,8 @@ </xsl:when> <xsl:when test="name(parent::*) = 'figure' - or name(parent::*) = 'example'"> + or name(parent::*) = 'example' + or name(parent::*) = 'sidebar'"> <bridgehead> <xsl:choose> <xsl:when test="../@id"> @@ -50,16 +53,25 @@ <xsl:template match="sectioninfo"> <xsl:apply-templates/> </xsl:template> - <!-- remove keywordset, authorblurb and all contents, also blockinfo--> + <!-- remove keywordset, authorblurb, remark and all contents, also blockinfo--> <xsl:template match="blockinfo"> <xsl:apply-templates/> </xsl:template> <!-- get rid of entirely --> <xsl:template match="keywordset"/> <xsl:template match="authorblurb"/> + <xsl:template match="remark"/> + <!-- convert userinput to emphasis --> + <xsl:template match="userinput"> + <emphasis> + <xsl:apply-templates/> + </emphasis> + </xsl:template> <!-- remove figure and example --> - <xsl:template match="figure - | example"> + <xsl:template + match="figure + | example + | sidebar"> <xsl:apply-templates/> </xsl:template> <xsl:output method="xml" indent="yes"
You can determine which DocBook tags have and which haven't been implemented by looking at Text-Pandoc-Readers-DocBook. |
Peter Lavin is a technical writer who has been published in a number of print and online magazines. He is the author of Object Oriented PHP, published by No Starch Press and a contributor to PHP Hacks by O'Reilly Media.
Please do not reproduce this article in whole or part, in any form, without obtaining written permission.