Sunday, March 8, 2015

Transforming Scribus Source File for Text Output

I've been working, very slowly, on a publication for the town I live in. It's a book of trail maps and information about open land in the area. I wanted to use an open tool for the book layout, partly because I don't have a copy of the commercial tools that are normally used for that, and partly because using free, open tools will make it easier for other people to edit the book document in the future.

I  found the Scribus project and I am now completing the book with it. It took a little time to learn the functions, but I'm really happy with it. The PDF output is great, and I found all the layout adjustments that I needed. Thanks, Scribus!

One tricky problem I encountered was in sharing the book content for review. I was not able to find an easy way to export the written content as text so I could paste it into a word processing file. Scribus is free, but the people who are reviewing this book are not likely to enjoy it as much as I do. I tried copying the text from the text frames and pasting it. This was tedious and the copied text did not include any line breaks. So I had to hunt through the pasted text and insert line breaks between each paragraph.

Since I am likely to submit more versions of the book for review, I thought it would be useful to have an automated way to extract the text. Scribus uses XML source files, so I wrote XSLT to transform them to plain text.

Here's the XSLT stylesheet that will write the text of a Scribus document to a plain text file. It handles my document, which is not very complex. One challenging aspect of Scribus source XML is that character strings are held in elements that are siblings of the paragraph markers. So the export XSLT has to include logic for recognizing the paragraph structures. I prefer explicit structures in XML documents, like wrapping contents inside a <para> element, but I guess the Scribus team had their reasons for keeping things flat.

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
  xmlns:exsl="http://exslt.org/common"
  extension-element-prefixes="exsl">
  
  <!-- 
  #!/bin/bash
  # Example bash script for transforming a Scribus file.
  # This requires an XSLT 2.0 compatible XSLT processor, 
  # in this case Saxon v9.

  SAXON_JAR_PATH="/path/to/saxon/download/saxon9he.jar"
  INPUT_FILE="my-scribus-file.sla"
  OUTPUT_FILE="my-output-text-file.txt"
  XSLT_FILE="this-xslt-file.xslt"

  java -classpath $SAXON_JAR_PATH \
  net.sf.saxon.Transform \
  -o ${OUTPUT_FILE} \
  ${INPUT_FILE} \
  ${XSLT_FILE}
  -->
  
  <xsl:output method="text" />
  
  <!-- Start by explicitly selecting the root of the DOM, and then 
  applying templates to PAGEOBJECT elements. 
  
  Sort the selected elements by the OwnPage attributes, which
  indicate the document order. 
  
  Then sort the selected elements by the YPOS attibutes, which 
  indicate the order on the page, roughly, and assuming you're
  reading from top to bottom. Pick a different attribute for 
  the secondary sort if you prefer. -->
  
  <xsl:template match="/">
    <xsl:apply-templates select="//PAGEOBJECT">
      <xsl:sort select="@OwnPage" data-type="number" />
      <xsl:sort select="@YPOS" data-type="number" />
    </xsl:apply-templates>
  </xsl:template>
  
  <!-- Working with Scribus XML is tricky because the para 
  elements are siblings of the ITEXT elements that hold text 
  strings. I would have expected nested elements. But I guess 
  there's a Scribus-related reason. -->
   
  <xsl:template match="ITEXT">
    <xsl:value-of select="@CH" />
  </xsl:template>
  
  <!-- Write a tab character if the ITEXT is followed by a 
  tab element.  -->
  
  <xsl:template match="ITEXT[name(following-sibling::*[1])='tab']">
    <xsl:value-of select="@CH" />
    <xsl:text>&#x9;</xsl:text>
  </xsl:template>
  
  <!-- The newline character in the text element creates a line
  break between the paragraphs of your Scribus document. I believe
  the trail element is equivalent to the end of a paragraph.  -->
  
  <xsl:template match="ITEXT[name(following-sibling::*[1])=('para', 'trail')]">
    <xsl:value-of select="@CH" />
    <xsl:text>&#xa;</xsl:text>
  </xsl:template>
  
  <!-- I don't want lots of newlines and space so text nodes
     must be supressed. -->
  <xsl:template match="text()" />
  
</xsl:stylesheet>