trax, and transforms from a DOMSource

Simon Brooke · Mar 15, 2007

Consider this Java fragment, part of an application which takes crufty HTML
documents in MS Word and O

's excuses for HTML and produces a
standardised clean presentation in both HTML and PDF:

/**
* a map of my substitutions, loaded from the file in my resources which
* contains my substitution specifications
*/
protected Map substitutions = null;

/** the DOM printer I'm going to use */
protected Printer caxton = new Printer( );

/**
* a SedBuffer which ro knock out the worst cruft from MS Word and O

* generated HTML
*/
protected SedBuffer sed = new SedBuffer( );

/** a tidy parser to load messy HTML as a document */
protected Tidy sweeper = new Tidy( );

/**
* A transformer to be preloaded with the XSL file in my resources to use
* for converting the heathen into pdf
*/
protected Transformer converter = null;

/**
* A transformer to be preloaded with the XSL file in my resources to use
* for splitting the heathen into web-servable units
*/
protected Transformer splitter = null;

/**
* convert the heathen
*
* @param heathen the foreign file to convert
*
* @return the base name of the conversion
*/
public String convert( File heathen )
throws IOException, TransformerException, SubstitutionException
{
String result = toBaseName( heathen.getName( ) );
File htmlFile = new File( repository, result + ".html" );
File pdfFile = new File( repository, result + ".pdf" );
File sweptFile = File.createTempFile( result, ".swept" );

File subrep = new File( repository, result );

if ( !subrep.mkdir( ) )
{
throw new IOException(
"could not create sub-directory within repository" );
}

File tmp = File.createTempFile( result, ".conv" );

/* sed is just an instance of my implementation of SED in Java. What
* it's doing here is getting rid of the really awful cruft in HTML
* generated by MS Word or O

, the sort of cruft that's so bad even
* Tidy wouldn't cope with it. */
sed.substitute( new FileInputStream( heathen ),
new FileOutputStream( tmp ), substitutions );

/* sweeper is an instance of Andy Quick and Dave Raggett's JTidy -
* it knocks the remaining cruft out of foreign HTML, and produces
* a DOM object */
Document swept = sweeper.parseDOM( new FileInputStream( tmp ), null );

try
{
/* caxton is my own recursive descent DOM pretty-printer - it dates
* back to 1999, before the days of TRAX. It's reliable, if not
* perfect.
http://www.weft.co.uk/library/jacquard/documentation/uk/co/weft/domutil/Printer.html
*/
caxton.print( swept, new FileOutputStream( sweptFile ) );

converter.transform( new StreamSource( sweptFile ),
new StreamResult( htmlFile ) );

splitter.transform( new StreamSource( sweptFile ),
new StreamResult( new File( subrep, "index.html" ) ) );

StringBuffer commandString = new StringBuffer( "prince -s " );

commandString.append( resourceDir ).append( File.separatorChar );
commandString.append( "paperback.css" ).append( ' ' );
commandString.append( htmlFile.getCanonicalPath( ) ).append( ' ' );
commandString.append( pdfFile.getCanonicalPath( ) );

/* pass the result off to Prince for final formatting to PDF */
Runtime.getRuntime( ).exec( commandString.toString( ) );
System.err.println( "Finished");
}
catch ( Exception e )
{
// TODO Auto-generated catch block
e.printStackTrace( );
}

return result;
}

The above works - which is great - but it isn't wonderfully efficient
writing the Document object created by JTidy out to disk and parsing it in
again. It would be much more efficient just to pass the Document object on
to the transformers, like this:

Document swept = sweeper.parseDOM( new FileInputStream( tmp ), null );

try
{

converter.transform( new DOMSource( swept ),
new StreamResult( htmlFile ) );

splitter.transform( new DOMSource( swept ),
new StreamResult( new File( subrep, "index.html" ) ) );

...

However, this doesn't work - both 'converter' generates output which isn't
as expected, and 'splitter' generates output as if no transform had been
applied.

So, what am I doing wrong here? I thought that TRAX (I'm using Xalan2
2.7.0) might be marking the Document object as processed in the sweep by
the 'converter' Transformer, so that when it gets to the 'splitter'
transformer it's already polluted, but that isn't the case as if I reverse
the order of the transformations I get exactly the same output. Anyone?

Bjoern Hoehrmann · Mar 15, 2007

* Simon Brooke wrote in comp.text.xml:

/* sed is just an instance of my implementation of SED in Java. What
* it's doing here is getting rid of the really awful cruft in HTML
* generated by MS Word or O, the sort of cruft that's so bad even
* Tidy wouldn't cope with it. */

You might want to try http://home.ccil.org/~cowan/XML/tagsoup/ instead.

Simon Brooke · Mar 15, 2007

* Simon Brooke wrote in comp.text.xml:

You might want to try http://home.ccil.org/~cowan/XML/tagsoup/ instead.

Thanks, looks interesting.

--
(e-mail address removed) (Simon Brooke) http://www.jasmine.org.uk/~simon/

;; Perl ... is the Brittney Spears of programming - easily accessible
;; but, in the final analysis, empty of any significant thought
;; Frank Adrian on Slashdot, 21st July 2003

Split results and make list	0	Apr 12, 2012
Help! 'dummy.xsl' ?	3	Mar 18, 2007
Java - Automatic conversion to Unicode and Problems with the orderof attributes	0	Feb 17, 2005
Problem with XSLT and Java	5	May 12, 2006
Trouble converting an XML doc into a string	2	Mar 5, 2010
xml file break line	0	Feb 28, 2007
Problems with UTF-8 characters and XSLT	2	Jun 30, 2005
Use output of XSL transformation as new XSL stylesheet	0	Jan 16, 2006

trax, and transforms from a DOMSource

Simon Brooke

Bjoern Hoehrmann

Simon Brooke

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads