trax, and transforms from a DOMSource

S

Simon Brooke

Consider this Java fragment, part of an application which takes crufty HTML
documents in MS Word and Oo_O's excuses for HTML and produces a
standardised clean presentation in both HTML and PDF:

/**
* a map of my substitutions, loaded from the file in my resources which
* contains my substitution specifications
*/
protected Map substitutions = null;

/** the DOM printer I'm going to use */
protected Printer caxton = new Printer( );

/**
* a SedBuffer which ro knock out the worst cruft from MS Word and Oo_O
* generated HTML
*/
protected SedBuffer sed = new SedBuffer( );

/** a tidy parser to load messy HTML as a document */
protected Tidy sweeper = new Tidy( );

/**
* A transformer to be preloaded with the XSL file in my resources to use
* for converting the heathen into pdf
*/
protected Transformer converter = null;

/**
* A transformer to be preloaded with the XSL file in my resources to use
* for splitting the heathen into web-servable units
*/
protected Transformer splitter = null;


/**
* convert the heathen
*
* @param heathen the foreign file to convert
*
* @return the base name of the conversion
*/
public String convert( File heathen )
throws IOException, TransformerException, SubstitutionException
{
String result = toBaseName( heathen.getName( ) );
File htmlFile = new File( repository, result + ".html" );
File pdfFile = new File( repository, result + ".pdf" );
File sweptFile = File.createTempFile( result, ".swept" );

File subrep = new File( repository, result );

if ( !subrep.mkdir( ) )
{
throw new IOException(
"could not create sub-directory within repository" );
}

File tmp = File.createTempFile( result, ".conv" );

/* sed is just an instance of my implementation of SED in Java. What
* it's doing here is getting rid of the really awful cruft in HTML
* generated by MS Word or Oo_O, the sort of cruft that's so bad even
* Tidy wouldn't cope with it. */
sed.substitute( new FileInputStream( heathen ),
new FileOutputStream( tmp ), substitutions );

/* sweeper is an instance of Andy Quick and Dave Raggett's JTidy -
* it knocks the remaining cruft out of foreign HTML, and produces
* a DOM object */
Document swept = sweeper.parseDOM( new FileInputStream( tmp ), null );

try
{
/* caxton is my own recursive descent DOM pretty-printer - it dates
* back to 1999, before the days of TRAX. It's reliable, if not
* perfect.
http://www.weft.co.uk/library/jacquard/documentation/uk/co/weft/domutil/Printer.html
*/
caxton.print( swept, new FileOutputStream( sweptFile ) );

converter.transform( new StreamSource( sweptFile ),
new StreamResult( htmlFile ) );

splitter.transform( new StreamSource( sweptFile ),
new StreamResult( new File( subrep, "index.html" ) ) );

StringBuffer commandString = new StringBuffer( "prince -s " );

commandString.append( resourceDir ).append( File.separatorChar );
commandString.append( "paperback.css" ).append( ' ' );
commandString.append( htmlFile.getCanonicalPath( ) ).append( ' ' );
commandString.append( pdfFile.getCanonicalPath( ) );

/* pass the result off to Prince for final formatting to PDF */
Runtime.getRuntime( ).exec( commandString.toString( ) );
System.err.println( "Finished");
}
catch ( Exception e )
{
// TODO Auto-generated catch block
e.printStackTrace( );
}

return result;
}

The above works - which is great - but it isn't wonderfully efficient
writing the Document object created by JTidy out to disk and parsing it in
again. It would be much more efficient just to pass the Document object on
to the transformers, like this:

Document swept = sweeper.parseDOM( new FileInputStream( tmp ), null );

try
{

converter.transform( new DOMSource( swept ),
new StreamResult( htmlFile ) );

splitter.transform( new DOMSource( swept ),
new StreamResult( new File( subrep, "index.html" ) ) );

...

However, this doesn't work - both 'converter' generates output which isn't
as expected, and 'splitter' generates output as if no transform had been
applied.

So, what am I doing wrong here? I thought that TRAX (I'm using Xalan2
2.7.0) might be marking the Document object as processed in the sweep by
the 'converter' Transformer, so that when it gets to the 'splitter'
transformer it's already polluted, but that isn't the case as if I reverse
the order of the transformations I get exactly the same output. Anyone?
 
B

Bjoern Hoehrmann

* Simon Brooke wrote in comp.text.xml:
/* sed is just an instance of my implementation of SED in Java. What
* it's doing here is getting rid of the really awful cruft in HTML
* generated by MS Word or Oo_O, the sort of cruft that's so bad even
* Tidy wouldn't cope with it. */

You might want to try http://home.ccil.org/~cowan/XML/tagsoup/ instead.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top