XSLTranslation of a large XML file using Java results in OutOfMemory

Discussion in 'XML' started by Lenny Wintfeld, May 17, 2006.

  1. Hi

    I'm attempting additions/changes to a Java program that (among other
    things) uses XSLT to transform a large (96 Mb) XML file. It runs fine on
    small XML files but generates OutOfMemory exceptions with large XML
    files. I tried a simple punt of -Xmx512MB but that didn't work. In the
    future, the input XML file may become considerably bigger than 96 MB, so
    even if it did work, it probably would be putting off the inevitable to
    some later date.

    I'm using JavaSE 1.4.2_11 and the XSL/XML libraries that come with it.
    The translation is from and to an xml file. The code I inherited looks a
    lot like most of the example code you can find on the net for doing an
    XSLT transformation. The relevant part is:

    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer transformer = tf.newTransformer(xsltSource);
    transformer.transform(new StreamSource(new StringReader(x)),
    xsltDest);

    where xsltSource is XSLT in the form of a string, generated by code
    immediately above the snip shown, and the "x" is the input xml to be
    transformed.

    Things I tried:

    1. I modified the above code to use a file instead of a String as the
    XML to be transformed and a file for the XSLT that specifies the
    transformation. It works fine with small XML input files but not with
    large ones. I assume this code is using the DOM parser, and there is
    simply not enough room in memory to house the input XML file.

    2. Based on some old (years old) newsgroup posts I found, I tried using
    a SAX equivalent of the above code, assuming that SAX takes in, parses
    and transforms the input XML file either picemeal (maybe element by
    element?) or that SAX uses the complete virtual memory of the computer.
    But this code also results successful runs on small input XML files and
    OutOfMemory errors on large ones. Here is a snip of the SAX code
    (adapted from a chapter of Burke's "XSLT and Java" at the O'Reilly
    website):

    FileInputStream brXSLT = new FileInputStream ("C:/Documents and
    Settings/Lenny/Desktop/OCCxsl.xsl");

    // Set up the transformer
    TransformerFactory transFact =
    TransformerFactory.newInstance( );
    SAXTransformerFactory saxTransFact =
    (SAXTransformerFactory) transFact;
    Source xsltSource = new StreamSource(brXSLT);
    TransformerHandler transHand =
    saxTransFact.newTransformerHandler(xsltSource);

    // Set up input source
    InputSource inxml = new InputSource(inXML);
    SAXSource saxSource = new SAXSource(inxml);

    // Set the destination for the XSLT transformation
    transHand.setResult(new StreamResult(outXML));

    // attach the XSLT processor to the XMLReader
    String parserClass = "org.apache.crimson.parser.XMLReaderImpl";
    XMLReader reader = XMLReaderFactory.createXMLReader(parserClass);

    //parse the input file to an output file
    reader.setContentHandler(transHand);
    reader.parse(inxml);


    I'm considering making a custom parser of the input XML file which
    basically identifies elements of the input XML file and treats each
    element as if it were a comlete document. e.g. send the content handler
    ch.startDocument()
    ch.startElement(..) // pass through the original element
    ch.characters(..) // "
    ch.endElement(..) // "
    ch.endDocument()
    for each element in the input XML file.

    But being a newbie to XSLT, I don't know if this is worth pursuing, or
    even if it would work; I'm hoping there are simpler, more strightforward
    ways of accomplising the same thing and at a higher level. It does seem
    pretty clumsy, even if it would work.

    I found a reply on the web to someone who had a similar problem. To the
    effect that a "SAX pipeline" should be used. But there was no further
    elaboration, and so far, I haven't figured out what a SAX Pipeline is or
    how it would help.

    Any advice, references to examples, or actual examples would be
    greatly appreciated.

    Non-procedural programming is taking quite a bit of effort to
    understand!

    Thanks in advance for your help.

    Lenny Wintfeld

    ps - I've had this up on comp.lang.java.programmer for most of the day
    with no replies. It bridges both specialties, that's why I'm trying
    here.
    Lenny Wintfeld, May 17, 2006
    #1
    1. Advertising

  2. In general, XSLT can't operate as a streaming processor, since its use
    of XPaths assumes the entire document is available in memory (or at
    least can be re-read) at once. Some processors use more compact models
    than others and thus may be able to handle larger documents in the same
    memory; this is part of why Xalan created its own model, known as DTM,
    rather than using an off-the-shelf DOM implementation.

    If you're willing to limit the kinds of stylesheets you write to ones
    which _only_ process the document in forward order, you can of course
    set up a minimal data model which just contains one (or a few) nodes;
    Xalan's SQL extension works that way, actually.

    Yes, automatically recognizing which stylesheets (or portions thereof)
    are streamable would be a Good Thing, but it's still something of a Holy
    Grail for XSLT implementers. If you look in the archives of the Xalan
    mailing list, you'll see much past discussion of this, and of possible
    approaches to dealing with it. Look in particular for the keywords
    "streaming", "pruning", and "filtering". Folks are continuing to
    research this, but it is not an easy problem.

    But until someone does get a handle on this problem... Sometimes, if you
    have to process large documents, the only good answer is to drop down
    from XSLT to a lower level and code the processing yourself as a direct
    SAX application. That lets you take advantage of whatever
    streaming/pruning/filtering opportunities exist, as well as letting you
    code a special-purpose (and thus more compact) model for any data you do
    have to retain. High-level languages are a good thing, but some problems
    are still best addressed by low-level bit-twiddling.
    Joe Kesselman, May 17, 2006
    #2
    1. Advertising

  3. Lenny Wintfeld

    Peter Flynn Guest

    Joe Kesselman wrote:
    > In general, XSLT can't operate as a streaming processor, since its use
    > of XPaths assumes the entire document is available in memory (or at
    > least can be re-read) at once. Some processors use more compact models
    > than others and thus may be able to handle larger documents in the same
    > memory; this is part of why Xalan created its own model, known as DTM,
    > rather than using an off-the-shelf DOM implementation.


    Perhaps it's appropriate to mention Omnimark, which uses a technique
    sometimes known as "write-behind" (borrowed from the hardware field).
    Instead of having an addressing scheme (XPath) for accessing objects
    out of document sequence, it provides for the placement of references
    to named anchors at the places where you know (or have computed) you
    will need to access such objects, and then creating the anchors
    themselves when you encounter them in document order. When the last
    event in document order has triggered, the "write-behind" reconciliation
    takes place, and all the values of the anchors are slotted into the
    places reserved for them by the references.

    (At least, this is how it used to work: I haven't used it for years.)

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
    Peter Flynn, May 17, 2006
    #3
  4. Lenny Wintfeld

    Guest

    Thanks very much for your reply and advice. It's a shame that the XSL
    transform engines can't (at least as an option) use virtual memory as
    their target environment for xml data file transformations. It looks
    like I may have a long row to hoe in doing the equivalent of the
    transform using procedural code! The sad part is, the transfomations
    that are done to these XML files using XSLT seem to be custom made for
    XSLT!

    Just a couple of quick follow ups: 1. Note that the transformation that
    is being done is XML to XML. Except for a sort, which could be broken
    out of the XSLT stylesheet and done procedurally after the
    transformation is complete, all other transformations in the stylesheet
    are local to small elements in the xml being transformed and there are
    no dependencies between these. With those restrictions, is there a way
    to mechanize a sequential (element-by-element) transformation? If so
    could you point me to some examples? 2. I'm tantlized by the reference
    that I noted in my original post to a suggestion that a "SAX Pipeline"
    be used to process very large XML files. To me that sounds like a
    sequential processor of XML with XSLT. Do you know where I could get
    additonal info on a "SAX Pipeline", or might this have been some
    wishful thnking on the part of it's author?

    Once again, thanks for your feedback.

    Lenny Wintfeld
    , May 19, 2006
    #4
  5. wrote:

    > Just a couple of quick follow ups: 1. Note that the transformation that
    > is being done is XML to XML. Except for a sort, which could be broken
    > out of the XSLT stylesheet and done procedurally after the
    > transformation is complete, all other transformations in the stylesheet
    > are local to small elements in the xml being transformed and there are
    > no dependencies between these. With those restrictions, is there a way
    > to mechanize a sequential (element-by-element) transformation? If so
    > could you point me to some examples? 2. I'm tantlized by the reference


    It sounds like your focus is on large files
    (> 100 MB) and you may be willing to give up
    XSL and Java in order to solve the problem.
    The following tool is not so specialized in
    producing XML files, but it can handle 1 GB
    of data withing 1 or 2 minutes:

    http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.html#Printing-an-outline-of-an-XML-file

    > that I noted in my original post to a suggestion that a "SAX Pipeline"
    > be used to process very large XML files. To me that sounds like a
    > sequential processor of XML with XSLT. Do you know where I could get
    > additonal info on a "SAX Pipeline", or might this have been some
    > wishful thnking on the part of it's author?


    Maybe this one helps:

    Pipestreaming microformats
    http://www-128.ibm.com/developerworks/xml/library/x-matters44.html
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, May 19, 2006
    #5
  6. wrote:
    > Thanks very much for your reply and advice. It's a shame that the XSL
    > transform engines can't (at least as an option) use virtual memory as
    > their target environment for xml data file transformations.


    Generally, XSLT transformers *will* use virtual memory if the language
    they're running in and the operating system they're running on support
    it -- they just don't try to do the memory management themselves; they
    trust the system to do it for them. And in fact Java does use virtual
    memory... but the JVM you're using won't let you set that limit high
    enough for this particular document.

    > It looks
    > like I may have a long row to hoe in doing the equivalent of the
    > transform using procedural code! The sad part is, the transfomations
    > that are done to these XML files using XSLT seem to be custom made for
    > XSLT!


    I know how you feel. All I can say is that I know folks who are working
    on finding ways to address this, so In The Future Things Should Be
    Better. The concepts are relatively straightforward; the hard part is
    translating them into rules the machine can apply.

    > transformation is complete, all other transformations in the stylesheet
    > are local to small elements in the xml being transformed and there are
    > no dependencies between these. With those restrictions, is there a way
    > to mechanize a sequential (element-by-element) transformation?


    I agree that this is exactly the kind of problem that ought to be
    streamable... There's no portable way to leverage that, but specific
    XSLT processor may have a way to handle it. To take the example I know
    best: Xalan's internal data representation does happen to have the
    ability to "prune off" the most recently added nodes, so an explicit
    call to an extension function could, theoretically, discard the element
    once you're done processing it. In fact, one of Xalan's more obscure and
    underdocumented extensions does discard trees, though only in specific
    situations; we added that to handle the
    foreach-over-a-list-of-document()s situation... but I don't think
    there's a generalized version which would address your case. (We'd
    started investigating one, actually, then Other Priorities Intervenes.)

    > could you point me to some examples? 2. I'm tantlized by the reference
    > that I noted in my original post to a suggestion that a "SAX Pipeline"
    > be used to process very large XML files. To me that sounds like a
    > sequential processor of XML with XSLT.


    I think that was probably intended to be a reference to hand-coded SAX
    processing.

    But actually, you *could* do a compromise: hand-code a SAX processor
    which essentially breaks the large document up into a series of smaller
    ones and runs XSLT transforms on each one via its API (eg TrAX, of
    you're working in Java), then reassembles the output of those
    transformations into a single document again.

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, May 20, 2006
    #6
  7. Lenny Wintfeld

    Guest

    Jurgen, I looked at your reference to xmlgawk in some detail, and it
    seems pretty encouraging; not only for the problem I stated, but for
    web tie-ins on XML data. I will look at your document in more detail
    and at the references (especially XMLBooster, xmllib and Expat). But in
    the mean time could you let me know directly, or provide me with some
    info on the following: How would I tie in xmlgawk to my primary
    application(s) in java. Would I do the equivalent of an exec(..) of the
    awk processor and then look for an exit code or is there a library that
    ties it in more directly (similar to the XSLT library for Java)?

    I'm looking forward to seeing if xmlgawk would be a reasonable half
    step between purely procedural code and XSLT; either premanently, or
    until XSLT can handle the kinds of XML files I'm called on to process.

    Thanks for the reference!

    Lenny W.
    , May 22, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Michael Borgwardt

    java.lang.OutOfMemory

    Michael Borgwardt, May 7, 2004, in forum: Java
    Replies:
    8
    Views:
    3,527
    Boudewijn Dijkstra
    May 15, 2004
  2. Rune
    Replies:
    2
    Views:
    549
    Kevin McMurtrie
    Feb 19, 2005
  3. Replies:
    25
    Views:
    9,505
  4. Lenny Wintfeld
    Replies:
    0
    Views:
    736
    Lenny Wintfeld
    May 17, 2006
  5. Erik Wasser
    Replies:
    5
    Views:
    449
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page