large XML files

Discussion in 'Java' started by Roedy Green, Feb 7, 2010.

  1. Roedy Green

    Roedy Green Guest

    It seems to me the usual XML tools in Java load the entire XML file
    into RAM. Are there any tools that process sequentially, bringing in
    only a chunk at a time so you could handle really fat files.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    Every compilable program in a sense works. The problem is with your unrealistic expections on what it will do.
     
    Roedy Green, Feb 7, 2010
    #1
    1. Advertising

  2. On 7.2.2010 19:59, Roedy Green wrote:
    > It seems to me the usual XML tools in Java load the entire XML file
    > into RAM. Are there any tools that process sequentially, bringing in
    > only a chunk at a time so you could handle really fat files.


    Java has tools for such XML files. SAX processes XML so that it does not
    need to load it all to memory.

    --
    Good day for a change of scene. Repaper the bedroom wall.
     
    Donkey Hottie, Feb 7, 2010
    #2
    1. Advertising

  3. In article <>,
    Roedy Green <> wrote:

    > It seems to me the usual XML tools in Java load the entire XML file
    > into RAM. Are there any tools that process sequentially, bringing in
    > only a chunk at a time so you could handle really fat files.



    I thought that was a principal advantage of the Simple API For XML (SAX)
    model, at least in principle. :)

    <http://www.totheriver.com/learn/xml/xmltutorial.html>

    --
    John B. Matthews
    trashgod at gmail dot com
    <http://sites.google.com/site/drjohnbmatthews>
     
    John B. Matthews, Feb 7, 2010
    #3
  4. On 7.2.2010 20:14, Peter Duniho wrote:
    > Roedy Green wrote:
    >> It seems to me the usual XML tools in Java load the entire XML file
    >> into RAM. Are there any tools that process sequentially, bringing in
    >> only a chunk at a time so you could handle really fat files.

    >
    > Sounds like you want the XMLStreamReader interface:
    > http://java.sun.com/javase/6/docs/api/javax/xml/stream/XMLStreamReader.html
    >
    > I haven't used the Java version myself (there's a similar type in .NET),
    > and haven't looked closed to determine the specifics. But I presume
    > there's a way to get an implementation of the interface (looks like
    > XMLInputFactory is the way to go).
    >
    > Of course, if per a previous discussion you're stuck on Java 1.5, this
    > is unavailable to you. But otherwise, you should find it exactly what
    > you're asking for.
    >
    > Pete


    SAX interface works fine even with Java 1.4, and it does what Roedy wants.


    --
    Good day for a change of scene. Repaper the bedroom wall.
     
    Donkey Hottie, Feb 7, 2010
    #4
  5. Roedy Green

    Arne Vajhøj Guest

    On 07-02-2010 12:59, Roedy Green wrote:
    > It seems to me the usual XML tools in Java load the entire XML file
    > into RAM.


    ????

    W3CDOM and JAXB do load all data in memory.

    SAX and StAX do not load all data in memory.

    Arne
     
    Arne Vajhøj, Feb 7, 2010
    #5
  6. Roedy Green

    Lew Guest

    On 2/7/2010 1:20 PM, Donkey Hottie wrote:
    > On 7.2.2010 20:14, Peter Duniho wrote:
    >> Roedy Green wrote:
    >>> It seems to me the usual XML tools in Java load the entire XML file
    >>> into RAM. Are there any tools that process sequentially, bringing in
    >>> only a chunk at a time so you could handle really fat files.

    >>
    >> Sounds like you want the XMLStreamReader interface:
    >> http://java.sun.com/javase/6/docs/api/javax/xml/stream/XMLStreamReader.html
    >>
    >> I haven't used the Java version myself (there's a similar type in .NET),
    >> and haven't looked closed to determine the specifics. But I presume
    >> there's a way to get an implementation of the interface (looks like
    >> XMLInputFactory is the way to go).
    >>
    >> Of course, if per a previous discussion you're stuck on Java 1.5, this
    >> is unavailable to you. But otherwise, you should find it exactly what
    >> you're asking for.
    >>
    >> Pete

    >
    > SAX interface works fine even with Java 1.4, and it does what Roedy wants.


    It's been around since Java 1.2; it better work with 1.4.

    --
    Lew
     
    Lew, Feb 7, 2010
    #6
  7. Roedy Green

    Lew Guest

    Roedy Green wrote:
    >> It seems to me the usual XML tools in Java load the entire XML file
    >> into RAM. Are there any tools that process sequentially, bringing in
    >> only a chunk at a time so you could handle really fat files.


    Donkey Hottie wrote:
    > Java has tools for such XML files. SAX processes XML so that it does not
    > need to load it all to memory.


    I first used SAX for XML parsing in early 1999. There's nothing new
    about it.

    SAX, and its equally handy StAX sibling, are perfect for single-pass,
    very-high-speed, memory-parsimonious handling of XML documents.

    Roedy has an interesting definition of "usual XML tools", since he's
    ignoring two out of three interfaces, including one that's been around
    nearly forever.

    --
    Lew
     
    Lew, Feb 7, 2010
    #7
  8. Roedy Green

    Arne Vajhøj Guest

    On 07-02-2010 15:31, Lew wrote:
    > On 2/7/2010 1:20 PM, Donkey Hottie wrote:
    >> On 7.2.2010 20:14, Peter Duniho wrote:
    >>> Roedy Green wrote:
    >>>> It seems to me the usual XML tools in Java load the entire XML file
    >>>> into RAM. Are there any tools that process sequentially, bringing in
    >>>> only a chunk at a time so you could handle really fat files.
    >>>
    >>> Sounds like you want the XMLStreamReader interface:
    >>> http://java.sun.com/javase/6/docs/api/javax/xml/stream/XMLStreamReader.html
    >>>
    >>>
    >>> I haven't used the Java version myself (there's a similar type in .NET),
    >>> and haven't looked closed to determine the specifics. But I presume
    >>> there's a way to get an implementation of the interface (looks like
    >>> XMLInputFactory is the way to go).
    >>>
    >>> Of course, if per a previous discussion you're stuck on Java 1.5, this
    >>> is unavailable to you. But otherwise, you should find it exactly what
    >>> you're asking for.
    >>>
    >>> Pete

    >>
    >> SAX interface works fine even with Java 1.4, and it does what Roedy
    >> wants.

    >
    > It's been around since Java 1.2; it better work with 1.4.


    Yes and no.

    SAX was added to Java API in 1.4.

    JAXP API including SAX existed earlier than Java 1.4 and
    libraries implementing it could be separately downloaded.

    I have done the latter for Java 1.3 and it may have
    existed already for 1.2.

    Arne
     
    Arne Vajhøj, Feb 7, 2010
    #8
  9. Arne Vajhøj wrote:
    > On 07-02-2010 12:59, Roedy Green wrote:
    >> It seems to me the usual XML tools in Java load the entire XML file
    >> into RAM.

    >
    > ????
    >
    > W3CDOM and JAXB do load all data in memory.
    >
    > SAX and StAX do not load all data in memory.


    If you use XSLT to process an XML file, it has to keep a complete
    representation of the resulting XML document into memory, since an XSLT
    transformation can include XPath expressions, and XPath can in principle
    access anything in the dociument. This is true even if the input to XSLT is
    a SAXSource.
     
    Mike Schilling, Feb 7, 2010
    #9
  10. Roedy Green

    Arne Vajhøj Guest

    On 07-02-2010 16:37, Mike Schilling wrote:
    > Arne Vajhøj wrote:
    >> On 07-02-2010 12:59, Roedy Green wrote:
    >>> It seems to me the usual XML tools in Java load the entire XML file
    >>> into RAM.

    >>
    >> ????
    >>
    >> W3CDOM and JAXB do load all data in memory.
    >>
    >> SAX and StAX do not load all data in memory.

    >
    > If you use XSLT to process an XML file, it has to keep a complete
    > representation of the resulting XML document into memory, since an XSLT
    > transformation can include XPath expressions, and XPath can in principle
    > access anything in the dociument. This is true even if the input to XSLT is
    > a SAXSource.


    True.

    But that problem is very hard to solve.

    Arne
     
    Arne Vajhøj, Feb 7, 2010
    #10
  11. Roedy Green

    Tom Anderson Guest

    On Sun, 7 Feb 2010, Mike Schilling wrote:

    > Arne Vajh?j wrote:
    >> On 07-02-2010 12:59, Roedy Green wrote:
    >>> It seems to me the usual XML tools in Java load the entire XML file
    >>> into RAM.

    >>
    >> ????
    >>
    >> W3CDOM and JAXB do load all data in memory.
    >>
    >> SAX and StAX do not load all data in memory.

    >
    > If you use XSLT to process an XML file, it has to keep a complete
    > representation of the resulting XML document into memory, since an XSLT
    > transformation can include XPath expressions, and XPath can in principle
    > access anything in the dociument. This is true even if the input to
    > XSLT is a SAXSource.


    Weeeellll, kinda. Some XSLTs will require the whole document to be held in
    memory. But it is possible to process some XSLTs in a streaming or
    streaming-ish manner (where elements are held in memory, but only a subset
    at a time). There's nothing stopping an XSLT processor compiling such
    XSLTs into a form which does just that. Whether any actually do, i don't
    know.

    A while ago, i read about a streaming XPath processor. It couldn't handle
    all XPaths in a streaming manner, so it had to fall back to searching an
    in-memory tree where that was the case, but many common XPaths can be
    handled streamingly. For instance, something like:

    //order[@id='99']/order-item

    Could be. You run the parse, and maintain the current stack of elements in
    memory - all the elements enclosing the current parse point, IYSWIM. Then
    you just look at the top of the stack at every point to see if it's an
    order-item, then if it is, look back to see if the enclosing order has an
    id of 99. You could probably do it more efficiently than that, but that's
    one way you could do it. Something like this:

    //order[customer[@id='99']]/order-item

    Is more challenging, and requires a more sophisticated evaluation strategy
    - you might need to read in a whole order, search it for matching
    order-items, then throw it away and move on to the next one. Or, if you
    knew from the DTD that the customer element had to come before any
    order-items in an order, you could build a state machine that could decide
    that it was inside a matching order, and then report all order-items.

    Anyway, all speculation, but it's interesting stuff!

    tom

    --
    Dreams are not covered by any laws. They can be about anything. --
    Cmdr Zorg
     
    Tom Anderson, Feb 7, 2010
    #11
  12. Roedy Green

    Tom Anderson Guest

    On Sun, 7 Feb 2010, Roedy Green wrote:

    > It seems to me the usual XML tools in Java load the entire XML file into
    > RAM. Are there any tools that process sequentially, bringing in only a
    > chunk at a time so you could handle really fat files.


    What do you mean by 'tools'?

    tom

    --
    Dreams are not covered by any laws. They can be about anything. --
    Cmdr Zorg
     
    Tom Anderson, Feb 7, 2010
    #12
  13. Tom Anderson wrote:
    > On Sun, 7 Feb 2010, Mike Schilling wrote:
    >
    >> Arne Vajh?j wrote:
    >>> On 07-02-2010 12:59, Roedy Green wrote:
    >>>> It seems to me the usual XML tools in Java load the entire XML file
    >>>> into RAM.
    >>>
    >>> ????
    >>>
    >>> W3CDOM and JAXB do load all data in memory.
    >>>
    >>> SAX and StAX do not load all data in memory.

    >>
    >> If you use XSLT to process an XML file, it has to keep a complete
    >> representation of the resulting XML document into memory, since an
    >> XSLT transformation can include XPath expressions, and XPath can in
    >> principle access anything in the dociument. This is true even if
    >> the input to XSLT is a SAXSource.

    >
    > Weeeellll, kinda. Some XSLTs will require the whole document to be
    > held in memory. But it is possible to process some XSLTs in a
    > streaming or streaming-ish manner (where elements are held in memory,
    > but only a subset at a time). There's nothing stopping an XSLT
    > processor compiling such XSLTs into a form which does just that.
    > Whether any actually do, i don't know.


    Xalan (the XSLT processor in the JDK), doesn't.
     
    Mike Schilling, Feb 8, 2010
    #13
  14. Roedy Green

    Arne Vajhøj Guest

    On 07-02-2010 17:25, Tom Anderson wrote:
    > On Sun, 7 Feb 2010, Mike Schilling wrote:
    >> Arne Vajh?j wrote:
    >>> On 07-02-2010 12:59, Roedy Green wrote:
    >>>> It seems to me the usual XML tools in Java load the entire XML file
    >>>> into RAM.
    >>>
    >>> ????
    >>>
    >>> W3CDOM and JAXB do load all data in memory.
    >>>
    >>> SAX and StAX do not load all data in memory.

    >>
    >> If you use XSLT to process an XML file, it has to keep a complete
    >> representation of the resulting XML document into memory, since an
    >> XSLT transformation can include XPath expressions, and XPath can in
    >> principle access anything in the dociument. This is true even if the
    >> input to XSLT is a SAXSource.

    >
    > Weeeellll, kinda. Some XSLTs will require the whole document to be held
    > in memory. But it is possible to process some XSLTs in a streaming or
    > streaming-ish manner (where elements are held in memory, but only a
    > subset at a time). There's nothing stopping an XSLT processor compiling
    > such XSLTs into a form which does just that. Whether any actually do, i
    > don't know.
    >
    > A while ago, i read about a streaming XPath processor. It couldn't
    > handle all XPaths in a streaming manner, so it had to fall back to
    > searching an in-memory tree where that was the case, but many common
    > XPaths can be handled streamingly. For instance, something like:
    >
    > //order[@id='99']/order-item
    >
    > Could be. You run the parse, and maintain the current stack of elements
    > in memory - all the elements enclosing the current parse point, IYSWIM.
    > Then you just look at the top of the stack at every point to see if it's
    > an order-item, then if it is, look back to see if the enclosing order
    > has an id of 99. You could probably do it more efficiently than that,
    > but that's one way you could do it. Something like this:
    >
    > //order[customer[@id='99']]/order-item
    >
    > Is more challenging, and requires a more sophisticated evaluation
    > strategy - you might need to read in a whole order, search it for
    > matching order-items, then throw it away and move on to the next one.
    > Or, if you knew from the DTD that the customer element had to come
    > before any order-items in an order, you could build a state machine that
    > could decide that it was inside a matching order, and then report all
    > order-items.
    >
    > Anyway, all speculation, but it's interesting stuff!


    Interesting.

    But for writing code today that use the standard XML libraries,
    then assuming that XSLT would read it all into memory would be
    a safe assumption.

    Arne
     
    Arne Vajhøj, Feb 8, 2010
    #14
  15. Roedy Green

    Lew Guest

    Tom Anderson wrote:
    >> Weeeellll, kinda. Some XSLTs will require the whole document to be held
    >> in memory. But it is possible to process some XSLTs in a streaming or
    >> streaming-ish manner (where elements are held in memory, but only a
    >> subset at a time). There's nothing stopping an XSLT processor compiling
    >> such XSLTs into a form which does just that. Whether any actually do, i
    >> don't know.


    None in common use. The usual XSLT and XPath processors assume a DOM.

    I know from a recent project that it's next to useless to match XPath
    expressions with a SAX parser.

    >> A while ago, i [sic] read about a streaming XPath processor. It couldn't
    >> handle all XPaths in a streaming manner, so it had to fall back to
    >> searching an in-memory tree where that was the case, but many common
    >> XPaths can be handled streamingly. For instance, something like:
    >>
    >> //order[@id='99']/order-item


    Links?

    Arne Vajhøj wrote:
    > But for writing code today that use the standard XML libraries,
    > then assuming that XSLT would read it all into memory would be
    > a safe assumption.


    --
    Lew
     
    Lew, Feb 8, 2010
    #15
  16. Roedy Green

    Roedy Green Guest

    On Sun, 07 Feb 2010 13:14:26 -0500, "John B. Matthews"
    <> wrote, quoted or indirectly quoted someone who
    said :

    >
    >I thought that was a principal advantage of the Simple API For XML (SAX)
    >model, at least in principle. :)


    I read a sentence about SAX that lead me to believe it too read the
    whole file into RAM, it just did not create a DOM tree. I am glad that
    is not true.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    Every compilable program in a sense works. The problem is with your unrealistic expections on what it will do.
     
    Roedy Green, Feb 8, 2010
    #16
  17. Roedy Green

    Lew Guest

    John B. Matthews wrote, quoted or indirectly quoted someone who said :
    >> I thought that was a principal advantage of the Simple API For XML (SAX)
    >> model, at least in principle. :)


    Roedy Green wrote:
    > I read a sentence about SAX that lead me to believe it too read the
    > whole file into RAM, it just did not create a DOM tree. I am glad that
    > is not true.


    It does read the whole file into RAM, just not all at once.

    SAX and StAX let you deal with the information as it streams in (hence the "S"
    for "streaming"), letting you process and perhaps discard stuff as it flows
    by. A typical use is to create an object model, perhaps including everything
    from the document, that is not a DOM. A DOM parser does the same thing, but
    allows only the DOM, not a custom model, and doesn't let you discard anything.
    It presents the whole DOM at the conclusion of parsing. If you then need a
    different object model you need room for both that model and the DOM.

    --
    Lew
     
    Lew, Feb 8, 2010
    #17
  18. Roedy Green

    Tom Anderson Guest

    On Sun, 7 Feb 2010, Lew wrote:

    > Tom Anderson wrote:
    >> On Sun, 7 Feb 2010, Mike Schilling wrote:
    >>
    >>> If you use XSLT to process an XML file, it has to keep a complete
    >>> representation of the resulting XML document into memory, since an
    >>> XSLT transformation can include XPath expressions, and XPath can in
    >>> principle access anything in the dociument. This is true even if the
    >>> input to XSLT is a SAXSource.

    >>
    >> Weeeellll, kinda. Some XSLTs will require the whole document to be held
    >> in memory. But it is possible to process some XSLTs in a streaming or
    >> streaming-ish manner (where elements are held in memory, but only a
    >> subset at a time). There's nothing stopping an XSLT processor compiling
    >> such XSLTs into a form which does just that. Whether any actually do, i
    >> don't know.

    >
    > None in common use. The usual XSLT and XPath processors assume a DOM.


    Curses. I had an idea that xmlstarlet did streaming XSLT, but on reading
    its documentation, i see no mention of it.

    I would point out that my point was in response to "XSLT [...] *has* to"
    (my emphasis), pointing out that this is not always so, though of course a
    theoretical possibility which is not implemented anywhere is of no use to
    anyone.

    > I know from a recent project that it's next to useless to match XPath
    > expressions with a SAX parser.


    In what sense? That it justs builds a DOM tree behind the scenes?

    >>> A while ago, i [sic] read about a streaming XPath processor. It couldn't
    >>> handle all XPaths in a streaming manner, so it had to fall back to
    >>> searching an in-memory tree where that was the case, but many common
    >>> XPaths can be handled streamingly. For instance, something like:
    >>>
    >>> //order[@id='99']/order-item

    >
    > Links?


    Yes, some of those would be really good, actually.

    tom

    --
    secular utopianism is based on a belief in an unstoppable human ability
    to make a better world -- Rt Rev Tom Wright
     
    Tom Anderson, Feb 8, 2010
    #18
  19. Roedy Green

    Lew Guest

    Lew wrote:
    >> I know from a recent project that it's next to useless to match XPath
    >> expressions with a SAX parser.



    Tom Anderson wrote:
    > In what sense? That it justs builds a DOM tree behind the scenes?


    In the sense that for XPath to work, there has to already be a DOM for it to
    search, or else you have to forego built-in XPath processing. In that recent
    project they attempted to cache results from XPath expressions that were built
    by manually matching the expression with data from the streamed input. When
    that missed, they had to either re-read the whole input or go ahead and build
    a DOM regardless. The complexity and time cost of manual XPath handling and
    the frequency of misses presented a rather intractable barrier to the approach.

    That's only a single data point, of course. I don't rule out the possibility
    that another approach to blending SAX and XPath could work. Had it been up to
    me, I would have abandoned XPath for that application and just used SAX or
    StAX to build a domain-specific object model, not a DOM, and directly
    referenced items from that model.

    --
    Lew
     
    Lew, Feb 8, 2010
    #19
  20. Roedy Green

    Tom Anderson Guest

    On Mon, 8 Feb 2010, Lew wrote:

    > Lew wrote:
    >
    >>> I know from a recent project that it's next to useless to match XPath
    >>> expressions with a SAX parser.

    >
    > Tom Anderson wrote:
    >> In what sense? That it justs builds a DOM tree behind the scenes?

    >
    > In the sense that for XPath to work, there has to already be a DOM for
    > it to search, or else you have to forego built-in XPath processing.


    Right, yes.

    > In that recent project they attempted to cache results from XPath
    > expressions that were built by manually matching the expression with
    > data from the streamed input. When that missed, they had to either
    > re-read the whole input or go ahead and build a DOM regardless. The
    > complexity and time cost of manual XPath handling and the frequency of
    > misses presented a rather intractable barrier to the approach.


    Yes, unless you know what a large fraction of your XPaths are upfront, i
    can't see that being a winning strategy.

    > That's only a single data point, of course. I don't rule out the
    > possibility that another approach to blending SAX and XPath could work.
    > Had it been up to me, I would have abandoned XPath for that application
    > and just used SAX or StAX to build a domain-specific object model, not a
    > DOM, and directly referenced items from that model.


    Sounds sensible. Every time i've had to deal with XML and had the freedom
    to do it how i liked, i've ended up doing just that - write a
    ContentHandler that turns the elements into calls to some domain-space
    interface, then write an implementation of that that either builds objects
    or does something else useful.

    tom

    --
    24-Hour Monkey-Vision!
     
    Tom Anderson, Feb 8, 2010
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. spammy
    Replies:
    3
    Views:
    4,178
    Jason DeFontes
    May 21, 2004
  2. RAJ
    Replies:
    2
    Views:
    447
  3. Ketchup
    Replies:
    1
    Views:
    288
    Jan Tielens
    May 25, 2004
  4. thufir
    Replies:
    3
    Views:
    237
    Thufir
    Apr 12, 2008
  5. Replies:
    5
    Views:
    999
    Xho Jingleheimerschmidt
    Apr 2, 2009
Loading...

Share This Page