large XML files

R

Roedy Green

It seems to me the usual XML tools in Java load the entire XML file
into RAM. Are there any tools that process sequentially, bringing in
only a chunk at a time so you could handle really fat files.
 
D

Donkey Hottie

It seems to me the usual XML tools in Java load the entire XML file
into RAM. Are there any tools that process sequentially, bringing in
only a chunk at a time so you could handle really fat files.

Java has tools for such XML files. SAX processes XML so that it does not
need to load it all to memory.
 
D

Donkey Hottie

Sounds like you want the XMLStreamReader interface:
http://java.sun.com/javase/6/docs/api/javax/xml/stream/XMLStreamReader.html

I haven't used the Java version myself (there's a similar type in .NET),
and haven't looked closed to determine the specifics. But I presume
there's a way to get an implementation of the interface (looks like
XMLInputFactory is the way to go).

Of course, if per a previous discussion you're stuck on Java 1.5, this
is unavailable to you. But otherwise, you should find it exactly what
you're asking for.

Pete

SAX interface works fine even with Java 1.4, and it does what Roedy wants.
 
A

Arne Vajhøj

It seems to me the usual XML tools in Java load the entire XML file
into RAM.

????

W3CDOM and JAXB do load all data in memory.

SAX and StAX do not load all data in memory.

Arne
 
L

Lew

Donkey said:
Java has tools for such XML files. SAX processes XML so that it does not
need to load it all to memory.

I first used SAX for XML parsing in early 1999. There's nothing new
about it.

SAX, and its equally handy StAX sibling, are perfect for single-pass,
very-high-speed, memory-parsimonious handling of XML documents.

Roedy has an interesting definition of "usual XML tools", since he's
ignoring two out of three interfaces, including one that's been around
nearly forever.
 
A

Arne Vajhøj

It's been around since Java 1.2; it better work with 1.4.

Yes and no.

SAX was added to Java API in 1.4.

JAXP API including SAX existed earlier than Java 1.4 and
libraries implementing it could be separately downloaded.

I have done the latter for Java 1.3 and it may have
existed already for 1.2.

Arne
 
M

Mike Schilling

Arne said:
????

W3CDOM and JAXB do load all data in memory.

SAX and StAX do not load all data in memory.

If you use XSLT to process an XML file, it has to keep a complete
representation of the resulting XML document into memory, since an XSLT
transformation can include XPath expressions, and XPath can in principle
access anything in the dociument. This is true even if the input to XSLT is
a SAXSource.
 
A

Arne Vajhøj

If you use XSLT to process an XML file, it has to keep a complete
representation of the resulting XML document into memory, since an XSLT
transformation can include XPath expressions, and XPath can in principle
access anything in the dociument. This is true even if the input to XSLT is
a SAXSource.

True.

But that problem is very hard to solve.

Arne
 
T

Tom Anderson

If you use XSLT to process an XML file, it has to keep a complete
representation of the resulting XML document into memory, since an XSLT
transformation can include XPath expressions, and XPath can in principle
access anything in the dociument. This is true even if the input to
XSLT is a SAXSource.

Weeeellll, kinda. Some XSLTs will require the whole document to be held in
memory. But it is possible to process some XSLTs in a streaming or
streaming-ish manner (where elements are held in memory, but only a subset
at a time). There's nothing stopping an XSLT processor compiling such
XSLTs into a form which does just that. Whether any actually do, i don't
know.

A while ago, i read about a streaming XPath processor. It couldn't handle
all XPaths in a streaming manner, so it had to fall back to searching an
in-memory tree where that was the case, but many common XPaths can be
handled streamingly. For instance, something like:

//order[@id='99']/order-item

Could be. You run the parse, and maintain the current stack of elements in
memory - all the elements enclosing the current parse point, IYSWIM. Then
you just look at the top of the stack at every point to see if it's an
order-item, then if it is, look back to see if the enclosing order has an
id of 99. You could probably do it more efficiently than that, but that's
one way you could do it. Something like this:

//order[customer[@id='99']]/order-item

Is more challenging, and requires a more sophisticated evaluation strategy
- you might need to read in a whole order, search it for matching
order-items, then throw it away and move on to the next one. Or, if you
knew from the DTD that the customer element had to come before any
order-items in an order, you could build a state machine that could decide
that it was inside a matching order, and then report all order-items.

Anyway, all speculation, but it's interesting stuff!

tom
 
T

Tom Anderson

It seems to me the usual XML tools in Java load the entire XML file into
RAM. Are there any tools that process sequentially, bringing in only a
chunk at a time so you could handle really fat files.

What do you mean by 'tools'?

tom
 
M

Mike Schilling

Tom said:
Weeeellll, kinda. Some XSLTs will require the whole document to be
held in memory. But it is possible to process some XSLTs in a
streaming or streaming-ish manner (where elements are held in memory,
but only a subset at a time). There's nothing stopping an XSLT
processor compiling such XSLTs into a form which does just that.
Whether any actually do, i don't know.

Xalan (the XSLT processor in the JDK), doesn't.
 
A

Arne Vajhøj

If you use XSLT to process an XML file, it has to keep a complete
representation of the resulting XML document into memory, since an
XSLT transformation can include XPath expressions, and XPath can in
principle access anything in the dociument. This is true even if the
input to XSLT is a SAXSource.

Weeeellll, kinda. Some XSLTs will require the whole document to be held
in memory. But it is possible to process some XSLTs in a streaming or
streaming-ish manner (where elements are held in memory, but only a
subset at a time). There's nothing stopping an XSLT processor compiling
such XSLTs into a form which does just that. Whether any actually do, i
don't know.

A while ago, i read about a streaming XPath processor. It couldn't
handle all XPaths in a streaming manner, so it had to fall back to
searching an in-memory tree where that was the case, but many common
XPaths can be handled streamingly. For instance, something like:

//order[@id='99']/order-item

Could be. You run the parse, and maintain the current stack of elements
in memory - all the elements enclosing the current parse point, IYSWIM.
Then you just look at the top of the stack at every point to see if it's
an order-item, then if it is, look back to see if the enclosing order
has an id of 99. You could probably do it more efficiently than that,
but that's one way you could do it. Something like this:

//order[customer[@id='99']]/order-item

Is more challenging, and requires a more sophisticated evaluation
strategy - you might need to read in a whole order, search it for
matching order-items, then throw it away and move on to the next one.
Or, if you knew from the DTD that the customer element had to come
before any order-items in an order, you could build a state machine that
could decide that it was inside a matching order, and then report all
order-items.

Anyway, all speculation, but it's interesting stuff!

Interesting.

But for writing code today that use the standard XML libraries,
then assuming that XSLT would read it all into memory would be
a safe assumption.

Arne
 
L

Lew

None in common use. The usual XSLT and XPath processors assume a DOM.

I know from a recent project that it's next to useless to match XPath
expressions with a SAX parser.
A while ago, i [sic] read about a streaming XPath processor. It couldn't
handle all XPaths in a streaming manner, so it had to fall back to
searching an in-memory tree where that was the case, but many common
XPaths can be handled streamingly. For instance, something like:

//order[@id='99']/order-item

Links?
 
R

Roedy Green

I thought that was a principal advantage of the Simple API For XML (SAX)
model, at least in principle. :)

I read a sentence about SAX that lead me to believe it too read the
whole file into RAM, it just did not create a DOM tree. I am glad that
is not true.
 
L

Lew

John B. Matthews wrote, quoted or indirectly quoted someone who said :
Roedy said:
I read a sentence about SAX that lead me to believe it too read the
whole file into RAM, it just did not create a DOM tree. I am glad that
is not true.

It does read the whole file into RAM, just not all at once.

SAX and StAX let you deal with the information as it streams in (hence the "S"
for "streaming"), letting you process and perhaps discard stuff as it flows
by. A typical use is to create an object model, perhaps including everything
from the document, that is not a DOM. A DOM parser does the same thing, but
allows only the DOM, not a custom model, and doesn't let you discard anything.
It presents the whole DOM at the conclusion of parsing. If you then need a
different object model you need room for both that model and the DOM.
 
T

Tom Anderson

None in common use. The usual XSLT and XPath processors assume a DOM.

Curses. I had an idea that xmlstarlet did streaming XSLT, but on reading
its documentation, i see no mention of it.

I would point out that my point was in response to "XSLT [...] *has* to"
(my emphasis), pointing out that this is not always so, though of course a
theoretical possibility which is not implemented anywhere is of no use to
anyone.
I know from a recent project that it's next to useless to match XPath
expressions with a SAX parser.

In what sense? That it justs builds a DOM tree behind the scenes?
A while ago, i [sic] read about a streaming XPath processor. It couldn't
handle all XPaths in a streaming manner, so it had to fall back to
searching an in-memory tree where that was the case, but many common
XPaths can be handled streamingly. For instance, something like:

//order[@id='99']/order-item

Links?

Yes, some of those would be really good, actually.

tom
 
L

Lew

Tom said:
In what sense? That it justs builds a DOM tree behind the scenes?

In the sense that for XPath to work, there has to already be a DOM for it to
search, or else you have to forego built-in XPath processing. In that recent
project they attempted to cache results from XPath expressions that were built
by manually matching the expression with data from the streamed input. When
that missed, they had to either re-read the whole input or go ahead and build
a DOM regardless. The complexity and time cost of manual XPath handling and
the frequency of misses presented a rather intractable barrier to the approach.

That's only a single data point, of course. I don't rule out the possibility
that another approach to blending SAX and XPath could work. Had it been up to
me, I would have abandoned XPath for that application and just used SAX or
StAX to build a domain-specific object model, not a DOM, and directly
referenced items from that model.
 
T

Tom Anderson

In the sense that for XPath to work, there has to already be a DOM for
it to search, or else you have to forego built-in XPath processing.

Right, yes.
In that recent project they attempted to cache results from XPath
expressions that were built by manually matching the expression with
data from the streamed input. When that missed, they had to either
re-read the whole input or go ahead and build a DOM regardless. The
complexity and time cost of manual XPath handling and the frequency of
misses presented a rather intractable barrier to the approach.

Yes, unless you know what a large fraction of your XPaths are upfront, i
can't see that being a winning strategy.
That's only a single data point, of course. I don't rule out the
possibility that another approach to blending SAX and XPath could work.
Had it been up to me, I would have abandoned XPath for that application
and just used SAX or StAX to build a domain-specific object model, not a
DOM, and directly referenced items from that model.

Sounds sensible. Every time i've had to deal with XML and had the freedom
to do it how i liked, i've ended up doing just that - write a
ContentHandler that turns the elements into calls to some domain-space
interface, then write an implementation of that that either builds objects
or does something else useful.

tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,012
Latest member
RoxanneDzm

Latest Threads

Top