What tool to use for processing large documents

Luc Mercier · Oct 21, 2006

Hi Folks,

I'm new here, and I need some advice for what tool to use.

I'm using XML for benchmarking purposes. I'm writing some scientific
programs which I want to analyze. My program generates large XML logs
giving semi-structured information on the flow of the program. The XML
tree looks like the method calls tree, but at a much higher level, and I
add many values of some variables.

There is no predefined schema, and often as I modify my program I will
add some new tags and new information to put into the log.

Once a log is written, I never modify the document.

To analyze the data, I add a /almost/ perfect solution: from Matlab, I
would call the methods of the Java library dom4j. Typically, I would
load a document, then dump values of attributes matching an XPath
expression into a Matlab array, then do some stats or plotting. I'm very
happy with the comfort and the ease of this solution: no DB to set up,
just load a document, and and Matlab gives you an environment in which
you can call java methods without creating a java program, so it's very
easy to debug the XPath expressions you pass to dom4j's "selectNodes"
method.

Now, the problem is, it's perfect for documents of a few 10's of
megabytes, but now I would like to process documents of several hundreds
MBs to, let's say, maybe 10 GB (that's a fairly large upper bound).

It seems I have to give up with dom4j for that. I have tried to use
eXist to create a DB with my documents, and all I got was a lot of
(rather violent) crashes when I tried to run the first example they give
in the doc for retrieving a document via the XML

B api. Then I tried
BerkeleyDB XML, which I have not been able to install. I then tried
xmlDB, but as I tried to import a first document into a collection I got
a "java.lang.OutOfMemoryError: Java heap space" and found no mention in
the doc of how to specify the heap space.

After these 3 unsuccessful trials, I'd like to ask for some advice!

To summarize, my needs are:
* Processing (very) large XML documents
* Need for XPath
* Java API, to be able to call from Matlab
* Read-only processing
* Single user, no security issues, no remote access need
* Platform: Java if possible, otherwise Linux/Debian on x86.

I welcome any suggestion.

- Luc Mercier.

Joe Kesselman · Oct 21, 2006

Luc said:
* Processing (very) large XML documents
* Need for XPath

That combination sounds like you want a serious XML database. If done
right, that should give gives you a system which already knows how to
handle documents larger than memory and one which implements XPath data
retrieval against them, leaving you to implement just the program logic.

Another other solution is not to work on the whole document at once.
Instead, go with streaming-style processing, SAX-based with a relatively
small amount of persisting data. You can hand-code the extraction, or
there have been papers describing systems which can be used to filter a
SAX stream and extract just the subtrees which match a specified XPath.
Of course you may have to reprocess the entire stream in order to
evaluate a different XPath, but it is a way around memory constraints.
It works very well for some specific systems, either alone or by feeding
this "filtered" SAX stream into a model builder to construct a model
that reflects only the data your application actually cares about. On
the other hand, if you need true random access to the complete document,
this won't do it for you.

Joe Kesselman · Oct 21, 2006

Luc said:
* Processing (very) large XML documents
* Need for XPath

That combination sounds like you want a serious XML database. If done
right, that should give gives you a system which already knows how to
handle documents larger than memory and one which implements XPath data
retrieval against them, leaving you to implement just the program logic.
(I haven't worked with any of these, but I'll toss out my standard
reminder that IBM's DB2 now has XML-specific capabilities. I'm not sure
whether those have been picked up in Cloudscape, IBM's Java-based database.)

Another solution is not to work on the whole document at once. Instead,
go with streaming-style processing, SAX-based with a relatively
small amount of persisting data. You can hand-code the extraction, or
there have been papers describing systems which can be used to filter a
SAX stream and extract just the subtrees which match a specified XPath.
Of course you may have to reprocess the entire stream in order to
evaluate a different XPath, but it is a way around memory constraints.
It works very well for some specific systems, either alone or by feeding
this "filtered" SAX stream into a model builder to construct a model
that reflects only the data your application actually cares about. On
the other hand, if you need true random access to the complete document,
this won't do it for you.

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Oct 21, 2006

Luc said:
Now, the problem is, it's perfect for documents of a few 10's of
megabytes, but now I would like to process documents of several hundreds
MBs to, let's say, maybe 10 GB (that's a fairly large upper bound).

Whatever XML parser you use, notice that the parser
cannot parse faster than the disk can read the XML data.
Reading 10 GB off a disk will take around 3 to 5 minutes
for disk access alone. Reading and parsing together should
take around 20 minutes, even with the best parsers.

It seems I have to give up with dom4j for that. I have tried to use
eXist to create a DB with my documents, and all I got was a lot of
(rather violent) crashes when I tried to run the first example they give
in the doc for retrieving a document via the XMLB api. Then I tried
BerkeleyDB XML, which I have not been able to install. I then tried
xmlDB, but as I tried to import a first document into a collection I got
a "java.lang.OutOfMemoryError: Java heap space" and found no mention in
the doc of how to specify the heap space.

Remember that a DOM is a complete copy of the XML data
in the address space of the CPU. If your XML data has
10 GB, then your address space has to be at least 10 GB.
This is unrealistic on today's machines.

* Processing (very) large XML documents
* Need for XPath
* Java API, to be able to call from Matlab
* Read-only processing
* Single user, no security issues, no remote access need
* Platform: Java if possible, otherwise Linux/Debian on x86.

Java's SAX API can help you parse the data, but SAX will
_not_ allow you to use XPath.

I welcome any suggestion.

OK, I am assuming that the result of XPath processing
is much shorter than the original XML data. If so, I
bet the problem can be solved in xgawk:

http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.html#Printing-an-outline-of-an-XML-file

I have used xgawk for parsing files with several GB.
It works, but it will take several minutes, of course.

Joe Kesselman · Oct 21, 2006

Jürgen Kahrs said:
for disk access alone. Reading and parsing together should
take around 20 minutes, even with the best parsers.

Which is one reason to consider the database approach, with index
information precalculated at the time you store the info.

Or to use XML only as your interchange representation, and use something
more specialized for capture and/or computation. Staying 100% XML is a
reasonable choice for prototyping, but in production systems XML should
be used mostly in places where generality is actually of value.

Remember that a DOM is a complete copy of the XML data
in the address space of the CPU

Standard correction: The DOM is just an API. The metaphor it uses is one
of an object graph, but DOMs can be written which do not keep the whole
document in memory at once, intelligently loading the sections which are
actually referenced. But that requires that your application, in turn,
be careful about how it acesses the data, to avoid undue churn.

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Oct 21, 2006

Joe said:
Standard correction: The DOM is just an API. The metaphor it uses is one
of an object graph, but DOMs can be written which do not keep the whole
document in memory at once, intelligently loading the sections which are
actually referenced. But that requires that your application, in turn,
be careful about how it acesses the data, to avoid undue churn.

Let's assume that his application is careful about
how it accesses data. Should we still call such an
application DOM-based ? Which implementations of the
DOM API really allow the user to proceed this way ?
I am asking out of curiosity. I dont know the answer.

Luc Mercier · Oct 22, 2006

First, thanks a lot to both of you. Some additional information:

1...... The I/O issue:

Whatever XML parser you use, notice that the parser
cannot parse faster than the disk can read the XML data.
Reading 10 GB off a disk will take around 3 to 5 minutes
for disk access alone. Reading and parsing together should
take around 20 minutes, even with the best parsers.

Yes, I know that. As I said, 10 GB is a large upper bound. I do not
expect to have that big a file, but I do not want to have memory
problems in case that occur. In short: I want to be able to process
files bigger than the RAM.

For the I/O limit, I forgot to mention that my logs are in zipped xml. I
compress them on the fly while producing them. Since my files are very
repetitive (a few number of different tags, and virtually every other
character, excluding spaces, is a digit). Therefore I get excellent
compression ratios of 20:1 to 30:1. I believe this can speed up the
reading/parsing, although I admit I haven't checked.

However, if I need to uncompress them in the file system to be able to
use a particular tool, that's fine for me.

2...... The DOM vs SAX question:

Let's assume that his application is careful about
how it accesses data. Should we still call such an
application DOM-based ? Which implementations of the
DOM API really allow the user to proceed this way ?
I am asking out of curiosity. I dont know the answer.

I simplified a little my description of how I analyze my data. Usually,
what I do is:
(a). Get the set of nodes matching an XPath expression. Usually, the
size of this set is a very small fraction of the number of nodes in the
tree, say at most 10^-3.
(b). Iterates over the results. For each node N is the set, retrieve
some data (again with XPath) contained in the subtree rooted in N. Here
typically virtually 100% of the subtree contain useful data.
(c). do some computation with the data got in (b), then store the result
an forget about the data retrieved in (b).

So, what I'm doing is very sequential in nature. Dom4j returns a set
when I call selectNodes, but all I need is an iterator. Also, I often
use almost all the data contained in the document. So I think a
SAX-based XPath processor, if such thing exists, would definitely be a
suitable solution. Actually to implement (b) the best would be to have a
processor able to have a list of XPath expressions, and which would go
through the document an stop each time one of them matches, returning
the index of the matched expression. Of course you can do that by
agglomerating the expressions with a "OR", but then you have to find
which one matches.

Because of the nature of my queries, I do not believe that a DB system
with some smart indexing feature would speed up anything. But, once
again, I'm more concerned with the ease of use than the speed. This is
for benchmarking only, and I will run queries only a small number of
times of big files. I don't really want to write ad-hoc SAX-based code,
which sounds like a big pain. If a xml database, free if possible, can
allow me to run my queries without to much setup an decent performances,
that's perfect for me.

Again, thanks for your help.

Luc

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Oct 22, 2006

Luc said:
problems in case that occur. In short: I want to be able to process
files bigger than the RAM.

Indeed, "files bigger than the RAM", that's the crucial point.
The "classical DOM API" is not compatible with this constraint.
Maybe the approach Joe described fits better.

For the I/O limit, I forgot to mention that my logs are in zipped xml. I
compress them on the fly while producing them. Since my files are very
repetitive (a few number of different tags, and virtually every other
character, excluding spaces, is a digit). Therefore I get excellent
compression ratios of 20:1 to 30:1. I believe this can speed up the
reading/parsing, although I admit I haven't checked.

Yes, this can avoid the time needed for reading the XML
data off the hard disk.

gunzip -c data.xml.gz | ...
unzip -c data.xml.zip | ...

This way, the unzipped data never touches the hard disk.

use almost all the data contained in the document. So I think a
SAX-based XPath processor, if such thing exists, would definitely be a

I doubt that such a thing exists.

Good luck.

jay m · Oct 23, 2006

Well, there seem to be some open source XML databases
I don't know about your other qualifiers, but...

http://exist.sourceforge.net/
http://xml.apache.org/xindice/

are two.

Good luck
Jay

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Oct 23, 2006

jay said:
Well, there seem to be some open source XML databases
I don't know about your other qualifiers, but...

http://exist.sourceforge.net/
http://xml.apache.org/xindice/

are two.

Interesting. Sounds like "exist" should be able
to handle large files. But I am not convinced
that the performance will be acceptable. This
has to be tried to get an answer.

Luc Mercier · Oct 23, 2006

Jürgen Kahrs said:
Interesting. Sounds like "exist" should be able
to handle large files. But I am not convinced
that the performance will be acceptable. This
has to be tried to get an answer.

Well as I mentioned, I got serious problems using eXist: when I tried to
run the very first example they give in the documentation of the xml:db
api, my screen would get all red, and then Kde logged me out and I got
to the login screen... Never had anything like that before, especially
running Java code !

I tried the two current releases (standard and 'new core'), and both
produced the same result. So this product does not seem mature enough...

I'm trying xindice right now.

Thanks everyone for your suggestions.

Luc Mercier · Oct 23, 2006

Luc said:
Well as I mentioned, I got serious problems using eXist: when I tried to
run the very first example they give in the documentation of the xml:db
api, my screen would get all red, and then Kde logged me out and I got
to the login screen... Never had anything like that before, especially
running Java code !

I tried the two current releases (standard and 'new core'), and both
produced the same result. So this product does not seem mature enough...

I'm trying xindice right now.

Thanks everyone for your suggestions.

All right, after wasting some time with xindice, I read in the xindice FAQ:

--
10. My 5 megabyte file is crashing the command line, help?

See FAQ #2. Xindice wasn't designed for monster documents, rather, it
was designed for collections of small to medium sized documents. The
best thing to do in this case would be to look at your 5 megabyte file,
and determine whether or not it's a good candidate for being sliced into
a set of small documents. If so, you'll want to extract the separate
documents and add them to a Xindice collection individually. A good
example of this, would be a massive document of this form:
--

So it's not suitable for me. I certainly could slice up my documents,
but not easy as much as 5Mb- pieces.

Luc

Luc Mercier · Oct 23, 2006

Ok, so I found a well documented list of Native XML databases:

http://www.rpbourret.com/xml/ProdsNative.htm

Three of them are explicitly said to be designed to handle large documents:
* 4Suite, 4Suite Server (free)
* Infonyte DB (commercial)
* Sonic XML Server(commercial)

The first one is in Python. I don't know how easy that is to call Python
stuff from Matlab. I'm going to check that.

Does anyone has any experience with one of the two others?

- Luc.

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Oct 24, 2006

Luc said:
Well as I mentioned, I got serious problems using eXist: when I tried to
run the very first example they give in the documentation of the xml:db
api, my screen would get all red, and then Kde logged me out and I got
to the login screen... Never had anything like that before, especially
running Java code !

That's funny. I just remembered this one:

http://vtd-xml.sourceforge.net/
VTD-XML is the next generation XML parser
that goes beyond DOM and SAX in terms of
performance, memory and ease of use.

Joseph Kesselman · Oct 24, 2006

Congratulations; you found a JVM bug. See if there was a logfile of some
sort, and if so report it to the folks maintaining that version of Java...

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Oct 24, 2006

Joseph said:
Congratulations; you found a JVM bug. See if there was a logfile of some
sort, and if so report it to the folks maintaining that version of Java...

It was Luc Mercier who found it.

jay m · Oct 26, 2006

Jürgen Kahrs said:
That's funny. I just remembered this one:

http://vtd-xml.sourceforge.net/
VTD-XML is the next generation XML parser
that goes beyond DOM and SAX in terms of
performance, memory and ease of use.

From the website:

"Its memory usage is typically between 1.3x~1.5x the size of the XML
document, "
and
" VTD requires that XML document be maintained intact in memory."

For multi-GB documents, you will need a very well-equipped machine!

As an associate once told me: "yes, that's a very nice problem".
Regards
Jay

Luc Mercier · Nov 4, 2006

So, finally, after many experiments, I chose Infonyte DB, which was
clearly the best of everything I tried. It's a commercial software, but
not very expensive, handle documents up to 1 TB I think, performances
are ok, and setting everything up and getting started takes 5 min.

Thanks again to people who gave me some advices.

- Luc.

Processing in Python help	0	Aug 31, 2022
Static code analysis tool	0	Jan 22, 2020
Optimal way to make a table for large lists	2	Jul 7, 2022
How works with large integers ?	0	Aug 16, 2022
PHP cURL for large content and single HTTP request	1	Feb 23, 2023
Help for ActionPerformance and how to use HashMap.	2	Feb 10, 2022
Anybody know what 'Shiply' use as Back-end/System?	1	Oct 2, 2023
How to use Densenet121 in monai	0	Feb 16, 2024

What tool to use for processing large documents

Luc Mercier

Joe Kesselman

Joe Kesselman

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Joe Kesselman

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Luc Mercier

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

jay m

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Luc Mercier

Luc Mercier

Luc Mercier

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Joseph Kesselman

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

jay m

Luc Mercier

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads