What tool to use for processing large documents

Discussion in 'XML' started by Luc Mercier, Oct 21, 2006.

  1. Luc Mercier

    Luc Mercier Guest

    Hi Folks,

    I'm new here, and I need some advice for what tool to use.

    I'm using XML for benchmarking purposes. I'm writing some scientific
    programs which I want to analyze. My program generates large XML logs
    giving semi-structured information on the flow of the program. The XML
    tree looks like the method calls tree, but at a much higher level, and I
    add many values of some variables.

    There is no predefined schema, and often as I modify my program I will
    add some new tags and new information to put into the log.

    Once a log is written, I never modify the document.

    To analyze the data, I add a /almost/ perfect solution: from Matlab, I
    would call the methods of the Java library dom4j. Typically, I would
    load a document, then dump values of attributes matching an XPath
    expression into a Matlab array, then do some stats or plotting. I'm very
    happy with the comfort and the ease of this solution: no DB to set up,
    just load a document, and and Matlab gives you an environment in which
    you can call java methods without creating a java program, so it's very
    easy to debug the XPath expressions you pass to dom4j's "selectNodes"
    method.

    Now, the problem is, it's perfect for documents of a few 10's of
    megabytes, but now I would like to process documents of several hundreds
    MBs to, let's say, maybe 10 GB (that's a fairly large upper bound).

    It seems I have to give up with dom4j for that. I have tried to use
    eXist to create a DB with my documents, and all I got was a lot of
    (rather violent) crashes when I tried to run the first example they give
    in the doc for retrieving a document via the XML:DB api. Then I tried
    BerkeleyDB XML, which I have not been able to install. I then tried
    xmlDB, but as I tried to import a first document into a collection I got
    a "java.lang.OutOfMemoryError: Java heap space" and found no mention in
    the doc of how to specify the heap space.

    After these 3 unsuccessful trials, I'd like to ask for some advice!

    To summarize, my needs are:
    * Processing (very) large XML documents
    * Need for XPath
    * Java API, to be able to call from Matlab
    * Read-only processing
    * Single user, no security issues, no remote access need
    * Platform: Java if possible, otherwise Linux/Debian on x86.

    I welcome any suggestion.

    - Luc Mercier.
    Luc Mercier, Oct 21, 2006
    #1
    1. Advertising

  2. Luc Mercier wrote:
    > * Processing (very) large XML documents
    > * Need for XPath


    That combination sounds like you want a serious XML database. If done
    right, that should give gives you a system which already knows how to
    handle documents larger than memory and one which implements XPath data
    retrieval against them, leaving you to implement just the program logic.

    Another other solution is not to work on the whole document at once.
    Instead, go with streaming-style processing, SAX-based with a relatively
    small amount of persisting data. You can hand-code the extraction, or
    there have been papers describing systems which can be used to filter a
    SAX stream and extract just the subtrees which match a specified XPath.
    Of course you may have to reprocess the entire stream in order to
    evaluate a different XPath, but it is a way around memory constraints.
    It works very well for some specific systems, either alone or by feeding
    this "filtered" SAX stream into a model builder to construct a model
    that reflects only the data your application actually cares about. On
    the other hand, if you need true random access to the complete document,
    this won't do it for you.

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Oct 21, 2006
    #2
    1. Advertising

  3. Luc Mercier wrote:
    > * Processing (very) large XML documents
    > * Need for XPath


    That combination sounds like you want a serious XML database. If done
    right, that should give gives you a system which already knows how to
    handle documents larger than memory and one which implements XPath data
    retrieval against them, leaving you to implement just the program logic.
    (I haven't worked with any of these, but I'll toss out my standard
    reminder that IBM's DB2 now has XML-specific capabilities. I'm not sure
    whether those have been picked up in Cloudscape, IBM's Java-based database.)

    Another solution is not to work on the whole document at once. Instead,
    go with streaming-style processing, SAX-based with a relatively
    small amount of persisting data. You can hand-code the extraction, or
    there have been papers describing systems which can be used to filter a
    SAX stream and extract just the subtrees which match a specified XPath.
    Of course you may have to reprocess the entire stream in order to
    evaluate a different XPath, but it is a way around memory constraints.
    It works very well for some specific systems, either alone or by feeding
    this "filtered" SAX stream into a model builder to construct a model
    that reflects only the data your application actually cares about. On
    the other hand, if you need true random access to the complete document,
    this won't do it for you.

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Oct 21, 2006
    #3
  4. Luc Mercier wrote:

    > Now, the problem is, it's perfect for documents of a few 10's of
    > megabytes, but now I would like to process documents of several hundreds
    > MBs to, let's say, maybe 10 GB (that's a fairly large upper bound).


    Whatever XML parser you use, notice that the parser
    cannot parse faster than the disk can read the XML data.
    Reading 10 GB off a disk will take around 3 to 5 minutes
    for disk access alone. Reading and parsing together should
    take around 20 minutes, even with the best parsers.

    > It seems I have to give up with dom4j for that. I have tried to use
    > eXist to create a DB with my documents, and all I got was a lot of
    > (rather violent) crashes when I tried to run the first example they give
    > in the doc for retrieving a document via the XML:DB api. Then I tried
    > BerkeleyDB XML, which I have not been able to install. I then tried
    > xmlDB, but as I tried to import a first document into a collection I got
    > a "java.lang.OutOfMemoryError: Java heap space" and found no mention in
    > the doc of how to specify the heap space.


    Remember that a DOM is a complete copy of the XML data
    in the address space of the CPU. If your XML data has
    10 GB, then your address space has to be at least 10 GB.
    This is unrealistic on today's machines.

    > * Processing (very) large XML documents
    > * Need for XPath
    > * Java API, to be able to call from Matlab
    > * Read-only processing
    > * Single user, no security issues, no remote access need
    > * Platform: Java if possible, otherwise Linux/Debian on x86.


    Java's SAX API can help you parse the data, but SAX will
    _not_ allow you to use XPath.

    > I welcome any suggestion.


    OK, I am assuming that the result of XPath processing
    is much shorter than the original XML data. If so, I
    bet the problem can be solved in xgawk:

    http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.html#Printing-an-outline-of-an-XML-file

    I have used xgawk for parsing files with several GB.
    It works, but it will take several minutes, of course.
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Oct 21, 2006
    #4
  5. Jürgen Kahrs wrote:
    > for disk access alone. Reading and parsing together should
    > take around 20 minutes, even with the best parsers.


    Which is one reason to consider the database approach, with index
    information precalculated at the time you store the info.

    Or to use XML only as your interchange representation, and use something
    more specialized for capture and/or computation. Staying 100% XML is a
    reasonable choice for prototyping, but in production systems XML should
    be used mostly in places where generality is actually of value.

    > Remember that a DOM is a complete copy of the XML data
    > in the address space of the CPU


    Standard correction: The DOM is just an API. The metaphor it uses is one
    of an object graph, but DOMs can be written which do not keep the whole
    document in memory at once, intelligently loading the sections which are
    actually referenced. But that requires that your application, in turn,
    be careful about how it acesses the data, to avoid undue churn.


    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Oct 21, 2006
    #5
  6. Joe Kesselman wrote:

    >> Remember that a DOM is a complete copy of the XML data
    >> in the address space of the CPU

    >
    > Standard correction: The DOM is just an API. The metaphor it uses is one
    > of an object graph, but DOMs can be written which do not keep the whole
    > document in memory at once, intelligently loading the sections which are
    > actually referenced. But that requires that your application, in turn,
    > be careful about how it acesses the data, to avoid undue churn.


    Let's assume that his application is careful about
    how it accesses data. Should we still call such an
    application DOM-based ? Which implementations of the
    DOM API really allow the user to proceed this way ?
    I am asking out of curiosity. I dont know the answer.
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Oct 21, 2006
    #6
  7. Luc Mercier

    Luc Mercier Guest

    First, thanks a lot to both of you. Some additional information:

    1...... The I/O issue:

    > Whatever XML parser you use, notice that the parser
    > cannot parse faster than the disk can read the XML data.
    > Reading 10 GB off a disk will take around 3 to 5 minutes
    > for disk access alone. Reading and parsing together should
    > take around 20 minutes, even with the best parsers.


    Yes, I know that. As I said, 10 GB is a large upper bound. I do not
    expect to have that big a file, but I do not want to have memory
    problems in case that occur. In short: I want to be able to process
    files bigger than the RAM.

    For the I/O limit, I forgot to mention that my logs are in zipped xml. I
    compress them on the fly while producing them. Since my files are very
    repetitive (a few number of different tags, and virtually every other
    character, excluding spaces, is a digit). Therefore I get excellent
    compression ratios of 20:1 to 30:1. I believe this can speed up the
    reading/parsing, although I admit I haven't checked.

    However, if I need to uncompress them in the file system to be able to
    use a particular tool, that's fine for me.



    2...... The DOM vs SAX question:

    >>> Remember that a DOM is a complete copy of the XML data
    >>> in the address space of the CPU

    >> Standard correction: The DOM is just an API. The metaphor it uses is one
    >> of an object graph, but DOMs can be written which do not keep the whole
    >> document in memory at once, intelligently loading the sections which are
    >> actually referenced. But that requires that your application, in turn,
    >> be careful about how it acesses the data, to avoid undue churn.

    >
    > Let's assume that his application is careful about
    > how it accesses data. Should we still call such an
    > application DOM-based ? Which implementations of the
    > DOM API really allow the user to proceed this way ?
    > I am asking out of curiosity. I dont know the answer.


    I simplified a little my description of how I analyze my data. Usually,
    what I do is:
    (a). Get the set of nodes matching an XPath expression. Usually, the
    size of this set is a very small fraction of the number of nodes in the
    tree, say at most 10^-3.
    (b). Iterates over the results. For each node N is the set, retrieve
    some data (again with XPath) contained in the subtree rooted in N. Here
    typically virtually 100% of the subtree contain useful data.
    (c). do some computation with the data got in (b), then store the result
    an forget about the data retrieved in (b).

    So, what I'm doing is very sequential in nature. Dom4j returns a set
    when I call selectNodes, but all I need is an iterator. Also, I often
    use almost all the data contained in the document. So I think a
    SAX-based XPath processor, if such thing exists, would definitely be a
    suitable solution. Actually to implement (b) the best would be to have a
    processor able to have a list of XPath expressions, and which would go
    through the document an stop each time one of them matches, returning
    the index of the matched expression. Of course you can do that by
    agglomerating the expressions with a "OR", but then you have to find
    which one matches.

    Because of the nature of my queries, I do not believe that a DB system
    with some smart indexing feature would speed up anything. But, once
    again, I'm more concerned with the ease of use than the speed. This is
    for benchmarking only, and I will run queries only a small number of
    times of big files. I don't really want to write ad-hoc SAX-based code,
    which sounds like a big pain. If a xml database, free if possible, can
    allow me to run my queries without to much setup an decent performances,
    that's perfect for me.

    Again, thanks for your help.

    Luc
    Luc Mercier, Oct 22, 2006
    #7
  8. Luc Mercier wrote:

    > problems in case that occur. In short: I want to be able to process
    > files bigger than the RAM.


    Indeed, "files bigger than the RAM", that's the crucial point.
    The "classical DOM API" is not compatible with this constraint.
    Maybe the approach Joe described fits better.

    > For the I/O limit, I forgot to mention that my logs are in zipped xml. I
    > compress them on the fly while producing them. Since my files are very
    > repetitive (a few number of different tags, and virtually every other
    > character, excluding spaces, is a digit). Therefore I get excellent
    > compression ratios of 20:1 to 30:1. I believe this can speed up the
    > reading/parsing, although I admit I haven't checked.


    Yes, this can avoid the time needed for reading the XML
    data off the hard disk.

    gunzip -c data.xml.gz | ...
    unzip -c data.xml.zip | ...

    This way, the unzipped data never touches the hard disk.

    > use almost all the data contained in the document. So I think a
    > SAX-based XPath processor, if such thing exists, would definitely be a


    I doubt that such a thing exists.

    Good luck.
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Oct 22, 2006
    #8
  9. Luc Mercier

    jay m Guest

    jay m, Oct 23, 2006
    #9
  10. jay m wrote:

    > Well, there seem to be some open source XML databases
    > I don't know about your other qualifiers, but...
    >
    > http://exist.sourceforge.net/
    > http://xml.apache.org/xindice/
    >
    > are two.


    Interesting. Sounds like "exist" should be able
    to handle large files. But I am not convinced
    that the performance will be acceptable. This
    has to be tried to get an answer.
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Oct 23, 2006
    #10
  11. Luc Mercier

    Luc Mercier Guest

    Jürgen Kahrs wrote:
    > jay m wrote:
    >
    >> Well, there seem to be some open source XML databases
    >> I don't know about your other qualifiers, but...
    >>
    >> http://exist.sourceforge.net/
    >> http://xml.apache.org/xindice/
    >>
    >> are two.

    >
    > Interesting. Sounds like "exist" should be able
    > to handle large files. But I am not convinced
    > that the performance will be acceptable. This
    > has to be tried to get an answer.


    Well as I mentioned, I got serious problems using eXist: when I tried to
    run the very first example they give in the documentation of the xml:db
    api, my screen would get all red, and then Kde logged me out and I got
    to the login screen... Never had anything like that before, especially
    running Java code !

    I tried the two current releases (standard and 'new core'), and both
    produced the same result. So this product does not seem mature enough...

    I'm trying xindice right now.

    Thanks everyone for your suggestions.
    Luc Mercier, Oct 23, 2006
    #11
  12. Luc Mercier

    Luc Mercier Guest

    Luc Mercier wrote:
    > Jürgen Kahrs wrote:
    >> jay m wrote:
    >>
    >>> Well, there seem to be some open source XML databases
    >>> I don't know about your other qualifiers, but...
    >>>
    >>> http://exist.sourceforge.net/
    >>> http://xml.apache.org/xindice/
    >>>
    >>> are two.

    >> Interesting. Sounds like "exist" should be able
    >> to handle large files. But I am not convinced
    >> that the performance will be acceptable. This
    >> has to be tried to get an answer.

    >
    > Well as I mentioned, I got serious problems using eXist: when I tried to
    > run the very first example they give in the documentation of the xml:db
    > api, my screen would get all red, and then Kde logged me out and I got
    > to the login screen... Never had anything like that before, especially
    > running Java code !
    >
    > I tried the two current releases (standard and 'new core'), and both
    > produced the same result. So this product does not seem mature enough...
    >
    > I'm trying xindice right now.
    >
    > Thanks everyone for your suggestions.


    All right, after wasting some time with xindice, I read in the xindice FAQ:


    --
    10. My 5 megabyte file is crashing the command line, help?

    See FAQ #2. Xindice wasn't designed for monster documents, rather, it
    was designed for collections of small to medium sized documents. The
    best thing to do in this case would be to look at your 5 megabyte file,
    and determine whether or not it's a good candidate for being sliced into
    a set of small documents. If so, you'll want to extract the separate
    documents and add them to a Xindice collection individually. A good
    example of this, would be a massive document of this form:
    --

    So it's not suitable for me. I certainly could slice up my documents,
    but not easy as much as 5Mb- pieces.

    Luc
    Luc Mercier, Oct 23, 2006
    #12
  13. Luc Mercier

    Luc Mercier Guest

    Ok, so I found a well documented list of Native XML databases:

    http://www.rpbourret.com/xml/ProdsNative.htm

    Three of them are explicitly said to be designed to handle large documents:
    * 4Suite, 4Suite Server (free)
    * Infonyte DB (commercial)
    * Sonic XML Server(commercial)



    The first one is in Python. I don't know how easy that is to call Python
    stuff from Matlab. I'm going to check that.

    Does anyone has any experience with one of the two others?

    - Luc.
    Luc Mercier, Oct 23, 2006
    #13
  14. Luc Mercier wrote:

    > Well as I mentioned, I got serious problems using eXist: when I tried to
    > run the very first example they give in the documentation of the xml:db
    > api, my screen would get all red, and then Kde logged me out and I got
    > to the login screen... Never had anything like that before, especially
    > running Java code !


    That's funny. I just remembered this one:

    http://vtd-xml.sourceforge.net/
    VTD-XML is the next generation XML parser
    that goes beyond DOM and SAX in terms of
    performance, memory and ease of use.
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Oct 24, 2006
    #14
  15. Jürgen Kahrs wrote:
    >>api, my screen would get all red, and then Kde logged me out and I got
    >>to the login screen... Never had anything like that before, especially
    >>running Java code !


    Congratulations; you found a JVM bug. See if there was a logfile of some
    sort, and if so report it to the folks maintaining that version of Java...

    --
    Joe Kesselman / Beware the fury of a patient man. -- John Dryden
    Joseph Kesselman, Oct 24, 2006
    #15
  16. Joseph Kesselman wrote:
    > Jürgen Kahrs wrote:
    >>> api, my screen would get all red, and then Kde logged me out and I got
    >>> to the login screen... Never had anything like that before, especially
    >>> running Java code !

    >
    > Congratulations; you found a JVM bug. See if there was a logfile of some
    > sort, and if so report it to the folks maintaining that version of Java...
    >


    It was Luc Mercier who found it.
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Oct 24, 2006
    #16
  17. Luc Mercier

    jay m Guest

    Jürgen Kahrs wrote:
    > That's funny. I just remembered this one:
    >
    > http://vtd-xml.sourceforge.net/
    > VTD-XML is the next generation XML parser
    > that goes beyond DOM and SAX in terms of
    > performance, memory and ease of use.


    >From the website:

    "Its memory usage is typically between 1.3x~1.5x the size of the XML
    document, "
    and
    " VTD requires that XML document be maintained intact in memory."

    For multi-GB documents, you will need a very well-equipped machine!

    As an associate once told me: "yes, that's a very nice problem".
    Regards
    Jay
    jay m, Oct 26, 2006
    #17
  18. Luc Mercier

    Luc Mercier Guest

    So, finally, after many experiments, I chose Infonyte DB, which was
    clearly the best of everything I tried. It's a commercial software, but
    not very expensive, handle documents up to 1 TB I think, performances
    are ok, and setting everything up and getting started takes 5 min.

    Thanks again to people who gave me some advices.

    - Luc.
    Luc Mercier, Nov 4, 2006
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Filip Hendrickx
    Replies:
    3
    Views:
    797
    Filip Hendrickx
    Feb 7, 2006
  2. Replies:
    1
    Views:
    331
    Uche Ogbuji
    Aug 9, 2004
  3. Claudio Grondi
    Replies:
    2
    Views:
    615
    Satchidanand Haridas
    Jan 25, 2005
  4. Replies:
    1
    Views:
    478
    Juan T. Llibre
    Oct 18, 2006
Loading...

Share This Page