Seek in huge xml-files

Discussion in 'XML' started by Bogomir Engel, Aug 8, 2008.

  1. Hi all,

    For a student project I have to be able to look up information in
    xml-files that are several GB big. Depending on the input of the user
    through the GUI data has to be displayed. And it's not applicable to
    parse the whole file for every input. We can't use DOM since it would
    load the whole file into memory. Our current approaches are based on the
    use of SAX. We thought of generating some sort of index for every data
    set that would provide us the byte offset in the file. The Project has
    to be implemented in Java, so we wanted to do something like

    Reader.skip(offsetBytes)

    So we could jump to the location where our data set is without having to
    parse the whole file. The Problem with that is, that we don't have any
    idea on how to obtain the index information. How can you find out, where
    in a file the SAX parser is (meaning the byte offset)?

    Another point is that our tests with the SAX parser when skipping bytes
    in it's input source produced this exception.

    Content is not allowed in prolog

    So we are wondering, whether it's possible to jump to some given
    position and then parse from there.

    I'm thankful for any advice since I'm quite helpless now. Many Thanks!
    Bogomir Engel
    Bogomir Engel, Aug 8, 2008
    #1
    1. Advertising

  2. We successfully completed the application.

    javax.xml.stream.Location offers a method getCharacterOffset() which
    does exactly what we needed.

    This article was quite helpful:
    Parsing XML documents partially with StAX
    http://www.ibm.com/developerworks/xml/library/x-tipstx2/index.html

    StAX is a very useful tool, when you don't have the memory to do it with
    DOM and SAX offers insufficient control. For example one can decide how
    to proceed in the parsing process at any time.

    By the way, with some mapping JiBX (http://jibx.sourceforge.net/ very
    recommendable) created ordinary Java objects out of the xml-data sets
    for us. We saved the byte offsets in the objects during the initial
    parsing process.
    Bogomir Engel, Sep 3, 2008
    #2
    1. Advertising

  3. We successfully completed the application.

    javax.xml.stream.Location offers a method getCharacterOffset() which
    does exactly what we needed.

    This article was quite helpful:
    Parsing XML documents partially with StAX
    http://www.ibm.com/developerworks/xml/library/x-tipstx2/index.html

    StAX is a very useful tool, when you don't have the memory to do it with
    DOM and SAX offers insufficient control. For example one can decide how
    to proceed in the parsing process at any time.

    By the way, with some mapping JiBX (http://jibx.sourceforge.net/ very
    recommendable) created ordinary Java objects out of the xml-data sets
    for us. We saved the byte offsets in the objects during the initial
    parsing process.
    Bogomir Engel, Sep 3, 2008
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Christian Hiller
    Replies:
    7
    Views:
    1,367
    Andrew
    Oct 9, 2003
  2. hakhan
    Replies:
    0
    Views:
    332
    hakhan
    Oct 28, 2004
  3. Jeff Calico
    Replies:
    12
    Views:
    8,149
    Jeff Calico
    Feb 13, 2006
  4. Replies:
    3
    Views:
    470
  5. Replies:
    3
    Views:
    119
    Andreas Perstinger
    May 14, 2013
Loading...

Share This Page