xerces advanced usage - progresss, random access etc

Discussion in 'XML' started by Kza, Sep 4, 2006.

  1. Kza

    Kza Guest

    Hi, I am currently using xerces sax parser for c++, (I use DOM too, but
    I think SAX is more relevant here) for processing and displaying fairly
    large xml files. Usually I give xerces a filename, and it parses it and
    thats all good. But the customer needs more features.

    Feature 1: A progress display. I have tried a few times now to find a
    way of asking xerces how far through a file it is in bytes, but no
    luck. (I did try a per element check, but that involves a whole extra
    parse at the start just to count the elements). I have tried using the
    LocalFileInputSource, and getting its BinInputStream and calling itc
    curPos, but its always 0.

    Any ideas?

    Feature 2: Loading only a "screenful" of the file at a time. I also
    would like some sort of random access functionality, so if the user
    scrolls down to 75% of the file, the parser skips forward to that
    position and starts reading there, and when they scroll back up it goes
    up and reads just that little bit of the file.

    I am pretty sure feature 1 is possible with normal xerces sax, but I
    have no idea how, the documentation is very sparse, naming the
    functions etc but not actually saying what they do or how they should
    be used.

    For feature 2 it might be more complicated. A colleage mentioned some
    other "object models" like xparse and xalaron (not sure how thats
    pronounced or spelt) some apache project that parses xml in a random
    access fashion.

    Anyone got any ideas?

    Thanks a lot.
     
    Kza, Sep 4, 2006
    #1
    1. Advertising

  2. Kza wrote:
    > Feature 1: A progress display.


    The SAX APIs can be persuaded to give line/column information, though
    unless you know how many lines there were in the file before you stared
    parsing it that doesn't do you any good. Look at the Locator API.

    The DOM assumes reading the file is a single operation, so the concept
    of getting incremental details doesn't make much sense. You *could* plug
    in a stream filter between wherever the file is being read from and the
    parser, and set up that filter so it counts characters going by --
    that's going to give you only a very rough progress indication, and
    again it requires that you know the length before you start if you want
    to report it as a percentage-complete number.

    > Feature 2: Loading only a "screenful" of the file at a time.


    "Screenful" is not defined in XML. Nor is starting parse from the middle
    of a file. You could try to do something with incremental processing,
    via throttling of ta SAX stream -- I've done that in the past -- but
    keeping track of when enough has been read to fill a screen and when
    more would have to be read to fill the next screen is very much an
    application problem rather than a parser problem.

    Random-access to an XML model isn't a problem -- the DOM can do that,
    though again it isn't designed to operate on screenfuls -- but
    random-order parsing really doesn't make sense. Namespaces are
    context-dependent, to take one major point where that idea breaks down.


    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
     
    Joe Kesselman, Sep 5, 2006
    #2
    1. Advertising

  3. "Kza" <> writes:

    > Feature 1: A progress display. I have tried a few times now to find a
    > way of asking xerces how far through a file it is in bytes, but no
    > luck. (I did try a per element check, but that involves a whole extra
    > parse at the start just to count the elements). I have tried using the
    > LocalFileInputSource, and getting its BinInputStream and calling itc
    > curPos, but its always 0.
    >
    > Any ideas?


    You can implement your own InputStream which will keep track of how
    much data Xerces-C++ has consumed so far. Combine this with the total
    length of the file and you can calculate the progress.


    > Feature 2: Loading only a "screenful" of the file at a time. I also
    > would like some sort of random access functionality, so if the user
    > scrolls down to 75% of the file, the parser skips forward to that
    > position and starts reading there, and when they scroll back up it goes
    > up and reads just that little bit of the file.


    This one would definitely be easier with an in-memory model (e.g., DOM).


    hth,
    -boris


    --
    Boris Kolpackov
    Code Synthesis Tools CC
    http://www.codesynthesis.com
    Open-Source, Cross-Platform C++ XML Data Binding
     
    Boris Kolpackov, Sep 8, 2006
    #3
  4. Kza

    Kza Guest

    Just as an update here, and I hope top posting is de riguer for this
    news group,

    I solved feature one with xerces getSrcOffset() method. Even though I
    had to wrap it with an exception catcher, as the particular version we
    are using at work at the moment causes an exception when parsing is
    finished (but before the parse method returns) and theres no other way
    to find out when its finished.

    Feature 2 I dont have a solution for at the moment. DOM is not an
    option as the whole point is that a whole file uses up too much memory,
    and DOM loads the whole thing at once, thats why we wanted to load in a
    section at a time.

    If it turns out really important to analyse large files, I will just
    have to write a seperate program that uses sax, and maybe only filters
    for certain things, or perhaps reparses when people want to "scroll up"
    which has its own time trade off for saving memory. Its up to the
    customers really. I suspect the real solution is a non-xml indexed
    binary format. But the memory issue isnt actually as big as the
    customers think it is.. I will work something out.

    Boris Kolpackov wrote:
    > "Kza" <> writes:
    >
    > > Feature 1: A progress display. I have tried a few times now to find a
    > > way of asking xerces how far through a file it is in bytes, but no
    > > luck. (I did try a per element check, but that involves a whole extra
    > > parse at the start just to count the elements). I have tried using the
    > > LocalFileInputSource, and getting its BinInputStream and calling itc
    > > curPos, but its always 0.
    > >
    > > Any ideas?

    >
    > You can implement your own InputStream which will keep track of how
    > much data Xerces-C++ has consumed so far. Combine this with the total
    > length of the file and you can calculate the progress.
    >
    >
    > > Feature 2: Loading only a "screenful" of the file at a time. I also
    > > would like some sort of random access functionality, so if the user
    > > scrolls down to 75% of the file, the parser skips forward to that
    > > position and starts reading there, and when they scroll back up it goes
    > > up and reads just that little bit of the file.

    >
    > This one would definitely be easier with an in-memory model (e.g., DOM).
    >
    >
    > hth,
    > -boris
    >
    >
    > --
    > Boris Kolpackov
    > Code Synthesis Tools CC
    > http://www.codesynthesis.com
    > Open-Source, Cross-Platform C++ XML Data Binding
     
    Kza, Sep 8, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kevin
    Replies:
    19
    Views:
    1,158
    Tris Orendorff
    Feb 13, 2006
  2. globalrev
    Replies:
    4
    Views:
    823
    Gabriel Genellina
    Apr 20, 2008
  3. Kevin Walzer

    Re: PIL (etc etc etc) on OS X

    Kevin Walzer, Aug 1, 2008, in forum: Python
    Replies:
    4
    Views:
    456
    Fredrik Lundh
    Aug 13, 2008
  4. Michele Simionato
    Replies:
    1
    Views:
    632
    Lacrima
    Mar 27, 2010
  5. VK
    Replies:
    15
    Views:
    1,335
    Dr J R Stockton
    May 2, 2010
Loading...

Share This Page