How to scale and/or "object orient" SAX parsing for big files?

Discussion in 'XML' started by scott.david.brown@gmail.com, Jan 14, 2008.

  1. Guest

    So I have used DOM for sometime to parse my XML documents. But I have
    arrived at a point where my document is just too big to want to use
    DOM. So I am experimenting with SAX (technically, SAX2). But I have
    run into a conceptual issue that I just can't get around regarding the
    writing of a content handler. In my document, I have a lot of
    different elements (many element names) and the element tree can get
    very deep (many levels of nested elements). Furthermore, I really
    have a few conceptual blocks and sub-blocks of XML and I might want to
    have the ability to parse a file that has just one of these blocks.

    All the SAX examples are very simplistic and show how to make a
    handler that deals with (1) a very small number of element types
    (small number of element names) and (2) very shallow element trees.
    If I extend this approach for my application, I end up trying to
    create a single handler for the entire document which becomes
    hideous. In short, it would be nice to make a set of objects that
    handle various parts of the file. Then I can re-use those objects to
    parse these blocks as part of a single file or as a file that only
    contains the block. What confuses me is this: I can make a set of
    objects for different content blocks, but how to I use them?

    One of the ways I could see this happening is to simply change the
    handler as I parse. When I see element "A" in startElement, I could
    change the handler object to the one specialized for this "A" content,
    and when I see it again in endElement I can switch it back to the
    previous handler. However, these is where I find the SAX
    documentation confusing (technically, the Xerces-C++ implementation,
    but I looked at the JavaDocs too). How do I get the "handle" to the
    current XML stream being processed? There is an InputSource object
    that abstracts the source of the XML content, but I can't figure out
    how to get it from within startElement. Furthermore, I have no idea
    if that object has all the information about that current parsing
    state. For example, does it know where the parser is currently
    processing? If I feed it to my the parser to handle this next block,
    would it know where to pick up the work? And how to I get my hands on
    the parser object to change the handler object from withing a
    handler's startElement function? When I make my handler object and
    set to be the handler for the parser, do I just need to store a
    reference to the InputSource and parser in my handler (as member
    variables) so I have access to them later? If I change the handler
    while parsing, does it do what I expect?

    I have spent quite some time looking for discussions for how to scale
    SAX to these types of problems and I haven't had much luck. So I am
    hoping to create some discussion here.
     
    , Jan 14, 2008
    #1
    1. Advertising

  2. wrote:
    > All the SAX examples are very simplistic and show how to make a
    > handler that deals with (1) a very small number of element types
    > (small number of element names) and (2) very shallow element trees.
    > If I extend this approach for my application, I end up trying to
    > create a single handler for the entire document which becomes
    > hideous.


    Depth of the tree shouldn't matter to the handler itself, since the
    handler is only processing one event at a time -- no recursion issues.

    Complexity of the XML language certainly affects complexity of the
    handler, and you may indeed want to break it up into sub-handlers,
    switching off which one is being dispatched to as you parse through the
    document. In my experience this has rarely been done unless the
    sub-handler represents another namespace -- another XML-based language
    -- which has been plugged into this point of the document and which
    wants to be written as a reusable module, but it might indeed yield
    faster processing if you only have to dispatch the events actually
    expected at this point in the document.

    (Of course that requires that you know the exact structure of the
    expected document. I believe IBM has a patent or two regarding
    automatically high-performance parsers for known schemas, though those
    go beyond simply tuning the handlers.)

    > In short, it would be nice to make a set of objects that
    > handle various parts of the file. Then I can re-use those objects to
    > parse these blocks as part of a single file or as a file that only
    > contains the block. What confuses me is this: I can make a set of
    > objects for different content blocks, but how to I use them?


    Each one has to know how to hand off handling of its contents to the
    appropriate sub-handler. Maintain a stack so the sub-handler can return
    control to its caller. This is much like handling event streams in other
    event-driven models -- or much like handling token streams in a
    traditional (lexx/yacc) parser.

    > How do I get the "handle" to the
    > current XML stream being processed?


    Actually, you don't need the stream; you need the parser so you can tell
    it which handler to call. I don't think you can retrieve that from the
    SAX event; you have to track it in your application code.

    > For example, does it know where the parser is currently
    > processing?


    That's the parser's problem, not yours.

    If you're feeling paranoid about how the parser will response to having
    the handler changed in mid-parse -- though this should work in a
    properly implemented SAX parser -- the workaround would be to implement
    your own "stub" handler which dispatches to other handlers, and redirect
    it as neeeded. Clearly, changing that delegation pointer is a
    well-defined/well-contained operation.

    --
    Joe Kesselman / Beware the fury of a patient man. -- John Dryden
     
    Joseph Kesselman, Jan 14, 2008
    #2
    1. Advertising

  3. Guest

    On 14 Jan, 03:57, wrote:
    > So I have used DOM for sometime to parse my XML documents.  But I have
    > arrived at a point where my document is just too big to want to use
    > DOM.  So I am experimenting with SAX (technically, SAX2). But I have
    > run into a conceptual issue that I just can't get around regarding the
    > writing of a content handler.  In my document, I have a lot of
    > different elements (many element names) and the element tree can get
    > very deep (many levels of nested elements).  Furthermore, I really
    > have a few conceptual blocks and sub-blocks of XML and I might want to
    > have the ability to parse a file that has just one of these blocks.
    > ...


    If you're running out of steam using DOM, but SAX is proving
    difficult, one option might be to consider data binding. This
    converts XML documents into programming language specific objects,
    such as (in our case) C++, Java or some other.

    How much smaller the in memory usage is depends on your XML, but could
    easily be a quarter or less than the DOM size.

    You do need an XML schema for your data, and it may not be appropriate
    for huge, Huge, HUGE documents, but I'll let you judge that.

    You can find out more from our web site at:

    http://www.codalogic.com/lmx/

    or contact me directly.

    HTH,

    Pete Cordell
    Codalogic
    Visit http://www.codalogic.com/lmx/ for XML C++ data binding
     
    , Jan 14, 2008
    #3
  4. Guest

    On Jan 14, 11:10 am, Joseph Kesselman <>
    wrote:
    > > All the SAX examples are very simplistic and show how to make a
    > > handler that deals with (1) a very small number of element types
    > > (small number of element names) and (2) very shallow element trees.
    > > If I extend this approach for my application, I end up trying to
    > > create a single handler for the entire document which becomes
    > > hideous.

    >
    > Depth of the tree shouldn't matter to the handler itself, since the
    > handler is only processing one event at a time -- no recursion issues.
    >

    Yeah, I didn't make that very clear. Yes, my handler only addresses a
    single event at a time. I'm not trying to peak forward or anything
    bad like
    that.

    What I was trying to say was that when I have an element tree that is
    maybe levels 10 deep and has 10 unique element types at each level
    than I have 100 possible element types to worry about and my
    startElement gets messy. If I have a single handler for my file, then
    startElement has 100 possible elements to look through for a match
    because there are 100 different element names that could be
    encountered.
    Does that make sense, or am I really missing something? Because
    I really feel like I am. Using a single handler for a whole file
    means
    (to me) that it needs to handle every possible element type with
    little means to take advantage of the fact that I know a given element
    will only have a select set of sub-elements. I guess startElement
    could call a series of functions based on what element I am within,
    but
    it just doesn't scale up very well at all.

    What I think I want to do is break that up into a set of handlers that
    work
    on specific parts of the tree. As a result, each handler will be
    smaller and
    easier to write, test and maintain.

    But, please, please, feel free to set me straight. I feel like I must
    be
    missing something when people start to parse real documents that
    have something other that 1-2 levels of elements and only a handful
    of possible elements. I've looked at the specs for the XML office
    documents. There is no way they are using a single handler for the
    1000+ different element types in those files.

    I'm working on a prototype tonight that uses multiple handlers, tracks
    the
    handler stack (actually, each handler just stores a pointer to the
    prior
    handler) and changes the handler in the parser as it goes along. I
    will
    keep you posted.

    Thanks for your comments.
     
    , Jan 15, 2008
    #4
  5. Guest

    I think my design experiment is a success.

    I made a new interface class that will be used for all my content
    handlers. It takes the parser and the previous handler as a
    constructor argument. When I want to transition from my current
    handler to a new handler (because I found an element that has a
    specialized handler), I create the new handler and pass the parser
    (which is already stored in the current handler) and the current
    handler as the previous handler to this new handler. I then set the
    current handler in the parser (again, which is already stored in the
    current handler) to my new handler. When that handler sees the
    endElement for itself, it simply sets the handler in the parser back
    to the previous handler. Simply having the previous handler stored in
    the current handler avoids having to explicitly make a complicated
    external stack of handlers. The handlers push themselves and pop
    themselves simply by changing the current handler systematically.

    The nice thing is that I can now build up a set of handlers that can
    be easily reused for various sections of a large XML file, and each
    handler is specialized for a much smaller number of elements (the set
    of elements specific to that object being described). In this case,
    my document was a description of a bunch of geometry. So I have
    handlers for basic boxes, spheres, etc. These are potentially
    embedded within other geometric objects in my XML, etc. But rather
    than one handler that addresses every possible element, I have a nice
    hierarchy of handlers that work with each other.

    I don't think this design is that "dirty" or "shameful", but I would
    still take comments from people if they have a better approach.
     
    , Jan 15, 2008
    #5
  6. Perfectly reasonable.

    --
    Joe Kesselman / Beware the fury of a patient man. -- John Dryden
     
    Joseph Kesselman, Jan 15, 2008
    #6
  7. writes:

    > I think my design experiment is a success.
    >
    > I made a new interface class that will be used for all my content
    > handlers. It takes the parser and the previous handler as a
    > constructor argument. When I want to transition from my current
    > handler to a new handler (because I found an element that has a
    > specialized handler), I create the new handler and pass the parser
    > (which is already stored in the current handler) and the current
    > handler as the previous handler to this new handler. I then set the
    > current handler in the parser (again, which is already stored in the
    > current handler) to my new handler. When that handler sees the
    > endElement for itself, it simply sets the handler in the parser back
    > to the previous handler. Simply having the previous handler stored in
    > the current handler avoids having to explicitly make a complicated
    > external stack of handlers. The handlers push themselves and pop
    > themselves simply by changing the current handler systematically.


    This sound pretty much like a state machine (or a finite automata) or
    something like this, but very similar.

    Why not make a "global" state variable, and in the handler something
    like a "switch" or a "cond" block acting appropriately, choosing the
    proper handler function?

    With best regards,
    Victor
     
    Victor Anyakin, Jan 18, 2008
    #7
  8. Guest

    On Jan 18, 9:19 am, Victor Anyakin <>
    wrote:
    > This sound pretty much like a state machine (or a finite automata) or something like this, but very similar.
    >
    > Why not make a "global" state variable, and in the handler something like a "switch" or a "cond" block acting appropriately, choosing the proper handler function?
    >

    You are right, it really is a state machine. If I understand your
    suggestion correctly, I think there are two possible problems for my
    application.

    First has to do with the nature of SAX -- it's event driven. So if I
    see the start of an element called "string" I need to know what
    element I am already in, so I know what to do with that string. I
    think this is the #1 item that the SAX examples ignore. In really big
    XML documents, chances are you have elements with the same names that
    appear in different parts of the tree. So when you see an element
    with that name you can't just do something directly with it. You need
    to know if that element named "string" is a sub-element of something
    called "firstname" or something called "lastname" so you know what to
    do with it. Sure, you could make your XML element names more unique,
    but in very large and complicated XML documents (where SAX should work
    better with than DOM) element name collisions will happen. And to try
    and avoid them places an needless burden on your freedom to design the
    element structure.

    The second is, again, related to scale. I'm working on a large
    document with 1,000s of different element types. You can't make a
    switch with 1,000 cases. However, some of my lower level handlers do
    take your approach (of a "global" status). I have a private
    enumerated type in the handler that tracks what I am currently parsing
    so we know what to do when we see certain elements. This is the one
    thing that boggles my mind. I really don't see how someone could
    scale SAX except for the way I am doing it now.

    In hindsight, the amount of code I must write to use SAX is very
    frustrating. I tried SAX because I was looking forward to the
    potential performance over DOM for my large document. However, I
    think I might go back to DOM because this seems to be so poor at
    scaling.

    Isn't there anyone out there that can tell us why SAX is supposed to
    be so awesome? I'm not seeing it!
     
    , Jan 24, 2008
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Casey Hawthorne
    Replies:
    0
    Views:
    351
    Casey Hawthorne
    Oct 17, 2005
  2. Shaguf
    Replies:
    0
    Views:
    562
    Shaguf
    Dec 24, 2008
  3. Shaguf
    Replies:
    0
    Views:
    511
    Shaguf
    Dec 26, 2008
  4. Shaguf
    Replies:
    0
    Views:
    279
    Shaguf
    Dec 26, 2008
  5. Shaguf
    Replies:
    0
    Views:
    260
    Shaguf
    Dec 24, 2008
Loading...

Share This Page