XML parsing: SAX/expat & yield

Discussion in 'Python' started by kj, Aug 4, 2010.

  1. kj

    kj Guest

    I want to write code that parses a file that is far bigger than
    the amount of memory I can count on. Therefore, I want to stay as
    far away as possible from anything that produces a memory-resident
    DOM tree.

    The top-level structure of this xml is very simple: it's just a
    very long list of "records". All the complexity of the data is at
    the level of the individual records, but these records are tiny in
    size (relative to the size of the entire file).

    So the ideal would be a "parser-iterator", which parses just enough
    of the file to "yield" (in the generator sense) the next record,
    thereby returning control to the caller; the caller can process
    the record, delete it from memory, and return control to the
    parser-iterator; once parser-iterator regains control, it repeats
    this sequence starting where it left off.

    The problem, as I see it, is that SAX-type parsers like expat want
    to do everything with callbacks, which is not readily compatible
    with the generator paradigm I just described.

    Is there a way to get an xml.parsers.expat parser (or any other
    SAX-type parser) to stop at a particular point to yield a value?

    The only approach I can think of is to have the appropriate parser
    callbacks throw an exception wherever a yield would have been.
    The exception-handling code would have the actual yield statement,
    followed by code that restarts the parser where it left off.
    Additional logic would be necessary to implement the piecemeal
    reading of the input file into memory.

    But I'm not very conversant with SAX parsers, and even less with
    generators, so all this may be unnecessary, or way off.

    Any other tricks/suggestions for turning a SAX parsers into a
    generator, please let me know.

    ~K
     
    kj, Aug 4, 2010
    #1
    1. Advertising

  2. kj

    Peter Otten Guest

    kj wrote:

    > I want to write code that parses a file that is far bigger than
    > the amount of memory I can count on. Therefore, I want to stay as
    > far away as possible from anything that produces a memory-resident
    > DOM tree.
    >
    > The top-level structure of this xml is very simple: it's just a
    > very long list of "records". All the complexity of the data is at
    > the level of the individual records, but these records are tiny in
    > size (relative to the size of the entire file).
    >
    > So the ideal would be a "parser-iterator", which parses just enough
    > of the file to "yield" (in the generator sense) the next record,
    > thereby returning control to the caller; the caller can process
    > the record, delete it from memory, and return control to the
    > parser-iterator; once parser-iterator regains control, it repeats
    > this sequence starting where it left off.


    How about

    http://effbot.org/zone/element-iterparse.htm#incremental-parsing

    Peter
     
    Peter Otten, Aug 4, 2010
    #2
    1. Advertising

  3. kj

    kj Guest

    kj, Aug 4, 2010
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Thomas Guettler

    xml.parsers.expat vs. xml.sax

    Thomas Guettler, Apr 27, 2004, in forum: Python
    Replies:
    2
    Views:
    898
    Martijn Faassen
    Apr 27, 2004
  2. Gary Robinson

    xml.sax.xmlreader and expat

    Gary Robinson, Jun 27, 2006, in forum: Python
    Replies:
    2
    Views:
    340
    Stefan Behnel
    Jun 28, 2006
  3. sharan
    Replies:
    1
    Views:
    721
    Pavel Lepin
    Oct 26, 2007
  4. aha
    Replies:
    2
    Views:
    497
    Stefan Behnel
    Jan 23, 2009
  5. Erik Wasser
    Replies:
    5
    Views:
    465
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page