XML SAX parser bug?

Discussion in 'Python' started by mitsura@skynet.be, Jan 19, 2006.

  1. Guest

    Hi,

    I think I ran into a bug in the XML SAX parser.

    part of my program consist of reading a rather large XML file (about
    10Mb) containing a few thousand elements.
    I have the following problem. Sometimes that SAX parses misreads a
    line.
    Let me explain: the XML file contains a few thousand lines like this:
    "
    <TargetRef>WINOSSPI:Storage@@n91c90a.cmc.com</TargetRef>
    "
    where 'n91c90a.cmc.com' is the name of a system and thus changes per
    system.
    I a few cases, the SAX parser misreads the line. The parser sometimes
    plits characters the line in:
    "WINOSSPI:Storage@@n" and "91c90a.cmc.com".
    I put a 'print characters' line in the 'characters' method of the
    parser that is how I found out.
    It only happens for a few of the thousand lines but you can imagine
    that is very annoying.

    I checked for errors in the XML file but the file seems ok.

    Is this a bug or am I doing something wrong?
    I am new to Python.

    I am using Python 2.4.1, pyWin32 extension 2.4 and PyXML 0.8.4

    Any help very much appreciated.

    Kris
     
    , Jan 19, 2006
    #1
    1. Advertising

  2. wrote:

    > I think I ran into a bug in the XML SAX parser.
    >
    > part of my program consist of reading a rather large XML file (about
    > 10Mb) containing a few thousand elements.
    > I have the following problem. Sometimes that SAX parses misreads a
    > line.
    > Let me explain: the XML file contains a few thousand lines like this:
    > "
    > <TargetRef>WINOSSPI:Storage@@n91c90a.cmc.com</TargetRef>
    > "
    > where 'n91c90a.cmc.com' is the name of a system and thus changes per
    > system.
    > I a few cases, the SAX parser misreads the line. The parser sometimes
    > plits characters the line in:
    > "WINOSSPI:Storage@@n" and "91c90a.cmc.com".
    > I put a 'print characters' line in the 'characters' method of the
    > parser that is how I found out.
    > It only happens for a few of the thousand lines but you can imagine
    > that is very annoying.
    >
    > I checked for errors in the XML file but the file seems ok.
    >
    > Is this a bug or am I doing something wrong?


    it's not a bug; the parser is free to split up character runs (due to buffering,
    entities or character references, etc). it's up to you to merge character runs
    into strings.

    </F>
     
    Fredrik Lundh, Jan 19, 2006
    #2
    1. Advertising

  3. Guest

    Fredrik Lundh schreef:

    > wrote:
    >
    > > I think I ran into a bug in the XML SAX parser.
    > >
    > > part of my program consist of reading a rather large XML file (about
    > > 10Mb) containing a few thousand elements.
    > > I have the following problem. Sometimes that SAX parses misreads a
    > > line.
    > > Let me explain: the XML file contains a few thousand lines like this:
    > > "
    > > <TargetRef>WINOSSPI:Storage@@n91c90a.cmc.com</TargetRef>
    > > "
    > > where 'n91c90a.cmc.com' is the name of a system and thus changes per
    > > system.
    > > I a few cases, the SAX parser misreads the line. The parser sometimes
    > > plits characters the line in:
    > > "WINOSSPI:Storage@@n" and "91c90a.cmc.com".
    > > I put a 'print characters' line in the 'characters' method of the
    > > parser that is how I found out.
    > > It only happens for a few of the thousand lines but you can imagine
    > > that is very annoying.
    > >
    > > I checked for errors in the XML file but the file seems ok.
    > >
    > > Is this a bug or am I doing something wrong?

    >
    > it's not a bug; the parser is free to split up character runs (due to buffering,
    > entities or character references, etc). it's up to you to merge character runs
    > into strings.
    >
    > </F>

    Thanks for the feedback,

    but how do I detect that the parser has split up the characters? I gues
    I need to detect it in order to reconstruct the complete string
     
    , Jan 19, 2006
    #3
  4. wrote:
    > but how do I detect that the parser has split up the characters? I gues
    > I need to detect it in order to reconstruct the complete string


    Don't try to detect it. Instead, assume it always happens, and collect
    the strings in characters(), rather than processing them. Do something
    like this

    def startElement(self, ...):
    self.chardata = ""

    def characters(self, data):
    self.chardata += data

    def endElement(self, ...):
    process(self.chardata)

    This is simplified - you might have to deal with nested elements,
    somehow.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Jan 19, 2006
    #4
  5. Guest

    wrote:
    > Fredrik Lundh schreef:
    > > wrote:
    > > > I think I ran into a bug in the XML SAX parser.
    > > >
    > > > part of my program consist of reading a rather large XML file (about
    > > > 10Mb) containing a few thousand elements.
    > > > I have the following problem. Sometimes that SAX parses misreads a
    > > > line.

    > >
    > > it's not a bug; the parser is free to split up character runs (due to buffering,
    > > entities or character references, etc). it's up to you to merge character runs
    > > into strings.

    >
    > but how do I detect that the parser has split up the characters? I gues
    > I need to detect it in order to reconstruct the complete string


    Here's a recipe:

    http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/265881

    Using this filter you can then write SAX code that assumes normalized
    text events. Also, 4Suite's SAX implementation, Saxlette,
    automatically does this text event merging for you at C speed:

    http://4suite.org/docs/CoreManual.xml#saxlette

    --
    Uche Ogbuji Fourthought, Inc.
    http://uche.ogbuji.net http://fourthought.com
    http://copia.ogbuji.net http://4Suite.org
    Articles: http://uche.ogbuji.net/tech/publications/
     
    , Feb 7, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. William Brogden
    Replies:
    1
    Views:
    8,372
    Manoj S. P.
    Jun 30, 2003
  2. Michael Lee
    Replies:
    0
    Views:
    1,553
    Michael Lee
    Jun 27, 2003
  3. Per Magnus L?vold
    Replies:
    0
    Views:
    2,025
    Per Magnus L?vold
    Nov 16, 2004
  4. Sanjeev
    Replies:
    4
    Views:
    1,454
    Stanimir Stamenkov
    May 4, 2008
  5. Erik Wasser
    Replies:
    5
    Views:
    468
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page