XML parsing problem

Discussion in 'Perl Misc' started by Kurt Klinner, Aug 21, 2003.

  1. Kurt Klinner

    Kurt Klinner Guest

    Hello,

    while trying to parse a "large" XML document i found a
    strange behaviour of the Parser Module(s) (XML::parser:perlSAX,
    XML::parser, XML::parser::Expat

    If my file XML file is larger then 65536 bytes
    the actual character string is interrupted and a whitespace
    is added.

    For Example

    <DATASET>
    <DATA><![CDATA["NOVDEC_B"]]></DATA>
    <DATA><![CDATA["November\December"]]></DATA>
    <DATA><![CDATA["Nov\Dec"]]></DATA>
    <DATA><![CDATA["01.11."]]></DATA>
    <DATA><![CDATA[11]]></DATA>
    <DATA><![CDATA["begin_2month"]]></DATA>
    <DATA><![CDATA[11]]></DATA>
    </DATASET>

    if now "Novemver\December" is at the 65536 border the String is
    splitted in "Nov WHITESPACE ember\December"

    Any ideas how to avoid /fix that problem


    Thanks in advance

    Regards

    Kurt
     
    Kurt Klinner, Aug 21, 2003
    #1
    1. Advertising

  2. Kurt Klinner wrote:

    > while trying to parse a "large" XML document i found a
    > strange behaviour of the Parser Module(s) (XML::parser:perlSAX,
    > XML::parser, XML::parser::Expat
    >
    > If my file XML file is larger then 65536 bytes
    > the actual character string is interrupted and a whitespace
    > is added.



    This is documented behaviour:

    in XML::parser::Expat (I know, you have to know where to look ;--(

    ยท Char (Parser, String)
    This event is generated when non-markup is recognized. The non-
    markup sequence of characters is in String. A single non-markup
    sequence of characters may generate multiple calls to this han-
    dler. Whatever the encoding of the string in the original docu-
    ment, this is given to the handler in UTF-8.

    All books or tutorials about XML::parser show how to do this (buffer the
    text in the character handler and output it when you find any other event).
    If you use SAX you can use XML::Filter::BufferText (set up a pipeline using
    SAX::MAchines and have an XML::Filter::BufferText object as the first
    handler in the pipeline).

    Incidently, I believe most SAX parsers behave that way, character handlers
    can be called several times for a single element content.

    __
    Michel Rodriguez
    Perl &amp; XML
    http://xmltwig.com
     
    Michel Rodriguez, Aug 22, 2003
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Per Magnus L?vold
    Replies:
    0
    Views:
    1,422
    Per Magnus L?vold
    Nov 15, 2004
  2. Greg Wogan-Browne
    Replies:
    1
    Views:
    860
    Uche Ogbuji
    Jan 28, 2005
  3. Replies:
    2
    Views:
    510
  4. John Levine
    Replies:
    0
    Views:
    756
    John Levine
    Feb 2, 2012
  5. Erik Wasser
    Replies:
    5
    Views:
    500
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page