Problem with SaxParser. Works Occasionally.

Discussion in 'Java' started by stacey, Feb 27, 2007.

  1. stacey

    stacey Guest

    Hello everyone,

    I am using SaxParser, to parse an xml document, and i noticed that
    sometimes it ignores some data, while reading the characters inside an
    element.
    What i mean is:

    My xml file includes many instances of the following structure:

    <pk>
    <absi>769.9541477864069</absi>
    <area>170.0227589457148</area>
    <background>0.0000000000000000</background>
    <chisq>1202.267500954470</chisq>
    <goodn>1607.355201650164</goodn>
    <ind>1106.121302500338</ind>
    <lind>1082.000000000000</lind>
    <mass>922.4428373952809</mass>
    <meth>4</meth>
    <reso>5700.586009913091</reso>
    <rind>1256.000000000000</rind>
    <s2n>6.073893519996163</s2n>
    <type>0</type>
    </pk>

    My java code is quite big, but i debugged it, and i saw that the
    problem is in the function characters.
    >From every of the above structures i want to get the absi and mass

    value, and i write them in a file.

    The function characters:
    public void characters(char buf[], int offset, int len) throws
    SAXException {

    String s = new String(buf, offset, len);

    The s string sometimes is not as big as it should. Can we define the
    offset and the len?

    The problem is when it reaches the <mass> element. Sometimes it works
    ok, and it reads all the number 922.4428373952809.
    But other times, when let say the mass value is 1455.668578151738,
    the result is just 14 .

    Do you have any idea what is the error?? I can post my code, but i
    didn't do it know cause this post is already too big. Maybe someone
    has already encountered the error..


    I would appreciate any help.

    Thank you very very much,

    Stacey


    PS:
    I have many files like this, and i have noticed that my output files
    have a similar structure. Fault every four lines for some time and
    then it correct. :

    922.4428373952809 769.9541477864069
    927.4953784899038 37191.92095290756
    933.5252259507145 8110.517189035567
    940.4653099147753 9868.125486196035
    941.518898 4381.813320162202 <------------ we
    lose some numbers
    947.5404155021193 2787.439831966368
    954.4881638784998 392.1130341071628
    965.4569335866441 1401.545504962355
    978.4210869646715 438.9369494573742
    984.4917 886.1359194417274 <------------ we
    lose more numbers
    1003.550150977367 497.0759433683916
    1017.529625718612 3169.151170705610
    1055.582684542875 4943.314163415449
    1066.107033179408 443.6729762946884
    1074.5 3354.853245126279 <------------ we
    lose more numbers
    1076.531768475310 646.8024403962839
    1083.527120777254 498.7760684249872
    1311.697689088619 16369.00024571709
    1325.729755074315 287.2714228497898
    1349 637.2393867567375 <------------
    the number now is integer (error)
    1373.598186480195 223.2986354292584
    1385.548026377931 431.7554665051648
    1387.553176811347 268.0273520356520
    1443.594333307889 1317.685936747487
    14 661.7093703067692 <------- It
    should have read: 1455.668578151738
    1457.610697327313 768.3194420301912
    1467.786043204199 3468.484434990418
    1546.734272041830 565.9503240406932
    1552.544423206343 610.4527352860962
    1566.639317869258 308.7611649665076
    1575.708923737670 1524.695259940473

    (the rest of the file is ok)
     
    stacey, Feb 27, 2007
    #1
    1. Advertising

  2. stacey

    Chris Uppal Guest

    stacey wrote:

    > The function characters:
    > public void characters(char buf[], int offset, int len) throws
    > SAXException {
    >
    > String s = new String(buf, offset, len);
    >
    > The s string sometimes is not as big as it should. Can we define the
    > offset and the len?
    >
    > The problem is when it reaches the <mass> element. Sometimes it works
    > ok, and it reads all the number 922.4428373952809.
    > But other times, when let say the mass value is 1455.668578151738,
    > the result is just 14 .


    Are you assuming that characters() will always be called with all the text in
    one call ? If so, then don't because it won't. SAX may supply
    "1455.668578151738" in as many separate peices as it wants to -- even in 17
    different calls with one character each.

    -- chris
     
    Chris Uppal, Feb 27, 2007
    #2
    1. Advertising

  3. stacey wrote:
    > I am using SaxParser, to parse an xml document, and i noticed that
    > sometimes it ignores some data, while reading the characters inside an
    > element.
    > What i mean is:
    >
    > My xml file includes many instances of the following structure:
    >
    > <pk>
    > <absi>769.9541477864069</absi>
    > <area>170.0227589457148</area>
    > <background>0.0000000000000000</background>
    > <chisq>1202.267500954470</chisq>
    > <goodn>1607.355201650164</goodn>
    > <ind>1106.121302500338</ind>
    > <lind>1082.000000000000</lind>
    > <mass>922.4428373952809</mass>
    > <meth>4</meth>
    > <reso>5700.586009913091</reso>
    > <rind>1256.000000000000</rind>
    > <s2n>6.073893519996163</s2n>
    > <type>0</type>
    > </pk>
    >
    > My java code is quite big, but i debugged it, and i saw that the
    > problem is in the function characters.
    >>From every of the above structures i want to get the absi and mass

    > value, and i write them in a file.
    >
    > The function characters:
    > public void characters(char buf[], int offset, int len) throws
    > SAXException {
    >
    > String s = new String(buf, offset, len);
    >
    > The s string sometimes is not as big as it should. Can we define the
    > offset and the len?
    >
    > The problem is when it reaches the <mass> element. Sometimes it works
    > ok, and it reads all the number 922.4428373952809.
    > But other times, when let say the mass value is 1455.668578151738,
    > the result is just 14 .

    A wild guess:
    Could it be, that sometimes the parser passes the characters in two chunks
    instead of in one chunk?
    For example: In most cases the parser might call your handler like this:
    beginElement(...) // <mass>
    characters(...) // 1455.668578151738
    endElement(...) // </mass>
    But in some rare cases the parser might call your handler like this:
    beginElement(...) // <mass>
    characters(...) // 14
    characters(...) // 55.668578151738
    endElement(...) // </mass>

    Note that both ways are perfectly well according to the SAX specification.
    Hence your content handler has to cope with the possibility of multiple
    chunks (probably by concatenating the chunks to one string).

    --
    Thomas
     
    Thomas Fritsch, Feb 27, 2007
    #3
  4. stacey

    Guest

    On Feb 27, 9:21 am, "stacey" <> wrote:

    > I am using SaxParser, to parse an xml document, and i noticed that
    > sometimes it ignores some data, while reading the characters inside an
    > element.
    >
    > My xml file includes many instances of the following structure:
    >
    > <pk>
    > <absi>769.9541477864069</absi>
    > <area>170.0227589457148</area>
    > <background>0.0000000000000000</background>
    > <chisq>1202.267500954470</chisq>
    > <goodn>1607.355201650164</goodn>
    > <ind>1106.121302500338</ind>
    > <lind>1082.000000000000</lind>
    > <mass>922.4428373952809</mass>
    > <meth>4</meth>
    > <reso>5700.586009913091</reso>
    > <rind>1256.000000000000</rind>
    > <s2n>6.073893519996163</s2n>
    > <type>0</type>
    > </pk>
    >
    > My java code is quite big, but i debugged it, and i saw that the
    > problem is in the function characters.>From every of the above structures i want to get the absi and mass


    > The problem is when it reaches the <mass> element. Sometimes it works
    > ok, and it reads all the number 922.4428373952809.
    > But other times, when let say the mass value is 1455.668578151738,
    > the result is just 14 .


    This is not a bug, actually. SAX doesn't guarantee that text nodes
    will be delivered in a single call to the "characters" method -- in
    the example you gave you should get characters("14") and then
    characters("55.66....") immediately afterwards; it's your
    responsibility to stitch these back into a single string.

    Put off parsing the string into a number until you see the
    corresponding endElement call.

    Owen
     
    , Feb 27, 2007
    #4
  5. stacey

    Sem Guest

    On Feb 27, 12:56 pm, "Chris Uppal" <-
    THIS.org> wrote:
    > stacey wrote:
    > > The function characters:
    > > public void characters(char buf[], int offset, int len) throws
    > > SAXException {

    >
    > > String s = new String(buf, offset, len);

    >
    > > The s string sometimes is not as big as it should. Can we define the
    > > offset and the len?

    >
    > > The problem is when it reaches the <mass> element. Sometimes it works
    > > ok, and it reads all the number 922.4428373952809.
    > > But other times, when let say the mass value is 1455.668578151738,
    > > the result is just 14 .

    >
    > Are you assuming that characters() will always be called with all the text in
    > one call ? If so, then don't because it won't. SAX may supply
    > "1455.668578151738" in as many separate peices as it wants to -- even in 17
    > different calls with one character each.
    >
    > -- chris


    Where Can I study more on SaxParser?
    Please help

    --sem
     
    Sem, Feb 27, 2007
    #5
  6. stacey

    Chris Uppal Guest

    Chris Uppal, Feb 27, 2007
    #6
  7. Chris Uppal wrote:
    > stacey wrote:
    >
    >> The function characters:
    >> public void characters(char buf[], int offset, int len) throws
    >> SAXException {
    >>
    >> String s = new String(buf, offset, len);
    >>
    >> The s string sometimes is not as big as it should. Can we define the
    >> offset and the len?
    >>
    >> The problem is when it reaches the <mass> element. Sometimes it works
    >> ok, and it reads all the number 922.4428373952809.
    >> But other times, when let say the mass value is 1455.668578151738,
    >> the result is just 14 .

    >
    > Are you assuming that characters() will always be called with all the
    > text in one call ? If so, then don't because it won't. SAX may
    > supply "1455.668578151738" in as many separate peices as it wants to
    > -- even in 17 different calls with one character each.


    Why are you assuimg that each call will supply a non-zero number of
    characters? :)
     
    Mike Schilling, Feb 27, 2007
    #7
  8. stacey

    Chris Uppal Guest

    Mike Schilling wrote:

    [me:]
    > > SAX may
    > > supply "1455.668578151738" in as many separate peices as it wants to
    > > -- even in 17 different calls with one character each.

    >
    > Why are you assuimg that each call will supply a non-zero number of
    > characters? :)


    ;-)

    I did consider that issue. I decided it might be a little confusing to mention
    it, though...

    Actually, I can't find anything to suggest that 0-length sequence is forbidden
    by the SAX spec (such as it is). OTOH, I have no reason to suppose that any
    SAX implementation would actually do it.

    Makes no real difference in practice, since code which is written to work right
    with variable length (sub-)sequences at all will automatically cope with
    0-length sequences too.

    -- chris
     
    Chris Uppal, Feb 27, 2007
    #8
  9. stacey

    Sem Guest

    On Feb 27, 2:01 pm, "Chris Uppal" <-
    THIS.org> wrote:
    > Sem wrote:
    > > Where Can I study more on SaxParser?

    >
    > Read a good book ? I liked:
    > Processing XML with Java
    > Elliotte Rusty Harold
    > Online at:
    > http://www.cafeconleche.org/books/xmljava/
    >
    > Or you could go the SAX home page:
    > http://www.saxproject.org/
    >
    > Or you could read Sun's JavaDocs:
    > http://java.sun.com/javase/6/docs/api/org/xml/sax/package-summary.html
    >
    > -- chris


    Thank you very much
    It helps me a lot and I will chew, swallow and speak it soon
    --sem
     
    Sem, Feb 27, 2007
    #9
  10. stacey

    stacey Guest

    Thank you all for answering..and for your help.

    I didn't know that there could be more than one calls to get all the
    text.
    I thought that one call is logical.

    Anyways, I will try it now. I hope it works!

    My question still stands about the frequency of the "errors". ( i mean
    the every four lines).

    Thank you very very much again,

    Really Best Regards,

    Stacey


    On Feb 27, 7:56 pm, "Chris Uppal" <-
    THIS.org> wrote:
    > stacey wrote:
    > > The function characters:
    > > public void characters(char buf[], int offset, int len) throws
    > > SAXException {

    >
    > > String s = new String(buf, offset, len);

    >
    > > The s string sometimes is not as big as it should. Can we define the
    > > offset and the len?

    >
    > > The problem is when it reaches the <mass> element. Sometimes it works
    > > ok, and it reads all the number 922.4428373952809.
    > > But other times, when let say the mass value is 1455.668578151738,
    > > the result is just 14 .

    >
    > Are you assuming that characters() will always be called with all the text in
    > one call ? If so, then don't because it won't. SAX may supply
    > "1455.668578151738" in as many separate peices as it wants to -- even in 17
    > different calls with one character each.
    >
    > -- chris
     
    stacey, Feb 27, 2007
    #10
  11. stacey

    Guest

    Please don't top-post, and trim your replies a bit so the rest of us
    don't have to download the entire thread again every time we read your
    messages. :)

    On Feb 27, 3:45 pm, "stacey" <> wrote:
    > Thank you all for answering..and for your help.
    >
    > I didn't know that there could be more than one calls to get all the
    > text.
    > I thought that one call is logical.


    There's a couple of reasons for SAX to deliver text nodes in multiple
    pieces, the most obvious being in the case of ignorable whitespace.
    Consider the following document:

    <doc>
    <text>This has some ignorable whitespace</text>
    </doc>

    The easiest way for SAX to deliver this to the application is:

    beginElement ("doc")
    beginElement ("text")
    characters ("This has some ")
    characters ("ignorable whitespace")
    endElement ("text")
    endElement ("doc")

    Nothing says the SAX driver can't remove the whitespace internally and
    present it to the application in one call; however, allowing it to
    deliver the text in multiple pieces means the driver can be much
    simpler and memory allocation can be much more predictable.

    It also makes it easier for the driver to deliver character entities:
    it can deliver the text before the entity, the entity's corresponding
    character, and the text following the entity in three separate calls,
    if it's easier to implement.

    > My question still stands about the frequency of the "errors". ( i mean
    > the every four lines).


    For that, you'll have to talk to the authors behind your SAX driver.

    Owen
     
    , Feb 28, 2007
    #11
  12. stacey

    Adam Maass Guest

    "stacey" <> wrote:
    > <mass>922.4428373952809</mass>
    >
    > The function characters:
    > public void characters(char buf[], int offset, int len) throws
    > SAXException {
    >
    > String s = new String(buf, offset, len);
    >
    > The s string sometimes is not as big as it should. Can we define the
    > offset and the len?
    >
    > The problem is when it reaches the <mass> element. Sometimes it works
    > ok, and it reads all the number 922.4428373952809.
    > But other times, when let say the mass value is 1455.668578151738,
    > the result is just 14 .
    >
    > Do you have any idea what is the error?? I can post my code, but i
    > didn't do it know cause this post is already too big. Maybe someone
    > has already encountered the error..
    >
    >


    I have seen this error too many times to count, and corrected it several
    times.

    The problem is that the "characters" function is not guaranteed to be called
    on a complete set of characters in the element. (The SAX implementation uses
    a fixed-length char[] to read through the document. The boundary of that
    array sometimes lands in the middle of a text element. You, as the
    implementor of the 'characters' function in the SAXHandler, have to be
    prepared for this.)


    The solution is a little complex, but is something like this:

    StringBuffer buf;

    void startElement(...){
    buf = new StringBuffer();
    }

    void characters(char[] buf, int offset, int length){
    buf.append(buf, offset, length);
    }

    void endElement(...) {
    String s = buf.toString();
    }
     
    Adam Maass, Mar 6, 2007
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    465
  2. Replies:
    1
    Views:
    317
    Pete Becker
    Dec 20, 2006
  3. Andi

    SAXParser xinclude problem

    Andi, Dec 1, 2008, in forum: Java
    Replies:
    6
    Views:
    1,856
  4. Peter Higgins

    libxml's SaxParser and UTF-8 problem

    Peter Higgins, Mar 2, 2007, in forum: Ruby
    Replies:
    2
    Views:
    125
    Jenda Krynicky
    Mar 7, 2007
  5. Joseph S
    Replies:
    3
    Views:
    109
    Joseph S
    Jan 24, 2004
Loading...

Share This Page