Problem with SaxParser. Works Occasionally.

S

stacey

Hello everyone,

I am using SaxParser, to parse an xml document, and i noticed that
sometimes it ignores some data, while reading the characters inside an
element.
What i mean is:

My xml file includes many instances of the following structure:

<pk>
<absi>769.9541477864069</absi>
<area>170.0227589457148</area>
<background>0.0000000000000000</background>
<chisq>1202.267500954470</chisq>
<goodn>1607.355201650164</goodn>
<ind>1106.121302500338</ind>
<lind>1082.000000000000</lind>
<mass>922.4428373952809</mass>
<meth>4</meth>
<reso>5700.586009913091</reso>
<rind>1256.000000000000</rind>
<s2n>6.073893519996163</s2n>
<type>0</type>
</pk>

My java code is quite big, but i debugged it, and i saw that the
problem is in the function characters.
From every of the above structures i want to get the absi and mass
value, and i write them in a file.

The function characters:
public void characters(char buf[], int offset, int len) throws
SAXException {

String s = new String(buf, offset, len);

The s string sometimes is not as big as it should. Can we define the
offset and the len?

The problem is when it reaches the <mass> element. Sometimes it works
ok, and it reads all the number 922.4428373952809.
But other times, when let say the mass value is 1455.668578151738,
the result is just 14 .

Do you have any idea what is the error?? I can post my code, but i
didn't do it know cause this post is already too big. Maybe someone
has already encountered the error..


I would appreciate any help.

Thank you very very much,

Stacey


PS:
I have many files like this, and i have noticed that my output files
have a similar structure. Fault every four lines for some time and
then it correct. :

922.4428373952809 769.9541477864069
927.4953784899038 37191.92095290756
933.5252259507145 8110.517189035567
940.4653099147753 9868.125486196035
941.518898 4381.813320162202 <------------ we
lose some numbers
947.5404155021193 2787.439831966368
954.4881638784998 392.1130341071628
965.4569335866441 1401.545504962355
978.4210869646715 438.9369494573742
984.4917 886.1359194417274 <------------ we
lose more numbers
1003.550150977367 497.0759433683916
1017.529625718612 3169.151170705610
1055.582684542875 4943.314163415449
1066.107033179408 443.6729762946884
1074.5 3354.853245126279 <------------ we
lose more numbers
1076.531768475310 646.8024403962839
1083.527120777254 498.7760684249872
1311.697689088619 16369.00024571709
1325.729755074315 287.2714228497898
1349 637.2393867567375 <------------
the number now is integer (error)
1373.598186480195 223.2986354292584
1385.548026377931 431.7554665051648
1387.553176811347 268.0273520356520
1443.594333307889 1317.685936747487
14 661.7093703067692 <------- It
should have read: 1455.668578151738
1457.610697327313 768.3194420301912
1467.786043204199 3468.484434990418
1546.734272041830 565.9503240406932
1552.544423206343 610.4527352860962
1566.639317869258 308.7611649665076
1575.708923737670 1524.695259940473

(the rest of the file is ok)
 
C

Chris Uppal

stacey said:
The function characters:
public void characters(char buf[], int offset, int len) throws
SAXException {

String s = new String(buf, offset, len);

The s string sometimes is not as big as it should. Can we define the
offset and the len?

The problem is when it reaches the <mass> element. Sometimes it works
ok, and it reads all the number 922.4428373952809.
But other times, when let say the mass value is 1455.668578151738,
the result is just 14 .

Are you assuming that characters() will always be called with all the text in
one call ? If so, then don't because it won't. SAX may supply
"1455.668578151738" in as many separate peices as it wants to -- even in 17
different calls with one character each.

-- chris
 
T

Thomas Fritsch

stacey said:
I am using SaxParser, to parse an xml document, and i noticed that
sometimes it ignores some data, while reading the characters inside an
element.
What i mean is:

My xml file includes many instances of the following structure:

<pk>
<absi>769.9541477864069</absi>
<area>170.0227589457148</area>
<background>0.0000000000000000</background>
<chisq>1202.267500954470</chisq>
<goodn>1607.355201650164</goodn>
<ind>1106.121302500338</ind>
<lind>1082.000000000000</lind>
<mass>922.4428373952809</mass>
<meth>4</meth>
<reso>5700.586009913091</reso>
<rind>1256.000000000000</rind>
<s2n>6.073893519996163</s2n>
<type>0</type>
</pk>

My java code is quite big, but i debugged it, and i saw that the
problem is in the function characters.
From every of the above structures i want to get the absi and mass
value, and i write them in a file.

The function characters:
public void characters(char buf[], int offset, int len) throws
SAXException {

String s = new String(buf, offset, len);

The s string sometimes is not as big as it should. Can we define the
offset and the len?

The problem is when it reaches the <mass> element. Sometimes it works
ok, and it reads all the number 922.4428373952809.
But other times, when let say the mass value is 1455.668578151738,
the result is just 14 .
A wild guess:
Could it be, that sometimes the parser passes the characters in two chunks
instead of in one chunk?
For example: In most cases the parser might call your handler like this:
beginElement(...) // <mass>
characters(...) // 1455.668578151738
endElement(...) // </mass>
But in some rare cases the parser might call your handler like this:
beginElement(...) // <mass>
characters(...) // 14
characters(...) // 55.668578151738
endElement(...) // </mass>

Note that both ways are perfectly well according to the SAX specification.
Hence your content handler has to cope with the possibility of multiple
chunks (probably by concatenating the chunks to one string).
 
A

angrybaldguy

I am using SaxParser, to parse an xml document, and i noticed that
sometimes it ignores some data, while reading the characters inside an
element.

My xml file includes many instances of the following structure:

<pk>
<absi>769.9541477864069</absi>
<area>170.0227589457148</area>
<background>0.0000000000000000</background>
<chisq>1202.267500954470</chisq>
<goodn>1607.355201650164</goodn>
<ind>1106.121302500338</ind>
<lind>1082.000000000000</lind>
<mass>922.4428373952809</mass>
<meth>4</meth>
<reso>5700.586009913091</reso>
<rind>1256.000000000000</rind>
<s2n>6.073893519996163</s2n>
<type>0</type>
</pk>

My java code is quite big, but i debugged it, and i saw that the
problem is in the function characters.>From every of the above structures i want to get the absi and mass
The problem is when it reaches the <mass> element. Sometimes it works
ok, and it reads all the number 922.4428373952809.
But other times, when let say the mass value is 1455.668578151738,
the result is just 14 .

This is not a bug, actually. SAX doesn't guarantee that text nodes
will be delivered in a single call to the "characters" method -- in
the example you gave you should get characters("14") and then
characters("55.66....") immediately afterwards; it's your
responsibility to stitch these back into a single string.

Put off parsing the string into a number until you see the
corresponding endElement call.

Owen
 
S

Sem

stacey said:
The function characters:
public void characters(char buf[], int offset, int len) throws
SAXException {
String s = new String(buf, offset, len);
The s string sometimes is not as big as it should. Can we define the
offset and the len?
The problem is when it reaches the <mass> element. Sometimes it works
ok, and it reads all the number 922.4428373952809.
But other times, when let say the mass value is 1455.668578151738,
the result is just 14 .

Are you assuming that characters() will always be called with all the text in
one call ? If so, then don't because it won't. SAX may supply
"1455.668578151738" in as many separate peices as it wants to -- even in 17
different calls with one character each.

-- chris

Where Can I study more on SaxParser?
Please help

--sem
 
M

Mike Schilling

Chris said:
stacey said:
The function characters:
public void characters(char buf[], int offset, int len) throws
SAXException {

String s = new String(buf, offset, len);

The s string sometimes is not as big as it should. Can we define the
offset and the len?

The problem is when it reaches the <mass> element. Sometimes it works
ok, and it reads all the number 922.4428373952809.
But other times, when let say the mass value is 1455.668578151738,
the result is just 14 .

Are you assuming that characters() will always be called with all the
text in one call ? If so, then don't because it won't. SAX may
supply "1455.668578151738" in as many separate peices as it wants to
-- even in 17 different calls with one character each.

Why are you assuimg that each call will supply a non-zero number of
characters? :)
 
C

Chris Uppal

Mike Schilling wrote:

[me:]
Why are you assuimg that each call will supply a non-zero number of
characters? :)

;-)

I did consider that issue. I decided it might be a little confusing to mention
it, though...

Actually, I can't find anything to suggest that 0-length sequence is forbidden
by the SAX spec (such as it is). OTOH, I have no reason to suppose that any
SAX implementation would actually do it.

Makes no real difference in practice, since code which is written to work right
with variable length (sub-)sequences at all will automatically cope with
0-length sequences too.

-- chris
 
S

stacey

Thank you all for answering..and for your help.

I didn't know that there could be more than one calls to get all the
text.
I thought that one call is logical.

Anyways, I will try it now. I hope it works!

My question still stands about the frequency of the "errors". ( i mean
the every four lines).

Thank you very very much again,

Really Best Regards,

Stacey


stacey said:
The function characters:
public void characters(char buf[], int offset, int len) throws
SAXException {
String s = new String(buf, offset, len);
The s string sometimes is not as big as it should. Can we define the
offset and the len?
The problem is when it reaches the <mass> element. Sometimes it works
ok, and it reads all the number 922.4428373952809.
But other times, when let say the mass value is 1455.668578151738,
the result is just 14 .

Are you assuming that characters() will always be called with all the text in
one call ? If so, then don't because it won't. SAX may supply
"1455.668578151738" in as many separate peices as it wants to -- even in 17
different calls with one character each.

-- chris
 
A

angrybaldguy

Please don't top-post, and trim your replies a bit so the rest of us
don't have to download the entire thread again every time we read your
messages. :)

Thank you all for answering..and for your help.

I didn't know that there could be more than one calls to get all the
text.
I thought that one call is logical.

There's a couple of reasons for SAX to deliver text nodes in multiple
pieces, the most obvious being in the case of ignorable whitespace.
Consider the following document:

<doc>
<text>This has some ignorable whitespace</text>
</doc>

The easiest way for SAX to deliver this to the application is:

beginElement ("doc")
beginElement ("text")
characters ("This has some ")
characters ("ignorable whitespace")
endElement ("text")
endElement ("doc")

Nothing says the SAX driver can't remove the whitespace internally and
present it to the application in one call; however, allowing it to
deliver the text in multiple pieces means the driver can be much
simpler and memory allocation can be much more predictable.

It also makes it easier for the driver to deliver character entities:
it can deliver the text before the entity, the entity's corresponding
character, and the text following the entity in three separate calls,
if it's easier to implement.
My question still stands about the frequency of the "errors". ( i mean
the every four lines).

For that, you'll have to talk to the authors behind your SAX driver.

Owen
 
A

Adam Maass

stacey said:
<mass>922.4428373952809</mass>

The function characters:
public void characters(char buf[], int offset, int len) throws
SAXException {

String s = new String(buf, offset, len);

The s string sometimes is not as big as it should. Can we define the
offset and the len?

The problem is when it reaches the <mass> element. Sometimes it works
ok, and it reads all the number 922.4428373952809.
But other times, when let say the mass value is 1455.668578151738,
the result is just 14 .

Do you have any idea what is the error?? I can post my code, but i
didn't do it know cause this post is already too big. Maybe someone
has already encountered the error..

I have seen this error too many times to count, and corrected it several
times.

The problem is that the "characters" function is not guaranteed to be called
on a complete set of characters in the element. (The SAX implementation uses
a fixed-length char[] to read through the document. The boundary of that
array sometimes lands in the middle of a text element. You, as the
implementor of the 'characters' function in the SAXHandler, have to be
prepared for this.)


The solution is a little complex, but is something like this:

StringBuffer buf;

void startElement(...){
buf = new StringBuffer();
}

void characters(char[] buf, int offset, int length){
buf.append(buf, offset, length);
}

void endElement(...) {
String s = buf.toString();
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,077
Latest member
SangMoor21

Latest Threads

Top