Escapes Sequences Not Working?

R

Rick Brandt

If you examine the complete XML below you will see an element "Notes"
consisting of...

<Notes>test replace test[LINE]&amp;[LINE]replace</Notes>

As you can see I have properly (I think) escaped the ampersand (&) with
"&amp;". If I place this XML in a file and open it with Internet Explorer
the ampersand is properly dealt with. In my Java servlet I am using a SAX
parser to parse the XML and write it to a database. When that parser gets
to the "Notes" element all that is returned is the characters up to (not
including) the ampersand in the escape sequence. Everything after that is
truncated. I have found that this will happen with any escape sequence
(since they all start with the ampersand).

I get no errors and the record is written to the database, just with a
truncated Notes field.

Any ideas what I can look for?



<?xml version="1.0"?>
<MBO>
<Record>
<ID>-49781293</ID>
<OrderDate>2004-08-24 15:19:31</OrderDate>
<MemoBillType>5</MemoBillType>
<AccountNum>1</AccountNum>
<BillToAddress>TEST</BillToAddress>
<ShipToAddress>Same as Bill To Address</ShipToAddress>
<RegMgr>John Doe</RegMgr>
<SecCode>308040-860602</SecCode>
<Notes>test replace test[LINE]&amp;[LINE]replace</Notes>
<RequireDate>TEST</RequireDate>
<RackInfo>TEST</RackInfo>
<CallPhoneNumber>TEST TEST</CallPhoneNumber>
<SubRecord_A>
<LineNum>1</LineNum>
<Quantity>1</Quantity>
<PartNum>TEST</PartNum>
<ShipDesignation>TEST</ShipDesignation>
<Price>NULL_VALUE</Price>
<Discount>NULL_VALUE</Discount>
<Notes>TEST TEST TEST</Notes>
</SubRecord_A>
</Record>
</MBO>
 
M

Martin Honnen

Rick said:
If you examine the complete XML below you will see an element "Notes"
consisting of...

<Notes>test replace test[LINE]&amp;[LINE]replace</Notes>

As you can see I have properly (I think) escaped the ampersand (&) with
"&amp;". If I place this XML in a file and open it with Internet Explorer
the ampersand is properly dealt with. In my Java servlet I am using a SAX
parser to parse the XML and write it to a database. When that parser gets
to the "Notes" element all that is returned is the characters up to (not
including) the ampersand in the escape sequence. Everything after that is
truncated. I have found that this will happen with any escape sequence
(since they all start with the ampersand).

How does your SAX code look? You might get several chunks of character
data as the content of the <Notes> element.
 
R

Rick Brandt

Martin Honnen said:
How does your SAX code look? You might get several chunks of character
data as the content of the <Notes> element.

public void characters(char[] ch, int start, int length)
throws SAXException, DataSetException {
try {
if (elementStart) {
elementStart = false;
String s = new String(ch, start, length);

I'm using JBuilder 7 and it has a built in SAX parser object template that
extends DefaultHandler. The problem seems to be with the length argument
on the last line above. If I examine the ch[] array in debug mode it still
has all of the text from the "Notes" element, but the length argument being
passed from the parser is (for some reason) being set to the first
occurrence of an ampersand instead of extending to the element close tag.
So the String s that I use for insertion to the database is truncated.
 
R

Richard Tobin

I'm using JBuilder 7 and it has a built in SAX parser object template that
extends DefaultHandler. The problem seems to be with the length argument
on the last line above. If I examine the ch[] array in debug mode it still
has all of the text from the "Notes" element, but the length argument being
passed from the parser is (for some reason) being set to the first
occurrence of an ampersand instead of extending to the element close tag.
So the String s that I use for insertion to the database is truncated.

And you don't get more calls to characters() with the rest of the string?
There's no guarantee you will get it all at once.

-- Richard
 
W

William Park

In said:
If you examine the complete XML below you will see an element "Notes"
consisting of...

<Notes>test replace test[LINE]&amp;[LINE]replace</Notes>

As you can see I have properly (I think) escaped the ampersand (&)
with "&amp;". If I place this XML in a file and open it with Internet
Explorer the ampersand is properly dealt with. In my Java servlet I am
using a SAX parser to parse the XML and write it to a database. When
that parser gets to the "Notes" element all that is returned is the
characters up to (not including) the ampersand in the escape sequence.
Everything after that is truncated. I have found that this will
happen with any escape sequence (since they all start with the
ampersand).

I get no errors and the record is written to the database, just with a
truncated Notes field.

Any ideas what I can look for?

At least with Expat XML parser, I get 3 calls, ie.
test replace test[LINE]
&
[LINE]replace
So, collect all data until end of <Notes> element.
 
R

Rick Brandt

Richard Tobin said:
I'm using JBuilder 7 and it has a built in SAX parser object template that
extends DefaultHandler. The problem seems to be with the length argument
on the last line above. If I examine the ch[] array in debug mode it still
has all of the text from the "Notes" element, but the length argument being
passed from the parser is (for some reason) being set to the first
occurrence of an ampersand instead of extending to the element close tag.
So the String s that I use for insertion to the database is truncated.

And you don't get more calls to characters() with the rest of the string?
There's no guarantee you will get it all at once.

Should I get those "more calls" automatically or do I have to put in some
kind of loop? Why wouldn't Characters() return ALL characters between the
<> and </>? Isn't that what the parser's job is?

I was originally wrapping all of my text elements in CDATA sections, but I
ran into a problem where any CDATA section with the string "replace" in it
raised a Parse Error (previous newsgroup thread where I received no
answers).

I decided I would just escape all of the illegal XML characters instead of
using CDATA and now I have this truncation issue.

I appreciate the help.
 
R

Rick Brandt

William Park said:
At least with Expat XML parser, I get 3 calls, ie.
test replace test[LINE]
&
[LINE]replace
So, collect all data until end of <Notes> element.

OK I found this at a SAX FAQ site...

*****************************************
The ContentHandler.characters() callback is missing data!

Please read the JavaDoc for this method. A parser may split text into any
number of separate chunks, and some characters may be reported using
ignorableWhitespace() instead of this callback. If you want all the text
inside an element, you need to collect the text from the various characters
callbacks into a buffer. Only when you see the endElement event can you be
sure that you have seen all the text, and some of it may really "belong" to
child elements. \
******************************************

This appears to say that I am using the wrong event. It would be a major
re-write to move my code to the EndElement() event, but if I have to I
guess I have to, but then I might have child element characters included
that I don't want? How do I avoid the child element characters? The FAQ
doesn't go into that at all.
 
R

Richard Tobin

Rick Brandt said:
Should I get those "more calls" automatically

Yes. Quite likely you will get thre calls in this case.
I was originally wrapping all of my text elements in CDATA sections, but I
ran into a problem where any CDATA section with the string "replace" in it
raised a Parse Error (previous newsgroup thread where I received no
answers).

Maybe you should try a different parser!

-- Richard
 
R

Rick Brandt

Richard Tobin said:
Yes. Quite likely you will get thre calls in this case.


Maybe you should try a different parser!

AFAIK I am using the one that comes with java 1.4.2_04-b05. The import
statements in my SAX class are...

org.xml.sax.*;
org.xml.sax.helpers.*;
 
R

Rick Brandt

Rick Brandt said:
William Park said:
At least with Expat XML parser, I get 3 calls, ie.
test replace test[LINE]
&
[LINE]replace
So, collect all data until end of <Notes> element.

OK I found this at a SAX FAQ site...

*****************************************
The ContentHandler.characters() callback is missing data!

Please read the JavaDoc for this method. A parser may split text into any
number of separate chunks, and some characters may be reported using
ignorableWhitespace() instead of this callback. If you want all the text
inside an element, you need to collect the text from the various characters
callbacks into a buffer. Only when you see the endElement event can you be
sure that you have seen all the text, and some of it may really "belong" to
child elements. \
******************************************

This appears to say that I am using the wrong event. It would be a major
re-write to move my code to the EndElement() event, but if I have to I
guess I have to, but then I might have child element characters included
that I don't want? How do I avoid the child element characters? The FAQ
doesn't go into that at all.

Ok, I found yet another reference...

*********************************************
Note that a SAX driver is free to chunk the character data any way it
wants, so you cannot count on all of the character data content of an
element arriving in a single characters event.
*********************************************

So it appears that this is working "as designed" yet none of the examples I
see on these same pages describe methods for properly dealing with the
characters() event.

Immediately prior to the statement above the site uses an example for
pulling the data from the characters event that clearly will NOT work if
the parser decides to "chunk" the data into multiple pieces.

I guess I will look at collecting the pieces in characters and not writing
them until endElement(). I just wish I could fix the CDATA bug as this was
working fine for 3 or 4 years before that started happening. Either CDATA
forces all of the text in the characters event to be pulled in a single
block or we just got really lucky for all that time because I never saw any
truncation until the CDATA section was removed.
 
K

Keith M. Corbett

but

AFAIK I am using the one that comes with java 1.4.2_04-b05. The import
statements in my SAX class are...

org.xml.sax.*;
org.xml.sax.helpers.*;

Weird! I'm using the JAXP/DOM APIs built into Java SDK version 1.4.2_04.
(Linux) I can't reproduce an error with a CDATA section containing
"replace".

I think this CDATA problem is worth digging into. Can you post (or send me)
sample code and text?

/kmc
 
R

Rick Brandt

Keith M. Corbett said:
Weird! I'm using the JAXP/DOM APIs built into Java SDK version 1.4.2_04.
(Linux) I can't reproduce an error with a CDATA section containing
"replace".

I think this CDATA problem is worth digging into. Can you post (or send me)
sample code and text?

Well, here's the full story on that. I think what I'm seeing is a bug in
IPlanet's web application server which is what our production web servers
run.

About 2 months ago I had a user reporting errors when submitting data to my
Java servlet over an HTTP request. At the time we isolated it to when a
line-item note field was too long (or so we thought). The problem does NOT
happen when I point the client at the servlet running in my JBuilder
environment (which uses Tomcat) so I was stumped troubleshooting it. The
notes are somewhat of a non-critical field so I asked him to just keep them
short until I could investigate further.

Last week he reported the same problem only it was with a parent note
field. This time I was able to determine that it wasn't the length at all,
but rather that any time the string "replace" occurred. I then tested my
other client apps which send data over HTTP in a similar fashion. Every
single one of them bombs if I include "replace" in a CDATA section.

The error reported from the servlet is "root node missing" which I believe
is being raised because the parser is in fact not being passed any data at
all. I then discovered that the word replace was harmless if it was not in
a CDATA section so since I seemed to have few troubleshooting options I
decided to just escape all illegal XML characters and drop the CDATA
section. At initial design the CDATA looked like the easiest way to handle
the data entered by the user instead of doing a bunch of Replace()
functions. Now I'll have to rewrite all of my SAX parsing code because of
this issue with characters() breaking the text into chunks. Apparently it
uses the ampersand as the "chunk delimiter".

This CDATA problem definitely has some variability to it because while I
can reproduce the problem myself, I have never had any other user complain
of this (around 30) and I can find records in the database that contain the
word "replace" which apparently made it through ok.
 
W

William Park

In said:
I guess I will look at collecting the pieces in characters and not
writing them until endElement(). I just wish I could fix the CDATA
bug as this was working fine for 3 or 4 years before that started
happening. Either CDATA forces all of the text in the characters
event to be pulled in a single block or we just got really lucky for
all that time because I never saw any truncation until the CDATA
section was removed.

You were just lucky. :)

If you're using (or can use) Bash shell, then collecting all texts
inside <Notes> or any other element is simple. Assuming elements
containing data are not nested,

start () { # Usage: start tag att=value ...
case $1 in
Notes) unset data;;
esac
}
middle () { # Usage: middle text
case ${XML_ELEMENT_STACK[1]} in
Notes) data+="$1" ;;
esac
}
end () { # Usage: start tag
case $1 in
Notes) echo "$data" ;;
esac
}

Then,
xml -s start -d middle -e end "<Notes>aa&amp;bb</Notes>"
produces
aa&bb

Ref:
http://freshmeat.net/projects/bashdiff/
http://home.eol.ca/~parkw/index.html#xml
help xml
 
D

Donald Roby

Richard Tobin said:
Rick Brandt said:
I'm using JBuilder 7 and it has a built in SAX parser object template that
extends DefaultHandler. The problem seems to be with the length argument
on the last line above. If I examine the ch[] array in debug mode it still
has all of the text from the "Notes" element, but the length argument being
passed from the parser is (for some reason) being set to the first
occurrence of an ampersand instead of extending to the element close tag.
So the String s that I use for insertion to the database is truncated.

And you don't get more calls to characters() with the rest of the
string? There's no guarantee you will get it all at once.

Should I get those "more calls" automatically or do I have to put in
some kind of loop? Why wouldn't Characters() return ALL characters
between the <> and </>? Isn't that what the parser's job is?
You should get these "more calls" more or less automatically, but your
characters method has to allow for multiple calls with partial data.

The basic strategy is to setup a StringBuffer in the startElement method,
collect text into it in the characters method, and pull the whole result
out in the endElement method.

I don't know what triggers the division into multiple events, but it
sounds like the implementation you're using may be stopping on ampersands
to handle entities. I'd hope once you get your code dealing with the
multiple calls his will be transparent. Possibly use of a CDATA section
simplified the parsers job so it didn't need to do this.

But the characters method is definitely not guaranteed to return the
entire enclosed text, so you should do something like what I described
above.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top