Escapes Sequences Not Working?

Discussion in 'XML' started by Rick Brandt, Aug 25, 2004.

  1. Rick Brandt

    Rick Brandt Guest

    If you examine the complete XML below you will see an element "Notes"
    consisting of...

    <Notes>test replace test[LINE]&amp;[LINE]replace</Notes>

    As you can see I have properly (I think) escaped the ampersand (&) with
    "&amp;". If I place this XML in a file and open it with Internet Explorer
    the ampersand is properly dealt with. In my Java servlet I am using a SAX
    parser to parse the XML and write it to a database. When that parser gets
    to the "Notes" element all that is returned is the characters up to (not
    including) the ampersand in the escape sequence. Everything after that is
    truncated. I have found that this will happen with any escape sequence
    (since they all start with the ampersand).

    I get no errors and the record is written to the database, just with a
    truncated Notes field.

    Any ideas what I can look for?



    <?xml version="1.0"?>
    <MBO>
    <Record>
    <ID>-49781293</ID>
    <OrderDate>2004-08-24 15:19:31</OrderDate>
    <MemoBillType>5</MemoBillType>
    <AccountNum>1</AccountNum>
    <BillToAddress>TEST</BillToAddress>
    <ShipToAddress>Same as Bill To Address</ShipToAddress>
    <RegMgr>John Doe</RegMgr>
    <SecCode>308040-860602</SecCode>
    <Notes>test replace test[LINE]&amp;[LINE]replace</Notes>
    <RequireDate>TEST</RequireDate>
    <RackInfo>TEST</RackInfo>
    <CallPhoneNumber>TEST TEST</CallPhoneNumber>
    <SubRecord_A>
    <LineNum>1</LineNum>
    <Quantity>1</Quantity>
    <PartNum>TEST</PartNum>
    <ShipDesignation>TEST</ShipDesignation>
    <Price>NULL_VALUE</Price>
    <Discount>NULL_VALUE</Discount>
    <Notes>TEST TEST TEST</Notes>
    </SubRecord_A>
    </Record>
    </MBO>
     
    Rick Brandt, Aug 25, 2004
    #1
    1. Advertising

  2. Rick Brandt wrote:

    > If you examine the complete XML below you will see an element "Notes"
    > consisting of...
    >
    > <Notes>test replace test[LINE]&amp;[LINE]replace</Notes>
    >
    > As you can see I have properly (I think) escaped the ampersand (&) with
    > "&amp;". If I place this XML in a file and open it with Internet Explorer
    > the ampersand is properly dealt with. In my Java servlet I am using a SAX
    > parser to parse the XML and write it to a database. When that parser gets
    > to the "Notes" element all that is returned is the characters up to (not
    > including) the ampersand in the escape sequence. Everything after that is
    > truncated. I have found that this will happen with any escape sequence
    > (since they all start with the ampersand).


    How does your SAX code look? You might get several chunks of character
    data as the content of the <Notes> element.

    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
     
    Martin Honnen, Aug 25, 2004
    #2
    1. Advertising

  3. Rick Brandt

    Rick Brandt Guest

    "Martin Honnen" <> wrote in message
    news:412cab43$0$19550$-online.net...
    > How does your SAX code look? You might get several chunks of character
    > data as the content of the <Notes> element.


    public void characters(char[] ch, int start, int length)
    throws SAXException, DataSetException {
    try {
    if (elementStart) {
    elementStart = false;
    String s = new String(ch, start, length);

    I'm using JBuilder 7 and it has a built in SAX parser object template that
    extends DefaultHandler. The problem seems to be with the length argument
    on the last line above. If I examine the ch[] array in debug mode it still
    has all of the text from the "Notes" element, but the length argument being
    passed from the parser is (for some reason) being set to the first
    occurrence of an ampersand instead of extending to the element close tag.
    So the String s that I use for insertion to the database is truncated.


    --
    I don't check the Email account attached
    to this message. Send instead to...
    RBrandt at Hunter dot com
     
    Rick Brandt, Aug 25, 2004
    #3
  4. In article <>,
    Rick Brandt <> wrote:
    >I'm using JBuilder 7 and it has a built in SAX parser object template that
    >extends DefaultHandler. The problem seems to be with the length argument
    >on the last line above. If I examine the ch[] array in debug mode it still
    >has all of the text from the "Notes" element, but the length argument being
    >passed from the parser is (for some reason) being set to the first
    >occurrence of an ampersand instead of extending to the element close tag.
    >So the String s that I use for insertion to the database is truncated.


    And you don't get more calls to characters() with the rest of the string?
    There's no guarantee you will get it all at once.

    -- Richard
     
    Richard Tobin, Aug 25, 2004
    #4
  5. Rick Brandt

    William Park Guest

    In <comp.text.xml> Rick Brandt <> wrote:
    > If you examine the complete XML below you will see an element "Notes"
    > consisting of...
    >
    > <Notes>test replace test[LINE]&amp;[LINE]replace</Notes>
    >
    > As you can see I have properly (I think) escaped the ampersand (&)
    > with "&amp;". If I place this XML in a file and open it with Internet
    > Explorer the ampersand is properly dealt with. In my Java servlet I am
    > using a SAX parser to parse the XML and write it to a database. When
    > that parser gets to the "Notes" element all that is returned is the
    > characters up to (not including) the ampersand in the escape sequence.
    > Everything after that is truncated. I have found that this will
    > happen with any escape sequence (since they all start with the
    > ampersand).
    >
    > I get no errors and the record is written to the database, just with a
    > truncated Notes field.
    >
    > Any ideas what I can look for?


    At least with Expat XML parser, I get 3 calls, ie.
    test replace test[LINE]
    &
    [LINE]replace
    So, collect all data until end of <Notes> element.

    --
    William Park <>
    Open Geometry Consulting, Toronto, Canada
     
    William Park, Aug 25, 2004
    #5
  6. Rick Brandt

    Rick Brandt Guest

    "Richard Tobin" <> wrote in message
    news:cgigsq$26st$...
    > In article <>,
    > Rick Brandt <> wrote:
    > >I'm using JBuilder 7 and it has a built in SAX parser object template

    that
    > >extends DefaultHandler. The problem seems to be with the length

    argument
    > >on the last line above. If I examine the ch[] array in debug mode it

    still
    > >has all of the text from the "Notes" element, but the length argument

    being
    > >passed from the parser is (for some reason) being set to the first
    > >occurrence of an ampersand instead of extending to the element close

    tag.
    > >So the String s that I use for insertion to the database is truncated.

    >
    > And you don't get more calls to characters() with the rest of the string?
    > There's no guarantee you will get it all at once.


    Should I get those "more calls" automatically or do I have to put in some
    kind of loop? Why wouldn't Characters() return ALL characters between the
    <> and </>? Isn't that what the parser's job is?

    I was originally wrapping all of my text elements in CDATA sections, but I
    ran into a problem where any CDATA section with the string "replace" in it
    raised a Parse Error (previous newsgroup thread where I received no
    answers).

    I decided I would just escape all of the illegal XML characters instead of
    using CDATA and now I have this truncation issue.

    I appreciate the help.


    --
    I don't check the Email account attached
    to this message. Send instead to...
    RBrandt at Hunter dot com
     
    Rick Brandt, Aug 25, 2004
    #6
  7. Rick Brandt

    Rick Brandt Guest

    "William Park" <> wrote in message
    news:...
    > At least with Expat XML parser, I get 3 calls, ie.
    > test replace test[LINE]
    > &
    > [LINE]replace
    > So, collect all data until end of <Notes> element.


    OK I found this at a SAX FAQ site...

    *****************************************
    The ContentHandler.characters() callback is missing data!

    Please read the JavaDoc for this method. A parser may split text into any
    number of separate chunks, and some characters may be reported using
    ignorableWhitespace() instead of this callback. If you want all the text
    inside an element, you need to collect the text from the various characters
    callbacks into a buffer. Only when you see the endElement event can you be
    sure that you have seen all the text, and some of it may really "belong" to
    child elements. \
    ******************************************

    This appears to say that I am using the wrong event. It would be a major
    re-write to move my code to the EndElement() event, but if I have to I
    guess I have to, but then I might have child element characters included
    that I don't want? How do I avoid the child element characters? The FAQ
    doesn't go into that at all.


    --
    I don't check the Email account attached
    to this message. Send instead to...
    RBrandt at Hunter dot com
     
    Rick Brandt, Aug 25, 2004
    #7
  8. In article <>,
    Rick Brandt <> wrote:

    >Should I get those "more calls" automatically


    Yes. Quite likely you will get thre calls in this case.

    >I was originally wrapping all of my text elements in CDATA sections, but I
    >ran into a problem where any CDATA section with the string "replace" in it
    >raised a Parse Error (previous newsgroup thread where I received no
    >answers).


    Maybe you should try a different parser!

    -- Richard
     
    Richard Tobin, Aug 25, 2004
    #8
  9. Rick Brandt

    Rick Brandt Guest

    "Richard Tobin" <> wrote in message
    news:cgipvu$29ml$...
    > In article <>,
    > Rick Brandt <> wrote:
    >
    > >Should I get those "more calls" automatically

    >
    > Yes. Quite likely you will get thre calls in this case.
    >
    > >I was originally wrapping all of my text elements in CDATA sections, but

    I
    > >ran into a problem where any CDATA section with the string "replace" in

    it
    > >raised a Parse Error (previous newsgroup thread where I received no
    > >answers).

    >
    > Maybe you should try a different parser!


    AFAIK I am using the one that comes with java 1.4.2_04-b05. The import
    statements in my SAX class are...

    org.xml.sax.*;
    org.xml.sax.helpers.*;
     
    Rick Brandt, Aug 25, 2004
    #9
  10. Rick Brandt

    Rick Brandt Guest

    "Rick Brandt" <> wrote in message
    news:...
    > "William Park" <> wrote in message
    > news:...
    > > At least with Expat XML parser, I get 3 calls, ie.
    > > test replace test[LINE]
    > > &
    > > [LINE]replace
    > > So, collect all data until end of <Notes> element.

    >
    > OK I found this at a SAX FAQ site...
    >
    > *****************************************
    > The ContentHandler.characters() callback is missing data!
    >
    > Please read the JavaDoc for this method. A parser may split text into any
    > number of separate chunks, and some characters may be reported using
    > ignorableWhitespace() instead of this callback. If you want all the text
    > inside an element, you need to collect the text from the various

    characters
    > callbacks into a buffer. Only when you see the endElement event can you

    be
    > sure that you have seen all the text, and some of it may really "belong"

    to
    > child elements. \
    > ******************************************
    >
    > This appears to say that I am using the wrong event. It would be a major
    > re-write to move my code to the EndElement() event, but if I have to I
    > guess I have to, but then I might have child element characters included
    > that I don't want? How do I avoid the child element characters? The FAQ
    > doesn't go into that at all.


    Ok, I found yet another reference...

    *********************************************
    Note that a SAX driver is free to chunk the character data any way it
    wants, so you cannot count on all of the character data content of an
    element arriving in a single characters event.
    *********************************************

    So it appears that this is working "as designed" yet none of the examples I
    see on these same pages describe methods for properly dealing with the
    characters() event.

    Immediately prior to the statement above the site uses an example for
    pulling the data from the characters event that clearly will NOT work if
    the parser decides to "chunk" the data into multiple pieces.

    I guess I will look at collecting the pieces in characters and not writing
    them until endElement(). I just wish I could fix the CDATA bug as this was
    working fine for 3 or 4 years before that started happening. Either CDATA
    forces all of the text in the characters event to be pulled in a single
    block or we just got really lucky for all that time because I never saw any
    truncation until the CDATA section was removed.


    --
    I don't check the Email account attached
    to this message. Send instead to...
    RBrandt at Hunter dot com
     
    Rick Brandt, Aug 25, 2004
    #10
  11. "Rick Brandt" <> wrote in message
    news:...

    > > >I was originally wrapping all of my text elements in CDATA sections,

    but
    > I
    > > >ran into a problem where any CDATA section with the string "replace" in

    > it
    > > >raised a Parse Error (previous newsgroup thread where I received no
    > > >answers).

    > >
    > > Maybe you should try a different parser!

    >
    > AFAIK I am using the one that comes with java 1.4.2_04-b05. The import
    > statements in my SAX class are...
    >
    > org.xml.sax.*;
    > org.xml.sax.helpers.*;


    Weird! I'm using the JAXP/DOM APIs built into Java SDK version 1.4.2_04.
    (Linux) I can't reproduce an error with a CDATA section containing
    "replace".

    I think this CDATA problem is worth digging into. Can you post (or send me)
    sample code and text?

    /kmc
     
    Keith M. Corbett, Aug 26, 2004
    #11
  12. Rick Brandt

    Rick Brandt Guest

    "Keith M. Corbett" <> wrote in message
    news:...
    > Weird! I'm using the JAXP/DOM APIs built into Java SDK version 1.4.2_04.
    > (Linux) I can't reproduce an error with a CDATA section containing
    > "replace".
    >
    > I think this CDATA problem is worth digging into. Can you post (or send

    me)
    > sample code and text?


    Well, here's the full story on that. I think what I'm seeing is a bug in
    IPlanet's web application server which is what our production web servers
    run.

    About 2 months ago I had a user reporting errors when submitting data to my
    Java servlet over an HTTP request. At the time we isolated it to when a
    line-item note field was too long (or so we thought). The problem does NOT
    happen when I point the client at the servlet running in my JBuilder
    environment (which uses Tomcat) so I was stumped troubleshooting it. The
    notes are somewhat of a non-critical field so I asked him to just keep them
    short until I could investigate further.

    Last week he reported the same problem only it was with a parent note
    field. This time I was able to determine that it wasn't the length at all,
    but rather that any time the string "replace" occurred. I then tested my
    other client apps which send data over HTTP in a similar fashion. Every
    single one of them bombs if I include "replace" in a CDATA section.

    The error reported from the servlet is "root node missing" which I believe
    is being raised because the parser is in fact not being passed any data at
    all. I then discovered that the word replace was harmless if it was not in
    a CDATA section so since I seemed to have few troubleshooting options I
    decided to just escape all illegal XML characters and drop the CDATA
    section. At initial design the CDATA looked like the easiest way to handle
    the data entered by the user instead of doing a bunch of Replace()
    functions. Now I'll have to rewrite all of my SAX parsing code because of
    this issue with characters() breaking the text into chunks. Apparently it
    uses the ampersand as the "chunk delimiter".

    This CDATA problem definitely has some variability to it because while I
    can reproduce the problem myself, I have never had any other user complain
    of this (around 30) and I can find records in the database that contain the
    word "replace" which apparently made it through ok.


    --
    I don't check the Email account attached
    to this message. Send instead to...
    RBrandt at Hunter dot com
     
    Rick Brandt, Aug 26, 2004
    #12
  13. Rick Brandt

    William Park Guest

    In <comp.text.xml> Rick Brandt <> wrote:
    > I guess I will look at collecting the pieces in characters and not
    > writing them until endElement(). I just wish I could fix the CDATA
    > bug as this was working fine for 3 or 4 years before that started
    > happening. Either CDATA forces all of the text in the characters
    > event to be pulled in a single block or we just got really lucky for
    > all that time because I never saw any truncation until the CDATA
    > section was removed.


    You were just lucky. :)

    If you're using (or can use) Bash shell, then collecting all texts
    inside <Notes> or any other element is simple. Assuming elements
    containing data are not nested,

    start () { # Usage: start tag att=value ...
    case $1 in
    Notes) unset data;;
    esac
    }
    middle () { # Usage: middle text
    case ${XML_ELEMENT_STACK[1]} in
    Notes) data+="$1" ;;
    esac
    }
    end () { # Usage: start tag
    case $1 in
    Notes) echo "$data" ;;
    esac
    }

    Then,
    xml -s start -d middle -e end "<Notes>aa&amp;bb</Notes>"
    produces
    aa&bb

    Ref:
    http://freshmeat.net/projects/bashdiff/
    http://home.eol.ca/~parkw/index.html#xml
    help xml
    --
    William Park <>
    Open Geometry Consulting, Toronto, Canada
     
    William Park, Aug 26, 2004
    #13
  14. Rick Brandt

    Donald Roby Guest

    On Wed, 25 Aug 2004 13:18:24 -0500, Rick Brandt wrote:

    > "Richard Tobin" <> wrote in message
    > news:cgigsq$26st$...
    >> In article <>, Rick Brandt
    >> <> wrote:
    >> >I'm using JBuilder 7 and it has a built in SAX parser object template

    > that
    >> >extends DefaultHandler. The problem seems to be with the length

    > argument
    >> >on the last line above. If I examine the ch[] array in debug mode it

    > still
    >> >has all of the text from the "Notes" element, but the length argument

    > being
    >> >passed from the parser is (for some reason) being set to the first
    >> >occurrence of an ampersand instead of extending to the element close

    > tag.
    >> >So the String s that I use for insertion to the database is truncated.

    >>
    >> And you don't get more calls to characters() with the rest of the
    >> string? There's no guarantee you will get it all at once.

    >
    > Should I get those "more calls" automatically or do I have to put in
    > some kind of loop? Why wouldn't Characters() return ALL characters
    > between the <> and </>? Isn't that what the parser's job is?
    >

    You should get these "more calls" more or less automatically, but your
    characters method has to allow for multiple calls with partial data.

    The basic strategy is to setup a StringBuffer in the startElement method,
    collect text into it in the characters method, and pull the whole result
    out in the endElement method.

    I don't know what triggers the division into multiple events, but it
    sounds like the implementation you're using may be stopping on ampersands
    to handle entities. I'd hope once you get your code dealing with the
    multiple calls his will be transparent. Possibly use of a CDATA section
    simplified the parsers job so it didn't need to do this.

    But the characters method is definitely not guaranteed to return the
    entire enclosed text, so you should do something like what I described
    above.
     
    Donald Roby, Aug 28, 2004
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Alexander Schmolck

    re.sub replacement text \-escapes woe

    Alexander Schmolck, Feb 13, 2004, in forum: Python
    Replies:
    4
    Views:
    375
    Alexander Schmolck
    Feb 14, 2004
  2. Xah Lee

    Q: quoting string without escapes

    Xah Lee, Jan 31, 2005, in forum: Python
    Replies:
    2
    Views:
    366
    Daniel Bickett
    Jan 31, 2005
  3. Hans-Peter Jansen

    convert ascii escapes into binary form

    Hans-Peter Jansen, Jul 20, 2005, in forum: Python
    Replies:
    3
    Views:
    622
    Hans-Peter Jansen
    Jul 20, 2005
  4. James Thiele

    escapes in regular expressions

    James Thiele, May 21, 2006, in forum: Python
    Replies:
    4
    Views:
    248
    Heiko Wundram
    May 21, 2006
  5. =?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=

    cookielib incorrectly escapes cookie

    =?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=, Jul 5, 2006, in forum: Python
    Replies:
    1
    Views:
    375
    John J. Lee
    Jul 9, 2006
Loading...

Share This Page