losing carriage returns in CDATA section - how do I prevent this?

Discussion in 'Java' started by CarlosRivera, Jan 8, 2005.

  1. CarlosRivera

    CarlosRivera Guest

    I am using apache xerces J 2.5.0. I have \r\n feed combinations in the
    CDATA sections that get converted to \n (or rather \r gets lost. I am
    using sax parsing. I can see in the buffer that is passed that when I
    have \n, one character back it has the \r, but the start offset is on
    the \n. The source is an XML string, so it did not get lost while
    reading the file. In any case, it seems that it should not be removing
    the \r in the cdata section during my sax events. I am running this on
    windows; so it seems like the bahavior is converting \r\n to \n might be
    related. If this is related, this means that the code would not be
    portable between unix and windows. It should give it to me as is.
    Isn't this one of the purposes of the CDATA? I know that one can put
    character entities in the XML and it works, but this is real ugly. We
    just want to get some text from source location and put it into the XML
    without having to replace \r with
    .
    CarlosRivera, Jan 8, 2005
    #1
    1. Advertising

  2. In article <xaZDd.499$>,
    CarlosRivera <> wrote:

    >I am using apache xerces J 2.5.0. I have \r\n feed combinations in the
    >CDATA sections that get converted to \n (or rather \r gets lost.


    XML parsers convert CR-LF and CR to LF, so that you don't have to worry
    about what platform you're using.

    If you really want to preserve CRs, you have to use a character
    reference, but think carefully before doing this: XML is a text
    format, and dependence on platform-specific line-end sequences
    is not usually a good idea.

    -- Richard
    Richard Tobin, Jan 8, 2005
    #2
    1. Advertising

  3. CarlosRivera

    CarlosRivera Guest

    Wow, that was fast!

    I am just interested in having the characters in the cdata section left
    alone. The main purpose is because the text will end up going into a an
    email. Well, when javamail takes the lines with only \n and it base64
    encodes them, they don't come out right as they only have \n. This
    causes the lines in the email in one big line.

    If I set the javamail content transfer encoding to quoted printable, the
    email is rendered as expected. Perhaps a bug in the way javamail does
    its base64 encoding. I would imagine that it should stick the \r back
    in as it base64 encodes.

    Richard Tobin wrote:
    > In article <xaZDd.499$>,
    > CarlosRivera <> wrote:
    >
    >
    >>I am using apache xerces J 2.5.0. I have \r\n feed combinations in the
    >>CDATA sections that get converted to \n (or rather \r gets lost.

    >
    >
    > XML parsers convert CR-LF and CR to LF, so that you don't have to worry
    > about what platform you're using.
    >
    > If you really want to preserve CRs, you have to use a character
    > reference, but think carefully before doing this: XML is a text
    > format, and dependence on platform-specific line-end sequences
    > is not usually a good idea.
    >
    > -- Richard
    CarlosRivera, Jan 8, 2005
    #3
  4. Richard Tobin wrote:

    > In article <xaZDd.499$>,
    > CarlosRivera <> wrote:
    >
    >
    >>I am using apache xerces J 2.5.0. I have \r\n feed combinations in the
    >>CDATA sections that get converted to \n (or rather \r gets lost.

    >
    >
    > XML parsers convert CR-LF and CR to LF, so that you don't have to worry
    > about what platform you're using.


    To be more specific, here is an excerpt from the XML 1.0 spec:

    ====

    2.11 End-of-Line Handling

    XML parsed entities are often stored in computer files which, for
    editing convenience, are organized into lines. These lines are typically
    separated by some combination of the characters CARRIAGE RETURN (#xD)
    and LINE FEED (#xA).

    To simplify the tasks of applications, the XML processor MUST behave as
    if it normalized all line breaks in external parsed entities (including
    the document entity) on input, before parsing, by translating both the
    two-character sequence #xD #xA and any #xD that is not followed by #xA
    to a single #xA character.

    ====

    XML 1.1 generalizes that requirement a bit.


    John Bollinger
    John C. Bollinger, Jan 10, 2005
    #4
  5. CarlosRivera wrote:

    > I am just interested in having the characters in the cdata section left
    > alone. The main purpose is because the text will end up going into a an
    > email. Well, when javamail takes the lines with only \n and it base64
    > encodes them, they don't come out right as they only have \n. This
    > causes the lines in the email in one big line.


    I imagine they come out exactly right -- a true base-64 encoding of the
    input provided. If you viewed such a message on a system that used \n
    as the line terminator (e.g. any UNIX variant) then it would probably
    look fine. If you include the \r characters in such a message then
    there is a reasonably good chance that it appears double spaced when
    read on those systems; if it doesn't then something on the receiving
    side is probably cleaning up after you.

    > If I set the javamail content transfer encoding to quoted printable, the
    > email is rendered as expected. Perhaps a bug in the way javamail does
    > its base64 encoding. I would imagine that it should stick the \r back
    > in as it base64 encodes.


    Why would you think that? The point of applying a base-64 encoding is
    to be able to pass an entity unchanged over SMTP that otherwise would
    (or could) be mangled by conformant MTAs. Whether clients receiving
    such a message can do anything sensible with it is not (directly) a
    consideration.


    John Bollinger
    John C. Bollinger, Jan 10, 2005
    #5
  6. CarlosRivera

    CarlosRivera Guest

    Thanks for the help. Especially pointing out the XML spec.

    I was trying to say that when I set text for the body of the message
    (and it is text as opposed to some other type) and it has only \n, it
    should normalize the text, i.e. insert \r's, before base64 encoding. I
    thought that SMTP defined \r\n as the line terminator.

    John C. Bollinger wrote:
    > CarlosRivera wrote:
    >
    >> I am just interested in having the characters in the cdata section
    >> left alone. The main purpose is because the text will end up going
    >> into a an email. Well, when javamail takes the lines with only \n and
    >> it base64 encodes them, they don't come out right as they only have
    >> \n. This causes the lines in the email in one big line.

    >
    >
    > I imagine they come out exactly right -- a true base-64 encoding of the
    > input provided. If you viewed such a message on a system that used \n
    > as the line terminator (e.g. any UNIX variant) then it would probably
    > look fine. If you include the \r characters in such a message then
    > there is a reasonably good chance that it appears double spaced when
    > read on those systems; if it doesn't then something on the receiving
    > side is probably cleaning up after you.
    >
    >> If I set the javamail content transfer encoding to quoted printable,
    >> the email is rendered as expected. Perhaps a bug in the way javamail
    >> does its base64 encoding. I would imagine that it should stick the \r
    >> back in as it base64 encodes.

    >
    >
    > Why would you think that? The point of applying a base-64 encoding is
    > to be able to pass an entity unchanged over SMTP that otherwise would
    > (or could) be mangled by conformant MTAs. Whether clients receiving
    > such a message can do anything sensible with it is not (directly) a
    > consideration.
    >
    >
    > John Bollinger
    >
    CarlosRivera, Jan 16, 2005
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John Davison
    Replies:
    1
    Views:
    563
    Hal Rosser
    Jul 7, 2004
  2. Replies:
    2
    Views:
    1,632
    Richard Tobin
    Nov 27, 2003
  3. CarlosRivera
    Replies:
    2
    Views:
    1,711
    John C. Bollinger
    Jan 10, 2005
  4. Replies:
    3
    Views:
    728
    Joe Kesselman
    Mar 6, 2006
  5. Steve Anderson
    Replies:
    3
    Views:
    230
    Steve Anderson
    Jun 21, 2004
Loading...

Share This Page