Incorrect parsing of special characters

Discussion in 'XML' started by Dario Di Bella, Jun 17, 2004.

  1. Hi all,
    I hope someone can help me on this. I need to parse the following XML:

    ....
    <area name="promotore">
    <item id="004" code="003" description="attivita promotore">
    <![CDATA[»&nbsp;Attività&nbsp;Promotore]]>
    </item>
    </area>
    ....

    As you can see I used the CDATA section to include special characters.
    Unfortunately as I parse the file, the "item" element content turns to
    be:

    »&nbsp;Attività&nbsp;Promotore

    i.e. the "Â" character is inserted at the beginning of the string and
    the "à" character is translated into "Ã ".

    I'm using the javax.xml.parsers.DocumentBuilder parser.

    Has anyone got any clue? Thanks.

    Dario
     
    Dario Di Bella, Jun 17, 2004
    #1
    1. Advertising

  2. * Dario Di Bella wrote in comp.text.xml:
    >As you can see I used the CDATA section to include special characters.
    >Unfortunately as I parse the file, the "item" element content turns to
    >be:
    >
    >»&nbsp;Attività&nbsp;Promotore
    >
    >i.e. the "Â" character is inserted at the beginning of the string and
    >the "à" character is translated into "Ã ".
    >
    >I'm using the javax.xml.parsers.DocumentBuilder parser.
    >
    >Has anyone got any clue? Thanks.


    The output seems to be UTF-8 which you view in some application that
    assumes the output is ISO-8859-1 or similar encoded. Could you
    elaborate which kind of problem you are trying to solve? Everything
    should be fine as long as the second application supports UTF-8 and
    knows that the data is UTF-8 encoded.
     
    Bjoern Hoehrmann, Jun 17, 2004
    #2
    1. Advertising

  3. Bjoern,
    Thanks for your reply.
    I am building a jsp tag library to build a dynamic javascript menu on
    a web page. The javascript code is mm_menu.js shipped into
    Dreamweaver. The menu items should be dynamically loaded after the
    user login, based on the user permissions (i.e. some menu items will
    be enabled, some other won't). The common menu configuration is stored
    in an xml file. Each <item> tag represents a menu item. The CDATA
    section is the text that will be displayed on the page, hence contains
    html specific codes ("&nbsp;") and some special characters currenty
    used in the italian language ("à").

    Basically what I want to do is to print that CDATA section in an HTML
    page, using a jsp custom tag.

    I don't understand your observation regarding a second application:
    even if I parse the xml and echo the nodes content on the system
    output (i.e. System.out.println(element.getData());) I obtain the same
    wrong text.

    Any suggestion?
    Thanks and regards.

    Dario.



    Bjoern Hoehrmann <> wrote in message news:<>...
    > * Dario Di Bella wrote in comp.text.xml:
    > >As you can see I used the CDATA section to include special characters.
    > >Unfortunately as I parse the file, the "item" element content turns to
    > >be:
    > >
    > >»&nbsp;Attività&nbsp;Promotore
    > >
    > >i.e. the "Â" character is inserted at the beginning of the string and
    > >the "à" character is translated into "Ã ".
    > >
    > >I'm using the javax.xml.parsers.DocumentBuilder parser.
    > >
    > >Has anyone got any clue? Thanks.

    >
    > The output seems to be UTF-8 which you view in some application that
    > assumes the output is ISO-8859-1 or similar encoded. Could you
    > elaborate which kind of problem you are trying to solve? Everything
    > should be fine as long as the second application supports UTF-8 and
    > knows that the data is UTF-8 encoded.
     
    Dario Di Bella, Jun 18, 2004
    #3
  4. Dario Di Bella wrote:

    > <![CDATA[»&nbsp;Attività&nbsp;Promotore]]>
    > »&nbsp;Attività &nbsp;Promotore
    >
    > i.e. the "Â" character is inserted at the beginning of the string and
    > the "à" character is translated into "Ã ".


    Check your charset encoding. This looks very much as if the encoding in
    which the XML comes and the encoding used to read it don't match.

    /Thomas
     
    Thomas Weidenfeller, Jun 18, 2004
    #4
  5. Dario Di Bella wrote:
    > As you can see I used the CDATA section to include special characters.
    > Unfortunately as I parse the file, the "item" element content turns to
    > be:
    >
    > »&nbsp;Attività &nbsp;Promotore
    >
    > i.e. the "Â" character is inserted at the beginning of the string and
    > the "à" character is translated into "Ã ".


    Does your document correctly declare its encoding? If you specify
    none, the default is UTF-8 whereas Windows text editors usually
    default to CP1252. Trying to parse CP1252-encoded text as UTF-8
    could easily lead to the weirdness you describe.
     
    Michael Borgwardt, Jun 18, 2004
    #5
  6. * Dario Di Bella wrote in comp.text.xml:
    >Basically what I want to do is to print that CDATA section in an HTML
    >page, using a jsp custom tag.


    To make that work you need to ensure that the HTML document and the
    output of your code use the same encoding. It's all about the bytes.
    You could try to copy and paste the following fragment into a HTML
    document

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
    <meta http-equiv=Content-Type content="text/html;charset=utf-8">
    <title></title>
    <p>»&nbsp;Attività&nbsp;Promotore</p>

    and load that into your browser. All your characters should be just
    fine. If you change the fragment to

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
    <meta http-equiv=Content-Type content="text/html;charset=iso-8859-1">
    <title></title>
    <p>»&nbsp;Attività&nbsp;Promotore</p>

    It breaks. So you need to either change the encoding and/or
    declaration of the encoding of the surrounding HTML document
    or you need to transcode the data or you can try to use character
    references.

    The java.lang.String object for example provides a getBytes(...)
    method, you can do e.g. the following:

    class Foo{public static void main(String[] argv){try{

    System.out.write("\u00f6".getBytes("UTF-8"));
    System.out.println();
    System.out.write("\u00f6".getBytes("ISO-8859-1"));
    System.out.println();
    System.out.write(0x94); /* CP850, but it is not supported... */
    System.out.println();

    } catch (Exception e) {e.printStackTrace();}}}

    Depending on your operating system, locales, etc. one of the writes
    will most likely show an "ö". If you work on Windows, on the command
    line most likely the last write(), if you redirect the output to a
    file (`java Foo > file.txt`) and open file.txt in Notepad, you would
    notice that it is now the second write() that shows the "ö". You can
    also create a new text file containing "ö" and go to the command line
    prompt and type "type C:\...\file.txt" which would then likely show
    "÷" not "ö".

    HTH...
     
    Bjoern Hoehrmann, Jun 18, 2004
    #6
  7. Bjoern/Michael/Thomas,

    I solved this issue declaring a different encoding ("iso-8859-1"
    instead of "utf-8"). Thank you very much for your help, and excuse me
    for bothering you with a trivial problem ;-)

    Best regards.

    Dario.
     
    Dario Di Bella, Jun 18, 2004
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dario Di Bella

    Incorrect parsing of special characters

    Dario Di Bella, Jun 17, 2004, in forum: Java
    Replies:
    3
    Views:
    441
    Dario Di Bella
    Jun 18, 2004
  2. Stefan Mueller
    Replies:
    3
    Views:
    33,310
    Stefan Mueller
    Jul 23, 2006
  3. Replies:
    2
    Views:
    1,140
    Ingo Menger
    May 31, 2007
  4. rvino
    Replies:
    0
    Views:
    4,720
    rvino
    Aug 14, 2007
  5. majna
    Replies:
    4
    Views:
    781
    Thomas 'PointedEars' Lahn
    Sep 19, 2007
Loading...

Share This Page