Umlaut characters in Unicode

Discussion in 'XML' started by =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Nov 12, 2004.

  1. Hello,

    do you think that this file is a proper Unicode file?

    http://belnet.dl.sourceforge.net/sourceforge/ganttproject/ganttproject-example3.xml

    <?xml version="1.0" encoding="UTF-8"?>
    ...
    <resource id="1" name="Andreas Plüschke" function="10" contacts=""/>

    I am asking because of the ü Umlaut character.
    I am guessing that the author used an ISO-8859-1
    environment but forgot to change the encoding
    declaration from UTF-8 to ISO-8859-1.
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Nov 12, 2004
    #1
    1. Advertising

  2. Martin Honnen, Nov 12, 2004
    #2
    1. Advertising

  3. Martin Honnen wrote:

    > Why is an umlaut a problem? Unicode certainly contains/allows umlaut
    > characters.


    Umlaut is not a problem for Unicode.
    Umlaut is a problem if you write a text
    with an editor in ISO-8859-1 mode and
    watch the text with an editor in UTF-8
    mode.

    For example, while writing this posting,
    I use ISO-8859-1 mode and this is an u-Umlaut: ü
    Now, switch your news reader to UTF-8 and you
    will find that the character does not look like
    an u-umlaut anymore.
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Nov 12, 2004
    #3
  4. In article <>,
    Jürgen Kahrs <> wrote:

    >:Martin Honnen wrote:
    >:
    >:> Why is an umlaut a problem? Unicode certainly contains/allows umlaut
    >:> characters.
    >:
    >:Umlaut is not a problem for Unicode.
    >:Umlaut is a problem if you write a text
    >:with an editor in ISO-8859-1 mode and
    >:watch the text with an editor in UTF-8
    >:mode.
    >:
    >:For example, while writing this posting,
    >:I use ISO-8859-1 mode and this is an u-Umlaut: ü
    >:Now, switch your news reader to UTF-8 and you
    >:will find that the character does not look like
    >:an u-umlaut anymore.


    That's precisely the problem we've encountered with our application,
    which stores its data in UTF-8 encoded XML documents.

    We maintain everything internally in our Java application as part of a
    DOM, and it's saved to an external file on request. But we failed to
    force the byte stream written to the file to be encoded to UTF-8, so it
    used the default ISO-8859-1 on our American systems. When the next
    attempt was made to read the file (only if such characters appeared),
    errors occurred because there were non-UTF-8 characters present.

    The solution we found was to serialize the DOM with UTF-8 encoding
    specified (which we were already doing) and then also specify UTF-8
    encoding on the output file stream when writing. When this was done,
    opening such an XML file in an editor clearly showed something that did
    not resemble the letter with umlaut, or accent, or other special feature.

    = Steve =
    --
    Steve W. Jackson
    Montgomery, Alabama
    Steve W. Jackson, Nov 12, 2004
    #4
  5. Steve W. Jackson wrote:

    > We maintain everything internally in our Java application as part of a
    > DOM, and it's saved to an external file on request. But we failed to
    > force the byte stream written to the file to be encoded to UTF-8, so it
    > used the default ISO-8859-1 on our American systems. When the next
    > attempt was made to read the file (only if such characters appeared),
    > errors occurred because there were non-UTF-8 characters present.


    Yes, this is the situation I was thinking of.
    Now, with your unpleasant experience in mind,
    would you say that the following document was
    also encoded in an inadequate way ?

    http://belnet.dl.sourceforge.net/sourceforge/ganttproject/ganttproject-example3.xml

    As I said in my original posting, I am guessing
    that the author used an ISO-8859-1 environment
    (just like you) but forgot to change the encoding
    declaration from UTF-8 to ISO-8859-1.

    Thanks for answering !
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Nov 12, 2004
    #5
  6. In article <>,
    Jürgen Kahrs <> wrote:

    >:Steve W. Jackson wrote:
    >:
    >:> We maintain everything internally in our Java application as part of a
    >:> DOM, and it's saved to an external file on request. But we failed to
    >:> force the byte stream written to the file to be encoded to UTF-8, so it
    >:> used the default ISO-8859-1 on our American systems. When the next
    >:> attempt was made to read the file (only if such characters appeared),
    >:> errors occurred because there were non-UTF-8 characters present.
    >:
    >:Yes, this is the situation I was thinking of.
    >:Now, with your unpleasant experience in mind,
    >:would you say that the following document was
    >:also encoded in an inadequate way ?
    >:
    >: http://belnet.dl.sourceforge.net/sourceforge/ganttproject/ganttproject-examp
    >: le3.xml
    >:
    >:As I said in my original posting, I am guessing
    >:that the author used an ISO-8859-1 environment
    :mad:just like you) but forgot to change the encoding
    >:declaration from UTF-8 to ISO-8859-1.
    >:
    >:Thanks for answering !


    It looks to me as if it's not encoded properly, based on the visual
    appearance of the <resource> element near the end.

    Just to make clear what I said earlier, the problem we encountered did
    not stem from using an ISO-8859-1 encoding in the XML itself. All of
    our files already included <?xml version="1.0" encoding="UTF-8"?> at the
    top when serialized, since we told the XML serializer to use UTF-8.

    Instead, we also write the file using Java's OutputStreamWriter, in
    which we specify the stream being written (in this case, Java's
    FileOutputStream class designating the file) and the encoding to use
    when writing the stream. Only if *both* of these things were done would
    non-ASCII characters get correctly written and then parse without error
    next time around. We got a separate report of this same problem from a
    German user who used a directory name containing an umlaut-o (as in ö)
    and from a French user with an accented e (as in é).

    = Steve =
    --
    Steve W. Jackson
    Montgomery, Alabama
    Steve W. Jackson, Nov 12, 2004
    #6
  7. In article <>,
    Jürgen Kahrs <> wrote:

    >do you think that this file is a proper Unicode file?
    >
    >http://belnet.dl.sourceforge.net/sourceforge/ganttproject/ganttproject-example3.xml


    The file at that URL appears to be well-formed, and contains a
    correctly encoded UTF-8 u-with-umlaut. I don't see any problem with it.

    Putting a UTF-8 declaration on a file that is really Latin-1 (and which
    contains non-ascii characters) will almost always result in a detectable
    error because the result will almost always be an illegal UTF-8 byte
    sequence. An XML parser should detect the error.

    -- Richard
    Richard Tobin, Nov 12, 2004
    #7
  8. On Fri, 12 Nov 2004, Richard Tobin wrote:

    > Putting a UTF-8 declaration on a file that is really Latin-1 (and which
    > contains non-ascii characters) will almost always result in a detectable
    > error


    Indeed...

    > because the result will almost always be an illegal UTF-8 byte
    > sequence. An XML parser should detect the error.


    In fact, anything which is supposed to handle utf-8 should give up at
    that point, if only for security reasons. XML is a higher layer in
    the protocol layer-cake: I'm not sure that it really should be allowed
    to have any say in these lower-level problems. That way lie dragons,
    from a security analysis point of view.
    Alan J. Flavell, Nov 13, 2004
    #8
  9. Jürgen Kahrs wrote:


    > Now, with your unpleasant experience in mind,
    > would you say that the following document was
    > also encoded in an inadequate way ?
    >
    > http://belnet.dl.sourceforge.net/sourceforge/ganttproject/ganttproject-example3.xml
    >
    >
    > As I said in my original posting, I am guessing
    > that the author used an ISO-8859-1 environment
    > (just like you) but forgot to change the encoding
    > declaration from UTF-8 to ISO-8859-1.


    I have no problems viewing that file with Netscape 7 or IE 6, I don't
    see anything displayed incorrectly that suggests the encoding has not
    been declared correctly.


    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
    Martin Honnen, Nov 13, 2004
    #9
  10. Richard Tobin wrote:

    > Putting a UTF-8 declaration on a file that is really Latin-1 (and which
    > contains non-ascii characters) will almost always result in a detectable
    > error because the result will almost always be an illegal UTF-8 byte


    I should have looked into the hexdump immediately:

    00002250 20 6e 61 6d 65 3d 22 41 6e 64 72 65 61 73 20 50 | name="Andreas P|
    00002260 6c c3 bc 73 63 68 6b 65 22 20 66 75 6e 63 74 69 |l..schke" functi|

    C3BC in UTF-8 converts to position 0FC as described here:

    http://www.pemberley.com/janeinfo/latin1.html#utf8

    And 0FC is really the position of the ü as described
    on page 2 of this one:

    http://www.unicode.org/charts/PDF/U0080.pdf

    This mixture of bitwise encoding and character sets
    is a pain if you work with it rarely.

    > sequence. An XML parser should detect the error.


    The problem was that I did not trust my parser.
    I think I should put the Unicode 4.0 book onto my book shelf.

    Thanks to all who answered.
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Nov 13, 2004
    #10
  11. Richard Tobin, Nov 13, 2004
    #11
  12. =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Nov 13, 2004
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. vijay
    Replies:
    10
    Views:
    2,201
    vijay
    Apr 7, 2006
  2. Anna Kavan
    Replies:
    0
    Views:
    395
    Anna Kavan
    Oct 31, 2006
  3. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    959
    Grzegorz ¦liwiñski
    Jan 19, 2011
  4. Dirk Einecke
    Replies:
    8
    Views:
    220
    Dirk Einecke
    Apr 12, 2004
  5. John Butler
    Replies:
    7
    Views:
    177
    John Butler
    May 2, 2008
Loading...

Share This Page