UTF-8 and diacritics combining characters

Discussion in 'ASP .Net' started by jm, Dec 19, 2008.

  1. jm

    jm Guest

    Hi All,

    I'm trying to read an UTF-8 file where diacritics are coded using
    combining characters e.g the accented character is represented as the
    unaccented character followed by the accented combining character.

    Example: é is 65 CC 81

    It seems that the UTF8 encoding does not handle this:
    Dim sr As IO.StreamReader = New System.IO.StreamReader(fname, Encoding.UTF8)

    I do not get my accented characters.

    Is there another encoding to cope with such UTF-8 file ?

    Thanks,
    Jean-Michel
     
    jm, Dec 19, 2008
    #1
    1. Advertising

  2. "jm" <> wrote in message
    news:494b5c50$0$9379$...
    > Hi All,
    >
    > I'm trying to read an UTF-8 file where diacritics are coded using
    > combining characters e.g the accented character is represented as the
    > unaccented character followed by the accented combining character.
    >
    > Example: é is 65 CC 81
    >
    > It seems that the UTF8 encoding does not handle this:
    > Dim sr As IO.StreamReader = New System.IO.StreamReader(fname,
    > Encoding.UTF8)
    >
    > I do not get my accented characters.
    >
    > Is there another encoding to cope with such UTF-8 file ?
    >


    When you view a UTF-8 file as if it were ANSI file you will often see the
    the characters you expect but preceded with other characters.

    The reason for this the manner that UTF-8 encodes unicode characters and the
    values in the unicode domain chosen for the more common Latin accented
    characters. You should not conclude that the appearance of the expected
    character when viewed as ANSI is intended, its merely a coincidence and is
    not significant to the coding.

    The UTF8 encoder definitely does handle UTF-8 it correctly.

    The problem is that your initial encoding for é is wrong. é is c3 a9 in
    UTF-8 encoding. Most, if not all, the characters in the upper portion of
    ISO-8859-1 set will encode as 2 bytes in UTF-8. You would need move further
    up the unicode set of characters to get the point where 3 bytes are needed
    to encode a character in UTF-8.

    --
    Anthony Jones - MVP ASP/ASP.NET
     
    Anthony Jones, Dec 19, 2008
    #2
    1. Advertising

  3. jm

    jm Guest

    Anthony Jones a écrit :

    >
    > The problem is that your initial encoding for é is wrong. é is c3 a9 in
    > UTF-8 encoding. Most, if not all, the characters in the upper portion
    > of ISO-8859-1 set will encode as 2 bytes in UTF-8. You would need move
    > further up the unicode set of characters to get the point where 3 bytes
    > are needed to encode a character in UTF-8.
    >


    Hi,

    You say it is wrong but when when I open the file under word or FireFox,
    I get the right character (é) displayed.

    Should I understand that Encoding.UTF8 in .net does not handle combining
    characters and that there is is work around ?

    Actually, that is all I need to know so I can tell people who have
    created this file that we cannot handle it because of .net limitation.

    Regards,
    Jean-Michel
     
    jm, Dec 19, 2008
    #3
  4. jm

    Hans Kesting Guest

    jm wrote on 19-12-2008 :
    > Hi All,
    >
    > I'm trying to read an UTF-8 file where diacritics are coded using combining
    > characters e.g the accented character is represented as the unaccented
    > character followed by the accented combining character.
    >
    > Example: é is 65 CC 81
    >
    > It seems that the UTF8 encoding does not handle this:
    > Dim sr As IO.StreamReader = New System.IO.StreamReader(fname, Encoding.UTF8)
    >
    > I do not get my accented characters.
    >
    > Is there another encoding to cope with such UTF-8 file ?
    >
    > Thanks,
    > Jean-Michel


    If you normalize the string to "FormC", then you will get a printable
    string:
    string s = ...
    s = s.Normalize(System.Text.NormalizationForm.FormC);
    This will combine the characters and the non-spacing accents to
    accented characters.

    Apparently you text was in "FormD".

    Hans Kesting
     
    Hans Kesting, Dec 19, 2008
    #4
  5. jm

    jm Guest

    Hans Kesting a écrit :

    >
    > If you normalize the string to "FormC", then you will get a printable
    > string:
    > string s = ...
    > s = s.Normalize(System.Text.NormalizationForm.FormC);
    > This will combine the characters and the non-spacing accents to accented
    > characters.
    >
    > Apparently you text was in "FormD".
    >


    That seems to work OK. I had never heard of this normalize stuff !

    Many thanks,
    Jean-Michel
     
    jm, Dec 19, 2008
    #5
  6. jm

    jm Guest

    Mark Rae [MVP] a écrit :
    > "jm" <> wrote in message
    > news:494bbdd3$0$18379$...
    >
    >> That seems to work OK. I had never heard of this normalize stuff !

    >
    > http://www.google.co.uk/search?sour...1T4GPTB_en-GBGB298GB298&q="C#" Text Normalize
    >
    >


    Thanks for the link to Google. I had never heard of Google. This is
    great. So, if you know you have to normalize a string, you can enter
    text and normalize and you get web pages about this.
    Very interresting indeed.

    Jean-Michel
     
    jm, Dec 19, 2008
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    713
  2. Guest

    encoded diacritics

    Guest, Sep 29, 2004, in forum: XML
    Replies:
    1
    Views:
    749
    Jukka K. Korpela
    Sep 29, 2004
  3. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    968
    Grzegorz ¦liwiñski
    Jan 19, 2011
  4. Une Bévue
    Replies:
    9
    Views:
    184
    Une Bévue
    Sep 26, 2007
  5. majna
    Replies:
    4
    Views:
    678
    Thomas 'PointedEars' Lahn
    Sep 19, 2007
Loading...

Share This Page