UTF-8 and diacritics combining characters

Discussion in 'ASP .Net' started by jm, Dec 19, 2008.

  1. jm

    jm Guest

    Hi All,

    I'm trying to read an UTF-8 file where diacritics are coded using
    combining characters e.g the accented character is represented as the
    unaccented character followed by the accented combining character.

    Example: é is 65 CC 81

    It seems that the UTF8 encoding does not handle this:
    Dim sr As IO.StreamReader = New System.IO.StreamReader(fname, Encoding.UTF8)

    I do not get my accented characters.

    Is there another encoding to cope with such UTF-8 file ?

    jm, Dec 19, 2008
    1. Advertisements

  2. When you view a UTF-8 file as if it were ANSI file you will often see the
    the characters you expect but preceded with other characters.

    The reason for this the manner that UTF-8 encodes unicode characters and the
    values in the unicode domain chosen for the more common Latin accented
    characters. You should not conclude that the appearance of the expected
    character when viewed as ANSI is intended, its merely a coincidence and is
    not significant to the coding.

    The UTF8 encoder definitely does handle UTF-8 it correctly.

    The problem is that your initial encoding for é is wrong. é is c3 a9 in
    UTF-8 encoding. Most, if not all, the characters in the upper portion of
    ISO-8859-1 set will encode as 2 bytes in UTF-8. You would need move further
    up the unicode set of characters to get the point where 3 bytes are needed
    to encode a character in UTF-8.
    Anthony Jones, Dec 19, 2008
    1. Advertisements

  3. jm

    jm Guest

    Anthony Jones a écrit :

    You say it is wrong but when when I open the file under word or FireFox,
    I get the right character (é) displayed.

    Should I understand that Encoding.UTF8 in .net does not handle combining
    characters and that there is is work around ?

    Actually, that is all I need to know so I can tell people who have
    created this file that we cannot handle it because of .net limitation.

    jm, Dec 19, 2008
  4. jm

    Hans Kesting Guest

    jm wrote on 19-12-2008 :
    If you normalize the string to "FormC", then you will get a printable
    string s = ...
    s = s.Normalize(System.Text.NormalizationForm.FormC);
    This will combine the characters and the non-spacing accents to
    accented characters.

    Apparently you text was in "FormD".

    Hans Kesting
    Hans Kesting, Dec 19, 2008
  5. jm

    jm Guest

    Hans Kesting a écrit :
    That seems to work OK. I had never heard of this normalize stuff !

    Many thanks,
    jm, Dec 19, 2008
  6. jm

    jm Guest

    jm, Dec 19, 2008
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.