attempting to print unicode characters.

Discussion in 'C Programming' started by Ray, Aug 29, 2010.

  1. Ray

    Ray Guest

    Hi. I'm trying to print Unicode characters to standard output, and failing.
    Before you ask, yes, my term programs (I've tried six) are in UTF-8
    encoding and yes I'm using a font that does have a glyph for the "a with
    dieresis" character that the examples/attempts use.

    I have checked terminal mode and font by typing ä directly at the
    prompt, where it shows up just fine.

    But I haven't been able to get wide-character output from a C program.

    Here are a number of minimal example programs that didn't work.

    First attempt: This prints a question mark.
    #include <stdio.h>
    #include <wchar.h>
    #include <assert.h>

    int main(){
    wchar_t vowel;
    char utf[8];
    /* set output stream to wide-character mode, or halt. */
    assert(fwide(stdout, 1) > 0);
    /* unicode value of a with dieresis, U+00E4 */
    vowel = 0x00e4;
    /* attempt to print. */
    wprintf(L"%Lc \n", vowel);
    }


    Drat, I said, maybe it doesn't represent wchar_t characters using
    unicode values. So I read man pages and found that there are
    standard functions to convert utf-8 to wchar_t, and went, "oh,
    so utf-8 is what it deals with, and this function has to know
    how to translate that into whatever wchar_t format it's using,
    right?"

    Second attempt:

    #include <stdio.h>
    #include <wchar.h>
    #include <stdlib.h>
    #include <assert.h>

    int main(){
    wchar_t vowel;

    /* set output stream to wide-character mode, or halt. */
    assert(fwide(stdout, 1) > 0);

    /* convert utf8 to wide character. This assert succeeds
    so it's getting something.... */
    assert(mbrtowc(&vowel, "ä", 3, NULL) > 0);

    /* but this assert fails so what it got was apparently
    a long-character NUL. WTF? */
    assert(vowel != 0);

    /* attempt to print */
    wprintf(L"%Lc \n", vowel);
    }


    After seeing what happened with the second attempt, I realized that
    C compilers don't *have* to do anything meaningful with UTF8 text,
    either, although the point of providing a utf-8 conversion function
    if they don't is sorta murky to me... so I tried encoding the utf-8
    representation directly by hand and then using mbrtowc.

    third attempt:


    #include <stdio.h>
    #include <wchar.h>
    #include <stdlib.h>
    #include <assert.h>

    int main(){
    wchar_t vowel;
    char utf[8];

    /* set output stream to wide-character mode, or halt. */
    assert(fwide(stdout, 1) > 0);

    /* utf8 encoding of a with dieresis U+00E4 */
    utf[0] = 0xc3;
    utf[1] = 0xa4;
    utf[2] = 0;

    /* convert utf8 to wide character. This assert succeeds
    so it's getting something.... */
    assert(mbrtowc(&vowel, utf, 3, NULL) > 0);

    /* but this assert fails so what it got was apparently
    a long-character NUL. */
    assert(vowel != 0);

    /* attempt to print */
    wprintf(L"%Lc \n", vowel);
    }

    Now I'm completely at a loss. My source is pure ASCII, I don't
    rely on Unicode encodings, I provide the UTF-8 encoding that
    mbrtowc's man page says it understands by hand, and it's still
    failing. How can I get a C program to write out wide characters?

    Bear
     
    Ray, Aug 29, 2010
    #1
    1. Advertisements

  2. Ray

    Alan Curry Guest

    You're checking the wrong thing here. mbrtowc error return values are
    not <=0. They are (size_t)-1 and (size_t)-2 which are large positive
    values.

    size_t res = mbrtowc(&vowel, utf, 3, NULL);
    assert(res!=(size_t)-1);
    assert(res!=(size_t)-2);
     
    Alan Curry, Aug 29, 2010
    #2
    1. Advertisements

  3. Ray

    Geoff Guest

    I'm not sure about UTF-8 to wchar_t but I think you have to setlocale to get it
    to print properly. Here's what I did with your last sample:

    /* attempt to print */
    setlocale(LC_ALL,"german");
    wprintf(L"%Lc \n", vowel);
    wprintf(L"ä\n");
    wprintf(L"Ä\n");
    return 0;

    The last two produce the desired output. Your size_t res = mbrtowc(&vowel, utf,
    3, NULL); is returning 1 on my system, I expected it to return 2 since my
    documentation says "If the next count or fewer bytes complete a valid multibyte
    character, the value returned is the number of bytes that complete the multibyte
    character." - but -

    size_t res = mbrtowc(&vowel, "ä", 3, NULL);

    returns 1 also.

    I tried this on a Windows 7 system but I have not tried it on a *nix OS.
     
    Geoff, Aug 29, 2010
    #3
  4. Yes. Usually you want to call it with setlocale(LC_ALL, ""); to
    use the environment's locale.
     
    Mikko Rauhala, Aug 29, 2010
    #4
  5. Ray

    Nobody Guest

    Setting LC_ALL is often a fast route to having your software break in
    non-English locales. E.g. generating floating-point values which use a
    comma as the decimal separator is fine for displaying information to a
    human, but not so good if you're trying to generate data in a specific
    format (which invariably uses the US convention).

    It's often better to only set the specific categories which need to be
    set, e.g. LC_CTYPE for anything relating to encodings. Setting LC_NUMERIC
    should set alarm bells ringing; I've yet to encounter a non-trivial
    program where setting LC_NUMERIC didn't require switching it back to the
    "C" locale occasionally. In once case, we ended up adding #define's
    so that *printf generated an error (you had to explicitly use e.g.
    printf_localized() or printf_data() instead).

    The last time I worked in continental Europe, most of the programmers
    had their systems set to the US locale. The reason was to allow
    copy-and-paste between localised programs and standardised file formats
    (e.g. source code).
     
    Nobody, Aug 29, 2010
    #5
  6. Which is not what you want. You may use the wide output functions but
    the output you want is multi-byte not wide. As has already be
    mentioned, setlocale is the key here. Once the C run-time knows the
    final output encoding required you can print from wide strings or
    multi-byte stings and both will work.

    This may sound like nit-picking but it occurs to me that you may not
    know that you can use the plain I/O functions to print from wide
    character arrays (and indeed from non-wide character arrays, but mixing
    and matching can be a bad design decision).

    If you don't set the stream to wide, its orientation is decided by the
    first output operation. That's useful for testing -- you can switch
    between wprintf and printf without having to change the fwide call.

    Also, in a large program it is *very* useful to test the return from
    your print functions. Printing to a stream of the wrong orientation is
    one of the simplest ways to get an output failure and lots of
    traditional code does not look at the return from printf.

    Aside: I use this to test code. If you want to test the output error
    path, set the stream to the "other" orientation and you are set. You
    often need to use freopen if you want ot do this mid-output.

    I hope that helps to round out the picture.

    BTW, I think the FAQ needs some stuff on wide/multi-byte I/O. It will
    get ever more common for people to come here confused by this important
    area of the language. I don't think there is much one can do about that
    other than starting a new FAQ.

    <snip>
     
    Ben Bacarisse, Aug 29, 2010
    #6
  7. Ray

    Ray Guest

    okay, fourth attempt:

    #include <stdio.h>
    #include <wchar.h>
    #include <stdlib.h>
    #include <assert.h>
    #include <locale.h>

    int main(){
    wchar_t vowel;
    char utf[8];

    /* set output stream to wide-character mode, or halt. */
    assert(fwide(stdout, 1) > 0);
    assert (setlocale(LC_ALL, "en_US.utf8") != NULL);
    wprintf(L"ä \n");
    }

    This prints a lower-case 'a'. That's better, but still wrong.
    Does "en_US.utf8" suppress accents?

    fifth attempt:

    #include <stdio.h>
    #include <wchar.h>
    #include <stdlib.h>
    #include <assert.h>
    #include <locale.h>

    int main(){
    wchar_t vowel;
    char utf[8];

    /* set output stream to wide-character mode, or halt. */
    assert(fwide(stdout, 1) > 0);
    assert (setlocale(LC_ALL, "POSIX") != NULL);
    wprintf(L"ä \n");
    }

    does not change anything, this still prints a lower-case 'a'.

    Hmm, whatever 'locale' the darn terminal is using allows ä to show up,
    so against the advice of another poster I'll try the empty string with
    setlocale().

    sixth attempt:

    #include <stdio.h>
    #include <wchar.h>
    #include <stdlib.h>
    #include <assert.h>
    #include <locale.h>

    int main(){
    wchar_t vowel;
    char utf[8];

    /* set output stream to wide-character mode, or halt. */
    assert(fwide(stdout, 1) > 0);
    assert (setlocale(LC_ALL, "") != NULL);
    wprintf(L"ä \n");
    }

    Changes nothing. it *still* prints a lower-case 'a'.

    locale -a on my system returns
    C
    en_US.utf8
    POSIX

    the first is the default locale for C programs and restricted to 7-bit
    characters according to the setlocale manpage.

    the second is what my term programs are set to, and they show most unicode
    characters fine.

    But none of them work. /usr/share/i18n/SUPPORTED lists 417 more. I decided
    I would try the a german locale, since it was explicitly recommended
    upthread.

    seventh attempt:


    sixth attempt:

    #include <stdio.h>
    #include <wchar.h>
    #include <stdlib.h>
    #include <assert.h>
    #include <locale.h>

    int main(){
    wchar_t vowel;
    char utf[8];

    /* set output stream to wide-character mode, or halt. */
    assert(fwide(stdout, 1) > 0);
    assert (setlocale(LC_ALL, "de_DE.UTF-8") != NULL);
    wprintf(L"ä \n");
    }

    This time the call to setlocale returned NULL so the assert failed.
    I suppose that means I need to download the corresponding locale data
    before I can do that?

    Bear,
    still having no luck....
     
    Ray, Aug 29, 2010
    #7
  8. Ray

    Ray Guest


    when I tried this, it printed a question mark. On further debugging,
    it appears that a call to setlocale(LC_ALL,"german") on my system is
    returning NULL first, so the failure to setlocale is probably the
    reason why the printing fails.

    Bear
     
    Ray, Aug 29, 2010
    #8
  9. Ray

    Ray Guest

    So, okay, I went and downloaded and installed LOCALES_ALL, a package
    that installs all available locales. This changes things a little
    but still doesn't do what I want.

    The version with
    setlocale(LC_ALL,"german");

    still prints a question mark. On inspection, setlocale is still
    returning NULL first.

    replacing it with
    setlocale(LC_ALL, "de_DE.UTF-8");

    is better though; setlocale succeeds and the program goes on to print
    'ae'. Very nice, this is regarded in german as an alternate spelling
    of ä. And this illuminates the reason why, with
    setlocale(LC_ALL, "en_US.UTF-8") earlier, it printed a lower-case 'a'.
    Because, likewise, 'a' is regarded, in the US, as an alternate spelling
    of any accented version of 'a.'

    But, Dammit, it's NOT ä!

    I wanted to do something I thought was simple. What confluence of forces
    must I align in order to allow it? Is there a "standard" locale that just
    gets the hell out of my way and lets me use UTF-8 without trying to
    second-guess its linguistic meaning?

    Bear
     
    Ray, Aug 29, 2010
    #9
  10. Ray

    BartC Guest

    What happens with trying to print wide/unicode character 0x20AC (ie. the €
    euro symbol)? This 16-bit value is less likely to be mistaken for an 8-bit
    character, nor for a multi-byte (UTF8) sequence.
     
    BartC, Aug 29, 2010
    #10
  11. Ray

    Ray Guest

    BartC wrote:

    What happens with trying to print wide/unicode character 0x20AC (ie. the €
    euro symbol)? This 16-bit value is less likely to be mistaken for an 8-bit
    character, nor for a multi-byte (UTF8) sequence.

    Attempting
    wprintf(L"€ \n");

    prints the three letters 'EUR', in both de_DE.UTF-8 and in en_US.UTF-8
    locales.

    Bear
     
    Ray, Aug 29, 2010
    #11
  12. Ray

    BartC Guest

    That's weird. But try it this way:

    wchar_t wstr[10];

    wstr[0]=0x20AC;
    wstr[1]=0;
    wprintf(L"Wstr: <%s> %X\n",wstr,wstr[0]);

    Even here, I get mixed results myself (only one compiler out of three shows
    €, in a Windows console set to codepage 1252).

    Trying to have a literal L"€" string generated an error in gcc, and
    converted it to code 0x80 in another compiler.
     
    BartC, Aug 29, 2010
    #12
  13. Ray

    Geoff Guest

    The fact your system printed EUR when given € I suspect your terminal is cooking
    the output somehow. I suspect you are dealing with two problems, utf-8 to
    Unicode(?) conversion and your terminal locale cooking. On my Windows box I
    tried this modification of one of your previous attempts plus Alan Curry's
    advice:

    #include <stdio.h>
    #include <wchar.h>
    #include <stdlib.h>
    #include <assert.h>
    #include <locale.h>

    int main(void)
    {
    wchar_t vowel;
    char utf[8];

    /* set output stream to wide-character mode, or halt. */
    assert(fwide(stdout, 1) > 0);

    /* utf8 encoding of a with dieresis U+00E4 */
    utf[0] = 0xc3;
    utf[1] = 0xa4;
    utf[2] = 0;
    utf[3] = 0;

    /* convert utf8 to wide character. This assert succeeds
    so it's getting something.... */
    size_t res = mbrtowc(&vowel, utf, 3, NULL);
    assert(res!=(size_t)-1);
    assert(res!=(size_t)-2);

    /* but this assert fails so what it got was apparently
    a long-character NUL. */
    assert(vowel != 0);

    /* attempt to print */
    printf("Initial locale: %s\n", setlocale(LC_ALL, NULL));
    printf("%Lc \n", vowel);
    wprintf(L"ä\n");
    wprintf(L"Ä\n");
    printf("Default locale: %s\n", setlocale(LC_ALL, ""));
    printf("%Lc \n", vowel);
    wprintf(L"ä\n");
    wprintf(L"Ä\n");
    return 0;
    }

    With this output:

    Initial locale: C
    +
    S
    -
    Default locale: English_United States.1252
    A
    ä
    Ä

    NOTE: In the C locale the characters were a graphic (U+251C, I think), greek
    capital sigma, long dash. They are printing differently as +, S, - in my news
    client. The vowel consistently gets translated wrong on my system. In my VC2010
    debugger the vowel is displayed as a captal A with a tilde.

    I hope this sheds some light on the issue.
     
    Geoff, Aug 29, 2010
    #13
  14. Ray

    Ray Guest

    okay, my results on that one were very weird. it outputs everything
    up to the first directive - the point where it ought to print € --
    and that's all. not only does the € not show up, but the rest of
    the string, the code, and the newline don't show up too.

    Here's the code:

    #include <stdio.h>
    #include <wchar.h>
    #include <stdlib.h>
    #include <assert.h>
    #include <locale.h>

    int main(){
    wchar_t str[8];
    str[0] = 0x20AC;
    str[1] = 0;

    assert(fwide(stdout, 1) > 0);
    assert(setlocale(LC_ALL, "en_US.UTF-8") != 0);
    wprintf(L"str:%s code:%X \n",str, str[0]);
    }


    and here's what my console looks like after running it:

    [email protected]:~/src/dsh$ locale
    LANG=en_US.UTF-8
    LC_CTYPE="en_US.UTF-8"
    LC_NUMERIC="en_US.UTF-8"
    LC_TIME="en_US.UTF-8"
    LC_COLLATE="en_US.UTF-8"
    LC_MONETARY="en_US.UTF-8"
    LC_MESSAGES="en_US.UTF-8"
    LC_PAPER="en_US.UTF-8"
    LC_NAME="en_US.UTF-8"
    LC_ADDRESS="en_US.UTF-8"
    LC_TELEPHONE="en_US.UTF-8"
    LC_MEASUREMENT="en_US.UTF-8"
    LC_IDENTIFICATION="en_US.UTF-8"
    LC_ALL=en_US.UTF-8
    [email protected]:~/src/dsh$ gcc -o example example.c
    [email protected]:~/src/dsh$ ./example
    str:[email protected]:~/src/dsh$

    4 characters of output and nothing more. Also, no error
    and no corefile.

    Bear
     
    Ray, Aug 29, 2010
    #14
  15. Ray

    Alan Curry Guest

    That sounds like it's generating something which your terminal is
    interpreting as an escape code, which eats some of the characters that
    follow. This could easily happen if your terminal is not really in UTF-8
    mode. How sure are you about that?

    Try running your program piped to od -t x1 to dump the hex values of all the
    output bytes. The euro sign in UTF-8 should be 3 bytes, e2 82 ac. If those 3
    bytes show up, then your program is printing in UTF-8 correctly. If they
    don't, well at least it'll be interesting to find out what does show up.
     
    Alan Curry, Aug 29, 2010
    #15
  16. No. The problem here is, I think, setting the stream orientation before
    setting the locale. I think the IO system need to know the locale when
    it sets up the stream's orientation.

    A locale setting should almost always be one of the first things a
    program does. Put the locale setting first and I think it will work.

    All the other program have the same problem.

    <snip>
     
    Ben Bacarisse, Aug 29, 2010
    #16
  17. The above should not work although I suppose accidents can happen with
    undefined behaviour. %s is used to print a multi-byte encoded string,
    specifically it needs a char * argument not a wchar_t * one. %ls is
    what you need to print a wide string.
    To check the compiler, rather than print the strings (because this can
    go wrong for environmental reasons, stream orientation issues etc) see
    what L"€" generates. If you are using UTF-8 as the multi-byte encoding
    (C does not specify it) sizeof it should be twice the size of wchar_t
    and L"€"[0] should be 8634 with L"€"[1] zero:

    #include <stdio.h>

    int main(void)
    {
    printf("%zu %zu %ld %ld\n",
    sizeof *L"€", sizeof L"€", (long)L"€"[0], (long)L"€"[1]);
    return 0;
    }

    by not using wchar.h, setlocale and the rest we can just check the
    compiler is doing the right thing to start with.
     
    Ben Bacarisse, Aug 30, 2010
    #17
  18. It is, again, the problem of setting the orientation before the locale.
    The IO system needs to know the locale. I am not sure if this is
    covered by the C standard but it seems to be how thing work in practise.
    Drop this line. Let the stream orientation get set from the first IO
    operation -- at least til you get the hang of it.
    But, also, the format is wrong. %s expects a char *, specifically a
    mult-byte encoded string. %ls is for wide strings.
    <snip output>
     
    Ben Bacarisse, Aug 30, 2010
    #18
  19. Ray

    BartC Guest

    I understood that %s in wprintf() or %S in printf() printed a Unicode
    string, and %S in wprintf() or %s in printf printed an ordinary one
    (according to MSDN docs for wprintf()).

    I'm not too interested in multi-byte strings.
    I got 2 4 128 0 on two compilers (while gcc 3.4.5 didn't like the literal).

    I still don't get why € gets converted to 0x80/128 instead of 0x20AC/8634,
    especially as wchar_t width is 2 bytes. Neither code is the multi-byte 0xE2
    0x82 0xA2 version.
     
    BartC, Aug 30, 2010
    #19
  20. That's fine, but there is some evidence that the OP wants to use
    standard C (for example, fwide is not at all standard is MS C). A
    version of C that prints wide strings with %s is not standard C and is
    just going to further confuse matters.

    To further complicate things, my quick review of the docs suggests that
    wprintf uses %ls as per standard for wide strings (VS 2010).
    http://msdn.microsoft.com/en-us/library/tcxf1dw6.aspx
    All the more odd to use %s then!
    That's it's prerogative (what characters can appear in C source is
    implementation defined) but it is highly suggestive that your source is
    not UTF-8 encoded (or whatever gcc has been told to expect from the
    environment of compiler flags). The 128 suggests the same. What's the
    error it reports?
    One explanation is that your source is not UTF-8 encoded. The
    Windows-1252 encoding where the euro is 0x80 seems likely. The L"..."
    construct must then try to do something with that lone 0x80 byte and
    making U+0080 from it one rational choice. I'd certainly look at
    the source to see what is there between the "s.

    Alternatively, the system may not be using Unicode at all. C does not
    (yet) require Unicode/UTF-8 to be used as the wide and mult-byte
    encodings.
    That I am not surprised by. The effect of L"..." is to make a wide
    string from the mult-byte encoding with in the ... part. I'd not expect
    to see the UTF-8 encoding anywhere in the executable. It should be
    there in the source and I'd definitely look at the source to see what
    you really have in that string. The system may be trying to use Unicode
    but your source code might be in some other encoding.

    Plug: utf-8-dump http://bsb.me.uk/software/utf-8-dump/ though it
    probably won't work under Windows. Unix/Linux people might find it
    helpful for this sort of investigation.
     
    Ben Bacarisse, Aug 30, 2010
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.