attempting to print unicode characters.


R

Ray

Hi. I'm trying to print Unicode characters to standard output, and failing.
Before you ask, yes, my term programs (I've tried six) are in UTF-8
encoding and yes I'm using a font that does have a glyph for the "a with
dieresis" character that the examples/attempts use.

I have checked terminal mode and font by typing ä directly at the
prompt, where it shows up just fine.

But I haven't been able to get wide-character output from a C program.

Here are a number of minimal example programs that didn't work.

First attempt: This prints a question mark.
#include <stdio.h>
#include <wchar.h>
#include <assert.h>

int main(){
wchar_t vowel;
char utf[8];
/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);
/* unicode value of a with dieresis, U+00E4 */
vowel = 0x00e4;
/* attempt to print. */
wprintf(L"%Lc \n", vowel);
}


Drat, I said, maybe it doesn't represent wchar_t characters using
unicode values. So I read man pages and found that there are
standard functions to convert utf-8 to wchar_t, and went, "oh,
so utf-8 is what it deals with, and this function has to know
how to translate that into whatever wchar_t format it's using,
right?"

Second attempt:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>

int main(){
wchar_t vowel;

/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);

/* convert utf8 to wide character. This assert succeeds
so it's getting something.... */
assert(mbrtowc(&vowel, "ä", 3, NULL) > 0);

/* but this assert fails so what it got was apparently
a long-character NUL. WTF? */
assert(vowel != 0);

/* attempt to print */
wprintf(L"%Lc \n", vowel);
}


After seeing what happened with the second attempt, I realized that
C compilers don't *have* to do anything meaningful with UTF8 text,
either, although the point of providing a utf-8 conversion function
if they don't is sorta murky to me... so I tried encoding the utf-8
representation directly by hand and then using mbrtowc.

third attempt:


#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>

int main(){
wchar_t vowel;
char utf[8];

/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);

/* utf8 encoding of a with dieresis U+00E4 */
utf[0] = 0xc3;
utf[1] = 0xa4;
utf[2] = 0;

/* convert utf8 to wide character. This assert succeeds
so it's getting something.... */
assert(mbrtowc(&vowel, utf, 3, NULL) > 0);

/* but this assert fails so what it got was apparently
a long-character NUL. */
assert(vowel != 0);

/* attempt to print */
wprintf(L"%Lc \n", vowel);
}

Now I'm completely at a loss. My source is pure ASCII, I don't
rely on Unicode encodings, I provide the UTF-8 encoding that
mbrtowc's man page says it understands by hand, and it's still
failing. How can I get a C program to write out wide characters?

Bear
 
Ad

Advertisements

A

Alan Curry

/* convert utf8 to wide character. This assert succeeds
so it's getting something.... */
assert(mbrtowc(&vowel, "ä", 3, NULL) > 0);

You're checking the wrong thing here. mbrtowc error return values are
not <=0. They are (size_t)-1 and (size_t)-2 which are large positive
values.

size_t res = mbrtowc(&vowel, utf, 3, NULL);
assert(res!=(size_t)-1);
assert(res!=(size_t)-2);
 
G

Geoff

Hi. I'm trying to print Unicode characters to standard output, and failing.
Before you ask, yes, my term programs (I've tried six) are in UTF-8
encoding and yes I'm using a font that does have a glyph for the "a with
dieresis" character that the examples/attempts use.

I have checked terminal mode and font by typing ä directly at the
prompt, where it shows up just fine.

But I haven't been able to get wide-character output from a C program.

Here are a number of minimal example programs that didn't work.

First attempt: This prints a question mark.
#include <stdio.h>
#include <wchar.h>
#include <assert.h>

int main(){
wchar_t vowel;
char utf[8];
/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);
/* unicode value of a with dieresis, U+00E4 */
vowel = 0x00e4;
/* attempt to print. */
wprintf(L"%Lc \n", vowel);
}


Drat, I said, maybe it doesn't represent wchar_t characters using
unicode values. So I read man pages and found that there are
standard functions to convert utf-8 to wchar_t, and went, "oh,
so utf-8 is what it deals with, and this function has to know
how to translate that into whatever wchar_t format it's using,
right?"

Second attempt:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>

int main(){
wchar_t vowel;

/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);

/* convert utf8 to wide character. This assert succeeds
so it's getting something.... */
assert(mbrtowc(&vowel, "ä", 3, NULL) > 0);

/* but this assert fails so what it got was apparently
a long-character NUL. WTF? */
assert(vowel != 0);

/* attempt to print */
wprintf(L"%Lc \n", vowel);
}


After seeing what happened with the second attempt, I realized that
C compilers don't *have* to do anything meaningful with UTF8 text,
either, although the point of providing a utf-8 conversion function
if they don't is sorta murky to me... so I tried encoding the utf-8
representation directly by hand and then using mbrtowc.

third attempt:


#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>

int main(){
wchar_t vowel;
char utf[8];

/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);

/* utf8 encoding of a with dieresis U+00E4 */
utf[0] = 0xc3;
utf[1] = 0xa4;
utf[2] = 0;

/* convert utf8 to wide character. This assert succeeds
so it's getting something.... */
assert(mbrtowc(&vowel, utf, 3, NULL) > 0);

/* but this assert fails so what it got was apparently
a long-character NUL. */
assert(vowel != 0);

/* attempt to print */
wprintf(L"%Lc \n", vowel);
}

Now I'm completely at a loss. My source is pure ASCII, I don't
rely on Unicode encodings, I provide the UTF-8 encoding that
mbrtowc's man page says it understands by hand, and it's still
failing. How can I get a C program to write out wide characters?

Bear

I'm not sure about UTF-8 to wchar_t but I think you have to setlocale to get it
to print properly. Here's what I did with your last sample:

/* attempt to print */
setlocale(LC_ALL,"german");
wprintf(L"%Lc \n", vowel);
wprintf(L"ä\n");
wprintf(L"Ä\n");
return 0;

The last two produce the desired output. Your size_t res = mbrtowc(&vowel, utf,
3, NULL); is returning 1 on my system, I expected it to return 2 since my
documentation says "If the next count or fewer bytes complete a valid multibyte
character, the value returned is the number of bytes that complete the multibyte
character." - but -

size_t res = mbrtowc(&vowel, "ä", 3, NULL);

returns 1 also.

I tried this on a Windows 7 system but I have not tried it on a *nix OS.
 
M

Mikko Rauhala

I'm not sure about UTF-8 to wchar_t but I think you have to setlocale
to get it to print properly. Here's what I did with your last sample:

/* attempt to print */
setlocale(LC_ALL,"german");

Yes. Usually you want to call it with setlocale(LC_ALL, ""); to
use the environment's locale.
 
N

Nobody

Yes. Usually you want to call it with setlocale(LC_ALL, ""); to
use the environment's locale.

Setting LC_ALL is often a fast route to having your software break in
non-English locales. E.g. generating floating-point values which use a
comma as the decimal separator is fine for displaying information to a
human, but not so good if you're trying to generate data in a specific
format (which invariably uses the US convention).

It's often better to only set the specific categories which need to be
set, e.g. LC_CTYPE for anything relating to encodings. Setting LC_NUMERIC
should set alarm bells ringing; I've yet to encounter a non-trivial
program where setting LC_NUMERIC didn't require switching it back to the
"C" locale occasionally. In once case, we ended up adding #define's
so that *printf generated an error (you had to explicitly use e.g.
printf_localized() or printf_data() instead).

The last time I worked in continental Europe, most of the programmers
had their systems set to the US locale. The reason was to allow
copy-and-paste between localised programs and standardised file formats
(e.g. source code).
 
B

Ben Bacarisse

Ray said:
Hi. I'm trying to print Unicode characters to standard output, and failing.
Before you ask, yes, my term programs (I've tried six) are in UTF-8
encoding and yes I'm using a font that does have a glyph for the "a with
dieresis" character that the examples/attempts use.

I have checked terminal mode and font by typing ä directly at the
prompt, where it shows up just fine.

But I haven't been able to get wide-character output from a C
program.

Which is not what you want. You may use the wide output functions but
the output you want is multi-byte not wide. As has already be
mentioned, setlocale is the key here. Once the C run-time knows the
final output encoding required you can print from wide strings or
multi-byte stings and both will work.

This may sound like nit-picking but it occurs to me that you may not
know that you can use the plain I/O functions to print from wide
character arrays (and indeed from non-wide character arrays, but mixing
and matching can be a bad design decision).

If you don't set the stream to wide, its orientation is decided by the
first output operation. That's useful for testing -- you can switch
between wprintf and printf without having to change the fwide call.

Also, in a large program it is *very* useful to test the return from
your print functions. Printing to a stream of the wrong orientation is
one of the simplest ways to get an output failure and lots of
traditional code does not look at the return from printf.

Aside: I use this to test code. If you want to test the output error
path, set the stream to the "other" orientation and you are set. You
often need to use freopen if you want ot do this mid-output.

I hope that helps to round out the picture.

BTW, I think the FAQ needs some stuff on wide/multi-byte I/O. It will
get ever more common for people to come here confused by this important
area of the language. I don't think there is much one can do about that
other than starting a new FAQ.

<snip>
 
Ad

Advertisements

R

Ray

Ben said:
Which is not what you want. You may use the wide output functions but
the output you want is multi-byte not wide. As has already be
mentioned, setlocale is the key here. Once the C run-time knows the
final output encoding required you can print from wide strings or
multi-byte stings and both will work.

okay, fourth attempt:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>
#include <locale.h>

int main(){
wchar_t vowel;
char utf[8];

/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);
assert (setlocale(LC_ALL, "en_US.utf8") != NULL);
wprintf(L"ä \n");
}

This prints a lower-case 'a'. That's better, but still wrong.
Does "en_US.utf8" suppress accents?

fifth attempt:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>
#include <locale.h>

int main(){
wchar_t vowel;
char utf[8];

/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);
assert (setlocale(LC_ALL, "POSIX") != NULL);
wprintf(L"ä \n");
}

does not change anything, this still prints a lower-case 'a'.

Hmm, whatever 'locale' the darn terminal is using allows ä to show up,
so against the advice of another poster I'll try the empty string with
setlocale().

sixth attempt:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>
#include <locale.h>

int main(){
wchar_t vowel;
char utf[8];

/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);
assert (setlocale(LC_ALL, "") != NULL);
wprintf(L"ä \n");
}

Changes nothing. it *still* prints a lower-case 'a'.

locale -a on my system returns
C
en_US.utf8
POSIX

the first is the default locale for C programs and restricted to 7-bit
characters according to the setlocale manpage.

the second is what my term programs are set to, and they show most unicode
characters fine.

But none of them work. /usr/share/i18n/SUPPORTED lists 417 more. I decided
I would try the a german locale, since it was explicitly recommended
upthread.

seventh attempt:


sixth attempt:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>
#include <locale.h>

int main(){
wchar_t vowel;
char utf[8];

/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);
assert (setlocale(LC_ALL, "de_DE.UTF-8") != NULL);
wprintf(L"ä \n");
}

This time the call to setlocale returned NULL so the assert failed.
I suppose that means I need to download the corresponding locale data
before I can do that?

Bear,
still having no luck....
 
R

Ray

Geoff said:
/* attempt to print */
setlocale(LC_ALL,"german");
wprintf(L"%Lc \n", vowel);
wprintf(L"ä\n");
wprintf(L"Ä\n");
return 0;

The last two produce the desired output.


when I tried this, it printed a question mark. On further debugging,
it appears that a call to setlocale(LC_ALL,"german") on my system is
returning NULL first, so the failure to setlocale is probably the
reason why the printing fails.

Bear
 
R

Ray

Ray said:
when I tried this, it printed a question mark. On further debugging,
it appears that a call to setlocale(LC_ALL,"german") on my system is
returning NULL first, so the failure to setlocale is probably the
reason why the printing fails.

Bear

So, okay, I went and downloaded and installed LOCALES_ALL, a package
that installs all available locales. This changes things a little
but still doesn't do what I want.

The version with
setlocale(LC_ALL,"german");

still prints a question mark. On inspection, setlocale is still
returning NULL first.

replacing it with
setlocale(LC_ALL, "de_DE.UTF-8");

is better though; setlocale succeeds and the program goes on to print
'ae'. Very nice, this is regarded in german as an alternate spelling
of ä. And this illuminates the reason why, with
setlocale(LC_ALL, "en_US.UTF-8") earlier, it printed a lower-case 'a'.
Because, likewise, 'a' is regarded, in the US, as an alternate spelling
of any accented version of 'a.'

But, Dammit, it's NOT ä!

I wanted to do something I thought was simple. What confluence of forces
must I align in order to allow it? Is there a "standard" locale that just
gets the hell out of my way and lets me use UTF-8 without trying to
second-guess its linguistic meaning?

Bear
 
B

BartC

The version with
setlocale(LC_ALL,"german");

still prints a question mark. On inspection, setlocale is still
returning NULL first.

replacing it with
setlocale(LC_ALL, "de_DE.UTF-8");

is better though; setlocale succeeds and the program goes on to print
'ae'. Very nice, this is regarded in german as an alternate spelling
of ä. And this illuminates the reason why, with
setlocale(LC_ALL, "en_US.UTF-8") earlier, it printed a lower-case 'a'.
Because, likewise, 'a' is regarded, in the US, as an alternate spelling
of any accented version of 'a.'

But, Dammit, it's NOT ä!

I wanted to do something I thought was simple. What confluence of forces
must I align in order to allow it? Is there a "standard" locale that just
gets the hell out of my way and lets me use UTF-8 without trying to
second-guess its linguistic meaning?

What happens with trying to print wide/unicode character 0x20AC (ie. the €
euro symbol)? This 16-bit value is less likely to be mistaken for an 8-bit
character, nor for a multi-byte (UTF8) sequence.
 
R

Ray

BartC wrote:

What happens with trying to print wide/unicode character 0x20AC (ie. the €
euro symbol)? This 16-bit value is less likely to be mistaken for an 8-bit
character, nor for a multi-byte (UTF8) sequence.

Attempting
wprintf(L"€ \n");

prints the three letters 'EUR', in both de_DE.UTF-8 and in en_US.UTF-8
locales.

Bear
 
Ad

Advertisements

B

BartC

Ray said:
BartC wrote:

What happens with trying to print wide/unicode character 0x20AC (ie. the €
euro symbol)? This 16-bit value is less likely to be mistaken for an 8-bit
character, nor for a multi-byte (UTF8) sequence.

Attempting
wprintf(L"€ \n");

prints the three letters 'EUR', in both de_DE.UTF-8 and in en_US.UTF-8
locales.

That's weird. But try it this way:

wchar_t wstr[10];

wstr[0]=0x20AC;
wstr[1]=0;
wprintf(L"Wstr: <%s> %X\n",wstr,wstr[0]);

Even here, I get mixed results myself (only one compiler out of three shows
€, in a Windows console set to codepage 1252).

Trying to have a literal L"€" string generated an error in gcc, and
converted it to code 0x80 in another compiler.
 
G

Geoff

Ben said:
Which is not what you want. You may use the wide output functions but
the output you want is multi-byte not wide. As has already be
mentioned, setlocale is the key here. Once the C run-time knows the
final output encoding required you can print from wide strings or
multi-byte stings and both will work.

okay, fourth attempt:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>
#include <locale.h>

int main(){
wchar_t vowel;
char utf[8];

/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);
assert (setlocale(LC_ALL, "en_US.utf8") != NULL);
wprintf(L"ä \n");
}

This prints a lower-case 'a'. That's better, but still wrong.
Does "en_US.utf8" suppress accents?

fifth attempt:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>
#include <locale.h>

int main(){
wchar_t vowel;
char utf[8];

/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);
assert (setlocale(LC_ALL, "POSIX") != NULL);
wprintf(L"ä \n");
}

does not change anything, this still prints a lower-case 'a'.

Hmm, whatever 'locale' the darn terminal is using allows ä to show up,
so against the advice of another poster I'll try the empty string with
setlocale().

sixth attempt:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>
#include <locale.h>

int main(){
wchar_t vowel;
char utf[8];

/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);
assert (setlocale(LC_ALL, "") != NULL);
wprintf(L"ä \n");
}

Changes nothing. it *still* prints a lower-case 'a'.

locale -a on my system returns
C
en_US.utf8
POSIX

the first is the default locale for C programs and restricted to 7-bit
characters according to the setlocale manpage.

the second is what my term programs are set to, and they show most unicode
characters fine.

But none of them work. /usr/share/i18n/SUPPORTED lists 417 more. I decided
I would try the a german locale, since it was explicitly recommended
upthread.

seventh attempt:


sixth attempt:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>
#include <locale.h>

int main(){
wchar_t vowel;
char utf[8];

/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);
assert (setlocale(LC_ALL, "de_DE.UTF-8") != NULL);
wprintf(L"ä \n");
}

This time the call to setlocale returned NULL so the assert failed.
I suppose that means I need to download the corresponding locale data
before I can do that?

Bear,
still having no luck....

The fact your system printed EUR when given € I suspect your terminal is cooking
the output somehow. I suspect you are dealing with two problems, utf-8 to
Unicode(?) conversion and your terminal locale cooking. On my Windows box I
tried this modification of one of your previous attempts plus Alan Curry's
advice:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>
#include <locale.h>

int main(void)
{
wchar_t vowel;
char utf[8];

/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);

/* utf8 encoding of a with dieresis U+00E4 */
utf[0] = 0xc3;
utf[1] = 0xa4;
utf[2] = 0;
utf[3] = 0;

/* convert utf8 to wide character. This assert succeeds
so it's getting something.... */
size_t res = mbrtowc(&vowel, utf, 3, NULL);
assert(res!=(size_t)-1);
assert(res!=(size_t)-2);

/* but this assert fails so what it got was apparently
a long-character NUL. */
assert(vowel != 0);

/* attempt to print */
printf("Initial locale: %s\n", setlocale(LC_ALL, NULL));
printf("%Lc \n", vowel);
wprintf(L"ä\n");
wprintf(L"Ä\n");
printf("Default locale: %s\n", setlocale(LC_ALL, ""));
printf("%Lc \n", vowel);
wprintf(L"ä\n");
wprintf(L"Ä\n");
return 0;
}

With this output:

Initial locale: C
+
S
-
Default locale: English_United States.1252
A
ä
Ä

NOTE: In the C locale the characters were a graphic (U+251C, I think), greek
capital sigma, long dash. They are printing differently as +, S, - in my news
client. The vowel consistently gets translated wrong on my system. In my VC2010
debugger the vowel is displayed as a captal A with a tilde.

I hope this sheds some light on the issue.
 
R

Ray

BartC said:
Ray said:
BartC wrote:

What happens with trying to print wide/unicode character 0x20AC (ie. the
€ euro symbol)? This 16-bit value is less likely to be mistaken for an
8-bit character, nor for a multi-byte (UTF8) sequence.

Attempting
wprintf(L"€ \n");

prints the three letters 'EUR', in both de_DE.UTF-8 and in en_US.UTF-8
locales.

That's weird. But try it this way:

wchar_t wstr[10];

wstr[0]=0x20AC;
wstr[1]=0;
wprintf(L"Wstr: <%s> %X\n",wstr,wstr[0]);

Even here, I get mixed results myself (only one compiler out of three
shows €, in a Windows console set to codepage 1252).

Trying to have a literal L"€" string generated an error in gcc, and
converted it to code 0x80 in another compiler.

okay, my results on that one were very weird. it outputs everything
up to the first directive - the point where it ought to print € --
and that's all. not only does the € not show up, but the rest of
the string, the code, and the newline don't show up too.

Here's the code:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>
#include <locale.h>

int main(){
wchar_t str[8];
str[0] = 0x20AC;
str[1] = 0;

assert(fwide(stdout, 1) > 0);
assert(setlocale(LC_ALL, "en_US.UTF-8") != 0);
wprintf(L"str:%s code:%X \n",str, str[0]);
}


and here's what my console looks like after running it:

[email protected]:~/src/dsh$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
[email protected]:~/src/dsh$ gcc -o example example.c
[email protected]:~/src/dsh$ ./example
str:[email protected]:~/src/dsh$

4 characters of output and nothing more. Also, no error
and no corefile.

Bear
 
A

Alan Curry

BartC said:
That's weird. But try it this way:

wchar_t wstr[10];

wstr[0]=0x20AC;
wstr[1]=0;
wprintf(L"Wstr: <%s> %X\n",wstr,wstr[0]);

okay, my results on that one were very weird. it outputs everything
up to the first directive - the point where it ought to print € --
and that's all. not only does the € not show up, but the rest of
the string, the code, and the newline don't show up too.

That sounds like it's generating something which your terminal is
interpreting as an escape code, which eats some of the characters that
follow. This could easily happen if your terminal is not really in UTF-8
mode. How sure are you about that?

Try running your program piped to od -t x1 to dump the hex values of all the
output bytes. The euro sign in UTF-8 should be 3 bytes, e2 82 ac. If those 3
bytes show up, then your program is printing in UTF-8 correctly. If they
don't, well at least it'll be interesting to find out what does show up.
 
B

Ben Bacarisse

Ray said:
Ben said:
Which is not what you want. You may use the wide output functions but
the output you want is multi-byte not wide. As has already be
mentioned, setlocale is the key here. Once the C run-time knows the
final output encoding required you can print from wide strings or
multi-byte stings and both will work.

okay, fourth attempt:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>
#include <locale.h>

int main(){
wchar_t vowel;
char utf[8];

/* set output stream to wide-character mode, or halt. */
assert(fwide(stdout, 1) > 0);
assert (setlocale(LC_ALL, "en_US.utf8") != NULL);
wprintf(L"ä \n");
}

This prints a lower-case 'a'. That's better, but still wrong.
Does "en_US.utf8" suppress accents?

No. The problem here is, I think, setting the stream orientation before
setting the locale. I think the IO system need to know the locale when
it sets up the stream's orientation.

A locale setting should almost always be one of the first things a
program does. Put the locale setting first and I think it will work.

All the other program have the same problem.

<snip>
 
Ad

Advertisements

B

Ben Bacarisse

BartC said:
Ray said:
BartC wrote:

What happens with trying to print wide/unicode character 0x20AC (ie. the €
euro symbol)? This 16-bit value is less likely to be mistaken for an 8-bit
character, nor for a multi-byte (UTF8) sequence.

Attempting
wprintf(L"€ \n");

prints the three letters 'EUR', in both de_DE.UTF-8 and in en_US.UTF-8
locales.

That's weird. But try it this way:

wchar_t wstr[10];

wstr[0]=0x20AC;
wstr[1]=0;
wprintf(L"Wstr: <%s> %X\n",wstr,wstr[0]);

Even here, I get mixed results myself (only one compiler out of three
shows €, in a Windows console set to codepage 1252).

The above should not work although I suppose accidents can happen with
undefined behaviour. %s is used to print a multi-byte encoded string,
specifically it needs a char * argument not a wchar_t * one. %ls is
what you need to print a wide string.
Trying to have a literal L"€" string generated an error in gcc, and
converted it to code 0x80 in another compiler.

To check the compiler, rather than print the strings (because this can
go wrong for environmental reasons, stream orientation issues etc) see
what L"€" generates. If you are using UTF-8 as the multi-byte encoding
(C does not specify it) sizeof it should be twice the size of wchar_t
and L"€"[0] should be 8634 with L"€"[1] zero:

#include <stdio.h>

int main(void)
{
printf("%zu %zu %ld %ld\n",
sizeof *L"€", sizeof L"€", (long)L"€"[0], (long)L"€"[1]);
return 0;
}

by not using wchar.h, setlocale and the rest we can just check the
compiler is doing the right thing to start with.
 
B

Ben Bacarisse

Ray said:
BartC said:
Ray said:
BartC wrote:

What happens with trying to print wide/unicode character 0x20AC (ie. the
€ euro symbol)? This 16-bit value is less likely to be mistaken for an
8-bit character, nor for a multi-byte (UTF8) sequence.

Attempting
wprintf(L"€ \n");

prints the three letters 'EUR', in both de_DE.UTF-8 and in en_US.UTF-8
locales.

That's weird. But try it this way:

wchar_t wstr[10];

wstr[0]=0x20AC;
wstr[1]=0;
wprintf(L"Wstr: <%s> %X\n",wstr,wstr[0]);

Even here, I get mixed results myself (only one compiler out of three
shows €, in a Windows console set to codepage 1252).

Trying to have a literal L"€" string generated an error in gcc, and
converted it to code 0x80 in another compiler.

okay, my results on that one were very weird. it outputs everything
up to the first directive - the point where it ought to print € --
and that's all. not only does the € not show up, but the rest of
the string, the code, and the newline don't show up too.

It is, again, the problem of setting the orientation before the locale.
The IO system needs to know the locale. I am not sure if this is
covered by the C standard but it seems to be how thing work in practise.
Here's the code:

#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <assert.h>
#include <locale.h>

int main(){
wchar_t str[8];
str[0] = 0x20AC;
str[1] = 0;

assert(fwide(stdout, 1) > 0);

Drop this line. Let the stream orientation get set from the first IO
operation -- at least til you get the hang of it.
assert(setlocale(LC_ALL, "en_US.UTF-8") != 0);
wprintf(L"str:%s code:%X \n",str, str[0]);

But, also, the format is wrong. %s expects a char *, specifically a
mult-byte encoded string. %ls is for wide strings.

<snip output>
 
B

BartC

Ben Bacarisse said:
wchar_t wstr[10];

wstr[0]=0x20AC;
wstr[1]=0;
wprintf(L"Wstr: <%s> %X\n",wstr,wstr[0]);

Even here, I get mixed results myself (only one compiler out of three
shows €, in a Windows console set to codepage 1252).

The above should not work although I suppose accidents can happen with
undefined behaviour. %s is used to print a multi-byte encoded string,
specifically it needs a char * argument not a wchar_t * one. %ls is
what you need to print a wide string.

I understood that %s in wprintf() or %S in printf() printed a Unicode
string, and %S in wprintf() or %s in printf printed an ordinary one
(according to MSDN docs for wprintf()).

I'm not too interested in multi-byte strings.
Trying to have a literal L"€" string generated an error in gcc, and
converted it to code 0x80 in another compiler.

To check the compiler, rather than print the strings (because this can
go wrong for environmental reasons, stream orientation issues etc) see
what L"€" generates. If you are using UTF-8 as the multi-byte encoding
(C does not specify it) sizeof it should be twice the size of wchar_t
and L"€"[0] should be 8634 with L"€"[1] zero:

#include <stdio.h>

int main(void)
{
printf("%zu %zu %ld %ld\n",
sizeof *L"€", sizeof L"€", (long)L"€"[0], (long)L"€"[1]);
return 0;
}

by not using wchar.h, setlocale and the rest we can just check the
compiler is doing the right thing to start with.

I got 2 4 128 0 on two compilers (while gcc 3.4.5 didn't like the literal).

I still don't get why € gets converted to 0x80/128 instead of 0x20AC/8634,
especially as wchar_t width is 2 bytes. Neither code is the multi-byte 0xE2
0x82 0xA2 version.
 
Ad

Advertisements

B

Ben Bacarisse

BartC said:
Ben Bacarisse said:
wchar_t wstr[10];

wstr[0]=0x20AC;
wstr[1]=0;
wprintf(L"Wstr: <%s> %X\n",wstr,wstr[0]);

Even here, I get mixed results myself (only one compiler out of three
shows €, in a Windows console set to codepage 1252).

The above should not work although I suppose accidents can happen with
undefined behaviour. %s is used to print a multi-byte encoded string,
specifically it needs a char * argument not a wchar_t * one. %ls is
what you need to print a wide string.

I understood that %s in wprintf() or %S in printf() printed a Unicode
string, and %S in wprintf() or %s in printf printed an ordinary one
(according to MSDN docs for wprintf()).

That's fine, but there is some evidence that the OP wants to use
standard C (for example, fwide is not at all standard is MS C). A
version of C that prints wide strings with %s is not standard C and is
just going to further confuse matters.

To further complicate things, my quick review of the docs suggests that
wprintf uses %ls as per standard for wide strings (VS 2010).
http://msdn.microsoft.com/en-us/library/tcxf1dw6.aspx
I'm not too interested in multi-byte strings.

All the more odd to use %s then!
Trying to have a literal L"€" string generated an error in gcc, and
converted it to code 0x80 in another compiler.

To check the compiler, rather than print the strings (because this can
go wrong for environmental reasons, stream orientation issues etc) see
what L"€" generates. If you are using UTF-8 as the multi-byte encoding
(C does not specify it) sizeof it should be twice the size of wchar_t
and L"€"[0] should be 8634 with L"€"[1] zero:

#include <stdio.h>

int main(void)
{
printf("%zu %zu %ld %ld\n",
sizeof *L"€", sizeof L"€", (long)L"€"[0], (long)L"€"[1]);
return 0;
}

by not using wchar.h, setlocale and the rest we can just check the
compiler is doing the right thing to start with.

I got 2 4 128 0 on two compilers (while gcc 3.4.5 didn't like the
literal).

That's it's prerogative (what characters can appear in C source is
implementation defined) but it is highly suggestive that your source is
not UTF-8 encoded (or whatever gcc has been told to expect from the
environment of compiler flags). The 128 suggests the same. What's the
error it reports?
I still don't get why € gets converted to 0x80/128 instead of
0x20AC/8634, especially as wchar_t width is 2 bytes.

One explanation is that your source is not UTF-8 encoded. The
Windows-1252 encoding where the euro is 0x80 seems likely. The L"..."
construct must then try to do something with that lone 0x80 byte and
making U+0080 from it one rational choice. I'd certainly look at
the source to see what is there between the "s.

Alternatively, the system may not be using Unicode at all. C does not
(yet) require Unicode/UTF-8 to be used as the wide and mult-byte
encodings.
Neither code is
the multi-byte 0xE2 0x82 0xA2 version.

That I am not surprised by. The effect of L"..." is to make a wide
string from the mult-byte encoding with in the ... part. I'd not expect
to see the UTF-8 encoding anywhere in the executable. It should be
there in the source and I'd definitely look at the source to see what
you really have in that string. The system may be trying to use Unicode
but your source code might be in some other encoding.

Plug: utf-8-dump http://bsb.me.uk/software/utf-8-dump/ though it
probably won't work under Windows. Unix/Linux people might find it
helpful for this sort of investigation.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top