how to use unicode in c under linux?

F

flywav

hi all
you know unicdoe is very important, under linux, i always use
utf-8, but now i need save one file in unicode. my linux is centos.
and i know this system support unicode. the wchar_t *p is a unicode
string, i print the len it is 7. it is right,i save the file, the
file length is 7. i had checked it, it is'hello??' ? = 0x3F, so
what's wrong with these code? thank you.


#define __STDC_ISO_10646__ 200104L
#include <wchar.h>
#include <stdio.h>
#include <stdlib.h>
#define _TEXT(x) L ## x
int main() {

FILE *fp = NULL;
wchar_t *filename = _TEXT("oov.txt");
wchar_t *p = _TEXT("helloÄãºÃ");

wprintf(_TEXT("%S\n"), p);

wprintf(_TEXT("%d \n"), wcslen(p));

fp = fopen( "oov.txt", "w");
fwprintf(fp, _TEXT("%S"), p);
fclose(fp);
return 0;
}




my locale

LANG=zh_CN.UTF-8
LC_CTYPE="zh_CN.UTF-8"
LC_NUMERIC="zh_CN.UTF-8"
LC_TIME="zh_CN.UTF-8"
LC_COLLATE="zh_CN.UTF-8"
LC_MONETARY="zh_CN.UTF-8"
LC_MESSAGES="zh_CN.UTF-8"
LC_PAPER="zh_CN.UTF-8"
LC_NAME="zh_CN.UTF-8"
LC_ADDRESS="zh_CN.UTF-8"
LC_TELEPHONE="zh_CN.UTF-8"
LC_MEASUREMENT="zh_CN.UTF-8"
LC_IDENTIFICATION="zh_CN.UTF-8"
LC_ALL=
 
F

flywav

"oov.txt", "w");> fwprintf(fp, _TEXT("%S"), p);

(fwprintf() prints in ASCII)

Make sure your w_char type is actually multi-byte. If it is, then fwprintf()
must be doing the wrong thing. Try opening the file in binary. If that
fails, you'll just have to accept that the function doesn't do what you
want, and call putc to write out the Unicode byte by byte.

thanks, i had check my code
i use gcc -E 1.c

i found the code:
typedef long int wchar_t;
so i think wchar_t is unicode.


#define __STDC_ISO_10646__ 200104L
#include <wchar.h>
#include <stdio.h>
#include <stdlib.h>
#define _TEXT(x) L ## x

int main() {

FILE *fp = NULL;
wchar_t *filename = _TEXT("oov.txt");
wchar_t *p = _TEXT("hello");

wprintf(_TEXT("%S\n"), p);

wprintf(_TEXT("%d \n"), wcslen(p));

fp = fopen( "oov.txt", "wb");
fwprintf(fp, _TEXT("%S"), p);
fclose(fp);
return 0;

}

the file length is still 5. :(
 
B

Ben Bacarisse

flywav said:
hi all
you know unicdoe is very important, under linux, i always use
utf-8, but now i need save one file in unicode. my linux is centos.
and i know this system support unicode. the wchar_t *p is a unicode
string, i print the len it is 7. it is right,i save the file, the
file length is 7. i had checked it, it is'hello??' ? = 0x3F, so
what's wrong with these code? thank you.

There are a few things wrong. Lets have a look...
#define __STDC_ISO_10646__ 200104L

This is set by the implementation. You don't get to say!
#include <wchar.h>
#include <stdio.h>
#include <stdlib.h>
#define _TEXT(x) L ## x
int main() {

FILE *fp = NULL;
wchar_t *filename = _TEXT("oov.txt");
wchar_t *p = _TEXT("hello你好");

wprintf(_TEXT("%S\n"), p);

You can't print "wide" to stdout by default. Also, %S is
non-standard. Is it a typo?

You need to call setlocale first or none of the conversions will work.
After that, you need to decide if you want byte or wide output.
Byte output is easier, but if you must use wide output, then you must
set that first with a call to fwide.
wprintf(_TEXT("%d \n"), wcslen(p));

fp = fopen( "oov.txt", "w");
fwprintf(fp, _TEXT("%S"), p);
fclose(fp);
return 0;
}

Try this:

#include <wchar.h>
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

int main(void)
{
setlocale(LC_ALL, "");
FILE *fp = fopen("oov.txt", "w");
if (fp == NULL) {
fprintf(stderr, "Open failed.\n");
return EXIT_FAILURE;
}
if (fwide(stdout, 1) < 0 || fwide(fp, 1) < 0) {
fprintf(stderr, "Failed to set wide output.\n");
return EXIT_FAILURE;
}

const wchar_t *p = L"hello你好";
wprintf(L"%ls\n", p);
fwprintf(fp, L"%ls\n", p);
fclose(fp);
return 0;
}

You can avoid all the fwide stuff if you fprintf is acceptable.
 
C

CBFalconer

Ben said:
There are a few things wrong. Lets have a look...


This is set by the implementation. You don't get to say!

Adequately covered by the reservation of such names to the
implementation.
 
K

Keith Thompson

CBFalconer said:
Ben Bacarisse wrote: [...]
This is set by the implementation. You don't get to say!

Adequately covered by the reservation of such names to the
implementation.

Not really. The standard *could* have defined a mechanism allowing
programs to define a value for __STDC_ISO_10646__; such a mechanism
would not have violated the reservation of names starting with "__" to
the implementation, any more than "#ifdef __STDC_ISO_10646__" would
violate that reservation.

The fact that the standard *didn't* define such a mechanism is specified
in C99 6.10.8p4:

None of these macro names, nor the identifier defined, shall be
the subject of a #define or a #undef preprocessing directive.

(Since this is a "shall" requirement outside a constraint, the
behavior is undefined.)
 
F

flywav

There are a few things wrong. Lets have a look...


This is set by the implementation. You don't get to say!




You can't print "wide" to stdout by default. Also, %S is
non-standard. Is it a typo?

You need to call setlocale first or none of the conversions will work.
After that, you need to decide if you want byte or wide output.
Byte output is easier, but if you must use wide output, then you must
set that first with a call to fwide.



Try this:

#include <wchar.h>
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

int main(void)
{
setlocale(LC_ALL, "");
FILE *fp = fopen("oov.txt", "w");
if (fp == NULL) {
fprintf(stderr, "Open failed.\n");
return EXIT_FAILURE;
}
if (fwide(stdout, 1) < 0 || fwide(fp, 1) < 0) {
fprintf(stderr, "Failed to set wide output.\n");
return EXIT_FAILURE;
}

const wchar_t *p = L"helloÄãºÃ";
wprintf(L"%ls\n", p);
fwprintf(fp, L"%ls\n", p);
fclose(fp);
return 0;

}

You can avoid all the fwide stuff if you fprintf is acceptable.

Ben, i had keyed in your code.

it run well. i got message hello worldÄãºÃ. £¨english hello and chinese
hello)

but i hexdump file oov.txt
hexdump oov.txt
0000000 6568 6c6c e46f a0bd a5e5 0abd
000000c

is it right? i think if it is unicode file, it should be 65 00 68
00 ,etc? (intel cpu)
 
R

Richard Tobin

flywav said:
0000000 6568 6c6c e46f a0bd a5e5 0abd
000000c
is it right? i think if it is unicode file,

Unicode itself is not a character encoding, it's a list of characters
with corresponding numbers (known as "code points"). There are
several different ways of encoding those numbers as a sequence of
bytes.
it should be 65 00 68 00 ,etc? (intel cpu)

You would get that if it were using the UTF-16 (little-endian) encoding
of Unicode. What you are actually getting is the UTF-8 encoding,
in which ascii characters (i.e. those < 128) appear normally, and
other characters are encoded as a sequence of 2 or more bytes. You
have two sequences of 3 bytes corresponding to two Chinese characters.

-- Richard
 
F

flywav

Unicode itself is not a character encoding, it's a list of characters
with corresponding numbers (known as "code points"). There are
several different ways of encoding those numbers as a sequence of
bytes.


You would get that if it were using the UTF-16 (little-endian) encoding
of Unicode. What you are actually getting is the UTF-8 encoding,
in which ascii characters (i.e. those < 128) appear normally, and
other characters are encoded as a sequence of 2 or more bytes. You
have two sequences of 3 bytes corresponding to two Chinese characters.

-- Richard

nice!
but i still had some question
in ode
wchar_t *p = L"aaaa";

p is an unicode string or ansi string?
i think it is an unicode, but what's encoding? UTF-16 or UTF8 ? how
can i sure it ?
(in windows, wchar_t *p = L"aaa", i think it always is a unicode
string with UTF-16 encodeing, is it right?)

You would get that if it were using the UTF-16 (little-endian)
encodingof Unicode.
how to do this?. (use iconv lib ??).I want use the unicode string
(UTF-16 encodeing) in all mycode?

how to write the string to the file using UTD-16 encoding? I also want
read this string from saved file.
thanks all!
 
R

Richard Tobin

flywav said:
but i still had some question
in ode
wchar_t *p = L"aaaa";

p is an unicode string or ansi string?

That's an internal matter for the system. It's probably UTF-16 or
UTF-32. The question of little- or big-endian doesn't normally arise,
any more than it does for ints.
how to write the string to the file using UTD-16 encoding? I also want
read this string from saved file.

You may be able to control this by setting the locale appropriately.

-- Richard
 
F

flywav

how to write the string to the file using UTD-16 encoding? I also want
read this string from saved file.

You may be able to control this by setting the locale appropriately.

i run this code in windows. the oov.txt is an unicode in UTD-16
encoding file.
in linux, how to setting the locale appropriately. can you tell me how
to do? or
how can i get the information about this?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top