how to initial and print the unicode character?

W

wizardyhnr

i want to try ANSI C99's unicode fuctions. so i write a test program.
the function is simple, but i cannot compile it with dev c++ 4.9.9.2
under windows xp sp2, since the compiler always think that the
initialization of the wchar_t string is illegal. here is my function:
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <wctype.h>
#include <string.h>

int main(int argc, char *argv[])
{
wchar_t *cur_buff=L"X";
wprintf(cur_buff);
return 0;
}

in the function, the initialization of wchar_t *cur_buff is L"X", if X
is an ascii character, then all things function well. But if X is
non-ascii charater such as a Chinese character, compiler would alert
that this is a illegal byte sequence. The source file is saved as ascci
code, and the character set is gb2312. i wonder why this happens?
 
A

Andrew Poelstra

i want to try ANSI C99's unicode fuctions. so i write a test program.
the function is simple, but i cannot compile it with dev c++ 4.9.9.2
under windows xp sp2, since the compiler always think that the
initialization of the wchar_t string is illegal. here is my function:

Without looking at your actual problem, here's a few tips:
1) It's fairly unlikely that you actually have a C99 compiler.
2) It's very unlikely that something with the word "C++" in it
is even a C compiler, let alone a C99 compiler.

Other than that, we don't care what OS or platform you have. We discuss
standard C here, and that's platform independant.
 
A

Alf P. Steinbach

* wizardyhnr:
i want to try ANSI C99's unicode fuctions. so i write a test program.
the function is simple, but i cannot compile it with dev c++ 4.9.9.2
under windows xp sp2, since the compiler always think that the
initialization of the wchar_t string is illegal. here is my function:
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <wctype.h>
#include <string.h>

int main(int argc, char *argv[])
{
wchar_t *cur_buff=L"X";
wprintf(cur_buff);
return 0;
}

in the function, the initialization of wchar_t *cur_buff is L"X", if X
is an ascii character, then all things function well. But if X is
non-ascii charater such as a Chinese character, compiler would alert
that this is a illegal byte sequence. The source file is saved as ascci
code, and the character set is gb2312. i wonder why this happens?

Don't know about C, but in C++ you'd have to put a 'const' in there,

wchar_t const* curr_buff = L"X";
 
B

Ben Pfaff

Alf P. Steinbach said:
* wizardyhnr:

Don't know about C, but in C++ you'd have to put a 'const' in there,
wchar_t const* curr_buff = L"X";

Not in C.
 
A

Alf P. Steinbach

* Ben Pfaff:
Not in C.

On second thought, perhaps not in C++ either (sorry for being a bit
fast). Haven't checked, and since this is a C newsgroup, won't do. The
C++ non-const possibility for char* is just for C compatibility.
 
R

Richard Heathfield

wizardyhnr said:
i want to try ANSI C99's unicode [functions].

Unicode is not mentioned even once in my copy of the C99 Standard. On the
other hand, wide characters /are/ so mentioned, so let's assume you meant
that.

in the function, the initialization of wchar_t *cur_buff is L"X", if X
is an ascii character, then all things function well. But if X is
non-ascii charater such as a Chinese character, compiler would alert
that this is a illegal byte sequence.

As long as the compiler (or, almost certainly, the preprocessor in this
case) supports the basic source character set, it remains within its rights
to reject any other characters it encounters within the source code.

You can, however, read information into a wchar_t from a file at run-time. I
suggest you explore that option.
 
G

Guest

Richard said:
wizardyhnr said:
i want to try ANSI C99's unicode [functions].

Unicode is not mentioned even once in my copy of the C99 Standard. On the
other hand, wide characters /are/ so mentioned, so let's assume you meant
that.

Unicode is explicitly mentioned in TC2 in the description for
__STDC_ISO_10646__, and while the wording before TC2 does not mention
"Unicode", the differences between Unicode and ISO 10646 are not
relevant here.

#ifndef __STDC_ISO_10646__
#error
#endif
/* Now, the assumption that C's wide character functions are Unicode
functions is valid */

Also, the \U and \u escape sequences work with Unicode / ISO 10646
character values.
 
R

Richard Bos

Andrew Poelstra said:
Without looking at your actual problem, here's a few tips:

0) It's ISO C99, and has been from the start.
1) It's fairly unlikely that you actually have a C99 compiler.

It's actually 100% sure he hasn't.
2) It's very unlikely that something with the word "C++" in it
is even a C compiler, let alone a C99 compiler.

It's actually 100% sure it is an IDE with compiler suite which provides
C++, C89, and a Win32 library (MingW, to be precise), but not C99.

Richard
 
S

santosh

wizardyhnr said:
i want to try ANSI C99's unicode fuctions. so i write a test program.
the function is simple, but i cannot compile it with dev c++ 4.9.9.2
under windows xp sp2, since the compiler always think that the
initialization of the wchar_t string is illegal. here is my function:
[snip code]

I'm only guessing but I think the source character set of your compiler
doesn't support characters other than the basic C character set.

Maybe the following URLs could shed some light in this obscure area?
<http://evanjones.ca/unicode-in-c.html>
<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
<http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF>

Also the authoritative resource:
<http://www.unicode.org/>
 
W

wizardyhnr

i think i made a mistake about whether compiler fully support standard
c99, but i think some people too much emphsis it.
i recomplile the function with mingw 5.0.3, whose gcc version is 3.4.5,
and still the problem holds on. thought it may not support all features
of c99, i do not think the compiler would not support wide character
functions.
 
K

Keith Thompson

Alf P. Steinbach said:
Don't know about C, but in C++ you'd have to put a 'const' in there,

wchar_t const* curr_buff = L"X";

No, you don't have to (though at least one C compiler, namely gcc, can
be invoked with an option that would make it necessary), but it's a
good idea anyway.

String literals are not const, but attempting to modify a string
literal invokes undefined behavior. Using const could help you catch
an error that the compiler otherwise wouldn't warn you about.
 
W

Walter Roberson

wizardyhnr said:
As long as the compiler (or, almost certainly, the preprocessor in this
case) supports the basic source character set, it remains within its rights
to reject any other characters it encounters within the source code.

That is arguably incorrect, Richard.

C89 2.2.1 Character Sets
[...]
In a character constant or string literal, members of the execution
character set shall be represented by corresponding members of
the source character set or by escape sequences consisting of
the backslash \ followed by one or more characters. A byte with all
bits set to 0, called the null character, shall exist in the basic
execution set; it is used to terminate a character string literal.
[...]
In the execution character set, there shall be control characters
representing alert, backspace, carriage return, and new line. If any
other characters are encountered in a source file (except in
a character constant, a string literal, a header name, a comment,
or a preprocessing token that is never converted to a token), the
behaviour is undefined.

3.1.3.4 Character Constants
[...]
An integer character constant is a sequence of one or more multibyte
characters enclosed in single-quotes, as in 'x' or 'ab'. A wide
character constant is the same, except prefixed by the letter L.
With a few exceptions detailed later, the elements of the sequence
are any members of the source character set; they are mapped in an
implementation-defined manner to members of the execution character set.


Thus, string constants (and string literals) are allowed to contain
multi-byte characters; the value of those is implementation-defined,
and it is true that the implementation might choose to define the
values as being illegal. You are technically correct about that aspect,
though -in a way- misleading, in that the standard explicitly allows
for multi-byte character support, so it is, at least psychologically,
not the same kind of "within its rights" as would be, say, whether
dollar-sign is permitted in identifier names (which would clearly
be extension.)

I would, though, argue that your statement is not exactly correct, in that
the C89 standard defines the source character set, and defines the
execution character set, and defines the allowed characters in
literals to include representations of the execution character set,
*and the basic execution character set is defined to include some characters
that do not appear in the basic source character set*. It is thus not
permitted for the compiler to define the representation of those
additional characters (null, alert, backspace, carriage return, and
new line) as being illegal.

There is the semantic question of whether (e.g.) \a appearing in
a literal is a single character or a pair of characters for the purpose
of "If any other characters are encountered in the source file", but
notice that 3.1.3.4 specifically notes that there are exceptions to
"the elements of the sequence are any members of the source character set".

I'm not entirely clear, reading the whole of 3.1.3.4, as to which
portions are considered by the standard to be the "exceptions" and
which not, but for the purposes of this present nit, is is enough to
point out that the standard -says- there are exceptions, and
thus that within literals, there are permited values defined as valid
and yet which are not members of the source character set.
 
S

SM Ryan

# in the function, the initialization of wchar_t *cur_buff is L"X", if X
# is an ascii character, then all things function well. But if X is
# non-ascii charater such as a Chinese character, compiler would alert
# that this is a illegal byte sequence. The source file is saved as ascci
# code, and the character set is gb2312. i wonder why this happens?

Beyond ASCII, there are many different ways encode unicode. Unless your
compiler and edittor are using the same encoding, the compiler is going
to see garbage. Some encodings exclude certain byte values. and that could
well be the illegal byte sequence.
 
W

wizardyhnr

SM said:
# in the function, the initialization of wchar_t *cur_buff is L"X", if X
# is an ascii character, then all things function well. But if X is
# non-ascii charater such as a Chinese character, compiler would alert
# that this is a illegal byte sequence. The source file is saved as ascci
# code, and the character set is gb2312. i wonder why this happens?

Beyond ASCII, there are many different ways encode unicode. Unless your
compiler and edittor are using the same encoding, the compiler is going
to see garbage. Some encodings exclude certain byte values. and that could
well be the illegal byte sequence.

i think maybe this is the reason
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,077
Latest member
SangMoor21

Latest Threads

Top