Non latin characters in string literals

I

Ioannis Vranos

I am asking so as to be sure:


AFAIK non-latin, other language characters, produce undefined behaviour,
when used with standard library facilities expecting char strings like
printf(), and when used in string literals.


Is this correct?


The C99 standard mentions:


"5.2.1 Character sets

1 Two sets of characters and their associated collating sequences shall be
defined: the set in
which source files are written (the source character set), and the set
interpreted in the
execution environment (the execution character set). Each set is further
divided into a
basic character set, whose contents are given by this subclause, and a set
of zero or more
locale-specific members (which are not members of the basic character set)
called
extended characters. The combined set is also called the extended character
set. The
values of the members of the execution character set are implementation-
defined.

2 In a character constant or string literal, members of the execution
character set shall be
represented by corresponding members of the source character set or by
escape
sequences consisting of the backslash \ followed by one or more characters.
A byte with
all bits set to 0, called the null character, shall exist in the basic
execution character set; it
is used to terminate a character string.

3 Both the basic source and basic execution character sets shall have the
following
members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab,
vertical tab, and
form feed. The representation of each member of the source and execution
basic
character sets shall fit in a byte. In both the source and execution basic
character sets, the
value of each character after 0 in the above list of decimal digits shall be
one greater than
the value of the previous. In source files, there shall be some way of
indicating the end of
each line of text; this International Standard treats such an end-of-line
indicator as if it
were a single new-line character. In the basic execution character set,
there shall be
control characters representing alert, backspace, carriage return, and new
line. If any
other characters are encountered in a source file (except in an identifier,
a character
constant, a string literal, a header name, a comment, or a preprocessing
token that is never
converted to a token), the behavior is undefined.

4 A letter is an uppercase letter or a lowercase letter as defined above; in
this International
Standard the term does not include other characters that are letters in
other alphabets".




Thanks a lot,

--
Ioannis Vranos

C95 / C++03 Software Developer

http://www.cpp-software.net
 
I

Ioannis Vranos

Ioannis said:
I am asking so as to be sure:


AFAIK non-latin, other language characters, produce undefined behaviour,
when used with standard library facilities expecting char strings like
printf(), and when used in string literals.


I mean, for other language characters, wchar_t character type, wchar_t
strings/string literals, and wchar_t pointers and wchar_t facilities like
wprintf(), wscanf(), etc, should be used instead, along with the required
locales.





Thanks a lot,

--
Ioannis Vranos

C95 / C++03 Software Developer

http://www.cpp-software.net
 
N

Nick

Ioannis Vranos said:
I mean, for other language characters, wchar_t character type, wchar_t
strings/string literals, and wchar_t pointers and wchar_t facilities like
wprintf(), wscanf(), etc, should be used instead, along with the required
locales.

I don't think so. I can't see any problems with using standard C
strings for UTF-8 characters (indeed, I'm been doing it for a while with
no problems).

A few things to note:

- UTF-8 contains no zero bytes, so C strings will not terminate prematurely.

- it's risky, to say the least, to stick accented characters etc in
string literals; you need to use the appropriate hex escapes to be safe
(which also avoids your editor doing horrible things if you aren't
careful).

- strlen will come back with the number of bytes in the string, not the
number of characters. However as often as not you are using strlen to
work out how much storage space you need anyway.

- you need to be careful to meticulously cast to unsigned char. In
particular before passing to a ctype macro, but also before any time you
assign to an integer (unless you want enormous negative numbers flying
around).

But with all that in mind, it works fine. I only do it because I
already had a mountain of code that I wanted to make work with accented
characters - UTF-8 proved a remarkably pain-free way to do it, certainly
easier than learning all the w* features (which I've never used) and
editing all the code.

Nick, cheerfully expecting to find out he's wrong and what he thought
was a bad cold was something a lot worse.
 
E

Eric Sosman

I am asking so as to be sure:


AFAIK non-latin, other language characters, produce undefined behaviour,
when used with standard library facilities expecting char strings like
printf(), and when used in string literals.


Is this correct?

Not undefined behavior, but implementation-defined behavior.
If the implementation supports "extended" characters beyond those
specifically required by the Standard, you can use them. If it
doesn't, you can't. Also, they're implementation-defined rather
than unspecified, because of
"5.2.1 Character sets
[...] The
values of the members of the execution character set are implementation-
defined.

Since the implementation must define (i.e., document) the values
of all the supported characters, it must document the characters
themselves, en passant as it were, and thus define them.

In a follow-up post you mention using wchar_t in connection
with exotic glyphs. Those aren't "characters" in the sense of
5.2.1, but "wide characters." If the implementation provides
wide characters outside the repertoire of ordinary characters,
they, too, are implementation-defined (6.4.4.4p11), and using
them has implementation-defined behavior.

Summary: Not "portable," but not "undefined."
 
N

Nick

Joe Wright said:
I love the idea of UTF-8 but I don't know how to use it. Code points
0..127 are single byte ASCII characters and offer no problem. But what
do we do with the multi-byte characters?

We simply stick them into our C strings, byte by byte. There are no
zeros in UTF-8, so they are still valid strings (as CHAR_BIT is always
at least 8, a char can always hold at least 1 byte).

For example, try this.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
char s[] = "\xc2\xa1ol\xc3\xa9! This string costs \xc2\xa3""25.20";
printf("The string is \"%s\" - 30 symbols, but is %d bytes long",s,(unsigned int)strlen(s));
return EXIT_SUCCESS;
}

In a bash shell on my Ubuntu box that prints out perfectly.

This is my favourite reference page for the characters:
http://www.utf8-chartable.de/unicode-utf8-table.pl

BTW, if anyone can tell me how to avoid having the string concatenation in
the assignment line (in other words, how to end a hex string where the
next character is a digit or A-F) I'd be grateful.
 
O

osmium

Nick said:
Joe Wright said:
I love the idea of UTF-8 but I don't know how to use it. Code points
0..127 are single byte ASCII characters and offer no problem. But what
do we do with the multi-byte characters?

We simply stick them into our C strings, byte by byte. There are no
zeros in UTF-8, so they are still valid strings (as CHAR_BIT is always
at least 8, a char can always hold at least 1 byte).

For example, try this.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
char s[] = "\xc2\xa1ol\xc3\xa9! This string costs \xc2\xa3""25.20";
printf("The string is \"%s\" - 30 symbols, but is %d bytes
long",s,(unsigned int)strlen(s));
return EXIT_SUCCESS;
}

In a bash shell on my Ubuntu box that prints out perfectly.

This is my favourite reference page for the characters:
http://www.utf8-chartable.de/unicode-utf8-table.pl

BTW, if anyone can tell me how to avoid having the string concatenation in
the assignment line (in other words, how to end a hex string where the
next character is a digit or A-F) I'd be grateful.
 
I

Ioannis Vranos

Nick said:
Joe Wright said:
I love the idea of UTF-8 but I don't know how to use it. Code points
0..127 are single byte ASCII characters and offer no problem. But what
do we do with the multi-byte characters?

We simply stick them into our C strings, byte by byte. There are no
zeros in UTF-8, so they are still valid strings (as CHAR_BIT is always
at least 8, a char can always hold at least 1 byte).

For example, try this.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void) {
char s[] = "\xc2\xa1ol\xc3\xa9! This string costs \xc2\xa3""25.20";
printf("The string is \"%s\" - 30 symbols, but is %d bytes
long",s,(unsigned int)strlen(s)); return EXIT_SUCCESS;
}

In a bash shell on my Ubuntu box that prints out perfectly.

This is my favourite reference page for the characters:
http://www.utf8-chartable.de/unicode-utf8-table.pl

BTW, if anyone can tell me how to avoid having the string concatenation in
the assignment line (in other words, how to end a hex string where the
next character is a digit or A-F) I'd be grateful.



Isn't multibyte characters usage more messy and more restricted, than usage
of wchar_t characters?




--
Ioannis Vranos

C95 / C++03 Software Developer

http://www.cpp-software.net
 
E

Eric Sosman

Isn't multibyte characters usage more messy and more restricted, than usage
of wchar_t characters?

"Messier," yes, in the sense that it's more complicated
to navigate in a string of multibyte characters than in an
ordinary string where all characters are the same size. If
`p' points to a char in an ordinary string, `p+1' points to
the next char. But in a multibyte string, the char at `p+1'
might be a continuation of the "character" starting at `p'
rather than an independent "character" on its own. Going
backwards is, if possible at all, even worse.

"More restricted" -- I honestly don't know. It's my
impression that there's supposed to be a wchar_t value for
every (valid) multibyte sequence, and at least one multibyte
sequence for every wchar_t value, but I can't find a guarantee
to that effect. (Since the mapping is locale-dependent, and
since locales are implementation-defined, such a guarantee
might be impossible -- consider converting one way, changing
locales, and trying to convert back again).

If you're handling the multibyte strings as "pass-through"
data, you can probably leave them in multibyte form and not
worry about their internal structure. If you're going to
analyze and/or manipulate the strings' characters, it may be
best to convert from multibyte to wchar_t, do the work, and
(if needed) convert back again. That is, treat the multibyte
string as the "external encoding" of a wchar_t string that
you work with internally.
 
N

Nobody

I am asking so as to be sure:


AFAIK non-latin, other language characters, produce undefined behaviour,
when used with standard library facilities expecting char strings like
printf(), and when used in string literals.


Is this correct?

Using non-ASCII characters in string literals is problematic.

Passing non-ASCII strings to library functions isn't a problem, although
some of them will expect the strings to be valid according to the encoding
of the current locale().

If the library functions only accepted ASCII strings, there wouldn't be
much point in having locales.
 
N

Nobody

Isn't multibyte characters usage more messy and more restricted, than usage
of wchar_t characters?

It's more messy, but you typically have to convert from/to multibyte
representation for input and output (less so on Windows, where the OS APIs
use wchar_t, although you still need to convert for non-Microsoft file
formats and network protocols).

Generally, if you're just passing strings around without processing them,
it's easier to keep the "char" representation.

If you need to do non-trivial processing, wide characters are easier (e.g.
indexing a wchar_t array indexes characters rather than bytes).

OTOH, Windows uses a 16-bit wchar_t, so if you're using characters outside
of the basic multilingual plane (BMP), you end up with a multi-wchar_t
representation, which is the worst of both worlds.
 
K

Keith Thompson

Nobody said:
Using non-ASCII characters in string literals is problematic.

Really? It works fine on this IBM mainframe [*].

The standard mentions ASCII only in passing in a couple of footnotes.

A string literal may contain
any member of the source character set except the double-quote ",
backslash \, or new-line character
plus escape sequences (C99 6.5.4). That's defined in the syntax
for string-literal, so if anything else appears in a string literal,
a diagnostic is required.

The "source character set" consists of the "basic source character
set", defined in C99 5.2.1, plus zero or more additional
implementation-defined characters.

So if your compiler's documentation says that the Euro sign is part of
the source character set, using a Euro sign in a string literal is
perfectly valid *for that implementation*. If not, it requires a
diagnostic.
Passing non-ASCII strings to library functions isn't a problem, although
some of them will expect the strings to be valid according to the encoding
of the current locale().

It depends on the library function. The rules for a string-literal
and a string are very different. A string-literal appears in
source code; it must follow the syntax rules in 6.5.4. A string
is a run-time entity, consisting of "a contiguous sequence of
characters terminated by and including the first null character".
The values of these characters are just numbers; whether they're
part of any character set is typically irrelevant. The string
passed to fopen() is typically restricted in some way, but by the
OS or the file system (for example, the system might disallow '?',
a member of the basic character set, in file names).
If the library functions only accepted ASCII strings, there wouldn't be
much point in having locales.

[*] Ok, I don't really have an IBM mainframe.
 
K

Kaz Kylheku

OTOH, Windows uses a 16-bit wchar_t, so if you're using characters outside
of the basic multilingual plane (BMP), you end up with a multi-wchar_t

Characters outside of the basic multilingual plane contain only
ridiculous nonsense like music symbols, mahjong game tiles, and runes
for ancient languages that have no speakers. (And, ironically, no
serious mahjong or music software has any use for these characters).
Most users won't even have fonts supplying the glyphs for this rubbish.
 
I

Ioannis Vranos

Nobody said:
Using non-ASCII characters in string literals is problematic.

Passing non-ASCII strings to library functions isn't a problem, although
some of them will expect the strings to be valid according to the encoding
of the current locale().

If the library functions only accepted ASCII strings, there wouldn't be
much point in having locales.


ASCII is not required to be supported.

What is supported is the source character set and the execution character
set. The execution character set in particular, isn't required to be ASCII.


A classic example used, is the EBCDIC character set.


As far as I know, for the "plain" (non wchar_t) library functions accepting
char strings, the portable inputs are only characters of the basic character
set.


For other characters, wchar_t strings, wchar_t pointers and wchar_t
literals, along with the wchar_t functions like wprintf() should be used,
along with the system supported locales.


In summary, only char strings using the basic execution character set (only
latin letters for letters) are portable.




--
Ioannis Vranos

C95 / C++03 Software Developer

http://www.cpp-software.net
 
I

Ioannis Vranos

Corrected:



Ioannis said:
ASCII is not required to be supported.

What is supported is the source character set and the execution character
set. The execution character set in particular, isn't required to be
ASCII.


A classic example used, is the EBCDIC character set.


As far as I know, for the "plain" (non wchar_t) library functions
accepting char strings, the portable inputs are only characters of the
basic character set.


For other characters, wchar_t strings, wchar_t pointers and wchar_t
literals, along with the wchar_t functions like wprintf() should be used,
along with the system supported locales.


In summary, only char strings


==> and wchar_t strings,

using the basic execution character set
(only latin letters for letters) are portable.





--
Ioannis Vranos

C95 / C++03 Software Developer

http://www.cpp-software.net
 
F

Flash Gordon

Kaz said:
Characters outside of the basic multilingual plane contain only
ridiculous nonsense like music symbols,

People write software for handling music... and books discussing it.
It's entirely possible that a text discussing music, which someone's
taxt processing software is processing, could include those music symbols.
mahjong game tiles,

It is really the tiles which are there, rather than the Chinese
characters which are on the tiles?

Oh, and don't forget the UAE... I've been there and they still make a
lot of use of there own character set in daily life (street signs, news
papers etc), is that in the basic multilingual plane? It could be, I
don't know.
and runes
for ancient languages that have no speakers.

Is the Chinese character set in the basic multilingual plane?
(And, ironically, no
serious mahjong or music software has any use for these characters).
Most users won't even have fonts supplying the glyphs for this rubbish.

There are plenty of fonts available as part of a Windows install (i.e.
on the Windows media or available from Windows Update.

OK, most people in the western world won't need them.
 
K

Keith Thompson

Flash Gordon said:
People write software for handling music... and books discussing
it. It's entirely possible that a text discussing music, which
someone's taxt processing software is processing, could include those
music symbols.


It is really the tiles which are there, rather than the Chinese
characters which are on the tiles?

Yes, Unicode characters 0x1F000 through 1F02B are "MAHJONG TILE EAST
WIND" ... "MAHJONG TILE BACK".

[snip]
 
N

Nobody

People write software for handling music... and books discussing it.
It's entirely possible that a text discussing music, which someone's
taxt processing software is processing, could include those music symbols.

A more realistic use of "symbols" outside the BMP is the mathematical
alphanumerics, which provide Latin and Greek letters in various styles. I
can reasonably foresee mathemetical texts using these as "characters"
(rather than as "graphics", which is how I would expect a musical text to
use musical notation).
It is really the tiles which are there, rather than the Chinese
characters which are on the tiles?

No, it's the tiles. One of the three suits uses Han characters (two
characters per tile), as do the winds and two of the three dragons. The
other two suits use patterns of coins and sticks. The "mahjong tile"
Unicode characters have one character for each tile.

Oh, and don't forget the UAE... I've been there and they still make a
lot of use of there own character set in daily life (street signs, news
papers etc), is that in the basic multilingual plane? It could be, I
don't know.

Arabic is in the BMP.
Is the Chinese character set in the basic multilingual plane?

The BMP includes most of the Chinese, Japanese and Korean characters
which are in modern use. The Japanese kanji are considered equivalent to
traditional Chinese hanzi, while simplified Chinese hanzi have distinct
codepoints.

However, there are around 40,000 "supplementary" Han characters which are
relegated to plane 2.

These aren't widely used; the characters in the BMP are sufficient for
normal use, and more advanced text handling is normally done using legacy
encodings rather than Unicode.
 
B

Ben Bacarisse

Nobody said:
A more realistic use of "symbols" outside the BMP is the mathematical
alphanumerics, which provide Latin and Greek letters in various
styles.

[Topicality drift gone mad, bad I checked this out earlier and I
can't bear the time ging to waste!]

Most of the common mathematical alphabetics in special fonts are also
provided in the BMP. Things like double strike N and Z and some
script letters. Not all, of course, just some.

<snip>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top