E
Eric Sosman
Michael B Allen wrote On 07/23/07 14:10,:
You've heard of the <locale.h> mechanisms (you mentioned
them), but it doesn't seem that you know how or when to use
them. It's quite simple, really:
- Add #include <locale.h> near the top of the file
- Insert a setlocale() call somewhere before those
isxxx() queries. The names of locales are system-
dependent; on the machine in front of me right now
one appropriate call looks like
setlocale (LC_CTYPE, "iso_8859_1");
The, well, "characteristics" of character codes are a
function of the current locale, and change as the locale
changes. In the "C" locale, there are (for instance) only
52 alphabetic characters; the other 204 are non-alphabetic
regardless of what glyphs they may produce on an output
device. Change locale -- which is to say, change to a
different set of customs about the meanings of characters --
and you get (potentially) a different answer.
C's one-char-is-one-character style does not fit well
with multi-byte encodings, and especially not with variable-
length encodings. Granted; no argument; you're in the right.
But making char unsigned will not cure this illness, nor
even cure the least troubling of its symptoms; it's simply
beside the point.
If it "needs to be C," then it "needs to be C," and
wishing that C were something radically different from what
it is isn't going to help you. Either learn to solve your
problems in C, or find another language -- which may mean
finding another "it."
... except that's exactly the place I'd *want* to do
them! If I want to add support for Korean or even Klingon
I'd much rather concentrate on half a dozen translation
modules than go grubbing through the entire system adding
KlingonKapability to every function that touches a character.
And, once again: The signedness or unsignedness of char
has nothing to do with solving this much larger problem.
A character consisting of a variable number of unsigned
bytes is no easier to deal with than one whose bytes are
signed. You still can't get the forty-second character
from a byte array with `string[41]'.
[...]
Second, the <ctype.h> functions are required to accept
arguments whose values cover the entire range of unsigned
char (plus EOF). Half those values have the high bit set,
and the <ctype.h> functions cannot ignore that half.
#include <stdio.h>
#include <ctype.h>
#define CH 0xdf
int
main()
{
printf("%c %d %x\n", CH, CH, CH);
printf("isalnum=%d\n", isalnum(CH));
printf("isalpha=%d\n", isalpha(CH));
printf("iscntrl=%d\n", iscntrl(CH));
printf("isdigit=%d\n", isdigit(CH));
printf("isgraph=%d\n", isgraph(CH));
printf("islower=%d\n", islower(CH));
printf("isupper=%d\n", isupper(CH));
printf("isprint=%d\n", isprint(CH));
printf("ispunct=%d\n", ispunct(CH));
printf("isspace=%d\n", isspace(CH));
return 0;
}
$ LANG=en_US.ISO-8859-1 ./t
ß 223 df
isalnum=0
isalpha=0
iscntrl=0
isdigit=0
isgraph=0
islower=0
isupper=0
isprint=0
ispunct=0
isspace=0
You've heard of the <locale.h> mechanisms (you mentioned
them), but it doesn't seem that you know how or when to use
them. It's quite simple, really:
- Add #include <locale.h> near the top of the file
- Insert a setlocale() call somewhere before those
isxxx() queries. The names of locales are system-
dependent; on the machine in front of me right now
one appropriate call looks like
setlocale (LC_CTYPE, "iso_8859_1");
The, well, "characteristics" of character codes are a
function of the current locale, and change as the locale
changes. In the "C" locale, there are (for instance) only
52 alphabetic characters; the other 204 are non-alphabetic
regardless of what glyphs they may produce on an output
device. Change locale -- which is to say, change to a
different set of customs about the meanings of characters --
and you get (potentially) a different answer.
Again, even if these functions did work they *still* wouldn't handle
non-latin1 encodings (e.g. UTF-8).
C's one-char-is-one-character style does not fit well
with multi-byte encodings, and especially not with variable-
length encodings. Granted; no argument; you're in the right.
But making char unsigned will not cure this illness, nor
even cure the least troubling of its symptoms; it's simply
beside the point.
I think that you should consider the possability that programming
requirements are changing and that discussing the history of C will have
no impact on that. Anyone who could move to Java or .NET already has. The
rest of us are doing systems programming that needs to be C (like me).
If it "needs to be C," then it "needs to be C," and
wishing that C were something radically different from what
it is isn't going to help you. Either learn to solve your
problems in C, or find another language -- which may mean
finding another "it."
If standards mandate UTF-8 your techniques will have to change or you're
going to be doing a lot of painful character encoding conversions at
interface boundries.
... except that's exactly the place I'd *want* to do
them! If I want to add support for Korean or even Klingon
I'd much rather concentrate on half a dozen translation
modules than go grubbing through the entire system adding
KlingonKapability to every function that touches a character.
And, once again: The signedness or unsignedness of char
has nothing to do with solving this much larger problem.
A character consisting of a variable number of unsigned
bytes is no easier to deal with than one whose bytes are
signed. You still can't get the forty-second character
from a byte array with `string[41]'.