I want unsigned char * string literals

Michael B Allen · Jul 22, 2007

Hello,

Early on I decided that all text (what most people call "strings" [1])
in my code would be unsigned char *. The reasoning is that the elements
of these arrays are decidedly not signed. In fact, they may not even
represent complete characters. At this point I think of text as simple
binary blobs. What charset, character encoding and termination they use
should not be exposed in the interface used to operate on them.

But now I have a dilemma. C string literals are signed char *. With GCC
4 warning about every sign mismatch, my code is spewing warnings all
over the place and I'm trying to figure out what to do about it.

My current thought is to define a Windows style _T macro:

#define _T(s) ((unsigned char *)s)

Use "text" functions like:

int
text_copy(const unsigned char *src, unsigned char *dst, int n)
{
while (n-- && *src) {
*dst++ = *src++;
...

And abolish the use of traditional string functions (at least for "text").

The code might then look like the following:

unsigned char buf[255];
text_copy(_T("hello, world"), buf, sizeof(buf));

What do you think?

If I do the above I have a lot of work to do so if someone has a better
idea I'd really like to hear about it.

Mike

PS: If you have an opinion that is unfavorable (but professional) let's
hear it.

[1] I use the term "text" to mean stuff that may actually be displayed
to a user (possibly in a foreign country). I use the term "string"
to represent traditional 8 bit zero terminated char * arrays.

Eric Sosman · Jul 22, 2007

Michael said:
Hello,

Early on I decided that all text (what most people call "strings" [1])
in my code would be unsigned char *. The reasoning is that the elements
of these arrays are decidedly not signed. In fact, they may not even
represent complete characters. At this point I think of text as simple
binary blobs. What charset, character encoding and termination they use
should not be exposed in the interface used to operate on them.

But now I have a dilemma. C string literals are signed char *.

Well, no. String literals (in typical contexts) generate
anonymous arrays of char -- just plain char, not signed char
or unsigned char. Plain char is signed on some systems and
signed on others, but it is a type of its own nevertheless.

(People seem to have a hard time with the notion that char
behaves like one of signed char or unsigned char, but is a
type distinct from both. The same people seem to have no
trouble with the fact that int is a type distinct from both
short and long, even though on most systems it behaves exactly
like one or the other. Go figure.)

With GCC
4 warning about every sign mismatch, my code is spewing warnings all
over the place and I'm trying to figure out what to do about it.

"Don't Do That." The compiler is telling you that the
square peg is a poor fit for the round hole, no matter how
hard you push on it.

My current thought is to define a Windows style _T macro:

#define _T(s) ((unsigned char *)s)

... invading the namespace reserved to the implementation,
thus making the code non-portable to any implementation that
decides to use _T as one of its own identifiers. If you really
want to pursue this folly, change the macro name. And put
parens around the use of the argument, too.

Use "text" functions like:

int
text_copy(const unsigned char *src, unsigned char *dst, int n)
{
while (n-- && *src) {
*dst++ = *src++;
...

And abolish the use of traditional string functions (at least for "text").

You'll also need to find substitutes for the *printf family,
for getenv, for the strto* family, for asctime and ctime, for
most of the locale mechanism, for ...

The code might then look like the following:

unsigned char buf[255];
text_copy(_T("hello, world"), buf, sizeof(buf));

What do you think?

I think you want some other programming language, possibly
Java. If you try to do this in C, you will waste an inordinate
amount of time and effort struggling against the language and
(especially) against the library.

Malcolm McLean · Jul 22, 2007

Michael B Allen said:
Hello,

Early on I decided that all text (what most people call "strings" [1])
in my code would be unsigned char *. The reasoning is that the
elements of these arrays are decidedly not signed. In fact, they may not
even represent complete characters. At this point I think of text as
simple binary blobs. What charset, character encoding and termination
they use should not be exposed in the interface used to operate on
them.

char * for a list of human readable characters.
unsigned char *for a list of arbitrary bytes - almost always octets.
signed char * - very rare. Sometimes you might need a tiny integer. I will
resist mentioning my campaign for 64 bit ints.

unsigned char really ought to be "byte". Unfortunately a bad decison was
taken to treat characters and bytes the same way, and now we are stuck with
sizeof(char) == 1 byte.

If you start using unsigned char* for strings then, as you have found, you
will merrily break all the calls to string library functions. This can be
patched up by a cast, but the real answer is not to do that in the first
place.
Very rarely are you interested in the actual encoding of a character. A few
exceptions arise when you want to code lookup tables for speed, or write
low-level routines to convert from decimal to machine letter, or put text
into binary files in an agreed coding, but they are very few.

Michael B Allen · Jul 22, 2007

"Don't Do That." The compiler is telling you that the
square peg is a poor fit for the round hole, no matter how
hard you push on it.

Hi Eric,

Trying to put a square peg in a round hole does not fairly characterize
casting char * to unsigned char *.

... invading the namespace reserved to the implementation,
thus making the code non-portable to any implementation that
decides to use _T as one of its own identifiers. If you really
want to pursue this folly, change the macro name. And put
parens around the use of the argument, too.

I didn't invade the namespace, MS did. Which is to say that symbol is
unlikely to be use for anything other than what MS (and I) are using it
for.

But I don't see why I can't use a different symbol and retain
compatibility with the Windows platform. I will do that.

You'll also need to find substitutes for the *printf family,
for getenv, for the strto* family, for asctime and ctime, for
most of the locale mechanism, for ...

That's not a big deal. I suspect that in the end I would only end up
wrapping very few functions. I don't really use any of the above directly
as it is.

Note that if you need a truly internationalized solution (everyone should)
you can't use a lot of the traditional C string functions anyway. Strncpy
and ctype stuff is useless. Consider that web servers almost invariably
run in the C locale so anything that depends on the locale mechanism is
of limited use.

The code might then look like the following:

unsigned char buf[255];
text_copy(_T("hello, world"), buf, sizeof(buf));

What do you think?

Click to expand...

I think you want some other programming language, possibly
Java. If you try to do this in C, you will waste an inordinate
amount of time and effort struggling against the language and
(especially) against the library.

I would love to use Java the language. Unfortunately it's libraries,
host OS integration, multi-threading and networking capabilities and
just about everything else is not suitable for my purposes. C++ seems
like an over design to me but I've never really tried to use it. The
C language itself is ideal for me. I don't think deficiencies in text
processing should deter me from using it.

So I take it you just use char * for text?

It doesn't bother you that char * isn't the appropriate type for what
is effectively a binary blob especially when most of the str* functions
don't handle internationalized text anyway?

Mike

Keith Thompson · Jul 22, 2007

Michael B Allen said:
Early on I decided that all text (what most people call "strings" [1])
in my code would be unsigned char *. The reasoning is that the elements
of these arrays are decidedly not signed. In fact, they may not even
represent complete characters. At this point I think of text as simple
binary blobs. What charset, character encoding and termination they use
should not be exposed in the interface used to operate on them.

But now I have a dilemma. C string literals are signed char *. With GCC
4 warning about every sign mismatch, my code is spewing warnings all
over the place and I'm trying to figure out what to do about it.

[...]

No, C string literals have type 'array[N] of char'; in most, but not
all, contexts, this is implicity converted to 'char*. (Consider
'sizeof "hello, world"'.)

My main point isn't that they're arrays rather than pointers, but that
they're arrays of (plain) char, not of signed char. Plain char is
equivalent to *either* signed char or unsigned char, but is still a
distinct type from either of them. It appears that plain char is
signed in your implementation.

I know this doesn't answer your actual question; hopefully someone
else can help with that.

pete · Jul 22, 2007

Michael said:
Hello,

Early on I decided that all text (what most people call "strings" [1])
in my code would be unsigned char *.
The reasoning is that the elements
of these arrays are decidedly not signed. In fact, they may not even
represent complete characters. At this point I think of text as simple
binary blobs. What charset,
character encoding and termination they use
should not be exposed in the interface used to operate on them.

But now I have a dilemma. C string literals are signed char *.

They are arrays of plain char,
which may be either a signed or unsigned type.

With GCC
4 warning about every sign mismatch, my code is spewing warnings all
over the place and I'm trying to figure out what to do about it.

My current thought is to define a Windows style _T macro:

#define _T(s) ((unsigned char *)s)

Use "text" functions like:

int
text_copy(const unsigned char *src, unsigned char *dst, int n)
{
while (n-- && *src) {
*dst++ = *src++;
...

And abolish the use of traditional string functions
(at least for "text").

The code might then look like the following:

unsigned char buf[255];
text_copy(_T("hello, world"), buf, sizeof(buf));

What do you think?

If I do the above I have a lot of work to do
so if someone has a better idea
I'd really like to hear about it.

Mike

PS: If you have an opinion that is unfavorable
(but professional) let's hear it.

The solution is obvious: use arrays of char to contain strings.

Using arrays of unsigned char to hold strings
creates a problem for you, but solves nothing.

If I have a problem
that is caused by using arrays of char to hold strings,
I'm unaware of what the problem is.

Michael B Allen · Jul 23, 2007

The code might then look like the following:

unsigned char buf[255];
text_copy(_T("hello, world"), buf, sizeof(buf));

What do you think?

If I do the above I have a lot of work to do
so if someone has a better idea
I'd really like to hear about it.

Mike

PS: If you have an opinion that is unfavorable
(but professional) let's hear it.

Click to expand...

The solution is obvious: use arrays of char to contain strings.

Using arrays of unsigned char to hold strings
creates a problem for you, but solves nothing.

If I have a problem
that is caused by using arrays of char to hold strings,
I'm unaware of what the problem is.

Hi pete,

I accept that there's no technical problem with using char. But I just
can't get over the fact that char isn't the right type for text.

If you read data from binary file would you read it into a char buffer
or unsigned char buffer?

Type char is not the correct type for text. It is mearly adequate for
a traditional C 7 bit encoded "string". But char is not the right type
for binary blobs of "text" used in internationalized programs.

The only problem with using unsigned char is string literals and that
seems like a weak reason to make all downstream functions use char.

Also, technically speaking, if I used char all internationalized string
functions eventually have to cast char to unsigned char so that it could
decode and encode and interpret whole characters.

If compilers allowed the user to specify what the type for string literals
was, that would basically solve this "problem".

Mike

Eric Sosman · Jul 23, 2007

Michael said:
Hi Eric,

Trying to put a square peg in a round hole does not fairly characterize
casting char * to unsigned char *.

Sorry: my mistake. I ought to have said round peg and
square hole. My apologies.

I didn't invade the namespace, MS did. Which is to say that symbol is
unlikely to be use for anything other than what MS (and I) are using it
for.

But I don't see why I can't use a different symbol and retain
compatibility with the Windows platform. I will do that.

Sorry again; I have no idea what you're talking about.
Whatever it is doesn't seem to be C, in which identifiers
beginning with _ and a capital letter belong to the implementation
and not to the programmer.

That's not a big deal. I suspect that in the end I would only end up
wrapping very few functions. I don't really use any of the above directly
as it is.

Not even printf? Are you writing for freestanding environments
where most of the Standard library is absent?

Note that if you need a truly internationalized solution (everyone should)
you can't use a lot of the traditional C string functions anyway. Strncpy
and ctype stuff is useless.

I'll agree with you about strncpy said:
Consider that web servers almost invariably
run in the C locale so anything that depends on the locale mechanism is
of limited use.

Well, that's really not a C problem, or at least not a "C-
only" problem. C's locale support is, admittedly, an afterthought
if not actually a wart, and doesn't generalize to multi-threaded
environments. But then, C itself has no notion of multiple threads,
so what can you expect?

The code might then look like the following:

unsigned char buf[255];
text_copy(_T("hello, world"), buf, sizeof(buf));

What do you think?

Click to expand...

I think you want some other programming language, possibly
Java. If you try to do this in C, you will waste an inordinate
amount of time and effort struggling against the language and
(especially) against the library.

Click to expand...

I would love to use Java the language. Unfortunately it's libraries,
host OS integration, multi-threading and networking capabilities and
just about everything else is not suitable for my purposes. C++ seems
like an over design to me but I've never really tried to use it. The
C language itself is ideal for me. I don't think deficiencies in text
processing should deter me from using it.

Then go ahead; nobody's stopping you. But if you've made up
your mind to use C, then use C and not some Frankenstein's monster
made of parts from one language and parts from the other. If text
processing is important to you and C's text processing isn't rich
enough for your needs, then either seek another language or add
your own text-processing libraries to C. But don't try to retrofit
C's admittedly primitive text-processing to suit your more advanced
goals; all you're doing is putting lipstick on a pig.

So I take it you just use char * for text?

That I do.

It doesn't bother you that char * isn't the appropriate type for what
is effectively a binary blob especially when most of the str* functions
don't handle internationalized text anyway?

You haven't explained just why you find char* inadequate,
and the only virtue of unsigned char* you've mentioned is that it's
unsigned. I don't see how that helps with internationalization.

Are you looking for wchar_t, by any chance?

Keith Thompson · Jul 23, 2007

Michael B Allen said:
I accept that there's no technical problem with using char. But I just
can't get over the fact that char isn't the right type for text.

But that's exactly what it's *supposed* to be. If you're saying it
doesn't meet that requirement, I don't disagree. Personally, I think
it would make more sense i most environments for plain char to be
unsigned.

If you read data from binary file would you read it into a char buffer
or unsigned char buffer?

Probably an unsigned char buffer, but a binary file could be anything.
It if contained 8-bit signed data, I'd use signed char.

Type char is not the correct type for text. It is mearly adequate for
a traditional C 7 bit encoded "string". But char is not the right type
for binary blobs of "text" used in internationalized programs.

The only problem with using unsigned char is string literals and that
seems like a weak reason to make all downstream functions use char.

Also, technically speaking, if I used char all internationalized string
functions eventually have to cast char to unsigned char so that it could
decode and encode and interpret whole characters.

If compilers allowed the user to specify what the type for string literals
was, that would basically solve this "problem".

Not really; the standard functions that take strings would still
require pointers to plain char.

As I said, IMHO making plain char unsigned is the best solution in
most environments. I don't know why that hasn't caught on. Perhaps
there's to much badly writen code that assumes plain char is signed.

pete · Jul 23, 2007

Michael said:
Hello,

Early on I decided that all text (what most people call "strings" [1])
in my code would be unsigned char *.
The reasoning is that the elements
of these arrays are decidedly not signed. In fact, they may not even
represent complete characters. At this point I think of text as simple
binary blobs. What charset,
character encoding and termination they use
should not be exposed in the interface used to operate on them.

But now I have a dilemma. C string literals are signed char *.
With GCC
4 warning about every sign mismatch, my code is spewing warnings all
over the place and I'm trying to figure out what to do about it.

My current thought is to define a Windows style _T macro:

#define _T(s) ((unsigned char *)s)

Use "text" functions like:

int
text_copy(const unsigned char *src, unsigned char *dst, int n)
{
while (n-- && *src) {
*dst++ = *src++;
...

And abolish the use of traditional string functions
(at least for "text").

The code might then look like the following:

unsigned char buf[255];
text_copy(_T("hello, world"), buf, sizeof(buf));

What do you think?

If I do the above I have a lot of work to do
so if someone has a better
idea I'd really like to hear about it.

Mike

PS: If you have an opinion that is unfavorable
(but professional) let's hear it.

[1] I use the term "text" to mean stuff that may actually be displayed
to a user (possibly in a foreign country). I use the term "string"
to represent traditional 8 bit zero terminated char * arrays.

I think it might be simpler to retain the char interface,
and then cast inside your functions:

int
text_copy(const char *src, char *dst, int n)
{
unsigned char *s1 = ( unsigned char *)dst;
const unsigned char *s2 = (const unsigned char *)src;

while (n != 0 && *s2 != '\0') {
*s1++ = *s2++;
--n;
}
while (n-- != 0) {
*s1++ = '\0';
}

Eric Sosman · Jul 23, 2007

Keith said:
But that's exactly what it's *supposed* to be. If you're saying it
doesn't meet that requirement, I don't disagree. Personally, I think
it would make more sense i most environments for plain char to be
unsigned.

Probably an unsigned char buffer, but a binary file could be anything.
It if contained 8-bit signed data, I'd use signed char.

Not really; the standard functions that take strings would still
require pointers to plain char.

As I said, IMHO making plain char unsigned is the best solution in
most environments. I don't know why that hasn't caught on. Perhaps
there's to much badly writen code that assumes plain char is signed.

The historical background for C's ambiguity is fairly
clear: The "load byte" instruction sign-extended on some
machines and zero-extended on others (and on some, simply
left the high-order bits of the destination register alone).
Had C mandated either sign-extension or zero-extension, it
would have added extra instructions to every single character
fetch on the un-favored architectures.

Nowadays it is a good trade to hide such minor matters
behind a veneer of "programmer friendliness," but the economics
(i.e., the relative cost of computer time and programmer time)
were different when C was devised. It would, I think, be an act
of supreme arrogance and stupidity to maintain that today's
economic balance is the end state, subject to no further change.

Keith Thompson · Jul 23, 2007

Eric Sosman said:
Keith Thompson wrote: [...]

As I said, IMHO making plain char unsigned is the best solution in
most environments. I don't know why that hasn't caught on. Perhaps
there's to much badly writen code that assumes plain char is signed.

Click to expand...

The historical background for C's ambiguity is fairly
clear: The "load byte" instruction sign-extended on some
machines and zero-extended on others (and on some, simply
left the high-order bits of the destination register alone).
Had C mandated either sign-extension or zero-extension, it
would have added extra instructions to every single character
fetch on the un-favored architectures.

Nowadays it is a good trade to hide such minor matters
behind a veneer of "programmer friendliness," but the economics
(i.e., the relative cost of computer time and programmer time)
were different when C was devised. It would, I think, be an act
of supreme arrogance and stupidity to maintain that today's
economic balance is the end state, subject to no further change.

I'm not (necessarily) suggesting that the standard should require
plain char to be unsigned. What I'm suggesting is that most current
implementations should probably choose to make plain char unsigned.
Many of them make it signed, perhaps for backward compatibility, but
IMHO it's a poor tradeoff.

Michael B Allen · Jul 23, 2007

Michael said:
Michael said:

Hello,

Early on I decided that all text (what most people call "strings" [1])
in my code would be unsigned char *.
The reasoning is that the elements
of these arrays are decidedly not signed. In fact, they may not even
represent complete characters. At this point I think of text as simple
binary blobs. What charset,
character encoding and termination they use
should not be exposed in the interface used to operate on them.

But now I have a dilemma. C string literals are signed char *.
With GCC
4 warning about every sign mismatch, my code is spewing warnings all
over the place and I'm trying to figure out what to do about it.

My current thought is to define a Windows style _T macro:

#define _T(s) ((unsigned char *)s)

Use "text" functions like:

int
text_copy(const unsigned char *src, unsigned char *dst, int n)
{
while (n-- && *src) {
*dst++ = *src++;
...

And abolish the use of traditional string functions
(at least for "text").

The code might then look like the following:

unsigned char buf[255];
text_copy(_T("hello, world"), buf, sizeof(buf));

What do you think?

If I do the above I have a lot of work to do
so if someone has a better
idea I'd really like to hear about it.

Mike

PS: If you have an opinion that is unfavorable
(but professional) let's hear it.

[1] I use the term "text" to mean stuff that may actually be displayed
to a user (possibly in a foreign country). I use the term "string"
to represent traditional 8 bit zero terminated char * arrays.

Click to expand...

I think it might be simpler to retain the char interface,
and then cast inside your functions:

int
text_copy(const char *src, char *dst, int n)
{
unsigned char *s1 = ( unsigned char *)dst;
const unsigned char *s2 = (const unsigned char *)src;

Hi pete,

Ok, I'm giving in. I asked, I got an answer and you guys are right.

Even though char is wrong, it's just another little legacy wart with
no serious technical impact other than the fact that to inspect bytes
within the text one should cast to unsigned char first. So if casting
has to occur, doing it in the base functions is a lot more elegant than
casting every string literal throughout the entire codebase.

But in hope that someday compilers will provide an option for char to
be unsigned, I have started to replaced all instances of the char type
with my own typedef so that when that day comes I can tweak one line of
code and have what I want.

Actually I see GCC has a -funsigned-char option that seems to be what
I want but it didn't seem to have any effect on the warnings.

Mike

Ian Collins · Jul 23, 2007

Michael said:
Actually I see GCC has a -funsigned-char option that seems to be what
I want but it didn't seem to have any effect on the warnings.

Could it be that it simply makes char unsigned?

Alan Curry · Jul 23, 2007

Actually I see GCC has a -funsigned-char option that seems to be what
I want but it didn't seem to have any effect on the warnings.

-funsigned-char affects the compiler's behavior, possibly causing your
program to behave differently, but it doesn't make your code correct. Correct
code works when compiled with either -fsigned-char or -funsigned-char.
The warning is designed to help you make your code correct, by alerting you
when you've done something which might not work the same if you changed from
-funsigned-char to -fsigned-char (or from gcc to some other compiler that
doesn't let you choose)

If you got different warnings depending on your -f[un]signed-char option,
you'd have to compile your code twice to see all the possible warnings. That
wouldn't be friendly.

Eric Sosman · Jul 23, 2007

Michael said:
[...]

Even though char is wrong, it's just another little legacy wart with
no serious technical impact other than the fact that to inspect bytes
within the text one should cast to unsigned char first. [...]

It is unnecessary to cast anything in order to "inspect"
a character in a string. *cptr == 'A' and *cptr == 'ß' work
just fine (on systems that have a ß character), and there's
no need to cast either *cptr or the constant.

Perhaps you're unhappy about the casting that *is* needed
for the <ctype.h> functions, and I share your unhappiness.
But that's not really a consequence of the sign ambiguity of
char; rather, it follows from the functions' having a domain
consisting of all char values *plus* EOF. Were it not for the
need to handle EOF -- a largely useless addition, IMHO -- there
would be no need to cast when using <ctype.h>.

However, that's far from the worst infelicity in the C
library. The original Standard tried (mostly) to codify
C-as-it-was, not to replace it with C-remade-in-trendy-mode.
The <ctype.h> functions -- and their treatment of EOF -- were
already well-established before the first Standard was written,
and the writers had little choice but to accept them.

Michael B Allen · Jul 23, 2007

Michael said:
Michael said:

[...]

Even though char is wrong, it's just another little legacy wart with
no serious technical impact other than the fact that to inspect bytes
within the text one should cast to unsigned char first. [...]

Click to expand...

It is unnecessary to cast anything in order to "inspect"
a character in a string. *cptr == 'A' and *cptr == 'ß' work
just fine (on systems that have a ß character), and there's
no need to cast either *cptr or the constant.

Hi Eric,

The above code will not work with non-latin1 character encodings (most
importantly UTF-8). That will severely limit it's portability from an i18n
perspective (e.g. no CJK). And even domestically you're going to run into
trouble soon. Standards related to Kebreros, LDAP, GSSAPI and many more
are basically saying they don't care about codepages anymore. Everything
is going to be UTF-8 (except on Windows which will of course continue
to use wchar_t).

Perhaps you're unhappy about the casting that *is* needed
for the <ctype.h> functions, and I share your unhappiness.
But that's not really a consequence of the sign ambiguity of
char; rather, it follows from the functions' having a domain
consisting of all char values *plus* EOF. Were it not for the
need to handle EOF -- a largely useless addition, IMHO -- there
would be no need to cast when using <ctype.h>.

Forget casting, the ctype functions don't even work at all if the high
bit is on. Ctype only works with ASCII.

However, that's far from the worst infelicity in the C
library. The original Standard tried (mostly) to codify
C-as-it-was, not to replace it with C-remade-in-trendy-mode.
The <ctype.h> functions -- and their treatment of EOF -- were
already well-established before the first Standard was written,
and the writers had little choice but to accept them.

Ok. A little history is nice. But I really think these discussions
should be punctuated with saying that the C standard library is basically
useless at this point.

ctype - useless for i18n
errno - a classic non-standard standard
locale - no context object so it can't be safely used in libraries
setjmp - not portable
signal - no comment necessary
stdio - no context object to keep state separate (e.g. can't mix wide
and non-wide I/O)
stdlib - malloc has no context object
string - useless for i18n

If we're ever going to create a new "standard" library for C the first
step is to admit that the one we have now is useless for anything but
hello world programs.

Mike

Eric Sosman · Jul 23, 2007

Michael B Allen wrote On 07/23/07 12:53,:

[...]
Perhaps you're unhappy about the casting that *is* needed
for the <ctype.h> functions, and I share your unhappiness.
But that's not really a consequence of the sign ambiguity of
char; rather, it follows from the functions' having a domain
consisting of all char values *plus* EOF. Were it not for the
need to handle EOF -- a largely useless addition, IMHO -- there
would be no need to cast when using <ctype.h>.

Click to expand...

Forget casting, the ctype functions don't even work at all if the high
bit is on. Ctype only works with ASCII.

First, C does not assume ASCII character encodings,
and runs happily on systems that do not use ASCII. The
only constraints on the encoding are (1) that the available
characters include a specified set of "basic" characters,
(2) that the codes for the basic characters be non-negative,
and (3) that the codes for the characters '0' through '9'
be consecutive and ascending. Any encoding that meets
these requirements -- ASCII or not -- is acceptable for C.

Second, the <ctype.h> functions are required to accept
arguments whose values cover the entire range of unsigned
char (plus EOF). Half those values have the high bit set,

and the said:
Ok. A little history is nice. But I really think these discussions
should be punctuated with saying that the C standard library is basically
useless at this point.

If you think so, then why use C? You're planning on
throwing away the entire library and changing the handling
of text in fundamental ways (ways that go far beyond your
initial "I want unsigned text" plea). The result would be
a programming language in which existing C programs would
not run and perhaps would not compile; why are so you set
on calling this new and different language "C?" Call it
"D" or "Sanskrit" or "Baloney" if you like, but it ain't C.

Richard Heathfield · Jul 23, 2007

Michael B Allen said:

Forget casting, the ctype functions don't even work at all if the high
bit is on. Ctype only works with ASCII.

Strange, that - I've used it with EBCDIC, with the high bit set, and it
worked just fine. I wonder what I'm doing wrong.

If we're ever going to create a new "standard" library for C the first
step is to admit that the one we have now is useless for anything but
hello world programs.

The standard C library could be a lot, lot better, it's true, but it's
surprising just how much can be done with it if you try.

Michael B Allen · Jul 23, 2007

Michael B Allen wrote On 07/23/07 12:53,:

[...]
Perhaps you're unhappy about the casting that *is* needed
for the <ctype.h> functions, and I share your unhappiness.
But that's not really a consequence of the sign ambiguity of
char; rather, it follows from the functions' having a domain
consisting of all char values *plus* EOF. Were it not for the
need to handle EOF -- a largely useless addition, IMHO -- there
would be no need to cast when using <ctype.h>.

Click to expand...

Forget casting, the ctype functions don't even work at all if the high
bit is on. Ctype only works with ASCII.

Click to expand...

First, C does not assume ASCII character encodings,
and runs happily on systems that do not use ASCII. The
only constraints on the encoding are (1) that the available
characters include a specified set of "basic" characters,
(2) that the codes for the basic characters be non-negative,
and (3) that the codes for the characters '0' through '9'
be consecutive and ascending. Any encoding that meets
these requirements -- ASCII or not -- is acceptable for C.

True. I forgot about EBCDIC and such (thanks Richard).

But that is just a pedantic distraction from the real point which is that
your code will not work with non-latin1 encodings and that is going to
seriously impact it's portablity.

Second, the <ctype.h> functions are required to accept
arguments whose values cover the entire range of unsigned
char (plus EOF). Half those values have the high bit set,
and the <ctype.h> functions cannot ignore that half.

#include <stdio.h>
#include <ctype.h>

#define CH 0xdf

int
main()
{
printf("%c %d %x\n", CH, CH, CH);

printf("isalnum=%d\n", isalnum(CH));
printf("isalpha=%d\n", isalpha(CH));
printf("iscntrl=%d\n", iscntrl(CH));
printf("isdigit=%d\n", isdigit(CH));
printf("isgraph=%d\n", isgraph(CH));
printf("islower=%d\n", islower(CH));
printf("isupper=%d\n", isupper(CH));
printf("isprint=%d\n", isprint(CH));
printf("ispunct=%d\n", ispunct(CH));
printf("isspace=%d\n", isspace(CH));

return 0;
}

$ LANG=en_US.ISO-8859-1 ./t
ß 223 df
isalnum=0
isalpha=0
iscntrl=0
isdigit=0
isgraph=0
islower=0
isupper=0
isprint=0
ispunct=0
isspace=0

Again, even if these functions did work they *still* wouldn't handle
non-latin1 encodings (e.g. UTF-8).

If you think so, then why use C? You're planning on
throwing away the entire library and changing the handling
of text in fundamental ways (ways that go far beyond your
initial "I want unsigned text" plea). The result would be
a programming language in which existing C programs would
not run and perhaps would not compile; why are so you set
on calling this new and different language "C?" Call it
"D" or "Sanskrit" or "Baloney" if you like, but it ain't C.

I think that you should consider the possability that programming
requirements are changing and that discussing the history of C will have
no impact on that. Anyone who could move to Java or .NET already has. The
rest of us are doing systems programming that needs to be C (like me).

If standards mandate UTF-8 your techniques will have to change or you're
going to be doing a lot of painful character encoding conversions at
interface boundries.

Mike

Writing through an unsigned char pointer	19	Apr 11, 2013
String operations with unsigned char arrays	2	Mar 27, 2009
'unsigned long' to 'char[]' array	4	Aug 2, 2010
Encoding of character literals	4	Nov 3, 2011
signed vs. unsigned multiplication	8	Jun 17, 2012
I want to include fees depending on the payment method, using the plugin "Deposits for Woocommerce"	0	Aug 17, 2022
Questions on conversions between char* to unsigned char* and vice versa	8	Dec 31, 2010
Non latin characters in string literals	17	Jan 3, 2010

I want unsigned char * string literals

Michael B Allen

Eric Sosman

Malcolm McLean

Michael B Allen

Keith Thompson

pete

Michael B Allen

Eric Sosman

Keith Thompson

pete

Eric Sosman

Keith Thompson

Michael B Allen

Ian Collins

Alan Curry

Eric Sosman

Michael B Allen

Eric Sosman

Richard Heathfield

Michael B Allen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads