I want unsigned char * string literals

Eric Sosman · Jul 23, 2007

Michael B Allen wrote On 07/23/07 14:10,:

[...]
Second, the <ctype.h> functions are required to accept
arguments whose values cover the entire range of unsigned
char (plus EOF). Half those values have the high bit set,
and the <ctype.h> functions cannot ignore that half.

Click to expand...

#include <stdio.h>
#include <ctype.h>

#define CH 0xdf

int
main()
{
printf("%c %d %x\n", CH, CH, CH);

printf("isalnum=%d\n", isalnum(CH));
printf("isalpha=%d\n", isalpha(CH));
printf("iscntrl=%d\n", iscntrl(CH));
printf("isdigit=%d\n", isdigit(CH));
printf("isgraph=%d\n", isgraph(CH));
printf("islower=%d\n", islower(CH));
printf("isupper=%d\n", isupper(CH));
printf("isprint=%d\n", isprint(CH));
printf("ispunct=%d\n", ispunct(CH));
printf("isspace=%d\n", isspace(CH));

return 0;
}

$ LANG=en_US.ISO-8859-1 ./t
ß 223 df
isalnum=0
isalpha=0
iscntrl=0
isdigit=0
isgraph=0
islower=0
isupper=0
isprint=0
ispunct=0
isspace=0

You've heard of the <locale.h> mechanisms (you mentioned
them), but it doesn't seem that you know how or when to use
them. It's quite simple, really:

- Add #include <locale.h> near the top of the file

- Insert a setlocale() call somewhere before those
isxxx() queries. The names of locales are system-
dependent; on the machine in front of me right now
one appropriate call looks like

setlocale (LC_CTYPE, "iso_8859_1");

The, well, "characteristics" of character codes are a
function of the current locale, and change as the locale
changes. In the "C" locale, there are (for instance) only
52 alphabetic characters; the other 204 are non-alphabetic
regardless of what glyphs they may produce on an output
device. Change locale -- which is to say, change to a
different set of customs about the meanings of characters --
and you get (potentially) a different answer.

Again, even if these functions did work they *still* wouldn't handle
non-latin1 encodings (e.g. UTF-8).

C's one-char-is-one-character style does not fit well
with multi-byte encodings, and especially not with variable-
length encodings. Granted; no argument; you're in the right.
But making char unsigned will not cure this illness, nor
even cure the least troubling of its symptoms; it's simply
beside the point.

I think that you should consider the possability that programming
requirements are changing and that discussing the history of C will have
no impact on that. Anyone who could move to Java or .NET already has. The
rest of us are doing systems programming that needs to be C (like me).

If it "needs to be C," then it "needs to be C," and
wishing that C were something radically different from what
it is isn't going to help you. Either learn to solve your
problems in C, or find another language -- which may mean
finding another "it."

If standards mandate UTF-8 your techniques will have to change or you're
going to be doing a lot of painful character encoding conversions at
interface boundries.

... except that's exactly the place I'd *want* to do
them! If I want to add support for Korean or even Klingon
I'd much rather concentrate on half a dozen translation
modules than go grubbing through the entire system adding
KlingonKapability to every function that touches a character.

And, once again: The signedness or unsignedness of char
has nothing to do with solving this much larger problem.
A character consisting of a variable number of unsigned
bytes is no easier to deal with than one whose bytes are
signed. You still can't get the forty-second character
from a byte array with `string[41]'.

Michael B Allen · Jul 23, 2007

Michael B Allen wrote On 07/23/07 14:10,:

[...]
Second, the <ctype.h> functions are required to accept
arguments whose values cover the entire range of unsigned
char (plus EOF). Half those values have the high bit set,
and the <ctype.h> functions cannot ignore that half.

Click to expand...

#include <stdio.h>
#include <ctype.h>

#define CH 0xdf

int
main()
{
printf("%c %d %x\n", CH, CH, CH);

printf("isalnum=%d\n", isalnum(CH));
printf("isalpha=%d\n", isalpha(CH));
printf("iscntrl=%d\n", iscntrl(CH));
printf("isdigit=%d\n", isdigit(CH));
printf("isgraph=%d\n", isgraph(CH));
printf("islower=%d\n", islower(CH));
printf("isupper=%d\n", isupper(CH));
printf("isprint=%d\n", isprint(CH));
printf("ispunct=%d\n", ispunct(CH));
printf("isspace=%d\n", isspace(CH));

return 0;
}

$ LANG=en_US.ISO-8859-1 ./t
ß 223 df
isalnum=0
isalpha=0
iscntrl=0
isdigit=0
isgraph=0
islower=0
isupper=0
isprint=0
ispunct=0
isspace=0

Click to expand...

You've heard of the <locale.h> mechanisms (you mentioned
them), but it doesn't seem that you know how or when to use
them. It's quite simple, really:

- Add #include <locale.h> near the top of the file

- Insert a setlocale() call somewhere before those
isxxx() queries. The names of locales are system-
dependent; on the machine in front of me right now
one appropriate call looks like

setlocale (LC_CTYPE, "iso_8859_1");

Ahh, yes. I just wrote that code in haste. I spaced on setlocale.

Still my understanding from being on the linux-utf8 mailing list (which
doesn't just discuss UTF-8) for a few years, is that some of these
functions even with the latin1 codes do not work correctly.

The, well, "characteristics" of character codes are a
function of the current locale, and change as the locale
changes. In the "C" locale, there are (for instance) only
52 alphabetic characters; the other 204 are non-alphabetic
regardless of what glyphs they may produce on an output
device. Change locale -- which is to say, change to a
different set of customs about the meanings of characters --
and you get (potentially) a different answer.

C's one-char-is-one-character style does not fit well
with multi-byte encodings, and especially not with variable-
length encodings. Granted; no argument; you're in the right.
But making char unsigned will not cure this illness, nor
even cure the least troubling of its symptoms; it's simply
beside the point.

I never said using unsigned char would "fix" anything. I don't know where
you got that in this conversation. I just want to use the right data
type for binary data and string literals were the only thing standing
in the way.

... except that's exactly the place I'd *want* to do
them! If I want to add support for Korean or even Klingon
I'd much rather concentrate on half a dozen translation
modules than go grubbing through the entire system adding
KlingonKapability to every function that touches a character.

This is were we clearly disagree. Translating between character encodings
at interface boundries is a hack.

And not every function that touches a
character would need "KlingonKapability" (whatever that is). Most of the
time you're just working on ASCII characters anyway so that code
is not a whole lot different from before. It's only when you need to
do things like case comparison of non-ASCII characters that you need to
do extra work. But even then you just put that work into a function like
text_casecmp and you don't have to think about it much again.

You make it sound like I'm rewriting everything and have my own
translation tables and so on. That's not the case. As useless as
the C standard library is, I still have no choice but to use it. The
implementations are just less efficient and inelegant. For example to do
a caseless comparison of two strings you have to use mbtowc to convert
each character to wchar_t, use towupper, then compare and repeat for
the next character. It's kinda ugly but it's a lot better than tripping
over the Unicode speed bump every time you want to call a function that
expects a different character encoding.

And, once again: The signedness or unsignedness of char
has nothing to do with solving this much larger problem.
A character consisting of a variable number of unsigned
bytes is no easier to deal with than one whose bytes are
signed. You still can't get the forty-second character
from a byte array with `string[41]'.

Yeah, yeah, yeah. I agree that even though signed char is the wrong type
for binary data, there's no technical problem with using it throughout
higher level code. I never claimed there was.

The problem (which isn't really much of a problem) is when any function
that actually inspects the binary data represeting text will almost
certainly want to do it using unsigned char. For example, codebases that
support UTF-8 throughout usually have a fast code path for UTF-8. A
function to decode one UTF-8 character into it's Unicode value might
start out something like the following:

int
utf8towc(const char *ssrc, const char *sslim, wchar_t *wc)
{
const unsigned char *src = (unsigned char *)ssrc;
const unsigned char *slim = (unsigned char *)sslim;

if (*src < 0x80) {
*wc = *src;
} else if ((*src & 0xE0) == 0xC0) {
...

You can't do the above conditional comparisons on signed char so you
gotta cast.

Mike

Michael B Allen · Jul 23, 2007

Michael B Allen wrote On 07/23/07 14:10,:

On Mon, 23 Jul 2007 13:31:24 -0400

[...]
Second, the <ctype.h> functions are required to accept
arguments whose values cover the entire range of unsigned
char (plus EOF). Half those values have the high bit set,
and the <ctype.h> functions cannot ignore that half.

#include <stdio.h>
#include <ctype.h>

#define CH 0xdf

int
main()
{
printf("%c %d %x\n", CH, CH, CH);

printf("isalnum=%d\n", isalnum(CH));
printf("isalpha=%d\n", isalpha(CH));
printf("iscntrl=%d\n", iscntrl(CH));
printf("isdigit=%d\n", isdigit(CH));
printf("isgraph=%d\n", isgraph(CH));
printf("islower=%d\n", islower(CH));
printf("isupper=%d\n", isupper(CH));
printf("isprint=%d\n", isprint(CH));
printf("ispunct=%d\n", ispunct(CH));
printf("isspace=%d\n", isspace(CH));

return 0;
}

$ LANG=en_US.ISO-8859-1 ./t
ß 223 df
isalnum=0
isalpha=0
iscntrl=0
isdigit=0
isgraph=0
islower=0
isupper=0
isprint=0
ispunct=0
isspace=0

Click to expand...

You've heard of the <locale.h> mechanisms (you mentioned
them), but it doesn't seem that you know how or when to use
them. It's quite simple, really:

- Add #include <locale.h> near the top of the file

- Insert a setlocale() call somewhere before those
isxxx() queries. The names of locales are system-
dependent; on the machine in front of me right now
one appropriate call looks like

setlocale (LC_CTYPE, "iso_8859_1");

Click to expand...

Ahh, yes. I just wrote that code in haste. I spaced on setlocale.

Still my understanding from being on the linux-utf8 mailing list (which
doesn't just discuss UTF-8) for a few years, is that some of these
functions even with the latin1 codes do not work correctly.

Actually I think maybe I'm wrong about this. I think these functions do
work with latin1.

But of coarse they definitely do not work with multi-byte encodings like
UTF-8 or SHIFT-JIS or something like that. For that you would have to
convert to a wchar_t and test with isw*.

Mike

Keith Thompson · Jul 23, 2007

Michael B Allen said:
Ok. A little history is nice. But I really think these discussions
should be punctuated with saying that the C standard library is basically
useless at this point.
[...]
setjmp - not portable

The restrictions are a bit severe, but how is it not portable? Any
conforming hosted implementation has to support it correctly.

[...]

stdlib - malloc has no context object

Can you expand on this? What's a "context object"?

Keith Thompson · Jul 23, 2007

Eric Sosman said:
And, once again: The signedness or unsignedness of char
has nothing to do with solving this much larger problem.
A character consisting of a variable number of unsigned
bytes is no easier to deal with than one whose bytes are
signed. You still can't get the forty-second character
from a byte array with `string[41]'.

If you're just using ASCII (which is a 7-bit character set), it
doesn't matter whether plain char is signed or unsigned.

If you're using one of the 8-bit extended versions of ASCII, such as
ISO 8859-1, then making plain char unsigned can have some advantages.
For example, lower case 'e' with an acute accent is character 0xe9;
referring to that as -23 rather than 233 doesn't make a lot of sense.
(I understand the sign-extension performance issue; does that still
apply to modern systems?)

Once you go beyond that, I'm not sure whether it matters or not.

Ben Pfaff · Jul 23, 2007

Michael B Allen said:
locale - no context object so it can't be safely used in libraries

POSIX is standardizing an extended set of locale functions with
what you call "context objects". It appears to me to be
implementable as a set of wrappers around the existing functions
for systems that don't support it, so this may actually be a
viable interface fairly soon. (For folks who are familiar with
gnulib--which is not c.l.c compliant code by any means--I'm
thinking about adding a module to support these new functions.)

setjmp - not portable

How so? I've successfully used it in code that I believe to be
portable, and some fairly portable libraries, e.g. libpng, use it
also.

Ben Pfaff · Jul 23, 2007

Keith Thompson said:
(I understand the sign-extension performance issue; does that still
apply to modern systems?)

What performance issue is there with sign extension?

santosh · Jul 23, 2007

Keith said:
Michael B Allen said:

Ok. A little history is nice. But I really think these discussions
should be punctuated with saying that the C standard library is basically
useless at this point.
[...]

stdlib - malloc has no context object

Click to expand...

Can you expand on this? What's a "context object"?

He probably means that the Standard library functions are not guaranteed to
be reentrant.

Michael B Allen · Jul 23, 2007

POSIX is standardizing an extended set of locale functions with
what you call "context objects". It appears to me to be
implementable as a set of wrappers around the existing functions
for systems that don't support it, so this may actually be a
viable interface fairly soon. (For folks who are familiar with
gnulib--which is not c.l.c compliant code by any means--I'm
thinking about adding a module to support these new functions.)

Glad to hear it. Can you have pointers to info about what the API will
look like?

How so? I've successfully used it in code that I believe to be
portable, and some fairly portable libraries, e.g. libpng, use it
also.

I don't know. I just remember suggesting using setjmp for some Samba thing
(the CIFS server for UNIX) and I was told setjmp was taboo because it
was not portable. Those guys have a pretty big build farm so I wasn't
about to ask for an explaination.

Mike

Eric Sosman · Jul 23, 2007

Ben Pfaff wrote On 07/23/07 17:08,:

What performance issue is there with sign extension?

Extra code. For example, the PDP-11 MOVB instruction
performed sign extension when loading an 8-bit byte into
a 16-bit register (the hardware equivalent of "promotion").
To get unsigned semantics, you'd need more instructions to
do something about the high-order bits (maybe SUB R0,R0
followed by BISB byte,R0 -- two instructions instead of one).

On the other side of the fence, the S/360 IC instruction
copied a memory byte to the low-order 8 bits of a 32-bit
register, leaving the high-order 24 bits unchanged. It's
easy to get "unsigned" semantics -- XR the register with
itself and then IC the desired byte -- but extending the
sign would require additional work. (A 24-bit left shift
plus a 24-bit arithmetic right shift would do it, at a
cost of three instructions instead of two; if my S/360
assembly were not as rusty as it has now become, I might
think of a snazzier sequence.)

If dmr had mandated one treatment or the other, one or
the other of these machines -- and others facing similar
issues -- would have found compiled C code clunky if not
downright bloated. Nowadays, a compiler might spend effort
figuring out when the high-order bits are actually needed
and when they can be ignored, but compilers Back Then did
not have the CPU and memory resources modern compilers enjoy.
Anything much more than a bit of peephole optimization was
likely not to be done at all.

Ben Pfaff · Jul 23, 2007

Eric Sosman said:
Ben Pfaff wrote On 07/23/07 17:08,:

Extra code. For example, the PDP-11 MOVB instruction
performed sign extension when loading an 8-bit byte into
a 16-bit register (the hardware equivalent of "promotion").
To get unsigned semantics, you'd need more instructions to
do something about the high-order bits (maybe SUB R0,R0
followed by BISB byte,R0 -- two instructions instead of one).

Ah, I see.

I'm not sure that this concern applies to modern compilers on
modern x86 processors, for at least three reasons. First, the
x86 has an instruction for moving data and performing sign
extension at the same time (MOVSX). I don't know how often it's
applied by compilers, because the x86 also has several
instructions for doing a memory load or store and an arithmetic
operation at the same time, and those don't support sign- or
zero-extension as part of the instruction. But that leads into
the second reason, that, I imagine, an instruction for
sign-extension could overlap its execution with other
instructions most of the time, so that even if it makes the code
longer it doesn't necessarily make it slower.

Third, I suspect that compilers are often able to avoid the need
for sign-extension entirely (as you note later in a part of your
article that I did not quote).

(All this is idle speculation. I am not a compiler or x86
architecture expert.)

Michael B Allen · Jul 23, 2007

Keith said:
Keith said:

Michael B Allen said:

Ok. A little history is nice. But I really think these discussions
should be punctuated with saying that the C standard library is basically
useless at this point.
[...]

stdlib - malloc has no context object

Click to expand...

Can you expand on this? What's a "context object"?

Click to expand...

He probably means that the Standard library functions are not guaranteed to
be reentrant.

Hi santosh (and Keith),

By "context object" I just mean some place to put state. If malloc had
a context object you could create any number of separate allocators
- block allocators, allocators backed with shared memory, lockless
allocators, debugging and profiling allocators, allocators that will free
all objects allocated from it in one call (i.e. garbage collection),
allocators that allocate memory from an arbitrary chuck of memory (a
sort of sub-allocator) and so on. All of these could be used throughout
your code at the same time independantly.

[Note that this should not be confused with Object Oriented Programming
so please hold the "you want C++" responses. For OOP you need polymorphism
which is not obtained by simply adding an "context object".]

In general, malloc (like many of the C APIs) was just poorly designed. At
the time it was invented it was fine but now it's a good example of a
poorly designed API.

Incedentially, the current malloc is reentrant and thread-safe because
it uses locks (although I don't recall of the top of my head if that
is a standards requirement). But with a context object you wouldn't
need the locks and the code would still be reentrant and it would be
thread-safe provided your threads used their own context objects (or
if they locks). So reentrance is one benifit but it is by no means the
only one.

Mike

Ben Pfaff · Jul 23, 2007

Michael B Allen said:
By "context object" I just mean some place to put state. If malloc had
a context object you could create any number of separate allocators
- block allocators, allocators backed with shared memory, lockless
allocators, debugging and profiling allocators, allocators that will free
all objects allocated from it in one call (i.e. garbage collection),
allocators that allocate memory from an arbitrary chuck of memory (a
sort of sub-allocator) and so on. All of these could be used throughout
your code at the same time independantly.

Usually folks implement these things in separate libraries, that
sometimes delegate some of their functionality to malloc. Pool
allocators for example can be very effectively implemented as a
layer above malloc. In fact in implementing many of the things
that you describe it has never occurred to me that an extra
"context" argument to malloc would be useful. Context objects
are certainly useful in implementing those more advanced
concepts, but I don't think they're needed at the lowest level.

Michael B Allen · Jul 24, 2007

Usually folks implement these things in separate libraries, that
sometimes delegate some of their functionality to malloc. Pool
allocators for example can be very effectively implemented as a
layer above malloc. In fact in implementing many of the things
that you describe it has never occurred to me that an extra
"context" argument to malloc would be useful. Context objects
are certainly useful in implementing those more advanced
concepts, but I don't think they're needed at the lowest level.

Hi Ben,

I think you just agreed with me (not sure) but I just want to add that
the "context object" focus comes from the fact that you can use the same
functions with different context objects. So the context object would
encapsulate everything about the behavior of the allocator. That allows
the user to swap out the allocator with a different implementation but
without changing all of the allocation calls in your code.

Mike

Mark McIntyre · Jul 24, 2007

I accept that there's no technical problem with using char. But I just
can't get over the fact that char isn't the right type for text.

Huh? A char array is perfect for text.

Do you perchance mean wide characters? Considered wchar_t?

Type char is not the correct type for text. It is mearly adequate for
a traditional C 7 bit encoded "string". But char is not the right type
for binary blobs of "text" used in internationalized programs.

Binary blobs are not text however. They're binary data. Unsigned char
arrays are good for that, but I suspect you want either wchar_t or
some specific binary representation of multibyte characters. If
/thats/ what you're after, unsigned char arrays are still good.

--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan

Ben Pfaff · Jul 24, 2007

Michael B Allen said:
Usually folks implement these things in separate libraries, that
sometimes delegate some of their functionality to malloc. Pool
allocators for example can be very effectively implemented as a
layer above malloc. In fact in implementing many of the things
that you describe it has never occurred to me that an extra
"context" argument to malloc would be useful. Context objects
are certainly useful in implementing those more advanced
concepts, but I don't think they're needed at the lowest level.

Click to expand...

I think you just agreed with me (not sure)[...]

No, I was trying to say that you can layer all of the allocators
you want on top of malloc, or make them independent of it,
instead of needing to give malloc multiple independent arenas,
etc., based on a context object.

Michael B Allen · Jul 24, 2007

Michael B Allen said:
Michael B Allen said:

By "context object" I just mean some place to put state. If malloc had
a context object you could create any number of separate allocators
- block allocators, allocators backed with shared memory, lockless
allocators, debugging and profiling allocators, allocators that will free
all objects allocated from it in one call (i.e. garbage collection),
allocators that allocate memory from an arbitrary chuck of memory (a
sort of sub-allocator) and so on. All of these could be used throughout
your code at the same time independantly.

Usually folks implement these things in separate libraries, that
sometimes delegate some of their functionality to malloc. Pool
allocators for example can be very effectively implemented as a
layer above malloc. In fact in implementing many of the things
that you describe it has never occurred to me that an extra
"context" argument to malloc would be useful. Context objects
are certainly useful in implementing those more advanced
concepts, but I don't think they're needed at the lowest level.

Click to expand...

I think you just agreed with me (not sure)[...]

Click to expand...

No, I was trying to say that you can layer all of the allocators
you want on top of malloc, or make them independent of it,
instead of needing to give malloc multiple independent arenas,
etc., based on a context object.

Hi Ben,

I never said that the exiting malloc(3) function should be changed to
have a context object. I just said it was useless because it didn't have
one. And as such it shouldn't be used.

As for allocating backing memory from malloc(3) for one such
implementation of the API I'm describing that seems fine (if you're ok
with the locking overhead) but I'm not sure I understand the significace
of doing that wrt the topic of malloc being poorly designed.

Mike

Ben Pfaff · Jul 24, 2007

Michael B Allen said:
I never said that the exiting malloc(3) function should be changed to
have a context object. I just said it was useless because it didn't have
one. And as such it shouldn't be used.

malloc is "useless" because it doesn't have a context object?
Please don't exaggerate. This is refuted by the existence of
hundreds of millions of lines of code that make use of malloc.

Eric Sosman · Jul 24, 2007

Michael said:
[...]
I never said that the exiting malloc(3) function should be changed to
have a context object. I just said it was useless [...]

For the Nth time: Forget about C and find a language
more suited to your tastes. If you truly believe C is
useless, you're just wasting your time and our patience.
Go away! Be happy! Be happy somewhere else, please!
We who are about to be obsoleted salute thee; just leave
us to our misery and begone!

Michael B Allen · Jul 24, 2007

Michael said:
Michael said:

[...]
I never said that the exiting malloc(3) function should be changed to
have a context object. I just said it was useless [...]

Click to expand...

For the Nth time: Forget about C and find a language
more suited to your tastes. If you truly believe C is
useless, you're just wasting your time and our patience.
Go away! Be happy! Be happy somewhere else, please!
We who are about to be obsoleted salute thee; just leave
us to our misery and begone!

Oh please. I appreciate your input. It's usually good advice. But spare
me the drama. Just because I think The C Standard Library is useless
[1], that has little impact on using C The Language.

Mike

[1] Ok, yes, "useless" is an exaggeration simply because you *have*
to use the standard library to interface with the host. But otherwise
I don't use a lot of it (e.g. I literally don't use malloc *at all* -
I have my own allocators).

Writing through an unsigned char pointer	19	Apr 11, 2013
String operations with unsigned char arrays	2	Mar 27, 2009
'unsigned long' to 'char[]' array	4	Aug 2, 2010
Encoding of character literals	4	Nov 3, 2011
signed vs. unsigned multiplication	8	Jun 17, 2012
I want to include fees depending on the payment method, using the plugin "Deposits for Woocommerce"	0	Aug 17, 2022
Questions on conversions between char* to unsigned char* and vice versa	8	Dec 31, 2010
Non latin characters in string literals	17	Jan 3, 2010

I want unsigned char * string literals

Eric Sosman

Michael B Allen

Michael B Allen

Keith Thompson

Keith Thompson

Ben Pfaff

Ben Pfaff

santosh

Michael B Allen

Eric Sosman

Ben Pfaff

Michael B Allen

Ben Pfaff

Michael B Allen

Mark McIntyre

Ben Pfaff

Michael B Allen

Ben Pfaff

Eric Sosman

Michael B Allen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads