Plain Char

P

Peter Nilsson

In a post regarding toupper(), Richard Heathfield once asked me to think
about what the conversion of a char to unsigned char would mean, and whether
it was sensible to actually do so. And pete has raised a doubt in my mind
on the same issue.

Either through ignorance or incompetence, I've been unable to resolve some
issues.


6.4.4.4p6 states...

The hexadecimal digits that follow the backslash and the letter
x in a hexadecimal escape sequence are taken to be part of the
construction of a single character for an integer character
constant or of a single wide character for a wide character
constant. The numerical value of the hexadecimal integer so
formed specifies the value of the desired character or wide
character.

6.4.4.4p9 states...

An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer.
...

What does this mean? Why does it use the phrase 'value of the
_representation_'?

It goes on to say...

If an integer character constant contains a single character or
escape sequence, its value is the one that results when an object
with type char whose value is that of the single character or
escape sequence is converted to type int.

What does this mean?

I'm thinking of when plain char is signed. For the character constants
obviously in the range 0..CHAR_MAX, e.g. '\x50', then I can expect the
value to be what the constant implies, namely 0x50 for the sample given.
But what happens when a character constant (using hex or octal escape)
is in the range CHAR_MAX+1..UCHAR_MAX?

What is the value of '\xe9' on an 8-bit char implementation? I would have
thought 233, but if plain char is signed, then it would seem that the value
is implementation defined. But in what way? Is the value 233 _converted_ to
char (as in 6.3.1), or is the value _as if_ an unsigned char object was read
through a char lvalue? [In which case '\x80' could be problematic on 8-bit
1c and sm machines (and seemingly also on 2c machines under C99).]

Moving onto other source methods for plain char values:

7.19.3p11 states...

...The byte input functions read characters from the stream as if by
successive calls to the fgetc function.

7.19.7.1 states...

...the fgetc function obtains that character as an unsigned char
converted to an int...

7.19.8.1 states...

The fread function reads, into the array pointed to by ptr, up to
nmemb elements whose size is specified by size, from the stream
pointed to by stream. For each object, size calls are made to the
fgetc function and the results stored, in the order read, in an
array of unsigned char exactly overlaying the object.

Now the specs for fgets make no mention of unsigned char arrays, and fgetc
does not write the read character to a buffer, it simply returns an int.
This suggests that fgets (notionally using fgetc) stores it's characters by
assignment, and thus conversion (6.3.1).

So, there is at least two ways a plain char string can store the same source
'string' of characters. [Considering characters who's original unsigned char
values are outside the range of signed char.]

Writing a generic to_upper(char*) function would seem to be impossible. If
the function chooses to convert the original string pointer to an unsigned
char *, then it fails for strings read by fgets and possibly string
literals. If it converts single char values to unsigned then it possibly
fails for strings read via fread. [Unless the implementation defined
conversion from unsigned char (or int) to (signed plain) char is a literal
reinterpretation of the low order bits.]

Implementations are allowed to support extended character sets, and those
character codings may not be representable as positive values in plain char.
So how can a strictly portable program deal with such characters? Do string
manipulation functions have to know the potential source?

Fundamentally, how does a program reliably convert negative plain char
values back to the original non-negative int (or unsigned char) values?
[e.g. for locale dependant functions like some toxxxx() functions.]
 
P

pete

Peter said:
In a post regarding toupper(),
Richard Heathfield once asked me to think
about what the conversion of a char to unsigned char would mean,
and whether it was sensible to actually do so.
And pete has raised a doubt in my mind on the same issue.

The idea was that
memcmp(s1, s2, shorter_string_length + 1)
should equal
strcmp(s1, s2)
Writing a generic to_upper(char*)
function would seem to be impossible.

It's up to you, to define what to_upper(char*)
is supposed do with what, and when it is undefined.

This is what I have for int to_upper(int c):

#include <limits.h>
#include <string.h>

#define UPPER "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
#define LOWER "abcdefghijklmnopqrstuvwxyz"

int to_upper(int c)
{
char *upper;
const char *const lower = LOWER;

upper = CHAR_MAX >= c && c > '\0' ? strchr(lower, c) : NULL;
return upper != NULL ? *(upper - lower + UPPER) : c;
}
If the function chooses to convert the original
string pointer to an unsigned char *,
then it fails for strings read by fgets and possibly string
literals.
If it converts single char values to unsigned then it possibly
fails for strings read via fread. [Unless the implementation defined
conversion from unsigned char (or int)
to (signed plain) char is a literal
reinterpretation of the low order bits.]

Implementations are allowed to support extended character sets,
and those character codings may not be representable as positive
values in plain char.
So how can a strictly portable program deal with such characters?

I don't think that strictly portable programs have to support
other locales besides the C locale.
Do string manipulation functions have to know the potential source?

Fundamentally, how does a program reliably convert negative plain char
values back to the original non-negative int (or unsigned char)
values?

If the argument to toupper isn't either representable
as unsigned char, or equal to EOF, then the behavior is undefined.
 
D

Dan Pop

In said:
In a post regarding toupper(), Richard Heathfield once asked me to think
about what the conversion of a char to unsigned char would mean, and whether
it was sensible to actually do so.

It's not sensible to obtain negative char values in the first place,
in a string/text processing context. There is NO portable use for the
negative char values (portable code needing them for arithmetic purposes
should use the signed char type instead).
Either through ignorance or incompetence, I've been unable to resolve some
issues.

There is a third possibility: the standard is a complete mess in this
area. The origin of the mess is historical: the string handling functions
expect pointers to plain char, but the actual processing involves unsigned
char values:

1 The sign of a nonzero value returned by the comparison functions
memcmp, strcmp, and strncmp is determined by the sign of the
difference between the values of the first pair of characters
(both interpreted as unsigned char) that differ in the objects
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
being compared.

This is consistent with the way fgetc() works and avoids all kind of
complications that I mention below. So, any time you need a character
value from a string, use a pointer to unsigned char to obtain it.

If you want some headaches, consider the case with CHAR_BIT > 8 and
CHAR_MAX == 127. Accessing strings via char pointers effectively
means losing information in the process, so even the positive char
values thus obtained are problematic. Allow trap representations into
the padding bits for extra fun...
6.4.4.4p6 states...

The hexadecimal digits that follow the backslash and the letter
x in a hexadecimal escape sequence are taken to be part of the
construction of a single character for an integer character
constant or of a single wide character for a wide character
constant. The numerical value of the hexadecimal integer so
formed specifies the value of the desired character or wide
character.

6.4.4.4p9 states...

An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer.
...

What does this mean? Why does it use the phrase 'value of the
_representation_'?

It goes on to say...

If an integer character constant contains a single character or
escape sequence, its value is the one that results when an object
with type char whose value is that of the single character or
escape sequence is converted to type int.

What does this mean?

I'm thinking of when plain char is signed. For the character constants
obviously in the range 0..CHAR_MAX, e.g. '\x50', then I can expect the
value to be what the constant implies, namely 0x50 for the sample given.
But what happens when a character constant (using hex or octal escape)
is in the range CHAR_MAX+1..UCHAR_MAX?

What is the value of '\xe9' on an 8-bit char implementation? I would have
thought 233, but if plain char is signed, then it would seem that the value
is implementation defined. But in what way? Is the value 233 _converted_ to
char (as in 6.3.1), or is the value _as if_ an unsigned char object was read
through a char lvalue?

The latter, apparently. Otherwise, the wording is unnecessarily
complicated (no need to introduce an *object* of type char, when all you
mean is a type conversion between unsigned char and char, followed by
promotion to int). The mention of the object means that a hypothetical
byte must be used in obtaining the value of the constant.

I'm not saying that this is necessarily a sensible way of getting the
value of the character constant, but this is the only sensible
interpretation of what the standard actually says.
[In which case '\x80' could be problematic on 8-bit
1c and sm machines (and seemingly also on 2c machines under C99).]

What's wrong with '\x80' on one's complement machines? Looks like a legit
representation of -127, unless I'm missing something. It's '\xff' that is
problematic on one's complement machines (-0 or trap representation).

The existence of these bit patterns is the reason I expressed my own
doubts WRT to the sanity of this method, above.
Moving onto other source methods for plain char values:

7.19.3p11 states...

...The byte input functions read characters from the stream as if by
successive calls to the fgetc function.

7.19.7.1 states...

...the fgetc function obtains that character as an unsigned char
converted to an int...

7.19.8.1 states...

The fread function reads, into the array pointed to by ptr, up to
nmemb elements whose size is specified by size, from the stream
pointed to by stream. For each object, size calls are made to the
fgetc function and the results stored, in the order read, in an
array of unsigned char exactly overlaying the object.

Now the specs for fgets make no mention of unsigned char arrays, and fgetc
does not write the read character to a buffer, it simply returns an int.
This suggests that fgets (notionally using fgetc) stores it's characters by
assignment, and thus conversion (6.3.1).

Not necessarily: the fgets specification simply doesn't say how the
function is doing its job. It could convert its s parameter to pointer
to unsigned char and simply store the unsigned char values returned by
fgetc as ints. If a conversion to plain char were involved in the
process, we'd have problems with those values that yield trap
representations and/or negative zeros when converted to plain char.
On a one's complement implementation, the character value 255 may end
up as a null character in fgets' buffer, which is certainly not what you
want.
So, there is at least two ways a plain char string can store the same source
'string' of characters. [Considering characters who's original unsigned char
values are outside the range of signed char.]
???

Writing a generic to_upper(char*) function would seem to be impossible. If
the function chooses to convert the original string pointer to an unsigned
char *, then it fails for strings read by fgets and possibly string
literals.
Why?

If it converts single char values to unsigned then it possibly
fails for strings read via fread.

The values have been written as unsigned char values and it is unsafe to
read them back as signed char values (all three types of supported
representations can have "forbidden" bit patterns in the signed types).
[Unless the implementation defined
conversion from unsigned char (or int) to (signed plain) char is a literal
reinterpretation of the low order bits.]

All three types of supported representations can have "forbidden" bit
patterns in the signed types. So, reinterpreting the low order bits is
not an option here.
Implementations are allowed to support extended character sets, and those
character codings may not be representable as positive values in plain char.
So how can a strictly portable program deal with such characters? Do string
manipulation functions have to know the potential source?

A portable program only uses unsigned char pointers to process data
Fundamentally, how does a program reliably convert negative plain char
values back to the original non-negative int (or unsigned char) values?
[e.g. for locale dependant functions like some toxxxx() functions.]

The program avoids dealing with negative plain char values in the first
place.

Another way of dealing with the issue is by postulating that only
characters from the basic execution character set can be portably
processed in strings. This avoids all the problems discussed above,
because all the character values are positive and representable as both
char and unsigned char.

Dan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top