P
Peter Nilsson
In a post regarding toupper(), Richard Heathfield once asked me to think
about what the conversion of a char to unsigned char would mean, and whether
it was sensible to actually do so. And pete has raised a doubt in my mind
on the same issue.
Either through ignorance or incompetence, I've been unable to resolve some
issues.
6.4.4.4p6 states...
The hexadecimal digits that follow the backslash and the letter
x in a hexadecimal escape sequence are taken to be part of the
construction of a single character for an integer character
constant or of a single wide character for a wide character
constant. The numerical value of the hexadecimal integer so
formed specifies the value of the desired character or wide
character.
6.4.4.4p9 states...
An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer.
...
What does this mean? Why does it use the phrase 'value of the
_representation_'?
It goes on to say...
If an integer character constant contains a single character or
escape sequence, its value is the one that results when an object
with type char whose value is that of the single character or
escape sequence is converted to type int.
What does this mean?
I'm thinking of when plain char is signed. For the character constants
obviously in the range 0..CHAR_MAX, e.g. '\x50', then I can expect the
value to be what the constant implies, namely 0x50 for the sample given.
But what happens when a character constant (using hex or octal escape)
is in the range CHAR_MAX+1..UCHAR_MAX?
What is the value of '\xe9' on an 8-bit char implementation? I would have
thought 233, but if plain char is signed, then it would seem that the value
is implementation defined. But in what way? Is the value 233 _converted_ to
char (as in 6.3.1), or is the value _as if_ an unsigned char object was read
through a char lvalue? [In which case '\x80' could be problematic on 8-bit
1c and sm machines (and seemingly also on 2c machines under C99).]
Moving onto other source methods for plain char values:
7.19.3p11 states...
...The byte input functions read characters from the stream as if by
successive calls to the fgetc function.
7.19.7.1 states...
...the fgetc function obtains that character as an unsigned char
converted to an int...
7.19.8.1 states...
The fread function reads, into the array pointed to by ptr, up to
nmemb elements whose size is specified by size, from the stream
pointed to by stream. For each object, size calls are made to the
fgetc function and the results stored, in the order read, in an
array of unsigned char exactly overlaying the object.
Now the specs for fgets make no mention of unsigned char arrays, and fgetc
does not write the read character to a buffer, it simply returns an int.
This suggests that fgets (notionally using fgetc) stores it's characters by
assignment, and thus conversion (6.3.1).
So, there is at least two ways a plain char string can store the same source
'string' of characters. [Considering characters who's original unsigned char
values are outside the range of signed char.]
Writing a generic to_upper(char*) function would seem to be impossible. If
the function chooses to convert the original string pointer to an unsigned
char *, then it fails for strings read by fgets and possibly string
literals. If it converts single char values to unsigned then it possibly
fails for strings read via fread. [Unless the implementation defined
conversion from unsigned char (or int) to (signed plain) char is a literal
reinterpretation of the low order bits.]
Implementations are allowed to support extended character sets, and those
character codings may not be representable as positive values in plain char.
So how can a strictly portable program deal with such characters? Do string
manipulation functions have to know the potential source?
Fundamentally, how does a program reliably convert negative plain char
values back to the original non-negative int (or unsigned char) values?
[e.g. for locale dependant functions like some toxxxx() functions.]
about what the conversion of a char to unsigned char would mean, and whether
it was sensible to actually do so. And pete has raised a doubt in my mind
on the same issue.
Either through ignorance or incompetence, I've been unable to resolve some
issues.
6.4.4.4p6 states...
The hexadecimal digits that follow the backslash and the letter
x in a hexadecimal escape sequence are taken to be part of the
construction of a single character for an integer character
constant or of a single wide character for a wide character
constant. The numerical value of the hexadecimal integer so
formed specifies the value of the desired character or wide
character.
6.4.4.4p9 states...
An integer character constant has type int. The value of an integer
character constant containing a single character that maps to a
single-byte execution character is the numerical value of the
representation of the mapped character interpreted as an integer.
...
What does this mean? Why does it use the phrase 'value of the
_representation_'?
It goes on to say...
If an integer character constant contains a single character or
escape sequence, its value is the one that results when an object
with type char whose value is that of the single character or
escape sequence is converted to type int.
What does this mean?
I'm thinking of when plain char is signed. For the character constants
obviously in the range 0..CHAR_MAX, e.g. '\x50', then I can expect the
value to be what the constant implies, namely 0x50 for the sample given.
But what happens when a character constant (using hex or octal escape)
is in the range CHAR_MAX+1..UCHAR_MAX?
What is the value of '\xe9' on an 8-bit char implementation? I would have
thought 233, but if plain char is signed, then it would seem that the value
is implementation defined. But in what way? Is the value 233 _converted_ to
char (as in 6.3.1), or is the value _as if_ an unsigned char object was read
through a char lvalue? [In which case '\x80' could be problematic on 8-bit
1c and sm machines (and seemingly also on 2c machines under C99).]
Moving onto other source methods for plain char values:
7.19.3p11 states...
...The byte input functions read characters from the stream as if by
successive calls to the fgetc function.
7.19.7.1 states...
...the fgetc function obtains that character as an unsigned char
converted to an int...
7.19.8.1 states...
The fread function reads, into the array pointed to by ptr, up to
nmemb elements whose size is specified by size, from the stream
pointed to by stream. For each object, size calls are made to the
fgetc function and the results stored, in the order read, in an
array of unsigned char exactly overlaying the object.
Now the specs for fgets make no mention of unsigned char arrays, and fgetc
does not write the read character to a buffer, it simply returns an int.
This suggests that fgets (notionally using fgetc) stores it's characters by
assignment, and thus conversion (6.3.1).
So, there is at least two ways a plain char string can store the same source
'string' of characters. [Considering characters who's original unsigned char
values are outside the range of signed char.]
Writing a generic to_upper(char*) function would seem to be impossible. If
the function chooses to convert the original string pointer to an unsigned
char *, then it fails for strings read by fgets and possibly string
literals. If it converts single char values to unsigned then it possibly
fails for strings read via fread. [Unless the implementation defined
conversion from unsigned char (or int) to (signed plain) char is a literal
reinterpretation of the low order bits.]
Implementations are allowed to support extended character sets, and those
character codings may not be representable as positive values in plain char.
So how can a strictly portable program deal with such characters? Do string
manipulation functions have to know the potential source?
Fundamentally, how does a program reliably convert negative plain char
values back to the original non-negative int (or unsigned char) values?
[e.g. for locale dependant functions like some toxxxx() functions.]