What's the deal with the "toupper" family?

Flash Gordon · Jul 6, 2006

Eric said:
Andrew said:

Andrew Poelstra wrote:

[...]

I believe all of these are guaranteed:

Click to expand...

char has no padding bits
char has no trap representations

Would you mind revealing where you find these guarantees?
If they are in the Standard, I have overlooked them.

Click to expand...

The first has been mentioned in this group many times (although it
may pertain only to unsigned char), and the second seemed to me a
logical extension.

Click to expand...

There are special guarantees for unsigned char, so that
it is possible to treat the representation of any object as
an array of unsigned char. This would not work if unsigned
char had trap representation or contained indeterminately-
valued padding bits.

This is covered in 6.2.6.2 para 1 of N1124 which describes padding bits
and explicitly states that unsigned char cannot have them.

With the range requirements for signed char, this means that signed char
can only have padding bits if CHAR_BIT is greater than 8, and char can
only have padding bits if it is signed and CHAR_BIT is greater than 8.

However, I am unaware of any similar guarantees for char,
either signed or plain. On an implementation where plain char
is unsigned one can deduce that it has no padding bits or traps
(argument: On such an implementation, plain char can represent
all the values unsigned char can, and since the latter "fills
the code space" the former must, too). But the argument doesn't
hold for signed char, or for plain char on an implementation
where CHAR_MIN<0.

Specifically, CHAR_MIN is allowed to be -127 on a 2s-complement system
with -128 being a trap. In addition, -0 is allowed to be a trap on
1s-complement and sign-magnitude implementations. Specifically, section
6.2.6.2 para 2 of N1124 describes this for all signed integer types with
no exception mentioned for char or signed char.

So you can have a trap representation for char even on a system with
CHARBIT==8 although I am not aware of any such system.

Since fgetc "obtains that character as an unsigned char converted to
int" it is obviously possible for it to read the representation that for
char could be a trap. Since the fgets and friends are defined in terms
of fgetc (section 7.19.3 para 11) the representation they store must
IMHO be that of the unsigned char, especially as there is the one bit
pattern that could be a trap for signed char.

So, going back to the original question, which has fallen off on this
quote, if you have some form of byte array that has been read from a
file by fgetc then I believe the technically correct method would be to
use an unsigned char pointer to read the values, since with a char
pointer you could read a trap representation and in any case for 1s
complement or sign-magnitude reading with a char pointer then casting to
unsigned char would change the bit pattern and this would IMHO be wrong.

If, on the other hand, you are passing a string literal a byte at a time
to isupper, toupper etc, then using a char pointer and casting to
unsigned char would IMHO be the correct thing.

All in all, I think it is a bit of a mess if char is signed when it
comes to the library functions. However, the standard committee probably
inherited a mess from the existing practice.

Peter Nilsson · Jul 7, 2006

Richard said:
Dik T. Winter said:

...and I think Peter is.

I said:
> It's up to the programmer to supply the correct character code value.
> ...To me, it generally makes more sense to do...
>
> toupper( * (unsigned char) &c )

[Later corrected to: toupper( * (unsigned char *) &c ) ]

>
> ...when c is a plain char.

^^^^^^^^^^^^^^^^^^^^^^^

I have _never_ said the technique should be applied to an int.

Peter Nilsson said: ...
char line[256]; ...
line = toupper(* (unsigned char *) &line); /* v2 */ ...
On the other hand, conforming implementations for big-endian platforms
certainly exist, and are in widespread use, and your technique breaks
on such platforms, in a manner I have described upthread.

Click to expand...

Care to explain why the above would break on such a platform?

Click to expand...

I'm not saying it will.

Then please stop calling it a...:

"technique that is flawed not just in theory but also in practice"

"technique that can be shown to break on very real and widely-
used platforms."

Especially when you have yet to demonstrate that the above code fails
on _any_ C implementation, let alone real world ones.

Old Wolf · Jul 7, 2006

Frederick said:
toupper( *(unsigned char const *)&c )

Does anyone else agree with this?

No. If we take an extension of your hypothetical system:
char: 1 sign bit, then 8 padding bits, then 7 value bits
uchar: 16 value bits
Padding bits must be 0 if sign bit is 0; otherwise, can be anything.

This satisfies the C standard (AFAIK) because the
representation of a non-negative plain char has the same
representation as the unsigned char of the same value.

But for negative-valued chars, the pointer cast version
returns different results depending on what the padding
bits are, which is stupid.

The only reason you would use the above expression, is
if you knew the char had been created by stuffing the
representation for a plain char into the unsigned char.

This is not the case for the result of getchar(), which is
a conversion of the value of the plain char.

Another example, on a system with 8-bit chars and
sign-magnitude:

If the byte in question has bit pattern 10000010
then your method ends up calling toupper(130).

But the proper method calls toupper(254).

The character with bit pattern 10000010 would cause
getchar() to return 254.

So it comes down to: does the char contain a value
that came from getchar, or does it contain the
representation of an unsigned char?

Again, I think the latter shows poor design, as the
representation of an unsigned char could correspond to
a trap for signed char. In particular, what is the result
of toupper(128) ?

Richard Heathfield · Jul 7, 2006

Peter Nilsson said:

I said:
I said:

It's up to the programmer to supply the correct character code value.
...To me, it generally makes more sense to do...

toupper( * (unsigned char) &c )

Click to expand...

[Later corrected to: toupper( * (unsigned char *) &c ) ]

...when c is a plain char.

Click to expand...

^^^^^^^^^^^^^^^^^^^^^^^

Yes, you did, and I missed that on my first reading. Hence the confusion. I
have already apologised for this error elsethread, but in case you missed
it I am happy to do so again.

<snip>

Keith Thompson · Jul 7, 2006

Richard Heathfield said:
Peter Nilsson said:

I said:

It's up to the programmer to supply the correct character code value.
...To me, it generally makes more sense to do...

toupper( * (unsigned char) &c )

Click to expand...

[Later corrected to: toupper( * (unsigned char *) &c ) ]

...when c is a plain char.

Click to expand...

^^^^^^^^^^^^^^^^^^^^^^^

Click to expand...

Yes, you did, and I missed that on my first reading. Hence the confusion. I
have already apologised for this error elsethread, but in case you missed
it I am happy to do so again.

<snip>

Nevertheless, using the proposed expression
toupper( * (unsigned char *) &c )
when c is of type int would be an easy mistake to make (and likely to
be missed on a machine of whichever endianness it is that would hide
the error).

Peter Nilsson · Jul 8, 2006

Richard Heathfield said:
Peter Nilsson said:

I originally wrote:

It's up to the programmer to supply the correct character code
value.
...To me, it generally makes more sense to do...

toupper( * (unsigned char) &c )
[Later corrected to: toupper( * (unsigned char *) &c ) ]

...when c is a plain char.
^^^^^^^^^^^^^^^^^^^^^^^

Click to expand...

Yes, you did, and I missed that on my first reading. Hence the confusion. I
have already apologised for this error elsethread, but in case you missed
it I am happy to do so again.

Click to expand...

I'm grateful for the apology and I'm sorry for having hounded you on
the issue.

Keith said:
Nevertheless, using the proposed expression
toupper( * (unsigned char *) &c )
when c is of type int would be an easy mistake to make

You say 'would be', but AFAIK, the number of people actively using the
method I posted is 1. I can tell you that I honestly can't recall ever
making that mistake. ;-)

I can't recall ever writing... int line[256]; ...instead of... char
line[256];
and with... int c = getchar(); ...I don't cast c in either form since
there's no need to do so.

Note that with an int c, I'm also careful to store the character as an
unsigned char byte, rather than using simple assignment to plain
char.

The fact that it's 'not the done thing', doesn't make it wrong.

pete · Jul 8, 2006

Mike S wrote:

OK, it's late and I might be missing something here, but aren't the
expressions

(unsigned char) c

and

*(unsigned char*) &c

semantically equivalent?

No.
For starters, (*(unsigned char*) &c) is an lvalue.
*(unsigned char*) &c = 0;
is valid C code.
(unsigned char) c = 0;
isn't valid C code.

Or is there a chance that they might evaluate
to a different result

Yes, in so many ways.
If the value of c is equal to -1,
then ((unsigned char) c) is equal to UCHAR_MAX,
regardless of the sizeof c,
or whether negative integers
are represented as two's complement,
one's complement or signed magnitude.

If the value of c is -1 and (sizeof c) equals 1, then
*(unsigned char*) &c
could equal either
(UCHAR_MAX) or (UCHAR_MAX - 1) or (UCHAR_MAX / 2 + 2)

If (sizeof c) doesn't equal 1,
then the allowed possible values of (*(unsigned char*) &c) are many.

pete · Jul 9, 2006

Peter said:
Let's say we have a German sharp S,
or a Spanish N with a curly thing on top of it,
[Tilde.]

and that its numeric value is negative. How do we go about
passing their value to toupper? Should we do the following?

toupper( (unsigned char)c );

Click to expand...

That's the clc regular's method. To me, it generally makes more
sense to do...

toupper( * (unsigned char) &c )

...when c is a plain char.

fputc(c) can only return either EOF or ((int)(unsigned char)c).

That's why the cast to (unsigned char) is appropriate
for the ctype functions.

Peter Nilsson · Jul 13, 2006

Old said:
...
This is not the case for the result of getchar(), which is
a conversion of the value of the plain char.

Huh? The getchar() function returns either EOF or a value in the
range of unsigned char. The unsigned char being the byte value
read from input.

Can someone tell me what's wrong with this question on StackOverflow?	0	Aug 19, 2023
What's the deal with size_t?	104	Nov 6, 2007
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
A character with a negative value	35	Nov 1, 2007
whats the use of unsigned char	11	Nov 6, 2009
Newbie - about using toupper/tolower	3	Jul 7, 2003
stream_cast fails with case insensitive string	0	May 6, 2011
How do i Do this function(dealing with arrays)	1	Dec 10, 2021

What's the deal with the "toupper" family?

Flash Gordon

Peter Nilsson

Old Wolf

Richard Heathfield

Keith Thompson

Peter Nilsson

pete

pete

Peter Nilsson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads