casting to unsigned char for is*() and to*() functions

M

mr_semantics

I have been reading about the practise of casting values to unsigned
char while using the <ctype.h> functions. For example,

c = toupper ((unsigned char) c);

Now I understand that the standard says this about the <ctype.h>
functions:

"The header <ctype.h> declares several functions useful for classifying
and mapping characters.166) In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the argument has any other value,
the behavior is undefined."

I am having a hard time formulating my question - basically its like
this though - Some people say cast to unsigned char (as in the above
example), whereas I have seen some people argue that casting to
unsigned char is unecessary, and if it is done, then a recast back to
int is necessary, because functions like toupper() expect an int, eg,

toupper( (int)((unsigned char) c) );

So what is the right thing to do? Cast to unsigned char? Cast to
unsigned char and back to int?
 
L

Lew Pitcher

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have been reading about the practise of casting values to unsigned
char while using the <ctype.h> functions. For example,

c = toupper ((unsigned char) c);

Now I understand that the standard says this about the <ctype.h>
functions:

"The header <ctype.h> declares several functions useful for classifying
and mapping characters.166) In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the argument has any other value,
the behavior is undefined."

I am having a hard time formulating my question - basically its like
this though - Some people say cast to unsigned char (as in the above
example), whereas I have seen some people argue that casting to
unsigned char is unecessary, and if it is done, then a recast back to
int is necessary, because functions like toupper() expect an int, eg,

toupper( (int)((unsigned char) c) );

So what is the right thing to do? Cast to unsigned char? Cast to
unsigned char and back to int?

IIRC, the C standard says that all members of the execution characterset
will be expressable as positive values.

Since the use of toupper() only makes sense within the scope of the
execution characterset, and not for arbitrary char values outside of
that range, it is safe to say that toupper() only works on positive char
values, or EOF (which is a specific, often negative, char value).

Casting the parameter to an unsigned char
- - may change the interpretation of the value of the parameter, if it is
(a negative value) EOF
- - has no effect on proper members of the execution character set

Casting this back to int, while properly correcting the type of the
parameter to int, otherwise has (to my knowledge) no other effect. The
damage has been done by the cast to unsigned char, and cannot be
corrected by the recasting to int.

In the end, toupper() was originally meant to take an int (as in the
return value of fgetc(), and that's what you should give it. If you are
working with char data items, convert them to int first, but otherwise
don't cast. I.e.


#include <ctype.h>
#include <stdio.h>

{
char datum = 'c', uc_datum;
int file_datum, uc_file_datum;

file_datum = fgetc(stdin);
uc_file_datum = toupper(file_datum);

uc_datum = toupper((int)datum);
}

- --

Lew Pitcher, IT Specialist, Enterprise Data Systems
Enterprise Technology Solutions, TD Bank Financial Group

(Opinions expressed here are my own, not my employer's)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)

iD8DBQFCwrO+agVFX4UWr64RAn4DAKCnWQEAHo7kXd8xv3DFlJFIyDH7BQCg5W9M
REN07taxd5C5T4SJMM8JaSk=
=HH+V
-----END PGP SIGNATURE-----
 
E

Eric Sosman

I have been reading about the practise of casting values to unsigned
char while using the <ctype.h> functions. For example,

c = toupper ((unsigned char) c);

Now I understand that the standard says this about the <ctype.h>
functions:

"The header <ctype.h> declares several functions useful for classifying
and mapping characters.166) In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the argument has any other value,
the behavior is undefined."

I am having a hard time formulating my question - basically its like
this though - Some people say cast to unsigned char (as in the above
example), whereas I have seen some people argue that casting to
unsigned char is unecessary, and if it is done, then a recast back to
int is necessary, because functions like toupper() expect an int, eg,

toupper( (int)((unsigned char) c) );

So what is the right thing to do? Cast to unsigned char? Cast to
unsigned char and back to int?

If `c' is a plain `char', cast it to `unsigned char'.
The further cast to `int' is harmless but unnecessary: since
<ctype.h> provides a prototype that says toupper() takes an
`int' argument, the compiler will do the conversion anyhow.

The reason you need the cast is that converting directly
from plain `char' to `int' might not produce what toupper()
needs. Specifically, if `char' is a signed type and `c' has
a negative value, direct conversion will produce a negative
`int'. If this negative `int' happens to equal EOF toupper()
will just return the EOF unaltered, and this might not be the
upper-case equivalent of `c'. If the negative `int' is
something other than EOF, all bets are off and you are in the
perilous realm of Undefined Behavior.

If `c' is an `int' obtained from something like getc(),
just pass it along without casting. getc() and its ilk
already return either EOF or a non-negative `unsigned char'
value, which is what toupper() et al. require.
 
P

pete

I have been reading about the practise of casting values to unsigned
char while using the <ctype.h> functions. For example,

c = toupper ((unsigned char) c);

Now I understand that the standard says this about the <ctype.h>
functions:

"The header <ctype.h>
declares several functions useful for classifying
and mapping characters.166) In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the argument has any other value,
the behavior is undefined."

I am having a hard time formulating my question - basically its like
this though - Some people say cast to unsigned char (as in the above
example), whereas I have seen some people argue that casting to
unsigned char is unecessary, and if it is done, then a recast back to
int is necessary, because functions like toupper() expect an int, eg,

toupper( (int)((unsigned char) c) );

So what is the right thing to do? Cast to unsigned char? Cast to
unsigned char and back to int?

Since toupper is undefined for values which are
not representable as unsigned char,
then if a cast to unsigned char will change the value,
then do that, if not, then don't bother.

fputc, by which all file output is described,
converts it's int argument to unsigned char, before output.

So, if you have a negative integer value like:
#define NEG_a ('a' - 1 - (unsigned char)-1)
you know that
putchar(NEG_a);
will output the 'a' character.

To make that negative number work with toupper:
putchar(toupper((unsigned char)NEG_a));
 
K

Kenneth Brody

Eric said:
I have been reading about the practise of casting values to unsigned
char while using the <ctype.h> functions. For example,

c = toupper ((unsigned char) c);

Now I understand that the standard says this about the <ctype.h>
functions: [...]
So what is the right thing to do? Cast to unsigned char? Cast to
unsigned char and back to int?

If `c' is a plain `char', cast it to `unsigned char'.
The further cast to `int' is harmless but unnecessary: since
<ctype.h> provides a prototype that says toupper() takes an
`int' argument, the compiler will do the conversion anyhow.

The reason you need the cast is that converting directly
from plain `char' to `int' might not produce what toupper()
needs. Specifically, if `char' is a signed type and `c' has
a negative value, direct conversion will produce a negative
`int'. If this negative `int' happens to equal EOF toupper()
will just return the EOF unaltered, and this might not be the
upper-case equivalent of `c'. If the negative `int' is
something other than EOF, all bets are off and you are in the
perilous realm of Undefined Behavior.
[...]

For example, suppose toupper were defined as:

#define toupper(c) ( ((c)==EOF) ? EOF : toupper_xlate[c] )

where "toupper_xlate[]" is an array of the convert-to-upper-case values.
If c is a signed char, then in addition to 0xFF probably equivalent to
EOF, you also have 0x80 through 0xFE being sign-extended as negative
subscripts into the array. (BTDT)

--
+-------------------------+--------------------+-----------------------------+
| Kenneth J. Brody | www.hvcomputer.com | |
| kenbrody/at\spamcop.net | www.fptech.com | #include <std_disclaimer.h> |
+-------------------------+--------------------+-----------------------------+
Don't e-mail me at: <mailto:[email protected]>
 
E

Eric Sosman

Lew said:
IIRC, the C standard says that all members of the execution characterset
will be expressable as positive values.

You seem to have missed the difference between "execution
chracter set" and "basic execution character set," described
in section 5.2.1. Section 6.2.5/3 guarantees that all the
basic characters are positive[*], but no such guarantee applies
to the "extended execution character set."

[*] Is this a defect in the Standard? '\0' is a member
of the basic execution character set, yet it is not positive.
Since the use of toupper() only makes sense within the scope of the
execution characterset, and not for arbitrary char values outside of
that range, it is safe to say that toupper() only works on positive char
values, or EOF (which is a specific, often negative, char value).

toupper() applies only to the execution character set and
to EOF, true. But toupper() is not restricted to the basic
execution character set; it also applies to extended characters.
If you want to translate æ to Æ or ñ to Ñ or å to Å, you are
dealing with extended characters and must consider that they
could be negative.

By the way, EOF is not "often" negative but "always" negative,
and is not a `char' value but an `int' value. See 7.19.1/3.
Casting the parameter to an unsigned char
- - may change the interpretation of the value of the parameter, if it is
(a negative value) EOF

Changing the value from possibly negative to guaranteed
positive is the purpose of the cast. I'm not sure why you
mention EOF here.
- - has no effect on proper members of the execution character set

Has no effect on members of the basic execution set, but
can affect extended characters.
Casting this back to int, while properly correcting the type of the
parameter to int, otherwise has (to my knowledge) no other effect. The
damage has been done by the cast to unsigned char, and cannot be
corrected by the recasting to int.

... except "damage" is the wrong word, and there's nothing
that needs to be "corrected."
In the end, toupper() was originally meant to take an int (as in the
return value of fgetc(), and that's what you should give it. If you are
working with char data items, convert them to int first, but otherwise
don't cast. I.e.


#include <ctype.h>
#include <stdio.h>

{
char datum = 'c', uc_datum;
int file_datum, uc_file_datum;

file_datum = fgetc(stdin);
uc_file_datum = toupper(file_datum);

uc_datum = toupper((int)datum);

No; that's both pointless and wrong. "Pointless" because
the compiler would perform this conversion anyhow without the
cast, and "wrong" because a negative `datum' would invoke
undefined behavior unless it just happened to equal EOF.
 
E

Eric Sosman

Kenneth said:
For example, suppose toupper were defined as:

#define toupper(c) ( ((c)==EOF) ? EOF : toupper_xlate[c] )

where "toupper_xlate[]" is an array of the convert-to-upper-case values.
If c is a signed char, then in addition to 0xFF probably equivalent to
EOF, you also have 0x80 through 0xFE being sign-extended as negative
subscripts into the array. (BTDT)

That particular definition wouldn't work because it
isn't safe from side-effects -- consider `toupper(*p++)'.
The usual rescue is something like

#define toupper(c) _toupper_xlate[(c) - EOF]

.... using a table that's "offset" by -EOF (usually 1)
positions.
 
J

Jack Klein

I have been reading about the practise of casting values to unsigned
char while using the <ctype.h> functions. For example,

c = toupper ((unsigned char) c);

Now I understand that the standard says this about the <ctype.h>
functions:

"The header <ctype.h> declares several functions useful for classifying
and mapping characters.166) In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the argument has any other value,
the behavior is undefined."

I am having a hard time formulating my question - basically its like
this though - Some people say cast to unsigned char (as in the above
example), whereas I have seen some people argue that casting to
^^^^^^^^^^^

Who are these "some people"? What are their qualifications to offer
advice on this subject?

On an implementation where plain char is signed (which is quite
common, especially on x86 systems, meaning Windows, most Linux, and
soon to be Macintosh), then this:

char ch = CHAR_MIN;

int uc = toupper(ch);

....produces undefined behavior, unless the macros CHAR_MIN and EOF
happen to be equal, not likely. The very words you quoted above
specifically say so.

So your "some people" are very foolish if they are offering advice
based on fundamental misconception.
unsigned char is unecessary, and if it is done, then a recast back to
int is necessary, because functions like toupper() expect an int, eg,

If the same foolish "some people" are saying this, then they are
actually beyond misconception and have arrived at ignorance. The only
advice they are qualified to give about C is perhaps how to spell the
name of the language. And even then I would double check their
answer.
 
C

CBFalconer

Eric said:
.... snip ...

If `c' is an `int' obtained from something like getc(),
just pass it along without casting. getc() and its ilk
already return either EOF or a non-negative `unsigned char'
value, which is what toupper() et al. require.

I think the point is that getc and friends do not return a char,
they all return an int. So unless the OP makes the beginners
mistake of storing that value in a char, all is correct without any
special effort. Thus the prototype for filling a char array is:

while (EOF != (ch = getc(...))) {
/* make tests on ch */
/* optionally store ch in a char array */
}

and the tests on the intermediate storage ch needs no special care,
provided that is of type int.
 
J

James Daughtry

The reason you need the cast is that converting directly
from plain `char' to `int' might not produce what toupper()
needs.
Does the standard guarantee that casting signed char with a negative,
non-EOF value, to unsigned char will produce the expected character? It
seems to me that unless this guarantee is provided, the cast would give
you defined behavior but garbage results. That's only marginally better
than undefined behavior. As such, wouldn't it be better to simply avoid
the operation if the value is out of range?

if (c == EOF || (c >= 0 && c <= UCHAR_MAX))
c = toupper(c);
else {
/* Special treatment for c */
}
 
E

Eric Sosman

James said:
Does the standard guarantee that casting signed char with a negative,
non-EOF value, to unsigned char will produce the expected character?

The Standard does not govern your expectations. ;-)
It
seems to me that unless this guarantee is provided, the cast would give
you defined behavior but garbage results. That's only marginally better
than undefined behavior. As such, wouldn't it be better to simply avoid
the operation if the value is out of range?

if (c == EOF || (c >= 0 && c <= UCHAR_MAX))
c = toupper(c);
else {
/* Special treatment for c */
}

What would the "special treatement" be? Without toupper()
to aid you, how would you know to transform -25 to -57 so as to
turn ç into Ç? And how would you know that -65 should remain
unchanged because ¿ has no upper-case equivalent?

IMHO (and with benefit of hindsight, a luxury not afforded
to pioneers) it was a mistake to define the <ctype.h> functions
on "all values getc() can return," and things would have been
better if they'd been defined on "all values a `char' can have."
But done's done, the moving finger writes, and there's no use
crying over spilt milk. I'm sure that if someone today were to
redesign the C library from scratch and without the need to
accommodate existing code, the outcome would be nicer in many
ways than what we have now. (Maybe I shouldn't be so sure; look
at the insalubrities of C that have been perpetuated in Java, a
"from scratch" effort whose designers might have been expected
to have known better!) However, it's surpassingly unlikely that
the library will change except by addition and extension; the
fundamental decisions about things like <ctype.h> will stay.

Do you have (or were you born with) a vermiform appendix
whose only discernible purpose is to put you at risk of
peritonitis? It's your heritage -- and C has heritage, too.
 
L

Lawrence Kirby

Does the standard guarantee that casting signed char with a negative,
non-EOF value, to unsigned char will produce the expected character?

Strictly, no.
It
seems to me that unless this guarantee is provided, the cast would give
you defined behavior but garbage results. That's only marginally better
than undefined behavior. As such, wouldn't it be better to simply avoid
the operation if the value is out of range?

This is a muddy area. In practical terms the likelihood of it failing is
remote and the pain of implementing it portably (assuming that is possible
at all) is too great. So we just assume that it works. Really the onus is
on anybody writing or developing under an implementation where this
doesn't work to think long and hard about what they are doing. On
architectures where there can be an issue (such as non 2's complement
ones) there is a simple fix for the implementation - make plain char an
unsigned type.

And no, a wrong local result is nowhere near as bad as undefined
behaviour, the former can affect results for a particualr value and
anything dependent on it, the latter can make the entire program
unpredictable.
if (c == EOF || (c >= 0 && c <= UCHAR_MAX))

c == EOF is a dangerous operation when c can be an unsigned type.

Lawrence
 
J

James Daughtry

make plain char an unsigned type.
Why do I get the feeling that this is the best solution for everyone to
begin with? Muddy areas do nothing but get your shoes messy, so it's
better to wander around them, no?
The Standard does not govern your expectations. ;-)
That's a shame! Here I've been expecting a well toasted bran muffin
with hot tea and orange juice every morning, but my implementation
fails to provide it consistently. Live and learn.
What would the "special treatement" be?
Special. But I'm pretty sure it falls under some kind of non-disclosure
agreement, as most special things do, so I can't really say... ;-)
 
E

Eric Sosman

James said:
Why do I get the feeling that this is the best solution for everyone to
begin with? Muddy areas do nothing but get your shoes messy, so it's
better to wander around them, no?

Remember what I said about hindsight being a luxury not
afforded to pioneers? We wouldn't have this mess if

- `char' had been unsigned from the git-go. However,
this would have penalized C implementations on some
machines (the historically important PDP-11 among
them), and the consequence might have been that we
wouldn't have this mess but also wouldn't have C.

... or ...

- getc() and the like didn't try to make the return
value carry both status (EOF) and data simulaneously.
It's this that forces them to return a perverted
sort of `char'-ish value instead of a value that's
actually expressible as a `char'.[*]

... or ...

- getc() and co. had not been designed to cater to the
convention that "negative return values are errors."
Actually, the Unix convention was that a -1 returned
by a system call indicated an error, but many coders
tested with `< 0' instead of with `== -1' -- it often
required less code, and memory used to be far more
precious than it is today. Defining EOF as CHAR_MIN-1
or CHAR_MAX+1 would have allowed getc() to return a
"natural" `char' value in a wider[*] `int' and avoided
all the present difficulties -- but all the `< 0' tests
would have broken.

... or ...

- The <ctype.h> functions had been defined to work on
`char' values instead of on the perverted values
returned by getc(). This would sometimes have meant
casting a getc()-returned `int' value to `char'
before handing it to toupper(), but that seems a lot
more natural (and easier to explain) than requiring
people to cast to `unsigned char' on the way from
`char' to `int'.

However, it didn't happen that way. And it's not going
to change -- some of these facilities may eventually be
supplanted by others better-suited to handling international
character sets, but nobody's going to respecify getc() or
toupper() at this late date. Hell, we can't even get up the
gumption to murder gets()! Plato regarded the world as an
imperfect expression of a perfect ideal; he'd have understood.

[*] The assumption that `int' is wider than `char' is no
longer universally true, which creates difficulties on some
systems: How can you choose an `int' value that's distinct
from all `char' values on a system where CHAR_MIN==INT_MIN
and CHAR_MAX==INT_MAX? I personally have not used such a
machine, but two possibilities occur to me: First, the I/O
facilities could continue to operate with characters of eight
(or so) bits, even though they occupy more memory. On input,
you'd get character values with a lot of high-order sign bits
(all zeroes or all ones), so EOF could be a `char' value that
corresponds to no actual I/O character. Output would probably
"chop" the high-order bits of a `char' that corresponded to
no character; the unpleasant consequence would be that there
would be `char' values that could not be written out and read
back in again.

The second approach would be to say that all `char' and
`int' values are "legitimate" as characters, so EOF would not
have a distinguished value. This would require coding along
the lines of

int ch;
if ((ch = getchar())) == EOF
&& (feof(stdin) || ferror(stdin))) ...

or even just

int ch = getchar();
if (feof(stdin) || ferror(stdin)) ...

.... meaning that the use of in-band signalling for exceptional
conditions really hasn't helped; you wind up needing to test
the out-of-band channel anyhow.
 
D

Dave Thompson

Lew said:
IIRC, the C standard says that all members of the execution characterset
will be expressable as positive values.

You seem to have missed the difference between "execution
chracter set" and "basic execution character set," described
in section 5.2.1. Section 6.2.5/3 guarantees that all the
basic characters are positive[*], but no such guarantee applies
to the "extended execution character set."

[*] Is this a defect in the Standard? '\0' is a member
of the basic execution character set, yet it is not positive.
It was, until changed to "nonnegative" in TC1.

- David.Thompson1 at worldnet.att.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,682
Members
48,796
Latest member
Greg L.

Latest Threads

Top