clarification on character handling

aegis · Aug 8, 2005

7.4#1 states
The header <ctype.h> declares several functions useful for classifying
and mapping characters.166) In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.

Why should something such as:
tolower(-10); invoke undefined behavior?

It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

Peter Nilsson · Aug 8, 2005

aegis said:
7.4#1 states
The header <ctype.h> declares several functions useful for classifying
and mapping characters.166) In all cases the argument is an int, the
value of which shall be representable as an unsigned char or shall
equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.

Why should something such as:
tolower(-10); invoke undefined behavior?

More to the point, what should it be if _not_ UB?

It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

Consider a simple look up table (and the fact that EOF is quite
often and deliberately set at -1). The toxxxx() macros and functions
are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

If you try tolower(-10), then the element referenced is not within
the specified array. It's no different to tolower(32767) on an 8-bit
char system. Why would you _expect_ some defined behaviour?

RAJU · Aug 8, 2005

Hi aegis,

The expected argument to tolower(c) is mentioned in the specification.
It's not specified if an unexpected arguments is passed. It's left to
the Compiler writers to have their own implementation, so it's
compiler/system dependent.

It's progrmmer's responsibility to avoid these kind of scenarios. There
is no error code retruned for these C functions. This is very common
for C standard.

Regards,
Raju

CBFalconer · Aug 8, 2005

aegis said:
7.4#1 states
The header <ctype.h> declares several functions useful for
classifying and mapping characters.166) In all cases the argument
is an int, the value of which shall be representable as an
unsigned char or shall equal the value of the macro EOF. If the
argument has any other value, the behavior is undefined.

Why should something such as:
tolower(-10); invoke undefined behavior?

It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

Many systems have an array of bits with masks, such that the array
can be indexed by the value of the character + 1. If the value of
EOF is -1 this maps into a normal 0 based array, if EOF is
something else appropriate code can correct. The bits have
significance as to whether the character is upper case, lower case,
printable, numeric, etc. A single index and mask can return the
appropriate characteristic.

Negative (-ve) input values other than EOF foul this up, and result
in illegal memory accesses.

Richard Kettlewell · Aug 8, 2005

aegis said:
7.4#1 states
The header <ctype.h> declares several functions useful for
classifying and mapping characters.166) In all cases the argument is
an int, the value of which shall be representable as an unsigned
char or shall equal the value of the macro EOF. If the argument has
any other value, the behavior is undefined.

Why should something such as:
tolower(-10); invoke undefined behavior?

It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

I would say you have it backwards: the ways in which tolower can be
implemented are defined by the specification, and the specification
allows implementations to break on negative non-EOF input if that's
the most convenient thing for them.

Antoine Leca · Aug 8, 2005

En said:
Why should something such as:
tolower(-10); invoke undefined behavior?

Because historically it does (out of bounds access), and it was not deemed
worthwhile to put it a reasonable behaviour (which one, by the way?)

Antoine

Antoine Leca · Aug 8, 2005

Sorry if I am too picky, I do not know what was the point of the original
poster, but since it posted to both comp.lang.c and comp.std.c, he perhaps
wants to make a point about toxxx() vs. isxxx().

En said:
The toxxxx() macros and functions are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

This is unlikely to work correctly on a large scale (and *_flags can't be
0);
furthermore your _flags[] array cannot be shared with toupper(), which makes
its name pretty misleading.

Also, implementations of tolower() and toupper() as macros using the
classification array lookup, like
#define tolower(x) ((x) ^ _flags[(x) + 1] & _upper_case_flag)
(with an adequately choosen _upper_case_flag, i.e. 0x20 for ASCII and 0x40
for EBCDIC) do not comply with the C standard, because the x argument is
evaluated twice.

The other obvious "solution",
#define tolower(x) (_locale_dependent_array_for_tolower[(x) + 1])
is difficult to have it working correctly according to the specifications,
because you should return an int, including for EOF (which is negative) and
UCHAR_MAX (which is positive), so the type of the element of the array
cannot in general be a character type; and the resulting increase in width
wastes memory. As a result, many implementations do not provide tolower()
and toupper() as macros, only as functions.

Antoine

Keith Thompson · Aug 8, 2005

Peter Nilsson said:
More to the point, what should it be if _not_ UB?

If plain char is signed, it would be sensible to define the various
functions to work properly with signed values, including negative
values. All the characters of the basic character set are required to
be positive, but it would be nice to be able to do something like:

char c = some_arbitrary_value;
if (isupper(c)) {
do_something();
}
else {
do_something_else();
}

The need to cast the argument to unsigned char is well documented, but
IMHO counterintuitive.

The restriction to non-negative values and EOF makes things slightly
easier for the implementation, and slightly more difficult for the
programmer. This may have been a good tradeoff when the functions
were first defined; I don't think it is now.

I've seen implementations of <ctype.h> that work properly for values
from -128 to +255, covering both signed and unsigned characters.
There is an overlap between EOF (typically -1) and whatever character
is encoded as -1 (lowercase-y-with-diaresis in Latin-1, I think), but
that's not a problem in the default locale, since all the functions
happen to return the same value for EOF and that character.

It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

Click to expand...

Consider a simple look up table (and the fact that EOF is quite
often and deliberately set at -1). The toxxxx() macros and functions
are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

If you try tolower(-10), then the element referenced is not within
the specified array. It's no different to tolower(32767) on an 8-bit
char system. Why would you _expect_ some defined behaviour?

This approach can handle negative values sensibly by changing the
offset value and making the array bigger.

Of course, since the standard doesn't require implementations to do
this, portable code still needs to make sure the argument is either
EOF or a non-negative value.

Johan Borkhuis · Aug 9, 2005

Peter said:
Consider a simple look up table (and the fact that EOF is quite
often and deliberately set at -1). The toxxxx() macros and functions
are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

If you try tolower(-10), then the element referenced is not within
the specified array.

Then why not change it to:
#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
This will make sure that you cannot get outside the boundaries of the
lookup table.

Kind regards,
Johan

--
o o o o o o o . . . _____J_o_h_a_n___B_o_r_k_h_u_i_s___
o _____ || http://www.borkhuis.com |
.][__n_n_|DD[ ====_____ | (e-mail address removed) |

>(________|__|_[_________]_|________________________________|

_/oo OOOOO oo` ooo ooo 'o!o!o o!o!o`
== VxWorks FAQ: http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html ==

Krishanu Debnath · Aug 9, 2005

Johan said:
Peter said:

Consider a simple look up table (and the fact that EOF is quite
often and deliberately set at -1). The toxxxx() macros and functions
are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

If you try tolower(-10), then the element referenced is not within
the specified array.

Click to expand...

Then why not change it to:
#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
This will make sure that you cannot get outside the boundaries of the
lookup table.

It'll make the tolower implementation buggy. Because in this case
tolower will successfully return if called with a negative integer
which maps to a valid uppercase letter after unsigned wrap.

Krishanu

Johan Borkhuis · Aug 9, 2005

Krishanu said:
Johan said:

Peter said:

Consider a simple look up table (and the fact that EOF is quite
often and deliberately set at -1). The toxxxx() macros and functions
are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

If you try tolower(-10), then the element referenced is not within
the specified array.

Click to expand...

Then why not change it to:
#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
This will make sure that you cannot get outside the boundaries of the
lookup table.

Click to expand...

It'll make the tolower implementation buggy. Because in this case
tolower will successfully return if called with a negative integer
which maps to a valid uppercase letter after unsigned wrap.

If I look at the man-page for toupper it says:
If c is not an unsigned char value, or EOF, the behaviour of these
functions is undefined.
(I know it is not the standard, but I don't have the standard at hand,
and this is closest to a definition I can get)

If you first check for EOF and if not EOF return the value from the
array you comply with this statement.

Kind regards,
Johan

--
o o o o o o o . . . _____J_o_h_a_n___B_o_r_k_h_u_i_s___
o _____ || http://www.borkhuis.com |
.][__n_n_|DD[ ====_____ | (e-mail address removed) |

>(________|__|_[_________]_|________________________________|

_/oo OOOOO oo` ooo ooo 'o!o!o o!o!o`
== VxWorks FAQ: http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html ==

Chris Croughton · Aug 9, 2005

Johan said:
Johan said:

Peter said:

Consider a simple look up table (and the fact that EOF is quite
often and deliberately set at -1). The toxxxx() macros and functions
are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

If you try tolower(-10), then the element referenced is not within
the specified array.

Click to expand...

Then why not change it to:
#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
This will make sure that you cannot get outside the boundaries of the
lookup table.

Click to expand...

Incidentally, the #define you are all using is for islower(), not
tolower(). Looking the character up in a table and selecting a bit.
But a similar thing can be done for tolower() etc. using a lookup table
so that it doesn't result in multiple evaluation of the argument
(although it isn't safe to assume that the argument is evaluated only
once).

It'll make the tolower implementation buggy. Because in this case
tolower will successfully return if called with a negative integer
which maps to a valid uppercase letter after unsigned wrap.

That doesn't matter (the effect is "undefined" if the character is out
of range, so whether it crashes, returns an incorrect result or causes
demons to fly out of your nose is up to the implementation). More
importantly it fails on EOF (and of course the +1 in the index is now
not needed because (unsigned char)(x) can never be negative).

A better implementation, as someone else mentioned, is to map all of the
characters from CHAR_MIN to UCHAR_MAX into the array:

#define islower(x) (_flags[(x) + CHAR_MIN] & _lower_case_flag)

This still has the problem that EOF will typically map onto one of the
other characters with a negative representation in signed char, but
that's the risk you take, if you want to make sure that the character
(char)EOF is treated as a real character you need to cast it to unsigned
char first still.

(Or better still would be to change the standard and force plain char to
be unsigned, but I doubt that will happen...)

Chris C

Krishanu Debnath · Aug 9, 2005

Johan said:
Krishanu said:

Johan said:

Peter Nilsson wrote:

Consider a simple look up table (and the fact that EOF is quite
often and deliberately set at -1). The toxxxx() macros and functions
are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

If you try tolower(-10), then the element referenced is not within
the specified array.

Then why not change it to:
#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
This will make sure that you cannot get outside the boundaries of the
lookup table.

Click to expand...

It'll make the tolower implementation buggy. Because in this case
tolower will successfully return if called with a negative integer
which maps to a valid uppercase letter after unsigned wrap.

Click to expand...

If I look at the man-page for toupper it says:
If c is not an unsigned char value, or EOF, the behaviour of these
functions is undefined.
(I know it is not the standard, but I don't have the standard at hand,

This is exactly what standard says.

and this is closest to a definition I can get)

If you first check for EOF and if not EOF return the value from the
array you comply with this statement.

*Yes*. Then why do you need a unsigned char cast?

You don't give a value that toupper/tolower accepts (e.g. a negative
integer), you will get an undefined behavior with *that*
implementation.

You are changing an undefined behavior to a **wrong output** with the
unsigned char cast.

Krishanu

Johan Borkhuis · Aug 9, 2005

Krishanu said:
This is exactly what standard says.

*Yes*. Then why do you need a unsigned char cast?

The main reason for the cast is to avoid negative index in an array.

You don't give a value that toupper/tolower accepts (e.g. a negative
integer), you will get an undefined behavior with *that*
implementation.

You can also consider a segmentation fault undefined behaviour.

You are changing an undefined behavior to a **wrong output** with the
unsigned char cast.

What is the definition of undefined behaviour? In this case the return
of something (AKA Garbage in Garbage out) can be considered undefined
behaviour (unless you consider the fact that because I defined it, it is
no longer undefined, and thus not according to the standard.....).
But as the output is undefined I don't think you can say that any output
can be considered wrong.

Kind regards,
Johan

--
o o o o o o o . . . _____J_o_h_a_n___B_o_r_k_h_u_i_s___
o _____ || http://www.borkhuis.com |
.][__n_n_|DD[ ====_____ | (e-mail address removed) |

>(________|__|_[_________]_|________________________________|

_/oo OOOOO oo` ooo ooo 'o!o!o o!o!o`
== VxWorks FAQ: http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html ==

pete · Aug 9, 2005

Krishanu said:
Johan said:

Krishanu said:

Johan Borkhuis wrote:

Peter Nilsson wrote:

Consider a simple look up table (and the fact that EOF is quite
often and deliberately set at -1). The toxxxx() macros and functions
are often implemented in this way...

unsigned char _flags[257] = { 0, .... };

#define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

If you try tolower(-10), then the element referenced is not within
the specified array.

Then why not change it to:
#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
This will make sure that you cannot get outside the boundaries of the
lookup table.

It'll make the tolower implementation buggy. Because in this case
tolower will successfully return if called with a negative integer
which maps to a valid uppercase letter after unsigned wrap.

Click to expand...

If I look at the man-page for toupper it says:
If c is not an unsigned char value, or EOF, the behaviour of these
functions is undefined.
(I know it is not the standard, but I don't have the standard at hand,

Click to expand...

This is exactly what standard says.

and this is closest to a definition I can get)

If you first check for EOF and if not EOF return the value from the
array you comply with this statement.

Click to expand...

*Yes*. Then why do you need a unsigned char cast?

You don't give a value that toupper/tolower accepts (e.g. a negative
integer), you will get an undefined behavior with *that*
implementation.

You are changing an undefined behavior to a **wrong output** with the
unsigned char cast.

The ctype function output with unsigned char cast arguments
is reasonable, especially if you consider that fputc and
functions described in terms of fputc, like putchar,
use the value of their arguments converted to unsigned char.

Douglas A. Gwyn · Aug 9, 2005

aegis said:
Why should something such as:
tolower(-10); invoke undefined behavior?
It obviously has something with how tolower can be implemented,
but I can't think of anything concrete.

We discussed this not very long ago.

The obvious implementation is:
#define tolower(c) __lowercase[(c)+1];
and if arbitrary integer values had to be accommodated
(large positive is also a problem), the table would be
far larger than necessary, for no benefit whatever for
correct programs. An alternative would be to use a
function call, with an explicit range check and then a
table look-up, which is much slower than the above.
That's the kind of trade-off that C is generally
unwilling to make, although it may be appropriate for
a more baby-proof PL.

kuyper · Aug 9, 2005

Krishanu Debnath wrote:
....

You are changing an undefined behavior to a **wrong output** with the
unsigned char cast.

There's no such thing as "wrong output" when the behavior is undefined.
In the C standard, "undefined behavior" means behavior for which the C
standard provides no definition. None. Not any. Whatsoever. Of any
kind. In particular, the C standard doesn't define the behavior in any
way which prohibits producing the result his unsigned char cast would
produce.

Douglas A. Gwyn · Aug 9, 2005

There's no such thing as "wrong output" when the behavior is undefined.

I think he meant that the programmer is defining the behavior,
but that the defined behavior might not make sense. Note that
the original example (negative int values) didn't make sense
either.

I think the only valid concern is that tolower(char_type) might
be invoked mistakenly, for some negative (char) value. This
won't happen for the basic character set, nor for the most
common codesets for *defined* character codes, but could happen
on some platforms if random garbage values are passed to
tolower(). In practice this could occur when the character
codes come from a hostile user, for example. The most likely
actual risk is denial of service due to crashing the process
with an illegal memory reference.

The "more secure library" TR under current development by WG14
is meant to provide a "drop-in" (easy automated editing) way to
catch such abuses in existing, not-so-carefully-constructed
applications. The alternative is to do a better job in the
original design and coding.

pete · Aug 10, 2005

Douglas said:
I think he meant that the programmer is defining the behavior,
but that the defined behavior might not make sense.

But it does make sense.
If you have a negative integer value like:
('A' - 1 - (unsigned char)-1)

then
putchar('A' - 1 - (unsigned char)-1)
returns 'A'.

and
tolower((unsigned char)('A' - 1 - (unsigned char)-1))
returns 'a'

Antoine Leca · Aug 10, 2005

I think the only valid concern is that tolower(char_type) might
be invoked mistakenly, for some negative (char) value. This
won't happen for the basic character set,
Agreed.

nor for the most common codesets for *defined* character codes,

Disagree.

One side of the problem is the definition of character set. Due to:
1) the overcrowed aspect of the 000-0177 range in ASCII
2) the widely use of 8-bit bytes
many if not all extended character sets these days (usable in char and
compatible with the basic character set of the architecture) defines
characters in the 08/00-15/15 range, that is toggling the 8th bit on.

On the other hand, for various reasons, not all compilers/implementations
that allow use of these extended character sets do switch char to be an
unsigned type. Of course, when the basic character set is EBCDIC, this is
required. But the standard is written in a way that allows to use e.g.
iso-8859-1 as character set while having SCHAR_MAX==127 (and in fact this is
very frequent setup in Western Europe.)

And in such a case, 'ä' is negative... (and is different from the result of
getc() if ä is in the stream :-( )

Which leads to a whole set of complications involving many use of unsigned
char casts.

As a result, I agree that a correctly programmed application should not fall
in the trap (and a current test here in Europe is to input ÿ to see how the
tested app reacts... 'ÿ' is -1 in iso-8859-1 codeset); but it is fairly easy
to be trapped, particularly when the application is ported.

but could happen on some platforms if random garbage values
are passed to tolower().

As I wrote, not only random garbage but also perfectly valid inputs on some
imperfect programs.

In practice this could occur when the character codes come
from a hostile user, for example.

Of course this leads to a risk, as you describe.
But I do not like the idea that what is genuinely a bug would be corrected
not because it harms anybody except the Americans/English-speaking people,
but only because some hostile hackers could turn it into a weapon...

;-) in case you missed it.

The "more secure library" TR under current development by WG14

Doesn't change its name to "safer"?
(http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1114.htm)

BTW, the "safer" library goes quite a bit further than tagging use of
negative value to tolower(). You can have some overview by reading
http://msdn.microsoft.com/library/8ef0s5kh.aspx or
http://msdn2.microsoft.com/library/wd3wzwts.aspx (MS is the sponsor of this
TR, so its implementation leads.)
In a nutshell, /many/ functions of the standard library are superceeded, and
this may need a significant effort to bring an existing tree on par.

Antoine

casting to unsigned char for is() and to() functions	14	Jun 29, 2005
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008
In the Matter of Herb Schildt: a Detailed Analysis of "C: TheComplete Nonsense"	109	Apr 3, 2010
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 15, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Apr 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Nov 1, 2007

clarification on character handling

aegis

Peter Nilsson

RAJU

CBFalconer

Richard Kettlewell

Antoine Leca

Antoine Leca

Keith Thompson

Johan Borkhuis

Krishanu Debnath

Johan Borkhuis

Chris Croughton

Krishanu Debnath

Johan Borkhuis

pete

Douglas A. Gwyn

kuyper

Douglas A. Gwyn

pete

Antoine Leca

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads