clarification on character handling

Discussion in 'C Programming' started by aegis, Aug 8, 2005.

  1. aegis

    aegis Guest

    7.4#1 states
    The header <ctype.h> declares several functions useful for classifying
    and mapping characters.166) In all cases the argument is an int, the
    value of which shall be representable as an unsigned char or shall
    equal the value of the macro EOF. If the
    argument has any other value, the behavior is undefined.

    Why should something such as:
    tolower(-10); invoke undefined behavior?

    It obviously has something with how tolower can be implemented,
    but I can't think of anything concrete.

    --
    aegis
    aegis, Aug 8, 2005
    #1
    1. Advertising

  2. aegis wrote:
    > 7.4#1 states
    > The header <ctype.h> declares several functions useful for classifying
    > and mapping characters.166) In all cases the argument is an int, the
    > value of which shall be representable as an unsigned char or shall
    > equal the value of the macro EOF. If the
    > argument has any other value, the behavior is undefined.
    >
    > Why should something such as:
    > tolower(-10); invoke undefined behavior?


    More to the point, what should it be if _not_ UB?

    > It obviously has something with how tolower can be implemented,
    > but I can't think of anything concrete.


    Consider a simple look up table (and the fact that EOF is quite
    often and deliberately set at -1). The toxxxx() macros and functions
    are often implemented in this way...

    unsigned char _flags[257] = { 0, .... };

    #define tolower(x) (_flags[(x) + 1] & _lower_case_flag)

    If you try tolower(-10), then the element referenced is not within
    the specified array. It's no different to tolower(32767) on an 8-bit
    char system. Why would you _expect_ some defined behaviour?

    --
    Peter
    Peter Nilsson, Aug 8, 2005
    #2
    1. Advertising

  3. aegis

    RAJU Guest

    Hi aegis,

    The expected argument to tolower(c) is mentioned in the specification.
    It's not specified if an unexpected arguments is passed. It's left to
    the Compiler writers to have their own implementation, so it's
    compiler/system dependent.

    It's progrmmer's responsibility to avoid these kind of scenarios. There
    is no error code retruned for these C functions. This is very common
    for C standard.

    Regards,
    Raju




    aegis wrote:
    > 7.4#1 states
    > The header <ctype.h> declares several functions useful for classifying
    > and mapping characters.166) In all cases the argument is an int, the
    > value of which shall be representable as an unsigned char or shall
    > equal the value of the macro EOF. If the
    > argument has any other value, the behavior is undefined.
    >
    > Why should something such as:
    > tolower(-10); invoke undefined behavior?
    >
    > It obviously has something with how tolower can be implemented,
    > but I can't think of anything concrete.
    >
    > --
    > aegis
    RAJU, Aug 8, 2005
    #3
  4. aegis

    CBFalconer Guest

    aegis wrote:
    >
    > 7.4#1 states
    > The header <ctype.h> declares several functions useful for
    > classifying and mapping characters.166) In all cases the argument
    > is an int, the value of which shall be representable as an
    > unsigned char or shall equal the value of the macro EOF. If the
    > argument has any other value, the behavior is undefined.
    >
    > Why should something such as:
    > tolower(-10); invoke undefined behavior?
    >
    > It obviously has something with how tolower can be implemented,
    > but I can't think of anything concrete.


    Many systems have an array of bits with masks, such that the array
    can be indexed by the value of the character + 1. If the value of
    EOF is -1 this maps into a normal 0 based array, if EOF is
    something else appropriate code can correct. The bits have
    significance as to whether the character is upper case, lower case,
    printable, numeric, etc. A single index and mask can return the
    appropriate characteristic.

    Negative (-ve) input values other than EOF foul this up, and result
    in illegal memory accesses.

    --
    Chuck F () ()
    Available for consulting/temporary embedded and systems.
    <http://cbfalconer.home.att.net> USE worldnet address!
    CBFalconer, Aug 8, 2005
    #4
  5. "aegis" <> writes:
    > 7.4#1 states
    > The header <ctype.h> declares several functions useful for
    > classifying and mapping characters.166) In all cases the argument is
    > an int, the value of which shall be representable as an unsigned
    > char or shall equal the value of the macro EOF. If the argument has
    > any other value, the behavior is undefined.
    >
    > Why should something such as:
    > tolower(-10); invoke undefined behavior?
    >
    > It obviously has something with how tolower can be implemented,
    > but I can't think of anything concrete.


    I would say you have it backwards: the ways in which tolower can be
    implemented are defined by the specification, and the specification
    allows implementations to break on negative non-EOF input if that's
    the most convenient thing for them.

    --
    http://www.greenend.org.uk/rjk/
    Richard Kettlewell, Aug 8, 2005
    #5
  6. aegis

    Antoine Leca Guest

    En <news:>,
    aegis va escriure:
    > Why should something such as:
    > tolower(-10); invoke undefined behavior?


    Because historically it does (out of bounds access), and it was not deemed
    worthwhile to put it a reasonable behaviour (which one, by the way?)


    Antoine
    Antoine Leca, Aug 8, 2005
    #6
  7. aegis

    Antoine Leca Guest

    Sorry if I am too picky, I do not know what was the point of the original
    poster, but since it posted to both comp.lang.c and comp.std.c, he perhaps
    wants to make a point about toxxx() vs. isxxx().

    En <news:>,
    Peter Nilsson va escriure:
    > The toxxxx() macros and functions are often implemented in this way...
    >
    > unsigned char _flags[257] = { 0, .... };
    >
    > #define tolower(x) (_flags[(x) + 1] & _lower_case_flag)


    This is unlikely to work correctly on a large scale (and *_flags can't be
    0);
    furthermore your _flags[] array cannot be shared with toupper(), which makes
    its name pretty misleading.

    Also, implementations of tolower() and toupper() as macros using the
    classification array lookup, like
    #define tolower(x) ((x) ^ _flags[(x) + 1] & _upper_case_flag)
    (with an adequately choosen _upper_case_flag, i.e. 0x20 for ASCII and 0x40
    for EBCDIC) do not comply with the C standard, because the x argument is
    evaluated twice.

    The other obvious "solution",
    #define tolower(x) (_locale_dependent_array_for_tolower[(x) + 1])
    is difficult to have it working correctly according to the specifications,
    because you should return an int, including for EOF (which is negative) and
    UCHAR_MAX (which is positive), so the type of the element of the array
    cannot in general be a character type; and the resulting increase in width
    wastes memory. As a result, many implementations do not provide tolower()
    and toupper() as macros, only as functions.


    Antoine
    Antoine Leca, Aug 8, 2005
    #7
  8. "Peter Nilsson" <> writes:
    > aegis wrote:
    >> 7.4#1 states
    >> The header <ctype.h> declares several functions useful for classifying
    >> and mapping characters.166) In all cases the argument is an int, the
    >> value of which shall be representable as an unsigned char or shall
    >> equal the value of the macro EOF. If the
    >> argument has any other value, the behavior is undefined.
    >>
    >> Why should something such as:
    >> tolower(-10); invoke undefined behavior?

    >
    > More to the point, what should it be if _not_ UB?


    If plain char is signed, it would be sensible to define the various
    functions to work properly with signed values, including negative
    values. All the characters of the basic character set are required to
    be positive, but it would be nice to be able to do something like:

    char c = some_arbitrary_value;
    if (isupper(c)) {
    do_something();
    }
    else {
    do_something_else();
    }

    The need to cast the argument to unsigned char is well documented, but
    IMHO counterintuitive.

    The restriction to non-negative values and EOF makes things slightly
    easier for the implementation, and slightly more difficult for the
    programmer. This may have been a good tradeoff when the functions
    were first defined; I don't think it is now.

    I've seen implementations of <ctype.h> that work properly for values
    from -128 to +255, covering both signed and unsigned characters.
    There is an overlap between EOF (typically -1) and whatever character
    is encoded as -1 (lowercase-y-with-diaresis in Latin-1, I think), but
    that's not a problem in the default locale, since all the functions
    happen to return the same value for EOF and that character.

    >> It obviously has something with how tolower can be implemented,
    >> but I can't think of anything concrete.

    >
    > Consider a simple look up table (and the fact that EOF is quite
    > often and deliberately set at -1). The toxxxx() macros and functions
    > are often implemented in this way...
    >
    > unsigned char _flags[257] = { 0, .... };
    >
    > #define tolower(x) (_flags[(x) + 1] & _lower_case_flag)
    >
    > If you try tolower(-10), then the element referenced is not within
    > the specified array. It's no different to tolower(32767) on an 8-bit
    > char system. Why would you _expect_ some defined behaviour?


    This approach can handle negative values sensibly by changing the
    offset value and making the array bigger.

    Of course, since the standard doesn't require implementations to do
    this, portable code still needs to make sure the argument is either
    EOF or a non-negative value.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
    We must do something. This is something. Therefore, we must do this.
    Keith Thompson, Aug 8, 2005
    #8
  9. Peter Nilsson wrote:
    > Consider a simple look up table (and the fact that EOF is quite
    > often and deliberately set at -1). The toxxxx() macros and functions
    > are often implemented in this way...
    >
    > unsigned char _flags[257] = { 0, .... };
    >
    > #define tolower(x) (_flags[(x) + 1] & _lower_case_flag)
    >
    > If you try tolower(-10), then the element referenced is not within
    > the specified array.


    Then why not change it to:
    #define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
    This will make sure that you cannot get outside the boundaries of the
    lookup table.

    Kind regards,
    Johan

    --
    o o o o o o o . . . _____J_o_h_a_n___B_o_r_k_h_u_i_s___
    o _____ || http://www.borkhuis.com |
    .][__n_n_|DD[ ====_____ | |
    >(________|__|_[_________]_|________________________________|

    _/oo OOOOO oo` ooo ooo 'o!o!o o!o!o`
    == VxWorks FAQ: http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html ==
    Johan Borkhuis, Aug 9, 2005
    #9
  10. Johan Borkhuis wrote:
    > Peter Nilsson wrote:
    > > Consider a simple look up table (and the fact that EOF is quite
    > > often and deliberately set at -1). The toxxxx() macros and functions
    > > are often implemented in this way...
    > >
    > > unsigned char _flags[257] = { 0, .... };
    > >
    > > #define tolower(x) (_flags[(x) + 1] & _lower_case_flag)
    > >
    > > If you try tolower(-10), then the element referenced is not within
    > > the specified array.

    >
    > Then why not change it to:
    > #define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
    > This will make sure that you cannot get outside the boundaries of the
    > lookup table.
    >


    It'll make the tolower implementation buggy. Because in this case
    tolower will successfully return if called with a negative integer
    which maps to a valid uppercase letter after unsigned wrap.

    Krishanu
    Krishanu Debnath, Aug 9, 2005
    #10
  11. Krishanu Debnath wrote:
    > Johan Borkhuis wrote:
    >
    >>Peter Nilsson wrote:
    >>
    >>>Consider a simple look up table (and the fact that EOF is quite
    >>>often and deliberately set at -1). The toxxxx() macros and functions
    >>>are often implemented in this way...
    >>>
    >>> unsigned char _flags[257] = { 0, .... };
    >>>
    >>> #define tolower(x) (_flags[(x) + 1] & _lower_case_flag)
    >>>
    >>>If you try tolower(-10), then the element referenced is not within
    >>>the specified array.

    >>
    >>Then why not change it to:
    >>#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
    >>This will make sure that you cannot get outside the boundaries of the
    >>lookup table.
    >>

    >
    >
    > It'll make the tolower implementation buggy. Because in this case
    > tolower will successfully return if called with a negative integer
    > which maps to a valid uppercase letter after unsigned wrap.


    If I look at the man-page for toupper it says:
    If c is not an unsigned char value, or EOF, the behaviour of these
    functions is undefined.
    (I know it is not the standard, but I don't have the standard at hand,
    and this is closest to a definition I can get)

    If you first check for EOF and if not EOF return the value from the
    array you comply with this statement.

    Kind regards,
    Johan

    --
    o o o o o o o . . . _____J_o_h_a_n___B_o_r_k_h_u_i_s___
    o _____ || http://www.borkhuis.com |
    .][__n_n_|DD[ ====_____ | |
    >(________|__|_[_________]_|________________________________|

    _/oo OOOOO oo` ooo ooo 'o!o!o o!o!o`
    == VxWorks FAQ: http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html ==
    Johan Borkhuis, Aug 9, 2005
    #11
  12. On 9 Aug 2005 00:00:58 -0700, Krishanu Debnath
    <> wrote:

    > Johan Borkhuis wrote:
    >> Peter Nilsson wrote:
    >> > Consider a simple look up table (and the fact that EOF is quite
    >> > often and deliberately set at -1). The toxxxx() macros and functions
    >> > are often implemented in this way...
    >> >
    >> > unsigned char _flags[257] = { 0, .... };
    >> >
    >> > #define tolower(x) (_flags[(x) + 1] & _lower_case_flag)
    >> >
    >> > If you try tolower(-10), then the element referenced is not within
    >> > the specified array.

    >>
    >> Then why not change it to:
    >> #define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
    >> This will make sure that you cannot get outside the boundaries of the
    >> lookup table.


    Incidentally, the #define you are all using is for islower(), not
    tolower(). Looking the character up in a table and selecting a bit.
    But a similar thing can be done for tolower() etc. using a lookup table
    so that it doesn't result in multiple evaluation of the argument
    (although it isn't safe to assume that the argument is evaluated only
    once).

    > It'll make the tolower implementation buggy. Because in this case
    > tolower will successfully return if called with a negative integer
    > which maps to a valid uppercase letter after unsigned wrap.


    That doesn't matter (the effect is "undefined" if the character is out
    of range, so whether it crashes, returns an incorrect result or causes
    demons to fly out of your nose is up to the implementation). More
    importantly it fails on EOF (and of course the +1 in the index is now
    not needed because (unsigned char)(x) can never be negative).

    A better implementation, as someone else mentioned, is to map all of the
    characters from CHAR_MIN to UCHAR_MAX into the array:

    #define islower(x) (_flags[(x) + CHAR_MIN] & _lower_case_flag)

    This still has the problem that EOF will typically map onto one of the
    other characters with a negative representation in signed char, but
    that's the risk you take, if you want to make sure that the character
    (char)EOF is treated as a real character you need to cast it to unsigned
    char first still.

    (Or better still would be to change the standard and force plain char to
    be unsigned, but I doubt that will happen...)

    Chris C
    Chris Croughton, Aug 9, 2005
    #12
  13. Johan Borkhuis wrote:
    > Krishanu Debnath wrote:
    > > Johan Borkhuis wrote:
    > >
    > >>Peter Nilsson wrote:
    > >>
    > >>>Consider a simple look up table (and the fact that EOF is quite
    > >>>often and deliberately set at -1). The toxxxx() macros and functions
    > >>>are often implemented in this way...
    > >>>
    > >>> unsigned char _flags[257] = { 0, .... };
    > >>>
    > >>> #define tolower(x) (_flags[(x) + 1] & _lower_case_flag)
    > >>>
    > >>>If you try tolower(-10), then the element referenced is not within
    > >>>the specified array.
    > >>
    > >>Then why not change it to:
    > >>#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
    > >>This will make sure that you cannot get outside the boundaries of the
    > >>lookup table.
    > >>

    > >
    > >
    > > It'll make the tolower implementation buggy. Because in this case
    > > tolower will successfully return if called with a negative integer
    > > which maps to a valid uppercase letter after unsigned wrap.

    >
    > If I look at the man-page for toupper it says:
    > If c is not an unsigned char value, or EOF, the behaviour of these
    > functions is undefined.
    > (I know it is not the standard, but I don't have the standard at hand,


    This is exactly what standard says.

    > and this is closest to a definition I can get)
    >
    > If you first check for EOF and if not EOF return the value from the
    > array you comply with this statement.
    >


    *Yes*. Then why do you need a unsigned char cast?

    You don't give a value that toupper/tolower accepts (e.g. a negative
    integer), you will get an undefined behavior with *that*
    implementation.

    You are changing an undefined behavior to a **wrong output** with the
    unsigned char cast.

    Krishanu
    Krishanu Debnath, Aug 9, 2005
    #13
  14. Krishanu Debnath wrote:
    >>If I look at the man-page for toupper it says:
    >>If c is not an unsigned char value, or EOF, the behaviour of these
    >>functions is undefined.
    >>(I know it is not the standard, but I don't have the standard at hand,

    >
    >
    > This is exactly what standard says.
    >
    >
    >>and this is closest to a definition I can get)
    >>
    >>If you first check for EOF and if not EOF return the value from the
    >>array you comply with this statement.

    >
    > *Yes*. Then why do you need a unsigned char cast?


    The main reason for the cast is to avoid negative index in an array.

    > You don't give a value that toupper/tolower accepts (e.g. a negative
    > integer), you will get an undefined behavior with *that*
    > implementation.


    You can also consider a segmentation fault undefined behaviour.

    > You are changing an undefined behavior to a **wrong output** with the
    > unsigned char cast.


    What is the definition of undefined behaviour? In this case the return
    of something (AKA Garbage in Garbage out) can be considered undefined
    behaviour (unless you consider the fact that because I defined it, it is
    no longer undefined, and thus not according to the standard.....).
    But as the output is undefined I don't think you can say that any output
    can be considered wrong.

    Kind regards,
    Johan

    --
    o o o o o o o . . . _____J_o_h_a_n___B_o_r_k_h_u_i_s___
    o _____ || http://www.borkhuis.com |
    .][__n_n_|DD[ ====_____ | |
    >(________|__|_[_________]_|________________________________|

    _/oo OOOOO oo` ooo ooo 'o!o!o o!o!o`
    == VxWorks FAQ: http://www.xs4all.nl/~borkhuis/vxworks/vxworks.html ==
    Johan Borkhuis, Aug 9, 2005
    #14
  15. aegis

    pete Guest

    Krishanu Debnath wrote:
    >
    > Johan Borkhuis wrote:
    > > Krishanu Debnath wrote:
    > > > Johan Borkhuis wrote:
    > > >
    > > >>Peter Nilsson wrote:
    > > >>
    > > >>>Consider a simple look up table (and the fact that EOF is quite
    > > >>>often and deliberately set at -1). The toxxxx() macros and functions
    > > >>>are often implemented in this way...
    > > >>>
    > > >>> unsigned char _flags[257] = { 0, .... };
    > > >>>
    > > >>> #define tolower(x) (_flags[(x) + 1] & _lower_case_flag)
    > > >>>
    > > >>>If you try tolower(-10), then the element referenced is not within
    > > >>>the specified array.
    > > >>
    > > >>Then why not change it to:
    > > >>#define tolower(x) (_flags[(unsigned char)(x) + 1] & _lower_case_flag)
    > > >>This will make sure that you cannot get outside the boundaries of the
    > > >>lookup table.
    > > >>
    > > >
    > > >
    > > > It'll make the tolower implementation buggy. Because in this case
    > > > tolower will successfully return if called with a negative integer
    > > > which maps to a valid uppercase letter after unsigned wrap.

    > >
    > > If I look at the man-page for toupper it says:
    > > If c is not an unsigned char value, or EOF, the behaviour of these
    > > functions is undefined.
    > > (I know it is not the standard, but I don't have the standard at hand,

    >
    > This is exactly what standard says.
    >
    > > and this is closest to a definition I can get)
    > >
    > > If you first check for EOF and if not EOF return the value from the
    > > array you comply with this statement.
    > >

    >
    > *Yes*. Then why do you need a unsigned char cast?
    >
    > You don't give a value that toupper/tolower accepts (e.g. a negative
    > integer), you will get an undefined behavior with *that*
    > implementation.
    >
    > You are changing an undefined behavior to a **wrong output** with the
    > unsigned char cast.


    The ctype function output with unsigned char cast arguments
    is reasonable, especially if you consider that fputc and
    functions described in terms of fputc, like putchar,
    use the value of their arguments converted to unsigned char.

    --
    pete
    pete, Aug 9, 2005
    #15
  16. aegis wrote:
    > Why should something such as:
    > tolower(-10); invoke undefined behavior?
    > It obviously has something with how tolower can be implemented,
    > but I can't think of anything concrete.


    We discussed this not very long ago.

    The obvious implementation is:
    #define tolower(c) __lowercase[(c)+1];
    and if arbitrary integer values had to be accommodated
    (large positive is also a problem), the table would be
    far larger than necessary, for no benefit whatever for
    correct programs. An alternative would be to use a
    function call, with an explicit range check and then a
    table look-up, which is much slower than the above.
    That's the kind of trade-off that C is generally
    unwilling to make, although it may be appropriate for
    a more baby-proof PL.
    Douglas A. Gwyn, Aug 9, 2005
    #16
  17. aegis

    Guest

    Krishanu Debnath wrote:
    ....
    > You are changing an undefined behavior to a **wrong output** with the
    > unsigned char cast.


    There's no such thing as "wrong output" when the behavior is undefined.
    In the C standard, "undefined behavior" means behavior for which the C
    standard provides no definition. None. Not any. Whatsoever. Of any
    kind. In particular, the C standard doesn't define the behavior in any
    way which prohibits producing the result his unsigned char cast would
    produce.
    , Aug 9, 2005
    #17
  18. wrote:
    > Krishanu Debnath wrote:
    > > You are changing an undefined behavior to a **wrong output** with the
    > > unsigned char cast.

    > There's no such thing as "wrong output" when the behavior is undefined.


    I think he meant that the programmer is defining the behavior,
    but that the defined behavior might not make sense. Note that
    the original example (negative int values) didn't make sense
    either.

    I think the only valid concern is that tolower(char_type) might
    be invoked mistakenly, for some negative (char) value. This
    won't happen for the basic character set, nor for the most
    common codesets for *defined* character codes, but could happen
    on some platforms if random garbage values are passed to
    tolower(). In practice this could occur when the character
    codes come from a hostile user, for example. The most likely
    actual risk is denial of service due to crashing the process
    with an illegal memory reference.

    The "more secure library" TR under current development by WG14
    is meant to provide a "drop-in" (easy automated editing) way to
    catch such abuses in existing, not-so-carefully-constructed
    applications. The alternative is to do a better job in the
    original design and coding.
    Douglas A. Gwyn, Aug 9, 2005
    #18
  19. aegis

    pete Guest

    Douglas A. Gwyn wrote:
    >
    > wrote:
    > > Krishanu Debnath wrote:
    > > > You are changing an undefined behavior
    > > > to a **wrong output** with the
    > > > unsigned char cast.

    > > There's no such thing as "wrong output"
    > > when the behavior is undefined.

    >
    > I think he meant that the programmer is defining the behavior,
    > but that the defined behavior might not make sense.


    But it does make sense.
    If you have a negative integer value like:
    ('A' - 1 - (unsigned char)-1)

    then
    putchar('A' - 1 - (unsigned char)-1)
    returns 'A'.

    and
    tolower((unsigned char)('A' - 1 - (unsigned char)-1))
    returns 'a'

    --
    pete
    pete, Aug 10, 2005
    #19
  20. aegis

    Antoine Leca Guest

    En <news:>, Douglas A. Gwyn va escriure:
    > I think the only valid concern is that tolower(char_type) might
    > be invoked mistakenly, for some negative (char) value. This
    > won't happen for the basic character set,


    Agreed.

    > nor for the most common codesets for *defined* character codes,


    Disagree.

    One side of the problem is the definition of character set. Due to:
    1) the overcrowed aspect of the 000-0177 range in ASCII
    2) the widely use of 8-bit bytes
    many if not all extended character sets these days (usable in char and
    compatible with the basic character set of the architecture) defines
    characters in the 08/00-15/15 range, that is toggling the 8th bit on.

    On the other hand, for various reasons, not all compilers/implementations
    that allow use of these extended character sets do switch char to be an
    unsigned type. Of course, when the basic character set is EBCDIC, this is
    required. But the standard is written in a way that allows to use e.g.
    iso-8859-1 as character set while having SCHAR_MAX==127 (and in fact this is
    very frequent setup in Western Europe.)


    And in such a case, 'ä' is negative... (and is different from the result of
    getc() if ä is in the stream :-( )


    Which leads to a whole set of complications involving many use of unsigned
    char casts.

    As a result, I agree that a correctly programmed application should not fall
    in the trap (and a current test here in Europe is to input ÿ to see how the
    tested app reacts... 'ÿ' is -1 in iso-8859-1 codeset); but it is fairly easy
    to be trapped, particularly when the application is ported.


    > but could happen on some platforms if random garbage values
    > are passed to tolower().


    As I wrote, not only random garbage but also perfectly valid inputs on some
    imperfect programs.

    > In practice this could occur when the character codes come
    > from a hostile user, for example.


    Of course this leads to a risk, as you describe.
    But I do not like the idea that what is genuinely a bug would be corrected
    not because it harms anybody except the Americans/English-speaking people,
    but only because some hostile hackers could turn it into a weapon...

    ;-) in case you missed it.


    > The "more secure library" TR under current development by WG14


    Doesn't change its name to "safer"?
    (http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1114.htm)

    BTW, the "safer" library goes quite a bit further than tagging use of
    negative value to tolower(). You can have some overview by reading
    http://msdn.microsoft.com/library/8ef0s5kh.aspx or
    http://msdn2.microsoft.com/library/wd3wzwts.aspx (MS is the sponsor of this
    TR, so its implementation leads.)
    In a nutshell, /many/ functions of the standard library are superceeded, and
    this may need a significant effort to bring an existing tree on par.


    Antoine
    Antoine Leca, Aug 10, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Velvet
    Replies:
    9
    Views:
    14,771
    Joerg Jooss
    Jan 19, 2006
  2. raavi
    Replies:
    2
    Views:
    898
    raavi
    Mar 2, 2006
  3. cgbusch
    Replies:
    6
    Views:
    7,464
    Mike Brown
    Sep 2, 2003
  4. mimmo
    Replies:
    4
    Views:
    27,930
  5. Fiaz Idris

    Character class [\W_] clarification

    Fiaz Idris, Dec 10, 2003, in forum: Perl Misc
    Replies:
    7
    Views:
    1,050
    William Herrera
    Dec 11, 2003
Loading...

Share This Page