QoI issue: silencing a warning

Discussion in 'C Programming' started by Noob, Mar 17, 2014.

  1. Noob

    Noob Guest

    [ NOTE : cross-posted to comp.lang.c and comp.unix.programmer,
    please trim as you see fit ]

    Hello,

    My compiler (gcc 4.7) is being a little fussy about the following code:
    (trimmed to a minimum)

    #include <ctype.h>
    int foo(const char *ext)
    {
    int ext_NNN = isdigit(ext[1]) && isdigit(ext[2]) && isdigit(ext[3]);
    return ext_NNN;
    }

    $ gcc -Wall -c tutu.c
    tutu.c: In function 'foo':
    tutu.c:4:3: warning: array subscript has type 'char' [-Wchar-subscripts]
    tutu.c:4:3: warning: array subscript has type 'char' [-Wchar-subscripts]
    tutu.c:4:3: warning: array subscript has type 'char' [-Wchar-subscripts]

    On this platform, isdigit seems to be a (complex) macro, and the preprocessor
    digests my function into...

    int foo(const char *ext)
    {
    int ext_NNN = (((__ctype_ptr__+sizeof(""[ext[1]]))[(int)(ext[1])])&04) && (((__ctype_ptr__+sizeof(""[ext[2]]))[(int)(ext[2])])&04) && (((__ctype_ptr__+sizeof(""[ext[3]]))[(int)(ext[3])])&04);
    return ext_NNN;
    }

    Sweet baby Jesus... IOCCC winner here.

    If I want to make the warnings disappear, I see two obvious work-arounds:

    1) Set -Wno-char-subscripts
    2) Cast isdigit's parameter to int (or to unsigned or to unsigned char)
    3) Something else?

    My questions :

    - Is something wrong with my code, or is this just a QoI issue?

    - How would you silence the warnings?

    Regards.
     
    Noob, Mar 17, 2014
    #1
    1. Advertisements

  2. isdigit() is defined as taking an int argument, not a char; cast your
    argument to int and the warnings will disappear.

    S
     
    Stephen Sprunk, Mar 17, 2014
    #2
    1. Advertisements

  3. The bug is in your code. Have a look at the specification of isdigit
    (not the implementation!); you will see that its parameter must either
    be EOF or the value of an unsigned char.
     
    Richard Kettlewell, Mar 17, 2014
    #3
  4. Noob

    Eric Sosman Guest

    Almost right, but not quite. The <ctype.h> functions take
    int arguments, as you say, but their values must be either EOF
    or (and this is the important bit) the value of an *unsigned*
    char as converted to int. So the correct cast is

    int ext_NNN = isdigit((unsigned char) ext[1]) ...

    and so on.

    It is tempting to save some typing by using an `unsigned char*'
    in the first place, as in

    int foo(const char *ext) {
    const unsigned char *uxt = (const unsigned char*) ext;
    int ext_NNN = isdigit(uxt[1]) ...

    I *think* this is okay, but type-punning always makes me just a
    little bit queasy, so I'd prefer (and recommend) the first option.
     
    Eric Sosman, Mar 17, 2014
    #4
  5. Noob

    Noob Guest

    So it is wrong to write:

    char foo[] = "abcd";
    isdigit(foo[1]);


    Lemme see...

    7.4 Character handling <ctype.h>
    The header <ctype.h> declares several functions useful for classifying
    and mapping characters. In all cases the argument is an int, the value of
    which shall be representable as an unsigned char or shall equal the value
    of the macro EOF. If the argument has any other value, the behavior is
    undefined.


    I don't understand your objection, in light of parameter conversion.
    (IIUC, the plain char argument is implicitly converted to an int.)

    6.5.2.2 Function calls
    Semantics
    7 If the expression that denotes the called function has a type that does include a prototype,
    the arguments are implicitly converted, as if by assignment, to the types of the
    corresponding parameters, taking the type of each parameter to be the unqualified version
    of its declared type. The ellipsis notation in a function prototype declarator causes
    argument type conversion to stop after the last declared parameter. The default argument
    promotions are performed on trailing arguments.

    What did I miss?

    Regards.
     
    Noob, Mar 17, 2014
    #5
  6. [snip]

    N1570 7.4p1:
    The header <ctype.h> declares several functions useful for
    classifying and mapping characters. In all cases the argument
    is an int, the value of which shall be representable as an
    unsigned char or shall equal the value of the macro EOF. If
    the argument has any other value, the behavior is undefined.

    Plain char may be a signed type, which means that a value of type char
    may be negative (and unequal to EOF), causing undefined behavior when
    it's passed to isdigit().

    Just cast the argument to unsigned char:

    int ext_NNN = isdigit((unsigned char)ext[1]) &&
    isdigit((unsigned char)ext[2]) &&
    isdigit((unsigned char)ext[3]);

    You're probably about to say that it's ridiculous that a function
    intended to operate on characters can't properly handle an argument of
    type char -- and I agree. I think it would have made more sense for
    isdigit() and friends to take an argument of type (plain) char. The
    problem is that these function are required to handle the value EOF
    (typically -1), which is *also* a value char value (if plain char is
    signed). I'm not convinced that permitting EOF as an argument was
    worthwhile, but we're stuck with it.

    On the other hand, when -1 is a valid char value, it typically
    *isn't* a digit, or a lowercase or uppercase letter, etc.
    That's a reasonable warning, since char values can be negative and a
    negative subscript value is usually an error. (Not always; you could be
    indexing from a pointer into the middle of an array.)
     
    Keith Thompson, Mar 17, 2014
    #6

  7. Because it doesn't work when "char" is signed; the char \x255
    would be converted to -127 and that is not a valid paramter
    to isdigit.

    Casper
     
    Casper H.S. Dik, Mar 17, 2014
    #7
  8. Noob

    Noob Guest

    OK, I think I see the problem.

    1. If my platform's plain chars are signed (which they are)
    2. and if I pass a value greater than CHAR_MAX (e.g. '©')
    3. and if the implementation of isdigit uses an array

    then a negative index sends it into la-la-land.
    I think I'll write an isdigit wrapper (or even roll my own).

    int myisdigit(char c)
    {
    return '0' <= c && c <= '9';
    }

    I think that's a portable solution, right?
    You read my mind. This is insane.
    The warning is good. I was surprised they used an array, and that
    isdigit was a macro. But it helped catch the problem, so... great!

    Regards.
     
    Noob, Mar 17, 2014
    #8
  9. Noob

    James Kuyper Guest

    Yes, but plain char can be signed, so values thereof can be negative.
    While the values of the basic execution character set are guaranteed to
    be non-negative, that is not guaranteed for the extended character set.
    When a negative char value is converted to int, its value remains
    unchanged, and since it is negative, is inherently incapable of being
    represented as an unsigned char.
     
    James Kuyper, Mar 17, 2014
    #9
  10. That's an interesting corner case that's actually guaranteed to work
    correctly, but it's still a bad idea.
    Yes, the plain char argument is implicitly converted to int,
    so isdigit(foo[1]) passes an argument of the correct *type*.
    The problem is that, in general, it's not guaranteed to be a valid
    *value*.

    Plain char may be either signed or unsigned. If it's signed, then
    a char value can be negative, resulting in a negative int value
    when it's converted from char to int. Unless the value happens to
    be equal to EOF, passing it to isdigit() has undefined behavior.
    (And if it does happen to be equal to EOF, then you're probably
    not doing what you want to do -- though you'll almost certainly get
    the correct result anyway, since (char)EOF is not a decimal digit
    in any character set I've ever heard of.)

    But as for your example:

    char foo[] = "abcd";
    isdigit(foo[1]);

    the character 'b' is a member of the basic execution character set,
    and all such characters are guaranteed to have nonnegative values
    (N1570 6.2.5p3).

    Nevertheless, just for the sake of consistency, you should always
    cast a char value to unsigned char before passing it any of the is*()
    or to*() functions declared in <ctype.h>.
     
    Keith Thompson, Mar 17, 2014
    #10
  11. Noob

    James Kuyper Guest

    It's UCHAR_MAX that matters, not CHAR_MAX, and in either case, that's
    not an possible problem with your code. For valid values of i, ext <=
    CHAR_MAX, and CHAR_MAX < UCHAR_MAX.
    It's the possibility that ext < 0 that is the problem.
    That conclusion is valid, even though it conflicts with the premise you
    used to reach that conclusion.
    Rolling your own does work for isdigit(), but this is NOT an ideal
    solution for most of the other <ctype.h> functions, because their
    behavior is locale-dependent. It's better to use a simple wrapper:

    inline int myisupper(char c) { return isupper((unsigned char)c); }
     
    James Kuyper, Mar 17, 2014
    #11
  12. (char)255 (which would be '\xff') would most likely be converted to -1.

    And if EOF == -1, which is usually the case, then it *is* a valid
    argument to isdigit, which just adds to the confusion. But negative
    char values in general are not valid arguments to isdigit(), and can
    cause undefined behavior.

    IMHO forcing plain char to be unsigned would make a lot more sense. The
    signedness of plain char was made implementation-defined for performance
    reasons; I don't know to what extent those reasons still apply.
     
    Keith Thompson, Mar 17, 2014
    #12
  13. Sorry, I completely failed in binary math.
    Yeah; unfortunately you can't change that as it is now encoded in
    ABIs.

    Casper
     
    Casper H.S. Dik, Mar 17, 2014
    #13
  14. A value of type char cannot be greater than CHAR_MAX.

    The problem is that '©' is very likely *negative* on your system.

    That's assuming you're using something like the 8-bit Latin-1 encoding.
    My system is conigured to use UTF-8 by default, so when I enter a
    literal '©' into a program using my text editor, I get a warning about a
    multi-character character constant.

    But you're more likely to get such a value by reading data from some
    outside source rather than having it in a string or character literal.
    I believe so -- but some of the is*() functions have locale-specific
    behavior. If you want to wrap the is*() functions, I suggest using a
    cast:

    int myisdigit(char c) {
    return isdigit((unsigned char)c);
    }

    You could even use a macro:

    #define ISDIGIT(c) isdigit((unsigned char)(c))

    [...]
    Any standard library function can be implemented as a macro.
    The is*() and to*() functions declared in <ctype.h> happen to be
    fairly easy to implement that way.
     
    Keith Thompson, Mar 17, 2014
    #14
  15. Apart from the "greater than CHAR_MAX" part, that's an accurate
    description of the way the problem shows up on your system.
    You attempted to index before the beginning of the array, not past
    its end.

    More generally, passing any value that does not satisfy:

    (arg >= 0 && arg <= UCHAR_MAX) || arg == EOF

    has undefined behavior.
     
    Keith Thompson, Mar 17, 2014
    #15
  16. Noob

    David Brown Guest

    An alternative could be to use the "-funsigned-char" option to make
    plain char unsigned. But I don't know what other consequences that
    might have on x86.
     
    David Brown, Mar 17, 2014
    #16
  17. Noob

    Ian Collins Guest

    Whatever the implementation of isdigit, why would a char subscript be
    worthy of a diagnostic? Yes char is usually signed, but so are short,
    int and long.
     
    Ian Collins, Mar 17, 2014
    #17
  18. Possibly to catch specifically this error.
     
    Barry Margolin, Mar 17, 2014
    #18
  19. Probably because a programmer using a char as a subscript very likely
    assumed he could use it to implement a lookup table covering the entire
    range of char values. (Which is exactly what happened here, though
    indirectly.)

    Use of a short, int, or long as an index is less likely to indicate an
    attempt to use a lookup table covering the entire range of the type.

    It's prone to both false negatives and false positives, but most
    heuristics are.
     
    Keith Thompson, Mar 18, 2014
    #19
  20. (snip)
    Seems to me the best suggestion. When you could reasonably rely
    on character data being plain ASCII, there might have been some
    argument against it, especially on processors that included a way
    to convert signed char to int that didn't conveniently also allow
    unsigned char to int.

    As well as I know it, the PDP-11 was popular in the days that
    decision might have been made. Maybe VAX, also.

    x86 has a complicated history that traces back at least to the 8080.
    The 8086 16 bit registers (AX, BX, CX, DX) could also be referened
    as two 8 bit registers (AL, AH, BL, BH, etc). In that case, I believe
    it would be more natural (easier) to zero the high byte than to sign
    extend it, but it is about as easy either way.

    (The 8086 was designed to allow a simple source translation
    from 8080 assembly, assigning specific 8086 registers to matching 8080
    registers.)

    VAX has CVTBW and CVTBL for sign extended byte to (16 bit) word
    and (32 bit) longword. There are also MOVZBW and MOVZBL for
    zero extended move. Seems to me that it is about as easy either way.

    -- glen
     
    glen herrmannsfeldt, Mar 18, 2014
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.