VLA question

Discussion in 'C Programming' started by Philip Lantz, Jun 14, 2013.

  1. I'll admit that I didn't quite understand the relevance the first time;
    you added some clarification this time (plus some of the other points
    discussed have started to sink in), so now I think I get it.
    This is indeed an interesting property of such systems, and one with
    unexpectedly far-reaching implications.
    I'd wondered about that, since the usual excuse for fgetc() returning an
    int is to allow for EOF, which is presented by most introductory texts
    as being impossible to mistake for a valid character.
    That only holds if plain char is unsigned, right?

    It seems these seemingly-unrelated restrictions would not apply if plain
    char were signed, which would be the (IMHO only) logical choice if
    character literals were signed.
    Perhaps "insane" was a bit strong, but I see no rational excuse for the
    signedness of plain chars and character literals to differ; the two are
    logically linked, and only C's definition of the latter as "int" even
    allows such a bizarre case to exist in theory.

    IMHO, that C++ implicitly requires the signedness of the two to match,
    apparently without problems, is an argument in favor of adopting the
    same rule in C. As long as the signedness matches, none of the problems
    mentioned in this thread would come up--and potentially break code that
    was not written to account for this unlikely corner case.
    I'm in no position to complain about that.
    I'm not arguing for the _probable_ existence of such systems as much as
    admitting that I don't have enough experience with atypical systems to
    have much idea what's really out there on the fringes, other than
    various examples given here since I've been reading. The world had
    pretty much standardized on twos-complement systems with flat, 32-bit
    address spaces by the time I started using C; 64-bit systems were my
    first real-world experience with having to think about variations in the
    sizes of base types--and even then usually only pointers.

    S
     
    Stephen Sprunk, Jul 2, 2013
    #61
    1. Advertisements

  2. Philip Lantz

    James Kuyper Guest

    On most systems, including the ones where C was first developed, that's
    perfectly true. But the C standard allows an implementation where that's
    not true to still be fully conforming. This does not "break" fgetc(), as
    some have claimed, since you can still use feof() and ferror() to
    determine whether an EOF value indicates success, failure, or
    end-of-file; but in principle it does make use of fgetc() less convenient.
    Correct - most of what I've been saying has been explicitly about
    platforms where CHAR_MAX > INT_MAX, which would not be permitted if char
    were signed. "For any two integer types with the same signedness and
    different integer conversion rank (see 6.3.1.1), the range of values of
    the type with smaller integer conversion rank is a subrange of the
    values of the other type." (6.2.5p8)

    ....
    I agree that the C++ approach makes more sense - I'm taking issue only
    with your characterization of C code which relies upon the C approach as
    "broken". I also think it's unlikely that the C committee would decide
    to change this, even though I've argued that the breakage that could
    occur would be fairly minor.

    You've seen how many complicated ideas and words I've had to put
    together to construct my arguments for the breakage being minor. The
    committee would have to be even more rigorous in considering the same
    issues. The fact that there could be any breakage at all (and there can
    be) means that there would have to be some pretty significant
    compensating advantages for the committee to decide to make such a
    change. Despite agreeing with the C++ approach, I don't think the
    advantages are large enough to justify such a change.
     
    James Kuyper, Jul 2, 2013
    #62
    1. Advertisements

  3. Philip Lantz

    James Kuyper Guest

    I was tired and in a hurry to go home, and didn't put enough thought
    into my response. Such an implementation would violate 6.2.5p8:

    "For any two integer types with the same signedness and different
    integer conversion rank (see 6.3.1.1), the range of values of the type
    with smaller integer conversion rank is a subrange of the values of the
    other type."
     
    James Kuyper, Jul 2, 2013
    #63
  4. Philip Lantz

    James Kuyper Guest

    On 06/29/2013 02:05 PM, Keith Thompson wrote:
    ....
    I've just posted an arguemnt on a different branch of this thread that
    7.21.2p3 indirectly implies that on systems where UCHAR_MAX > INT_MAX,
    given an unsigned character c and a valid int i, we must have

    (unsigned char)(int)c == c

    and

    (int)(unsigned char)i == i

    Comment?
     
    James Kuyper, Jul 2, 2013
    #64
  5. Depending on your definition of valid character. My undestanding
    is that ASCII-7 system can use a signed 8 bit char, but EBCDIC
    eight bit systems should use unsigned char. (No systems ever
    used the ASCII-8 code that IBM designed into S/360.)

    A unicode based system could use a 16 bit unsigned char, like
    Jave does.

    Valid character doesn't mean anything that you can put the
    but pattern out for, but for an actual character in the input
    character set.

    -- glen
     
    glen herrmannsfeldt, Jul 2, 2013
    #65
  6. Philip Lantz

    James Kuyper Guest

    For this purpose, a valid character is anything that can be returned by
    a successful call to fgetc(). Since I can fill a buffer with unsigned
    char values from 0 to UCHAR_MAX, and write that buffer to a binary
    stream, with a guarantee of being able read the same values back, I must
    respectfully disagree with the following assertion:

    ....
    Do you think that the only purpose for fgetc() is to read text files?
    All C input, whether from text streams or binary, has behavior defined
    by the standard in terms of calls to fgetc(), whether or not actual
    calls to that function occur.
     
    James Kuyper, Jul 2, 2013
    #66
  7. [...]

    I wouldn't call '\xff' (or '\xffff' for CHAR_BIT==16) contrived.
     
    Keith Thompson, Jul 2, 2013
    #67
  8. I agree.

    I sometimes wonder how much thought the committee put into making
    everything consistent for "exotic" systems, particularly those with
    char and int having the same size (which implies CHAR_BIT >= 16).
    I'm fairly sure that most C programmers don't put much thought
    into it.

    For most systems, having fgetc() return EOF reliably indicates that
    there were no more characters to read, and that exactly one of feof()
    or ferror() will then return true, and I think most C programmers
    rely on that assumption. That assumption can be violated only if
    CHAR_BIT >= 16.

    Even with CHAR_BIT == 8, storing the (non-EOF) result of fgetc() into a
    char object depends on the conversion to char (which is
    implementation-defined if plain char is signed) being particularly well
    behaved.

    Are there *any* systems with sizeof (int) == 1 (implying CHAR_BIT >= 16)
    that support stdio? I know that some implementations for DSPs have
    CHAR_BIT > 8, but are they all freestanding?

    I wonder if we (well, the committee) should consider adding some
    restrictions for hosted implementations, such as requiring INT_MAX >
    CHAR_MAX or specifying the results of out-of-range conversions to plain
    or signed char.
     
    Keith Thompson, Jul 2, 2013
    #68
  9. Philip Lantz

    James Kuyper Guest

    On 07/02/2013 03:24 PM, Keith Thompson wrote:
    ....
    That sounds like a good idea to me. However, if there's any existing
    implementations that would become non-conforming as a result of such a
    change, it could be difficult (and properly so) to get it approved.
     
    James Kuyper, Jul 2, 2013
    #69
  10. Why would anyone use that syntax for a character literal, rather than
    the shorter 0xff (or 0xffff)? That strikes me as contrived.

    There are certain cases where using the escape syntax is reasonable,
    such as '\n', but even '\0' is more simply written as just 0. String
    literals are another matter entirely, but those already have type
    (pointer to) char--another argument in favor of character literals
    having type char.

    S
     
    Stephen Sprunk, Jul 2, 2013
    #70
  11. Philip Lantz

    James Kuyper Guest

    For the same reasons I use false, '\0', L'\0', u'\0', U'\0', 0L, 0LL,
    0U, 0UL, 0ULL, 0.0F, 0.0, or 0.0L, depending upon the intended use of
    value, even though all of those constants have the same value. The form
    of the constant makes it's intended use clearer. As a side benefit, in
    some of those cases, it shuts up a warning message from the compiler,
    though that doesn't apply to '\0'.
     
    James Kuyper, Jul 2, 2013
    #71
  12. (snip)
    I never used one, but I thought I remembered some Cray machines
    with word addressed 64 bit words that did. Maybe only in museums
    by now.

    -- glen
     
    glen herrmannsfeldt, Jul 2, 2013
    #72
  13. (snip on EOF and valid characters)
    Doesn't really matter what I think, but it does matter what writers
    of compilers think.

    Are there compilers with 8 bit signed char using ASCII
    and EOF of -1?

    -- glen
     
    glen herrmannsfeldt, Jul 2, 2013
    #73
  14. Philip Lantz

    Eric Sosman Guest

    You've just described a system called the PDP-11 -- which
    dmr said was *not* the birthplace of C, but how would he know?
     
    Eric Sosman, Jul 2, 2013
    #74
  15. Philip Lantz

    James Kuyper Guest

    Not really - I'm talking about conforming implementations of C. The only
    thing that matters for the truth of my statements is what the writers of
    the standard intended. If writers of compilers disagree, and act on that
    disagreement, they will produce compilers that don't conform. That would
    be a problem - but it wouldn't have any affect on whether or not my
    statements are correct.

    I'm confused, however, as to what form you think that disagreement might
    take. Do you know of any implementations that implement fgetc() in ways
    that will cause it to return EOF when processing a byte from a stream in
    a file, if that byte has a value that you would not consider a valid
    character? Such an implementation would be non-conforming, but I can't
    imagine any reason for creating such an implementation. Most programs
    that use C I/O functions to write and read data consisting of any
    non-character data type would malfunction if that were true.
    EOF == -1 is quite common. ASCII has yielded to extended ASCII, UTF-8,
    or other more exotic options on most modern implementations I'm familiar
    with, which means that the extended execution character set includes
    characters that would be negative if char is signed. The one part I'm
    not sure of is how common it is for char to be signed - but if there are
    such implementations, it's not a problem. That's because the behavior of
    fputc() and fgetc() is defined in terms of unsigned char, not plain
    char. As a result, setting EOF to -1 cannot cause a conflict with any
    hypothetical character that happens to have a value of -1. You have to
    write such a char to file using fputc((unsigned char)-1), and fgetc()
    should return (int)(unsigned char)(-1) upon reading such a character.
    Having a successful call to fgetc() return EOF is only possible if
    UCHAR_MAX > INT_MAX, which can't happen on systems with 8-bit signed char.
     
    James Kuyper, Jul 2, 2013
    #75
  16. Philip Lantz

    Tim Rentsch Guest

    This is a clever line of reasoning. But the conclusions are
    wrong, for lots of different reasons.

    First, an implementation might simply fail any attempt to open a
    binary file. Or the open might succeed, but any attempt to write
    a negative argument might fail and indicate a write error. Such
    an implementation might be seen as abysmal but it still could be
    conforming. And it clearly satisfies 7.21.2p3, without imposing
    any further limitations on how conversions might work (or not).

    Second, the Standard doesn't guarantee that all unsigned char
    values can be transmitted: it is only the unsigned char values
    corresponding to int arguments that can be written, and hence
    only these that need compare equal when subsequently read. The
    word "transparently" might be taken as implying all unsigned char
    values will work, or it might be taken to mean what the rest of
    the paragraph spells out, ie that any values written will survive
    the round trip unchanged. The idea that the 'comparing equal'
    condition is meant to apply universally to all unsigned char
    values is an assumption not supported by any explicit wording.

    Third, even if an implementation allows reading and writing of
    binary files, and fputc works faithfully for all unsigned char
    values, conversion from unsigned char to int could still raise
    an implementation-defined signal (for values above INT_MAX).
    This could work if the default signal handler checked and did
    the right thing when converting arguments to fputc etc, and
    otherwise something else. (And the signal in question could
    be one not subject to change using signal().)

    Fourth, alternatively, ints could have a trap representation,
    where the implementation takes advantage of the freedom given by
    undefined behavior in such cases, to do the right thing when
    converting arguments to fputc (etc), or something else for other
    such conversions. Such implementations might be seen as rather
    perverse, but that doesn't make them non-conforming.

    Finally, and perhaps most obviously, there is no reason 7.21.2p3
    even necessarily applies, because freestanding implementations
    aren't required to implement <stdio.h>.
     
    Tim Rentsch, Jul 3, 2013
    #76
  17. All the Crays I used had CHAR_BIT==8. The T90 in particular was a
    word-addressed vector machine with 64-bit words. char* and void*
    pointers were 64-bit word pointers with a bytes offset stored in the
    high-order 3 bits. String operations were surprisingly slow -- but of
    course the hardware was heavily optimized for massive floating-point
    operations.

    But that was for Unicos, Cray's Unix system, so it had to conform (more
    or less) both to C and to POSIX. I never used the earlier non-Unix COS
    system, and I don't know what it was like.
     
    Keith Thompson, Jul 3, 2013
    #77
  18. Yes, a lot of them, in fact most of the C compilers I've used fit that
    description.

    8-bit char (i.e., CHAR_BIT==8): I've never used a system with CHAR_BIT
    != 8.

    signed char: Plain char is signed on *most* compilers I've used.
    Exceptions are Cray systems, SGI MIPS systems running Irix, and IBM
    PowerPC systems running AIX.

    Most systems these days support various supersets of ASCII, ranging from
    Windows-1252 or Latin-1 to full Unicode. But the 7-bit ASCII subset is
    nearly universal. EBCDIC-based systems are an obvious exception (and
    they'd probably have to make plain char unsigned).

    I don't believe I've ever seen a value for EOF other than -1.
     
    Keith Thompson, Jul 3, 2013
    #78
  19. Philip Lantz

    Tim Rentsch Guest

    Thank you, I appreciate the positive comment. Of course I would
    expect (and hope!) that most people would agree on the objective
    portion, and differ only in assignment of the subjective weights.
    It's nice to hear that's how it worked out in this instance.
    My reaction is just the opposite. For starters, I think the gap
    has gotten wider rather than narrower. Moreover it is likely to
    continue growing in the future, beause of the different design
    philosophies of the respective groups - the C group is generally
    conservative, the C++ group more open to accommodating new
    features.

    As to the "whole is greater than the sum of the parts" idea, I
    believe if individual changes don't stand on their own merits,
    then it's even worse to include them as a group. Let's take the
    'const'-defined array bound as an example. This language feature
    adds no significant expressive power to the language; it's
    simply another way of doing something that can already be done
    with about the same amount of code writing. There may be some
    things about it that are better, and some things that are worse,
    but certainly it isn't clearly better -- it's just different. So
    now what happens if rather than one of those we add 25 of them?
    There's no appreciable difference in how easy or hard programs
    are to write; but reading them gets harder, because there are
    more ways to write the same thing, and translating between them
    takes some effort. Meanwhile the language specification would
    get noticeably larger, and require more effort to read and digest
    (even not counting the effort needed to write). No real gain,
    and a bunch of cost.

    An uncharitable view of your opinion is that it is simply
    disguised chauvinism for the C++ way of doing things. Do you
    have any arguments to offer for the merits of some of these
    proposed new features that don't reference C++ but are able to
    stand on their own?
     
    Tim Rentsch, Jul 5, 2013
    #79
  20. Philip Lantz

    Ian Collins Guest

    It looks like this will continue to be the case, given the active
    discussion of new features for the next C++ revision.

    There are number of C++11 additions that would improve and stand alone
    in C in similar ways to C++, some examples:

    Static (compile-time) assertions. Yes you can do much the same with the
    preprocessor, but there are limits and I believe static_assert makes the
    conditions clearer as programme or function preconditions. They are also
    more concise.

    Initialisations preventing narrowing. Removes another source of
    unexpected behaviour.

    General compile time constants with constexpr. This would be
    particularly useful in the embedded world were you want to minimise RAM use.

    nullptr. Removes another source of unexpected behaviour.

    Raw string literals. Removes another potential source of unexpected
    behaviour and hard to read code (how many slashes do I need?).

    alignas to standardise alignment.

    And for the bold, "auto" variable declarations.

    None of these are particularity radical, but they would make
    programmer's life just that little bit easier.
     
    Ian Collins, Jul 5, 2013
    #80
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.