Advancing past the last element of an array

Discussion in 'C Programming' started by Johannes Schaub (litb), Dec 27, 2009.

  1. I have some question about this one in C:

    int a[3][1];
    int *ap = &a[0][0];

    I know that in C++ the following is perfectly fine:

    int ap1 = *ap;
    int ap2 = *(ap + 1);
    int ap3 = *(ap + 1 + 1);

    That is because the past-the-end pointer "ap + 1" happens to point to an
    unrelated integer that just happens to be stored there, and dereferencing it
    dereferences that pointer. Adding +1 again adds +1 to *that* pointer which
    is a pointer into the second subarray of "a". Which will point to the last
    integer that just happens to be at the past-the-end position of the second
    subarray.

    I know that the following two lines are undefined behavior in C++:

    int ap3 = *(ap + 2);
    int ap3_secondtry = *(ap + (1 + 1));

    That is because it adds 2 to the pointer into an array that only has one
    element.

    Now my question is - how is the matter in C? Is there some paragraph in the
    Standard that allows it? To me, it looks like all but "int ap1 = *ap;" is
    undefined behavior in C, because it seems to disallow dereferencing the
    past-the-end pointer.

    Any help is welcome!
     
    Johannes Schaub (litb), Dec 27, 2009
    #1
    1. Advertising

  2. Johannes Schaub (litb)

    Seebs Guest

    On 2009-12-27, Johannes Schaub (litb) <> wrote:
    > I have some question about this one in C:
    >
    > int a[3][1];
    > int *ap = &a[0][0];
    >
    > I know that in C++ the following is perfectly fine:
    >
    > int ap1 = *ap;
    > int ap2 = *(ap + 1);
    > int ap3 = *(ap + 1 + 1);


    Are you sure? I'd agree that it's almost certainly going to work. However,
    it's not as clear to me that it's "perfectly fine".

    > That is because the past-the-end pointer "ap + 1" happens to point to an
    > unrelated integer that just happens to be stored there, and dereferencing it
    > dereferences that pointer.


    In C, that's a bounds violation, because you're going past the bounds of
    the object to which you have a pointer. You have a pointer to the first of
    the three subarrays. While it happens that this is part of a larger object,
    it's still going past the bounds of the specific object from which you
    derived the pointer.

    You're certainly allowed to generate a pointer one past the end of an
    array, but you're not allowed to dereference it.

    > I know that the following two lines are undefined behavior in C++:


    > int ap3 = *(ap + 2);


    You are clearly crazy.

    There is no difference between "ap + 1 + 1" and "ap + 2".

    If you think there is, either the C++ standards committee is deeply
    insane, or you're very confused. I'm guessing both.

    > int ap3_secondtry = *(ap + (1 + 1));


    > That is because it adds 2 to the pointer into an array that only has one
    > element.


    Again, this is not different. There's no difference to be had. You're still
    going past the end of an array by the same amount.

    > Now my question is - how is the matter in C? Is there some paragraph in the
    > Standard that allows it? To me, it looks like all but "int ap1 = *ap;" is
    > undefined behavior in C, because it seems to disallow dereferencing the
    > past-the-end pointer.


    It does.

    If C++ allows dereferencing the one-past-the-end pointer, that gets you
    ap+1, but it doesn't make ap+1+1 different from ap+2. The mere fact that
    ap+1 happens to be the same address as &ap[1][0] doesn't mean that you can
    then expect ap+1+1 to be dereferenceable; it's still going two past the
    end of an array.

    -s
    --
    Copyright 2009, all wrongs reversed. Peter Seebach /
    http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
    http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
     
    Seebs, Dec 27, 2009
    #2
    1. Advertising

  3. Seebs wrote:

    > On 2009-12-27, Johannes Schaub (litb) <> wrote:
    >> I have some question about this one in C:
    >>
    >> int a[3][1];
    >> int *ap = &a[0][0];
    >>
    >> I know that in C++ the following is perfectly fine:
    >>
    >> int ap1 = *ap;
    >> int ap2 = *(ap + 1);
    >> int ap3 = *(ap + 1 + 1);

    >
    > Are you sure? I'd agree that it's almost certainly going to work.
    > However, it's not as clear to me that it's "perfectly fine".
    >
    >> That is because the past-the-end pointer "ap + 1" happens to point to an
    >> unrelated integer that just happens to be stored there, and dereferencing
    >> it dereferences that pointer.

    >
    > In C, that's a bounds violation, because you're going past the bounds of
    > the object to which you have a pointer. You have a pointer to the first
    > of
    > the three subarrays. While it happens that this is part of a larger
    > object, it's still going past the bounds of the specific object from which
    > you derived the pointer.
    >
    > You're certainly allowed to generate a pointer one past the end of an
    > array, but you're not allowed to dereference it.
    >
    >> I know that the following two lines are undefined behavior in C++:

    >
    >> int ap3 = *(ap + 2);

    >
    > You are clearly crazy.
    >
    > There is no difference between "ap + 1 + 1" and "ap + 2".
    >
    > If you think there is, either the C++ standards committee is deeply
    > insane, or you're very confused. I'm guessing both.
    >


    I'm fairly sure this is a fact in C++ that "+ 1 + 1" is different from "+ 2"
    above (notice that the binding is ((a + b) + c) and not (a + (b + c)). It
    says:

    "If an object of type T is located at an address A, a pointer of type cv T*
    whose value is the address A is said to point to that object, regardless of
    how the value was obtained. [Note: for instance, the address one past the
    end of an array (5.7) would be considered to point to an unrelated object of
    the array’s element type that might be located at that address. ]"

    The note sufficiently clarifies that "ap + 1" above is the same as "&a[1]
    [0]" in my opinion - and because C++ does not forbid dereferencing past-the-
    end unconditionally, i was in the opinion it is valid.

    The reason that i say "+ 2" or "+ (1+1)" is undefined behavior is because it
    says that one can not add an integer so it goes further than past-the-end.
    But once we have hit past-the-end, and point to an element of another array,
    we could increment again.

    My question was whether such things exist in C too. It has indeed practical
    relevance, since if C guarantees it too, then we could go from &a[0][0] to
    &a[2][0] without undefined behavior in C too, like we can in C++ using
    "ap++" until we hit end.

    If C does not provide this - is there a reason for that? Thanks for any
    pointers!
     
    Johannes Schaub (litb), Dec 28, 2009
    #3
  4. Johannes Schaub (litb)

    Seebs Guest

    On 2009-12-28, Johannes Schaub (litb) <> wrote:
    > I'm fairly sure this is a fact in C++ that "+ 1 + 1" is different from "+ 2"
    > above (notice that the binding is ((a + b) + c) and not (a + (b + c)).


    Hmm.

    > "If an object of type T is located at an address A, a pointer of type cv T*
    > whose value is the address A is said to point to that object, regardless of
    > how the value was obtained. [Note: for instance, the address one past the
    > end of an array (5.7) would be considered to point to an unrelated object of
    > the array?s element type that might be located at that address. ]"


    If you did:
    int *x = *a[0][0];
    int *y = x + 1;
    int *z = y + 1;
    you might be able to argue that the +1s are each being resolved separately, and
    you're not just running two past the end of the array. But otherwise, I don't
    think I buy it. If this really is what the spec says, I'd guess it's an
    unintentional bug.

    However, if you want to discuss C++ rules, you'll get more informed
    opinions in a C++ newsgroup.

    > The reason that i say "+ 2" or "+ (1+1)" is undefined behavior is because it
    > says that one can not add an integer so it goes further than past-the-end.
    > But once we have hit past-the-end, and point to an element of another array,
    > we could increment again.


    Interesting.

    > My question was whether such things exist in C too. It has indeed practical
    > relevance, since if C guarantees it too, then we could go from &a[0][0] to
    > &a[2][0] without undefined behavior in C too, like we can in C++ using
    > "ap++" until we hit end.


    > If C does not provide this - is there a reason for that? Thanks for any
    > pointers!


    There is absolutely no such guarantee, and there is a very good reason:
    Because all such code is fundamentally, deeply, broken.

    C does not allow dereferencing outside the bounds of an object. The one
    thing you can do is calculate the address one past the end -- but you
    can't dereference it. C does not have the rule that, no matter how you
    get a pointer, if you have a pointer that compares equal to another
    pointer, they're the same -- because this would break bounds checking.

    My guess is that C++ has that for some stupid reason pertaining to
    operator overloading.

    But even if it works, you should never, ever, not in a million years, not
    under any circumstances, write code depending on this kind of idiocy.

    (There is in fact a very good practical reason for this -- one of the
    most common changes to see in 2D array code is a shift from a 2D array
    to a 1D array of pointers, in which case, iterating off one does NOT
    lead you to the next one...)

    -s
    --
    Copyright 2009, all wrongs reversed. Peter Seebach /
    http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
    http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
     
    Seebs, Dec 28, 2009
    #4
  5. Johannes Schaub (litb)

    Kaz Kylheku Guest

    On 2009-12-28, Johannes Schaub (litb) <> wrote:
    >
    > I'm fairly sure this is a fact in C++ that "+ 1 + 1" is different from "+ 2"


    Not for the built-in + operator over arithmetic types.

    > above (notice that the binding is ((a + b) + c) and not (a + (b + c)). It
    > says:


    > The reason that i say "+ 2" or "+ (1+1)" is undefined behavior is because it
    > says that one can not add an integer so it goes further than past-the-end.
    > But once we have hit past-the-end, and point to an element of another array,
    > we could increment again.


    That isn't true; you are past the end of the object from which you
    derived the pointer.
    This is undefined behavior.

    When you're dealing with multidimensional arrays, the compiler can
    genearate code which assumes that no bounds are violated.

    So for instance if you have some machine instruction with, say, a 12 bit
    displacement field, and in the given situation, that bit field is wide
    enough to address a dimension of the array, then the compiler can just
    blindly generate that instrution, even if your program overflows the 12
    bit width.
     
    Kaz Kylheku, Dec 28, 2009
    #5
  6. Kaz Kylheku wrote:

    > On 2009-12-28, Johannes Schaub (litb) <> wrote:
    >>
    >> I'm fairly sure this is a fact in C++ that "+ 1 + 1" is different from "+
    >> 2"

    >
    > Not for the built-in + operator over arithmetic types.
    >
    >> above (notice that the binding is ((a + b) + c) and not (a + (b + c)). It
    >> says:

    >
    >> The reason that i say "+ 2" or "+ (1+1)" is undefined behavior is because
    >> it says that one can not add an integer so it goes further than
    >> past-the-end. But once we have hit past-the-end, and point to an element
    >> of another array, we could increment again.

    >
    > That isn't true; you are past the end of the object from which you
    > derived the pointer.
    > This is undefined behavior.
    >

    One-past-the-end is fine. Going past *that* is undefined if you do that
    addition in one operation (e.g +2 instead of +1 + 1). There is no saying, to
    what i know, in the Standard that it's undefined behavior to do this in two
    steps.

    I'm just unsure about C. But it seems it's actually not allowed by C.

    > When you're dealing with multidimensional arrays, the compiler can
    > genearate code which assumes that no bounds are violated.
    >
    > So for instance if you have some machine instruction with, say, a 12 bit
    > displacement field, and in the given situation, that bit field is wide
    > enough to address a dimension of the array, then the compiler can just
    > blindly generate that instrution, even if your program overflows the 12
    > bit width.


    There are exactly two valid values of object pointers: Either a byte in
    memory, or a null pointer value - there is no dedicated past-the-end value.
    If you are one past the end of one array that is just prior to another
    array, it follows you are also at the first element of the next array.

    The compiler could switch the segments if it hits past-the-end and it sees
    the addition would overflow the segment. This sounds like a practical reason
    for why "+1 + 1" is not UB: It just inserts checks after each addition
    whether it hit the end of a segment, and switches, if needed. But if you do
    "+2", you "jump over it", and do an overflow right there, with no chance for
    the compiler to switch segments.
     
    Johannes Schaub (litb), Dec 28, 2009
    #6
  7. Seebs wrote:

    > On 2009-12-28, Johannes Schaub (litb) <> wrote:
    >> I'm fairly sure this is a fact in C++ that "+ 1 + 1" is different from "+
    >> 2" above (notice that the binding is ((a + b) + c) and not (a + (b + c)).

    >
    > Hmm.
    >
    >> "If an object of type T is located at an address A, a pointer of type cv
    >> T* whose value is the address A is said to point to that object,
    >> regardless of how the value was obtained. [Note: for instance, the
    >> address one past the end of an array (5.7) would be considered to point
    >> to an unrelated object of the array?s element type that might be located
    >> at that address. ]"

    >
    > If you did:
    > int *x = *a[0][0];
    > int *y = x + 1;
    > int *z = y + 1;
    > you might be able to argue that the +1s are each being resolved
    > separately, and
    > you're not just running two past the end of the array. But otherwise, I
    > don't
    > think I buy it. If this really is what the spec says, I'd guess it's an
    > unintentional bug.
    >


    Hmm, maybe i should ask this question in a C++ group. I thought it's exactly
    intended behavior. But reading how surprised you guys are, i've now doubts
    about it :/
     
    Johannes Schaub (litb), Dec 28, 2009
    #7
  8. Johannes Schaub (litb)

    Kaz Kylheku Guest

    On 2009-12-28, Johannes Schaub (litb) <> wrote:
    >> When you're dealing with multidimensional arrays, the compiler can
    >> genearate code which assumes that no bounds are violated.
    >>
    >> So for instance if you have some machine instruction with, say, a 12 bit
    >> displacement field, and in the given situation, that bit field is wide
    >> enough to address a dimension of the array, then the compiler can just
    >> blindly generate that instrution, even if your program overflows the 12
    >> bit width.

    >
    > There are exactly two valid values of object pointers: Either a byte in
    > memory, or a null pointer value - there is no dedicated past-the-end value.


    In C, there is a concept of pointer validity which takes into account
    /how/ that pointer was obtained. That information is not necessarily
    encoded in the pointer's run time value. (Remember, in C, type
    information is not also encoded in run-time values; that doesn't mean
    you can violate the type system and still have a well-defined program).

    Since the validity of a pointer includes how it was obtained,
    merely knowing where that pointer points is not enough of an assurance
    of correctness.

    Validity is important when it comes to code generation, and
    optimization. Code can be generated and optimized based on validity
    assumptions (that the program hasn't invoked any undefined behavior).

    > If you are one past the end of one array that is just prior to another
    > array, it follows you are also at the first element of the next array.


    It doesn't follow that you are legally at the first element of the
    array.

    If a prisoner climbs the fence, it follows that he's physically not in
    prison any more, not that he's legally a free man.

    C is not assembly language. What is well-defined or not at the language
    level is not governed by the object code generated by some compilers.

    There isn't just once C language so you have to be careful about what
    you mean; when you say that something is well-defined, do you mean
    ISO C, or do you mean some dialect accepted by some compilers?

    Both concepts of definedness are valid and useful, as is not
    confusing one for the other.
     
    Kaz Kylheku, Dec 28, 2009
    #8
  9. Kaz Kylheku wrote:

    > On 2009-12-28, Johannes Schaub (litb) <> wrote:
    >>> When you're dealing with multidimensional arrays, the compiler can
    >>> genearate code which assumes that no bounds are violated.
    >>>
    >>> So for instance if you have some machine instruction with, say, a 12 bit
    >>> displacement field, and in the given situation, that bit field is wide
    >>> enough to address a dimension of the array, then the compiler can just
    >>> blindly generate that instrution, even if your program overflows the 12
    >>> bit width.

    >>
    >> There are exactly two valid values of object pointers: Either a byte in
    >> memory, or a null pointer value - there is no dedicated past-the-end
    >> value.

    >
    > In C, there is a concept of pointer validity which takes into account
    > /how/ that pointer was obtained. That information is not necessarily
    > encoded in the pointer's run time value. (Remember, in C, type
    > information is not also encoded in run-time values; that doesn't mean
    > you can violate the type system and still have a well-defined program).
    >
    > Since the validity of a pointer includes how it was obtained,
    > merely knowing where that pointer points is not enough of an assurance
    > of correctness.
    >
    > Validity is important when it comes to code generation, and
    > optimization. Code can be generated and optimized based on validity
    > assumptions (that the program hasn't invoked any undefined behavior).
    >

    I see now. In C, pointers seem to have these relatioships to where they are
    generated from.

    >> If you are one past the end of one array that is just prior to another
    >> array, it follows you are also at the first element of the next array.

    >
    > It doesn't follow that you are legally at the first element of the
    > array.
    >
    > If a prisoner climbs the fence, it follows that he's physically not in
    > prison any more, not that he's legally a free man.
    >
    > C is not assembly language. What is well-defined or not at the language
    > level is not governed by the object code generated by some compilers.
    >
    > There isn't just once C language so you have to be careful about what
    > you mean; when you say that something is well-defined, do you mean
    > ISO C, or do you mean some dialect accepted by some compilers?
    >
    > Both concepts of definedness are valid and useful, as is not
    > confusing one for the other.


    I think this makes sense. I'm talking about ISO C99. We cannot do this in C
    then. Thanks for showing me the matters, i like the prisoner analogy.
     
    Johannes Schaub (litb), Dec 28, 2009
    #9
  10. Johannes Schaub (litb)

    Seebs Guest

    On 2009-12-28, Johannes Schaub (litb) <> wrote:
    > Hmm, maybe i should ask this question in a C++ group. I thought it's exactly
    > intended behavior. But reading how surprised you guys are, i've now doubts
    > about it :/


    It makes no sense for it to be intentionally specified to work.

    The rule about pointers outside the bounds of an object is that you're
    allowed to generate a pointer one past the end of an array, for purposes
    of comparing it to pointers into the array, or subtracting other addresses
    in the array from it to count offsets.

    In C, where you got the pointer matters.

    int ary[3][1] = { 0 };
    int *ap_1 = (int *) ary;
    int *ap_2 = (int *) ary[0];
    int x;
    x = ap_1[0]; // clearly well-defined
    x = ap_1[1]; // well-defined, reads from ary[1][0]
    x = ap_2[0]; // clearly well-defined
    x = ap_2[1]; // undefined, tries to read from ary[0][1]

    In short, even though ap_1 and ap_2 are the same address in memory, the
    compiler is allowed to note that one of them is the address of a block of
    three arrays of single integers, thus, an object of size 3*sizeof(int),
    and the other is the address of an array of one integer, thus, an object
    of size 1*sizeof(int).

    If you want a pointer to the whole object, don't derive it from one of
    the members.

    -s
    --
    Copyright 2009, all wrongs reversed. Peter Seebach /
    http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
    http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
     
    Seebs, Dec 28, 2009
    #10
  11. Johannes Schaub (litb)

    Tim Rentsch Guest

    Kaz Kylheku <> writes:

    > On 2009-12-28, Johannes Schaub (litb) <> wrote:
    >>> When you're dealing with multidimensional arrays, the compiler can
    >>> genearate code which assumes that no bounds are violated.
    >>>
    >>> So for instance if you have some machine instruction with, say, a 12 bit
    >>> displacement field, and in the given situation, that bit field is wide
    >>> enough to address a dimension of the array, then the compiler can just
    >>> blindly generate that instrution, even if your program overflows the 12
    >>> bit width.

    >>
    >> There are exactly two valid values of object pointers: Either a byte in
    >> memory, or a null pointer value - there is no dedicated past-the-end value.

    >
    > In C, there is a concept of pointer validity which takes into account
    > /how/ that pointer was obtained. That information is not necessarily
    > encoded in the pointer's run time value. (Remember, in C, type
    > information is not also encoded in run-time values; that doesn't mean
    > you can violate the type system and still have a well-defined program).
    >
    > Since the validity of a pointer includes how it was obtained,
    > merely knowing where that pointer points is not enough of an assurance
    > of correctness. [...]


    Perhaps you would be so good as to tell the group in
    which section(s) of which ISO document(s) this concept
    is explained or described?
     
    Tim Rentsch, Jan 13, 2010
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tim Smith

    advancing with the asp:datagrid

    Tim Smith, Nov 25, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    336
    Chris Jackson
    Nov 25, 2003
  2. Replies:
    17
    Views:
    804
  3. Replies:
    5
    Views:
    469
    Ben Bacarisse
    Dec 6, 2006
  4. Replies:
    7
    Views:
    385
    Ben Bacarisse
    May 11, 2009
  5. David A. Black
    Replies:
    0
    Views:
    99
    David A. Black
    Sep 13, 2007
Loading...

Share This Page