Advancing past the last element of an array

  • Thread starter Johannes Schaub (litb)
  • Start date
J

Johannes Schaub (litb)

I have some question about this one in C:

int a[3][1];
int *ap = &a[0][0];

I know that in C++ the following is perfectly fine:

int ap1 = *ap;
int ap2 = *(ap + 1);
int ap3 = *(ap + 1 + 1);

That is because the past-the-end pointer "ap + 1" happens to point to an
unrelated integer that just happens to be stored there, and dereferencing it
dereferences that pointer. Adding +1 again adds +1 to *that* pointer which
is a pointer into the second subarray of "a". Which will point to the last
integer that just happens to be at the past-the-end position of the second
subarray.

I know that the following two lines are undefined behavior in C++:

int ap3 = *(ap + 2);
int ap3_secondtry = *(ap + (1 + 1));

That is because it adds 2 to the pointer into an array that only has one
element.

Now my question is - how is the matter in C? Is there some paragraph in the
Standard that allows it? To me, it looks like all but "int ap1 = *ap;" is
undefined behavior in C, because it seems to disallow dereferencing the
past-the-end pointer.

Any help is welcome!
 
S

Seebs

I have some question about this one in C:

int a[3][1];
int *ap = &a[0][0];

I know that in C++ the following is perfectly fine:

int ap1 = *ap;
int ap2 = *(ap + 1);
int ap3 = *(ap + 1 + 1);

Are you sure? I'd agree that it's almost certainly going to work. However,
it's not as clear to me that it's "perfectly fine".
That is because the past-the-end pointer "ap + 1" happens to point to an
unrelated integer that just happens to be stored there, and dereferencing it
dereferences that pointer.

In C, that's a bounds violation, because you're going past the bounds of
the object to which you have a pointer. You have a pointer to the first of
the three subarrays. While it happens that this is part of a larger object,
it's still going past the bounds of the specific object from which you
derived the pointer.

You're certainly allowed to generate a pointer one past the end of an
array, but you're not allowed to dereference it.
I know that the following two lines are undefined behavior in C++:
int ap3 = *(ap + 2);

You are clearly crazy.

There is no difference between "ap + 1 + 1" and "ap + 2".

If you think there is, either the C++ standards committee is deeply
insane, or you're very confused. I'm guessing both.
int ap3_secondtry = *(ap + (1 + 1));
That is because it adds 2 to the pointer into an array that only has one
element.

Again, this is not different. There's no difference to be had. You're still
going past the end of an array by the same amount.
Now my question is - how is the matter in C? Is there some paragraph in the
Standard that allows it? To me, it looks like all but "int ap1 = *ap;" is
undefined behavior in C, because it seems to disallow dereferencing the
past-the-end pointer.

It does.

If C++ allows dereferencing the one-past-the-end pointer, that gets you
ap+1, but it doesn't make ap+1+1 different from ap+2. The mere fact that
ap+1 happens to be the same address as &ap[1][0] doesn't mean that you can
then expect ap+1+1 to be dereferenceable; it's still going two past the
end of an array.

-s
 
J

Johannes Schaub (litb)

Seebs said:
I have some question about this one in C:

int a[3][1];
int *ap = &a[0][0];

I know that in C++ the following is perfectly fine:

int ap1 = *ap;
int ap2 = *(ap + 1);
int ap3 = *(ap + 1 + 1);

Are you sure? I'd agree that it's almost certainly going to work.
However, it's not as clear to me that it's "perfectly fine".
That is because the past-the-end pointer "ap + 1" happens to point to an
unrelated integer that just happens to be stored there, and dereferencing
it dereferences that pointer.

In C, that's a bounds violation, because you're going past the bounds of
the object to which you have a pointer. You have a pointer to the first
of
the three subarrays. While it happens that this is part of a larger
object, it's still going past the bounds of the specific object from which
you derived the pointer.

You're certainly allowed to generate a pointer one past the end of an
array, but you're not allowed to dereference it.
I know that the following two lines are undefined behavior in C++:
int ap3 = *(ap + 2);

You are clearly crazy.

There is no difference between "ap + 1 + 1" and "ap + 2".

If you think there is, either the C++ standards committee is deeply
insane, or you're very confused. I'm guessing both.

I'm fairly sure this is a fact in C++ that "+ 1 + 1" is different from "+ 2"
above (notice that the binding is ((a + b) + c) and not (a + (b + c)). It
says:

"If an object of type T is located at an address A, a pointer of type cv T*
whose value is the address A is said to point to that object, regardless of
how the value was obtained. [Note: for instance, the address one past the
end of an array (5.7) would be considered to point to an unrelated object of
the array’s element type that might be located at that address. ]"

The note sufficiently clarifies that "ap + 1" above is the same as "&a[1]
[0]" in my opinion - and because C++ does not forbid dereferencing past-the-
end unconditionally, i was in the opinion it is valid.

The reason that i say "+ 2" or "+ (1+1)" is undefined behavior is because it
says that one can not add an integer so it goes further than past-the-end.
But once we have hit past-the-end, and point to an element of another array,
we could increment again.

My question was whether such things exist in C too. It has indeed practical
relevance, since if C guarantees it too, then we could go from &a[0][0] to
&a[2][0] without undefined behavior in C too, like we can in C++ using
"ap++" until we hit end.

If C does not provide this - is there a reason for that? Thanks for any
pointers!
 
S

Seebs

I'm fairly sure this is a fact in C++ that "+ 1 + 1" is different from "+ 2"
above (notice that the binding is ((a + b) + c) and not (a + (b + c)).
Hmm.

"If an object of type T is located at an address A, a pointer of type cv T*
whose value is the address A is said to point to that object, regardless of
how the value was obtained. [Note: for instance, the address one past the
end of an array (5.7) would be considered to point to an unrelated object of
the array?s element type that might be located at that address. ]"

If you did:
int *x = *a[0][0];
int *y = x + 1;
int *z = y + 1;
you might be able to argue that the +1s are each being resolved separately, and
you're not just running two past the end of the array. But otherwise, I don't
think I buy it. If this really is what the spec says, I'd guess it's an
unintentional bug.

However, if you want to discuss C++ rules, you'll get more informed
opinions in a C++ newsgroup.
The reason that i say "+ 2" or "+ (1+1)" is undefined behavior is because it
says that one can not add an integer so it goes further than past-the-end.
But once we have hit past-the-end, and point to an element of another array,
we could increment again.
Interesting.

My question was whether such things exist in C too. It has indeed practical
relevance, since if C guarantees it too, then we could go from &a[0][0] to
&a[2][0] without undefined behavior in C too, like we can in C++ using
"ap++" until we hit end.
If C does not provide this - is there a reason for that? Thanks for any
pointers!

There is absolutely no such guarantee, and there is a very good reason:
Because all such code is fundamentally, deeply, broken.

C does not allow dereferencing outside the bounds of an object. The one
thing you can do is calculate the address one past the end -- but you
can't dereference it. C does not have the rule that, no matter how you
get a pointer, if you have a pointer that compares equal to another
pointer, they're the same -- because this would break bounds checking.

My guess is that C++ has that for some stupid reason pertaining to
operator overloading.

But even if it works, you should never, ever, not in a million years, not
under any circumstances, write code depending on this kind of idiocy.

(There is in fact a very good practical reason for this -- one of the
most common changes to see in 2D array code is a shift from a 2D array
to a 1D array of pointers, in which case, iterating off one does NOT
lead you to the next one...)

-s
 
K

Kaz Kylheku

I'm fairly sure this is a fact in C++ that "+ 1 + 1" is different from "+ 2"

Not for the built-in + operator over arithmetic types.
above (notice that the binding is ((a + b) + c) and not (a + (b + c)). It
says:
The reason that i say "+ 2" or "+ (1+1)" is undefined behavior is because it
says that one can not add an integer so it goes further than past-the-end.
But once we have hit past-the-end, and point to an element of another array,
we could increment again.

That isn't true; you are past the end of the object from which you
derived the pointer.
This is undefined behavior.

When you're dealing with multidimensional arrays, the compiler can
genearate code which assumes that no bounds are violated.

So for instance if you have some machine instruction with, say, a 12 bit
displacement field, and in the given situation, that bit field is wide
enough to address a dimension of the array, then the compiler can just
blindly generate that instrution, even if your program overflows the 12
bit width.
 
J

Johannes Schaub (litb)

Kaz said:
Not for the built-in + operator over arithmetic types.



That isn't true; you are past the end of the object from which you
derived the pointer.
This is undefined behavior.
One-past-the-end is fine. Going past *that* is undefined if you do that
addition in one operation (e.g +2 instead of +1 + 1). There is no saying, to
what i know, in the Standard that it's undefined behavior to do this in two
steps.

I'm just unsure about C. But it seems it's actually not allowed by C.
When you're dealing with multidimensional arrays, the compiler can
genearate code which assumes that no bounds are violated.

So for instance if you have some machine instruction with, say, a 12 bit
displacement field, and in the given situation, that bit field is wide
enough to address a dimension of the array, then the compiler can just
blindly generate that instrution, even if your program overflows the 12
bit width.

There are exactly two valid values of object pointers: Either a byte in
memory, or a null pointer value - there is no dedicated past-the-end value.
If you are one past the end of one array that is just prior to another
array, it follows you are also at the first element of the next array.

The compiler could switch the segments if it hits past-the-end and it sees
the addition would overflow the segment. This sounds like a practical reason
for why "+1 + 1" is not UB: It just inserts checks after each addition
whether it hit the end of a segment, and switches, if needed. But if you do
"+2", you "jump over it", and do an overflow right there, with no chance for
the compiler to switch segments.
 
J

Johannes Schaub (litb)

Seebs said:
I'm fairly sure this is a fact in C++ that "+ 1 + 1" is different from "+
2" above (notice that the binding is ((a + b) + c) and not (a + (b + c)).
Hmm.

"If an object of type T is located at an address A, a pointer of type cv
T* whose value is the address A is said to point to that object,
regardless of how the value was obtained. [Note: for instance, the
address one past the end of an array (5.7) would be considered to point
to an unrelated object of the array?s element type that might be located
at that address. ]"

If you did:
int *x = *a[0][0];
int *y = x + 1;
int *z = y + 1;
you might be able to argue that the +1s are each being resolved
separately, and
you're not just running two past the end of the array. But otherwise, I
don't
think I buy it. If this really is what the spec says, I'd guess it's an
unintentional bug.

Hmm, maybe i should ask this question in a C++ group. I thought it's exactly
intended behavior. But reading how surprised you guys are, i've now doubts
about it :/
 
K

Kaz Kylheku

There are exactly two valid values of object pointers: Either a byte in
memory, or a null pointer value - there is no dedicated past-the-end value.

In C, there is a concept of pointer validity which takes into account
/how/ that pointer was obtained. That information is not necessarily
encoded in the pointer's run time value. (Remember, in C, type
information is not also encoded in run-time values; that doesn't mean
you can violate the type system and still have a well-defined program).

Since the validity of a pointer includes how it was obtained,
merely knowing where that pointer points is not enough of an assurance
of correctness.

Validity is important when it comes to code generation, and
optimization. Code can be generated and optimized based on validity
assumptions (that the program hasn't invoked any undefined behavior).
If you are one past the end of one array that is just prior to another
array, it follows you are also at the first element of the next array.

It doesn't follow that you are legally at the first element of the
array.

If a prisoner climbs the fence, it follows that he's physically not in
prison any more, not that he's legally a free man.

C is not assembly language. What is well-defined or not at the language
level is not governed by the object code generated by some compilers.

There isn't just once C language so you have to be careful about what
you mean; when you say that something is well-defined, do you mean
ISO C, or do you mean some dialect accepted by some compilers?

Both concepts of definedness are valid and useful, as is not
confusing one for the other.
 
J

Johannes Schaub (litb)

Kaz said:
In C, there is a concept of pointer validity which takes into account
/how/ that pointer was obtained. That information is not necessarily
encoded in the pointer's run time value. (Remember, in C, type
information is not also encoded in run-time values; that doesn't mean
you can violate the type system and still have a well-defined program).

Since the validity of a pointer includes how it was obtained,
merely knowing where that pointer points is not enough of an assurance
of correctness.

Validity is important when it comes to code generation, and
optimization. Code can be generated and optimized based on validity
assumptions (that the program hasn't invoked any undefined behavior).
I see now. In C, pointers seem to have these relatioships to where they are
generated from.
It doesn't follow that you are legally at the first element of the
array.

If a prisoner climbs the fence, it follows that he's physically not in
prison any more, not that he's legally a free man.

C is not assembly language. What is well-defined or not at the language
level is not governed by the object code generated by some compilers.

There isn't just once C language so you have to be careful about what
you mean; when you say that something is well-defined, do you mean
ISO C, or do you mean some dialect accepted by some compilers?

Both concepts of definedness are valid and useful, as is not
confusing one for the other.

I think this makes sense. I'm talking about ISO C99. We cannot do this in C
then. Thanks for showing me the matters, i like the prisoner analogy.
 
S

Seebs

Hmm, maybe i should ask this question in a C++ group. I thought it's exactly
intended behavior. But reading how surprised you guys are, i've now doubts
about it :/

It makes no sense for it to be intentionally specified to work.

The rule about pointers outside the bounds of an object is that you're
allowed to generate a pointer one past the end of an array, for purposes
of comparing it to pointers into the array, or subtracting other addresses
in the array from it to count offsets.

In C, where you got the pointer matters.

int ary[3][1] = { 0 };
int *ap_1 = (int *) ary;
int *ap_2 = (int *) ary[0];
int x;
x = ap_1[0]; // clearly well-defined
x = ap_1[1]; // well-defined, reads from ary[1][0]
x = ap_2[0]; // clearly well-defined
x = ap_2[1]; // undefined, tries to read from ary[0][1]

In short, even though ap_1 and ap_2 are the same address in memory, the
compiler is allowed to note that one of them is the address of a block of
three arrays of single integers, thus, an object of size 3*sizeof(int),
and the other is the address of an array of one integer, thus, an object
of size 1*sizeof(int).

If you want a pointer to the whole object, don't derive it from one of
the members.

-s
 
T

Tim Rentsch

Kaz Kylheku said:
There are exactly two valid values of object pointers: Either a byte in
memory, or a null pointer value - there is no dedicated past-the-end value.

In C, there is a concept of pointer validity which takes into account
/how/ that pointer was obtained. That information is not necessarily
encoded in the pointer's run time value. (Remember, in C, type
information is not also encoded in run-time values; that doesn't mean
you can violate the type system and still have a well-defined program).

Since the validity of a pointer includes how it was obtained,
merely knowing where that pointer points is not enough of an assurance
of correctness. [...]

Perhaps you would be so good as to tell the group in
which section(s) of which ISO document(s) this concept
is explained or described?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top