Out-of-bounds Nonsense

  • Thread starter Frederick Gotham
  • Start date
F

Frederick Gotham

[ This post deals with both C and C++, but does not alienate either language
because the language feature being discussed is common to both languages. ]

Over on comp.lang.c, we've been discussing the accessing of array elements
via subscript indices which may appear to be out of range. In particular,
accesses similar to the following:

int arr[2][2];

arr[0][3] = 7;

Both the C Standard and the C++ Standard necessitate that the four int's be
lain out in memory in ascending order with no padding in between, i.e.:

(best viewed with a monowidth font)

--------------------------------
| Memory Address | Object |
--------------------------------
| 0 | arr[0][0] |
| 1 | arr[0][1] |
| 2 | arr[1][0] |
| 3 | arr[1][1] |
--------------------------------

One can see plainly that there should be no problem with the little snippet
above because arr[0][3] should be the same as arr[1][1], but I've had people
over on comp.lang.c telling me that the behaviour of the snippet is undefined
because of an "out of bounds" array access. They've even backed this up with
a quote from the C Standard:

J.2 Undefined behavior:
The behavior is undefined in the following circumstances:
[...]
- An array subscript is out of range, even if an object is apparently
accessible with the given subscript (as in the lvalue expression
a[1][7] given the declaration int a[4][5]) (6.5.6).

Are the same claims of undefined behaviour existing in C++ made by anyone?

If it is claimed that the snippet's behaviour is undefined because the second
subscript index is out of range of the dimension, then this rationale can be
brought into doubt by the following breakdown. First let's look at the
expression statement:

arr[0][3] = 9;

The compiler, both in C and in C++, must interpret this as:

*( *(arr+0) + 3 ) = 9;

In the inner-most set of parentheses, "arr" decays to a pointer to its first
element, i.e. an R-value of the type int(*)[2]. The value 0 is then added to
this address, which has no effect. The address is then dereferenced, yielding
an L-value of the type int[2]. This expression then decays to a pointer to
its first element, yielding an R-value of the type int*. The value 3 is then
added to this address. (In terms of bytes, it's p += 3 * sizeof(int)). This
address is then dereferenced, yielding an L-value of the type int. The L-
value int is then assigned to.

The only thing that sounds a little dodgy in the above paragraph is that an
L-value of the type int[2] is used as a stepping stone to access an element
whose index is greater than 1 -- but this shouldn't be a problem, because the
L-value decays to a simple R-value int pointer prior to the accessing of the
int object, so any dimension info should be lost by then.

To the C++ programmers: Is the snippet viewed as invoking undefined
behaviour? If so, why?

To the C programmers: How can you rationalise the assertion that it actually
does invoke undefined behaviour?

I'd like to remind both camps that, in other places, we're free to use our
memory however we please (given that it's suitably aligned, of course). For
instance, look at the following. The code is an absolute dog's dinner, but it
should work perfectly on all implementations:

/* Assume the inclusion of all necessary headers */

void Output(int); /* Defined elsewhere */

int main(void)
{
assert( sizeof(double) > sizeof(int) );

{ /* Start */

double *p;
int *q;
char unsigned const *pover;
char unsigned const *ptr;

p = malloc(5 * sizeof*p);
q = (int*)p++;
pover = (char unsigned*)(p+4);
ptr = (char unsigned*)p;
p[3] = 2423.234;
*q++ = -9;


do Output(*ptr++);
while (pover != ptr);

return 0;

} /* End */
}

Another thing I would remind both camps of, is that we can access any memory
as if it were simply an array of unsigned char's. That means we can access an
"int[2][2]" as if it were simply an object of the type "char unsigned[sizeof
(int[2][2])]".

The reason I'm writing this is that, at the moment, it sounds like absolute
nonsense to me that the original snippet's behaviour is undefined, and so I
challenge those who support its alleged undefinedness.

I leave you with this:

int arr[2][2];

void *const pv = &arr;

int *const pi = (int*)pv; /* Cast used for C++ programmers! */

pi[3] = 8;
 
K

Kai-Uwe Bux

Frederick said:
[ This post deals with both C and C++, but does not alienate either
[ language
because the language feature being discussed is common to both languages.
]

Over on comp.lang.c, we've been discussing the accessing of array elements
via subscript indices which may appear to be out of range. In particular,
accesses similar to the following:

int arr[2][2];

arr[0][3] = 7;

Both the C Standard and the C++ Standard necessitate that the four int's
be lain out in memory in ascending order with no padding in between, i.e.:

(best viewed with a monowidth font)

--------------------------------
| Memory Address | Object |
--------------------------------
| 0 | arr[0][0] |
| 1 | arr[0][1] |
| 2 | arr[1][0] |
| 3 | arr[1][1] |
--------------------------------

One can see plainly that there should be no problem with the little
snippet above because arr[0][3] should be the same as arr[1][1], but I've
had people over on comp.lang.c telling me that the behaviour of the
snippet is undefined because of an "out of bounds" array access. They've
even backed this up with a quote from the C Standard:

J.2 Undefined behavior:
The behavior is undefined in the following circumstances:
[...]
- An array subscript is out of range, even if an object is apparently
accessible with the given subscript (as in the lvalue expression
a[1][7] given the declaration int a[4][5]) (6.5.6).

Are the same claims of undefined behaviour existing in C++ made by anyone?

I think I have seen those claims in this news group with regard to C++.
If it is claimed that the snippet's behaviour is undefined because the
second subscript index is out of range of the dimension, then this
rationale can be brought into doubt by the following breakdown. First
let's look at the expression statement:

arr[0][3] = 9;

The compiler, both in C and in C++, must interpret this as:

*( *(arr+0) + 3 ) = 9;

In the inner-most set of parentheses, "arr" decays to a pointer to its
first element, i.e. an R-value of the type int(*)[2]. The value 0 is then
added to this address, which has no effect. The address is then
dereferenced, yielding an L-value of the type int[2]. This expression then
decays to a pointer to its first element, yielding an R-value of the type
int*. The value 3 is then added to this address. (In terms of bytes, it's
p += 3 * sizeof(int)). This address is then dereferenced, yielding an
L-value of the type int. The L- value int is then assigned to.

The only thing that sounds a little dodgy in the above paragraph is that
an L-value of the type int[2] is used as a stepping stone to access an
element whose index is greater than 1 -- but this shouldn't be a problem,

I think it might be.
because the L-value decays to a simple R-value int pointer prior to the
accessing of the int object, so any dimension info should be lost by then.

Why is it necessarily true that the pointer decays to a "simple" int
pointer? Do you have a clause in the standard for this? Moreover, what is
so "simple" about pointers anyway? I think, the standard allows for what I
like to call "decorated pointers" that have type and bounds information
attached to them, i.e., a pointer obtained from an int[2] could have
bounds-information built in that would trigger a segfault for out of bounds
access. In that case, the simple int* you mention could remember the bound
of the array that it is supposedly bound to. Where in the standard are the
provisions that prevent this type of overzealous bounds-checking?

To the C++ programmers: Is the snippet viewed as invoking undefined
behaviour? If so, why?

Because, you cannot deduce its behavior from the guarantees made by the
standard? I just note that you did not put in any references into your
reasoning. That makes it very hard to check whether the standard actually
guarantees the things you need. Given that there is a prima-facie out of
bounds access, I think you carry the burden of proof.

To the C programmers: How can you rationalise the assertion that it
actually does invoke undefined behaviour?

I have no idea about C. Sorry.

[snip]
 
F

Frederick Gotham

What ever happened to the idea of contiguous memory? When I define the
following object:

int arr[2][2];

, the type of the object "arr" is: int[2][2]

It consists of four int objects which are lain out contiguously in memory.

Therefore, if we take the address of the first int, why can't we add to that
address to yield the addresses of the int's which are directly after it in
contiguous memory? Isn't that one of the fundamental faculties of pointers?
 
V

Victor Bazarov

Frederick said:
What ever happened to the idea of contiguous memory? When I define the
following object:

int arr[2][2];

, the type of the object "arr" is: int[2][2]

It consists of four int objects which are lain out contiguously in
memory.

Therefore, if we take the address of the first int, why can't we add
to that address to yield the addresses of the int's which are
directly after it in contiguous memory? Isn't that one of the
fundamental faculties of pointers?

I think the conflict here is between the habits of [some] programmers
and what the Standard actually can *guarantee*. To interpret an array
of 2 arrays of 2 ints (int[2][2]) as a single array of 4 ints (which
have the same memory layouts, supposedly), you need to use a cast (and
a nasty one, reinterpret_cast). It's fine (on most platforms), but
since there can exist platforms on which it isn't OK, the Stadnard,
trying to be as generic as possible, cannot define the behaviour thus
prohibiting a C++ implemenation from existing on such [rare] platforms
and chooses to leave the behaviour undefined.

Again, nothing is there on most implementations and hardware platforms
to stop you from doing

int *p = &arr[0][0];
int &arr_1_1 = *(p + 3);

except that in standard terms it's UB.

Do we really need to keep going about it?

V
 
F

Frederick Gotham

Victor Bazarov:
I think the conflict here is between the habits of [some] programmers
and what the Standard actually can *guarantee*. To interpret an array
of 2 arrays of 2 ints (int[2][2]) as a single array of 4 ints (which
have the same memory layouts, supposedly),


The Standard necessitates that they have the same layout.

A multi-dimensional array is merely an array of arrays. An array may have
no padding at its start nor end, nor between elements.

Therefore, even if we have an array of arrays of arrays of arrays of
arrays, all objects must be directly after one another with no padding in
between.

you need to use a cast (and
a nasty one, reinterpret_cast).


Indeed, one could write:

int arr[2][2];

int (&b)[4] = reinterpret_cast<int(&)[4]>(arr);

b[0] = 1;
b[1] = 2;
b[2] = 3;
b[3] = 4;

It's fine (on most platforms), but
since there can exist platforms on which it isn't OK, the Stadnard,
trying to be as generic as possible, cannot define the behaviour thus
prohibiting a C++ implemenation from existing on such [rare] platforms
and chooses to leave the behaviour undefined.

Again, nothing is there on most implementations and hardware platforms
to stop you from doing

int *p = &arr[0][0];
int &arr_1_1 = *(p + 3);

except that in standard terms it's UB.

Do we really need to keep going about it?


Yes, because I think it's bullshit, and I think the Standard needs to
change.
 
V

Victor Bazarov

Frederick said:
Victor Bazarov:
[..]
Do we really need to keep going about it?


Yes, because I think it's bullshit, and I think the Standard needs to
change.

Then go argue your case in comp.std.c++. Otherwise it's a waste of
bandwidth.
 
M

Michiel.Salters

Victor said:
int arr[2][2];
nothing is there on most implementations and hardware platforms
to stop you from doing

int *p = &arr[0][0];
int &arr_1_1 = *(p + 3);

except that in standard terms it's UB.

Is it? After all, int is a POD, and so is int [2][2]. I think that will
make it defined.
But the wording that defines the bahavior is POD specific and won't
work for
std::string.

Regards,
Michiel Salters
 
K

Kai-Uwe Bux

Victor said:
int arr[2][2];
nothing is there on most implementations and hardware platforms
to stop you from doing

int *p = &arr[0][0];
int &arr_1_1 = *(p + 3);

except that in standard terms it's UB.

Is it? After all, int is a POD, and so is int [2][2]. I think that will
make it defined.
But the wording that defines the bahavior is POD specific and won't
work for
std::string.

Could you provide chapter and verse for the language that saves the day for
PODs?


Best

Kai-Uwe Bux
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top