Array subscripting overflow behaviour

N

Netocrat

The code at the bottom illustrates the nature of a situation discussed on
comp.lang.asm.x86. Basically an unsigned integer is being used as a
negative index in a loop, and it works as though the integer were signed.
I want to confirm my interpretation of this behaviour as there have been
differing understandings of why it works.

On my system (P4 using gcc under linux), the results are as follows:

len before sign swap : 9
shlen before sign swap : 9
l1 before addition : 0x8048644
l1[0] before addition : a
l1 after addition : 0x804864d
len after sign swap : 4294967287
&l1[len] : 0x8048644
l1[len] : a
Looping using len
l1[len==4294967287]: a
l1[len==4294967288]: b
l1[len==4294967289]: c
l1[len==4294967290]: d
l1[len==4294967291]: e
l1[len==4294967292]: f
l1[len==4294967293]: g
l1[len==4294967294]: h
l1[len==4294967295]: i
shlen after sign swap : 65527
Segmentation fault

I expected those results for shlen, but I didn't expect that the access
through len would work.

I have looked at the draft C89 and C99 standards and they don't seem to
say anything about the interpretation of indexes except that they must be
integers (6.5.2.1 Array subscripting). I assume that sign is therefore
interpreted according to the type of the integer and no implicit cast is
applied.

So my interpretation according to the standard - given that
sizeof(unsigned int) is 4 and sizeof(char *) is also 4 in this case - is
as follows:

The code len = -len; sets len to (unsigned integer)-9 == 4294967287. The
output confirms this. Then &l1[len] expands to:
l1 + len * sizeof(char)
== l1 + 4294967287
Given that l1 is greater than 9, this should overflow.
From C89 Draft, 3.3.6 Additive operators:
"As with any other arithmetic overflow, if the result does not fit
in the space provided, the behavior is undefined." [referring to pointer
addition]

So the fact that the code happens to work in this case is not by virtue of
the standard as I interpret it. It is officially undefined behaviour.
Is this interpretation correct?

Code is below
--------------

#include <stdio.h>

int main(void)
{
char *l1 = "abcdefghi";
unsigned int len = 9;
unsigned short shlen = len;

printf("len before sign swap : %u\n", len);
printf("shlen before sign swap : %hu\n", shlen);
printf("l1 before addition : %p\n", l1 );
printf("l1[0] before addition : %c\n", l1[0] );
l1 += len;
len = -len;
shlen = -shlen;
printf("l1 after addition : %p\n", l1 );
printf("len after sign swap : %u\n", len);
printf("&l1[len] : %p\n", &l1[len]);
printf("l1[len] : %c\n", l1[len]);
printf("Looping using len\n");
while(len != 0) {
printf("l1[len==%u]: %c\n", len, l1[len]);
len++;
}
printf("shlen after sign swap : %hu\n", shlen);
printf("l1[shlen] : %c\n", l1[shlen]);
printf("&l1[shlen] : %p\n", &l1[shlen]);
printf("Looping using shlen\n");
while(shlen != 0) {
printf("l1[shlen==%hu]: %c\n", shlen, l1[shlen]);
shlen++;
}
return 0;
}
 
W

Walter Roberson

Basically an unsigned integer is being used as a
negative index in a loop, and it works as though the integer were signed.
The code len = -len; sets len to (unsigned integer)-9 == 4294967287. The
output confirms this. Then &l1[len] expands to:
l1 + len * sizeof(char)
== l1 + 4294967287
Given that l1 is greater than 9, this should overflow.

There is a subtle point here that you overlooked in your analysis.
You conclusion looks right to me, but the path to it might be slightly
different.

You wrote, that l1 + len * sizeof(char) is l1 + 4294967287
but that's not exactly right. sizeof() has a result which is size_t
and that might be wider than an integer -- might, for example, be
64 bits. len * sizeof(char) would thus involve a silent type
conversion of len (an unsigned int) to a wider integral type
(e.g., a 64 bit long). As len is unsigned that isn't going to involve
sign propagation -- e.g., it wouldn't end up as 0xfffffffffffffff7
but rather as 0x00000000fffffff7 . When l1 is added to that, l1 would
be widened to size_t as well. The result is going to be outside of the
object, but might be within the virtual memory allocated to the user,
so detection is not certain.

But that analysis assumes that there really is a multiplication
by sizeof(char) with the subsequent type widening. I'd want to have
a close look at the wording of the standard before saying whether that
is how it is defined.
 
P

pete

Netocrat wrote:
So the fact that the code happens
to work in this case is not by virtue of
the standard as I interpret it. It is officially undefined behaviour.
Is this interpretation correct?

Yes.

Some compilers have an option to output assembly source code
from a C source input. If your compiler supports that option,
then that can give you an easy way to turn a multi C file program,
into a mixed C and assembly file program,
if you really like assembly that much.
 
N

Netocrat


Cheers.
Some compilers have an option to output assembly source code from a C
source input. If your compiler supports that option, then that can give
you an easy way to turn a multi C file program, into a mixed C and
assembly file program, if you really like assembly that much.

Yes, I've been using gcc's -S option.
 
E

Eric Sosman

Netocrat said:
The code at the bottom illustrates the nature of a situation discussed on
comp.lang.asm.x86. Basically an unsigned integer is being used as a
negative index in a loop, and it works as though the integer were signed.
I want to confirm my interpretation of this behaviour as there have been
differing understandings of why it works.
[...]
So the fact that the code happens to work in this case is not by virtue of
the standard as I interpret it. It is officially undefined behaviour.
Is this interpretation correct?

Yes. "It just happened to work" is a possible outcome
of undefined behavior.

The reason "it just happened to work" is (presumably)
because pointers and unsigned integers on your system have
the same width, and pointers are themselves really just
unsigned byte numbers. The wrap-around of the sum exactly
compensated for the earlier wrap-around of the index.

The difficulties you ran into with a 16-bit index (when
index-wrap and pointer-wrap had different values) are similar
to those you might encounter with 64-bit pointers and 32-bit
indices.

You'd probably get a different sort of error on systems
where pointers have more internal structure, e.g. IBM's AS/400.
 
N

Netocrat

Netocrat said:
Basically an unsigned integer is being used as a negative index in a
loop, and it works as though the integer were signed.
The code len = -len; sets len to (unsigned integer)-9 == 4294967287.
The output confirms this. Then &l1[len] expands to: l1 + len *
sizeof(char)
== l1 + 4294967287
Given that l1 is greater than 9, this should overflow.

There is a subtle point here that you overlooked in your analysis. You
conclusion looks right to me, but the path to it might be slightly
different.

You wrote, that l1 + len * sizeof(char) is l1 + 4294967287 but that's not
exactly right.

Well it's a conceptual if not actual equivalence... i.e. I don't know that
the sizeof operator is actually applied in the same sense that it would be
if included in code - with a specific return type. The most specific
wording I could find in the standard is in an example (C89 Draft; 3.3.2.1
Array Subscripting) - it refers to determining "the size of the object
to which the pointer points".
sizeof() has a result which is size_t and that might be
wider than an integer -- might, for example, be 64 bits. len *
sizeof(char) would thus involve a silent type conversion of len (an
unsigned int) to a wider integral type (e.g., a 64 bit long). As len is
unsigned that isn't going to involve sign propagation -- e.g., it wouldn't
end up as 0xfffffffffffffff7 but rather as 0x00000000fffffff7 . When l1 is
added to that, l1 would be widened to size_t as well. The result is going
to be outside of the object, but might be within the virtual memory
allocated to the user, so detection is not certain.

Yes that would be true if the sizeof() operator is applied in the same
sense as when explicitly called in code. In this case, though, on my
platform, size_t is also 4 bytes - in fact that's the initial type of the
variable that raised this discussion.
But that analysis assumes that there really is a multiplication by
sizeof(char) with the subsequent type widening. I'd want to have a close
look at the wording of the standard before saying whether that is how it
is defined.

I've looked at the draft but it isn't particularly clear.
 
N

Netocrat

Netocrat said:
The code at the bottom illustrates the nature of a situation discussed
on comp.lang.asm.x86. Basically an unsigned integer is being used as a
negative index in a loop, and it works as though the integer were
signed. I want to confirm my interpretation of this behaviour as there
have been differing understandings of why it works. [...]
So the fact that the code happens to work in this case is not by virtue
of the standard as I interpret it. It is officially undefined
behaviour. Is this interpretation correct?

Yes. "It just happened to work" is a possible outcome
of undefined behavior.

That's what I thought.
The reason "it just happened to work" is (presumably)
because pointers and unsigned integers on your system have the same width,
and pointers are themselves really just unsigned byte numbers. The
wrap-around of the sum exactly compensated for the earlier wrap-around of
the index.

I believe that's what's happening. The person whose code triggered this
discussion (the pointer was actually int * not char *) described what was
going on thus:
It is no coincidence that x86 add & subtract instructions are agnostic
to sign. Two's complement forms the basis of x86 processors (and
obviously all others) because the same adder which does unsigned
arithmetic does signed arithmetic too. Subtracting 0 - 1 produces the
same result whether signed or unsigned. Adding -1*4 to the pointer value
likewise produces the same result whether -1 is -1 or 4294957295
(0xFFFFFFFF). Due to arithmatic wrap-around, adding 0x100000000 is the
same as adding 0 since the result has only 32-bits. Therefore adding
0xFFFFFFFF and then adding 1 should be a no-op. That makes 0xFFFFFFFF
semantically identical to -1 which is why they are the same value.

Apart from the minor error in the 6th digit of the decimal representation
of 0xFFFFFFFF, and the incorrect description of 2's complement as
universal, I can find no fault with this description. I just wanted to
confirm that it isn't guaranteed behaviour.
The difficulties you ran into with a 16-bit index (when
index-wrap and pointer-wrap had different values) are similar to those you
might encounter with 64-bit pointers and 32-bit indices.

I think you're right there.
 
L

Lawrence Kirby

Basically an unsigned integer is being used as a
negative index in a loop, and it works as though the integer were signed.
The code len = -len; sets len to (unsigned integer)-9 == 4294967287. The
output confirms this. Then &l1[len] expands to:
l1 + len * sizeof(char)
== l1 + 4294967287
Given that l1 is greater than 9, this should overflow.

It it accessing outside the bounds of the array which has undefined
behaviour. So anything can happen, what you saw is just one possible
example of "anything".
There is a subtle point here that you overlooked in your analysis.
You conclusion looks right to me, but the path to it might be slightly
different.

You wrote, that l1 + len * sizeof(char) is l1 + 4294967287
but that's not exactly right. sizeof() has a result which is size_t
and that might be wider than an integer -- might, for example, be
64 bits. len * sizeof(char) would thus involve a silent type
conversion of len (an unsigned int) to a wider integral type
(e.g., a 64 bit long). As len is unsigned that isn't going to involve
sign propagation -- e.g., it wouldn't end up as 0xfffffffffffffff7
but rather as 0x00000000fffffff7 . When l1 is added to that, l1 would
be widened to size_t as well. The result is going to be outside of the
object, but might be within the virtual memory allocated to the user,
so detection is not certain.

This is a red herring, &l1[len] does NOT expand to l1 + len *
sizeof(char). At the C source level it is equivalent to *(l1 + len).
How the compiler evaluates it internally is not related to C's integer
types, in particular size_t is not involved at all.

Given

int a[10];

then in the expression

a

i is a value (a variable, the result of an expression) with integer type.
The actual integer type of i doesn't matter beyond establishing what the
value of i is. It might be char, it might be long long but whatever it is
it must have a value between 0 and 9 inclusive. If it does the a will
select the appropriate element of a using whatever magic the compiler
chooses, if it doesn't then yopu have undefined behaviour and nothing
further can be said about it.

If you're looking at the object code produced by the compiler i.e. what
"magic" it uses then you will probably see it scaling the index value
according to the size of the element type. However this isn't done in the
context of C code i.e. it has nothing to do with C's types. All that
matters is that the code selects the correct element given values 0-9. It
doesn't need to worry about other values, the code generated may do
something recognisable with those by accident. If we change the expression
to

(a+10)

Then all that changes is that the valid range of values for i is now
-10 to -1. (signed char)-1, (int)-1 and (long long)-1 are all perfectly
legitimate here and equivalent. If i is unsigned it can never contain a
valid value and the expression is always undefined.

Lawrence
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top