Re: Pointer Arithmetic & UB

Discussion in 'C Programming' started by James Kuyper, Dec 10, 2012.

1. James KuyperGuest

On 12/10/2012 12:14 PM, Edward Rutherford wrote:
> Hello
>
> Would the following code invoke an undefined behavior?

"invoke" is a bad term to use for this purpose; it implies that there's
some particular kind of behavior which is called "undefined behavior".
You should ask "Would the following code have undefined behavior?"

> char a[10];
> size_t i=20,j=15;
> *(a+i-j)=42;
>
> It potentially constructs the invalid pointer a+i as an intermediate
> value. But overall the access is inbounds.

Yes, it does have undefined behavior.
To make this seem more reasonable, consider a platform with the
following real-world characteristics: there are registers specialized
for storing addresses, and when an invalid address is stored in one of
those registers, the current process aborts immediately, as a safety
measure - it doesn't wait for the invalid address to be used. On such an
implementation, a conforming implementation could translate your code so
that 'a' is allocated near the end of a block of valid memory addresses,
so that adding 20 to a gives an invalid address. It could generate
it. Execution of those instructions would result in the register

James Kuyper, Dec 10, 2012

2. Eric SosmanGuest

On 12/10/2012 12:55 PM, James Kuyper wrote:
> On 12/10/2012 12:14 PM, Edward Rutherford wrote:
>> Hello
>>
>> Would the following code invoke an undefined behavior?

>
> "invoke" is a bad term to use for this purpose; it implies that there's
> some particular kind of behavior which is called "undefined behavior".
> You should ask "Would the following code have undefined behavior?"
>
>> char a[10];
>> size_t i=20,j=15;
>> *(a+i-j)=42;
>>
>> It potentially constructs the invalid pointer a+i as an intermediate
>> value. But overall the access is inbounds.

>
> Yes, it does have undefined behavior.
> To make this seem more reasonable, consider a platform with the
> following real-world characteristics: [...]

A colleague who did some work on IBM's AS/400 (they've
changed the name; I forget the new one) told me that simply
trying to calculate an out-of-range pointer yielded a null
pointer as a result. In the O.P.'s case, the intermediate
steps would go something like

a // OK so far
a + i // too big: result = NULL
NULL - j // not sure, but surely not good
*(NULL - j) // really Really REALLY not good

--
Eric Sosman
d

Eric Sosman, Dec 10, 2012

3. Edward A. FalkGuest

In article <ka59e6\$mq7\$>,
Eric Sosman <> wrote:
>
> A colleague who did some work on IBM's AS/400 (they've
>changed the name; I forget the new one) told me that simply
>trying to calculate an out-of-range pointer yielded a null
>pointer as a result.

Heh; learn something new every day. I never would have guessed
that there was an actual architecture that would blow up with
this construct.

I assume that *(a+(i-j)) would be ok?

--
-Ed Falk,
http://thespamdiaries.blogspot.com/

Edward A. Falk, Dec 11, 2012
4. James KuyperGuest

Context:
char a[10];
size_t i=20,j=15;
*(a+i-j)=42;

On 12/10/2012 07:46 PM, Edward A. Falk wrote:
....
> Heh; learn something new every day. I never would have guessed
> that there was an actual architecture that would blow up with
> this construct.
>
> I assume that *(a+(i-j)) would be ok?

That should be safe for all conforming implementations of C.
--
James Kuyper

James Kuyper, Dec 11, 2012
5. NoobGuest

Edward A. Falk wrote:

> I assume that *(a+(i-j)) would be ok?

Please correct me if I am wrong,

*(a+(i-j)) is strictly equivalent to a[i-j]

(I find the latter clearer.)

Noob, Dec 11, 2012
6. Eric SosmanGuest

On 12/10/2012 7:46 PM, Edward A. Falk wrote:
> In article <ka59e6\$mq7\$>,
> Eric Sosman <> wrote:
>>
>> A colleague who did some work on IBM's AS/400 (they've
>> changed the name; I forget the new one) told me that simply
>> trying to calculate an out-of-range pointer yielded a null
>> pointer as a result.

>
> Heh; learn something new every day. I never would have guessed
> that there was an actual architecture that would blow up with
> this construct.
>
> I assume that *(a+(i-j)) would be ok?

Assuming `i-j' in range, yes.

More on my colleague's tale: The code maintained a buffer
in which items of various sizes accumulated, and which drained
to disk when it got too full or too old. To decide whether a
newly-offered item would fit, the code did something like

itemEndPtr = nextBufferSpacePtr + itemSize;
if (itemEndPtr < bufferStart + bufferSize) ...

This worked as intended on all the other target systems, but
failed on AS/400. I suspect the failure had something to do
with the fact that the buffer was in a shared memory area, so
stepping off the end also meant stepping outside of mapped
address space; the problem might not have shown up with the

Still, perhaps a salutary lesson for the folks who still
believe "All the world's a VAX^H^H^Hx86^H^H^Hx64^H^H^H..."

--
Eric Sosman
d

Eric Sosman, Dec 11, 2012
7. glen herrmannsfeldtGuest

Eric Sosman <> wrote:

(previous snip on pointer offsets)

>>> A colleague who did some work on IBM's AS/400 (they've
>>> changed the name; I forget the new one) told me that simply
>>> trying to calculate an out-of-range pointer yielded a null
>>> pointer as a result.

>> Heh; learn something new every day. I never would have guessed
>> that there was an actual architecture that would blow up with
>> this construct.

>> I assume that *(a+(i-j)) would be ok?

> Assuming `i-j' in range, yes.

> More on my colleague's tale: The code maintained a buffer
> in which items of various sizes accumulated, and which drained
> to disk when it got too full or too old. To decide whether a
> newly-offered item would fit, the code did something like

> itemEndPtr = nextBufferSpacePtr + itemSize;
> if (itemEndPtr < bufferStart + bufferSize) ...

Might fail in x86 (especially the 80286) in huge model.

You can't load arbitrary data into segment selector registers
in protected mode x86. In large mode, though, any offset isn't
tested until an actual access is attempted. (The offset is in
an ordinary register, such as AX.)

In huge model, the system allocates a series of segments,
such that the one can address through them in order.

Still, I believe that the compilers are careful not to load
a segment selector until needed to actually access something,
maybe partly to allow such faulty C code.

> This worked as intended on all the other target systems, but
> failed on AS/400. I suspect the failure had something to do
> with the fact that the buffer was in a shared memory area, so
> stepping off the end also meant stepping outside of mapped
> address space; the problem might not have shown up with the
> `auto' array in your example.

I believe that could happen with protected mode x86, too.

> Still, perhaps a salutary lesson for the folks who still
> believe "All the world's a VAX^H^H^Hx86^H^H^Hx64^H^H^H..."

In the 80286 days, I had OS/2 1.0 and then 1.2 running, when
malloc(), I would directly allocate segments from OS/2 of exactly
the needed length. The hardware will then interrupt for an access,
even read, either before or just after the end of the allocated
space. (Unless the register wraps, and it is back into the
allocated space again.)

As usual in C, a 2D array was allocated as an array of pointers,
each pointing to its own OS/2 allocated segment.

Fortunately, the C compilers were always good at not using segment
selector registers when copying pointers that might not point to
anything.

I don't know AS/400 that well, but there have been systems that relied
on the compiler to generate the appropriate code, instead of run-time
memory protection. I believe some Burroughs ALGOL systems worked that
way. (Maybe still do.)

As far as I know, they never had a C compiler, but if one did it might
also have problems with out of range pointers.

-- glen

glen herrmannsfeldt, Dec 11, 2012
8. Ken BrodyGuest

On 12/10/2012 7:46 PM, Edward A. Falk wrote:
> In article <ka59e6\$mq7\$>,
> Eric Sosman <> wrote:
>>
>> A colleague who did some work on IBM's AS/400 (they've
>> changed the name; I forget the new one) told me that simply
>> trying to calculate an out-of-range pointer yielded a null
>> pointer as a result.

>
> Heh; learn something new every day. I never would have guessed
> that there was an actual architecture that would blow up with
> this construct.
>
> I assume that *(a+(i-j)) would be ok?

No. There is no requirement that the value of "i-j" be calculated prior to
parentheses to "fix" UB in things involving such constructs as "i + (i++)".)
Operator precedence only guarantees how the expression is to be
interpreted, not the actual order of evaluation.

Ken Brody, Dec 12, 2012
9. Ken BrodyGuest

On 12/10/2012 9:28 PM, James Kuyper wrote:
> Context:
> char a[10];
> size_t i=20,j=15;
> *(a+i-j)=42;
>
> On 12/10/2012 07:46 PM, Edward A. Falk wrote:
> ...
>> Heh; learn something new every day. I never would have guessed
>> that there was an actual architecture that would blow up with
>> this construct.
>>
>> I assume that *(a+(i-j)) would be ok?

>
> That should be safe for all conforming implementations of C.

Are you sure? Does anything in the Standard *require* that "i-j" be
evaluated prior to adding it to "a"?

Haven't we had this discussion earlier, related to other forms of UB, with

Ken Brody, Dec 12, 2012
10. Keith ThompsonGuest

Ken Brody <> writes:
> On 12/10/2012 7:46 PM, Edward A. Falk wrote:
>> In article <ka59e6\$mq7\$>,
>> Eric Sosman <> wrote:
>>>
>>> A colleague who did some work on IBM's AS/400 (they've
>>> changed the name; I forget the new one) told me that simply
>>> trying to calculate an out-of-range pointer yielded a null
>>> pointer as a result.

>>
>> Heh; learn something new every day. I never would have guessed
>> that there was an actual architecture that would blow up with
>> this construct.
>>
>> I assume that *(a+(i-j)) would be ok?

>
> No. There is no requirement that the value of "i-j" be calculated prior to
> adding it to "a". (Check the numerous threads here involving using
> parentheses to "fix" UB in things involving such constructs as "i + (i++)".)
> Operator precedence only guarantees how the expression is to be
> interpreted, not the actual order of evaluation.

True, but the expression `a+(i-j)` is evaluated *in the abstract
machine* by subtracting j from i and then adding the result to a.
A compiler is free to evaluate it by computing a+i and then
subtracting j from the result *only* if it can guarantee that the
result is the same, or if the canonical order has undefined behavior.

`INT_MAX + (1 - 1)` has well defined behavior.
`INT_MAX + 1 - 1` does not.

--
Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
Will write code for food.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Keith Thompson, Dec 12, 2012
11. glen herrmannsfeldtGuest

Ken Brody <> wrote:
> On 12/10/2012 9:28 PM, James Kuyper wrote:
>> Context:
>> char a[10];
>> size_t i=20,j=15;
>> *(a+i-j)=42;

(snip)
>>> I assume that *(a+(i-j)) would be ok?

>> That should be safe for all conforming implementations of C.

> Are you sure? Does anything in the Standard *require* that "i-j" be
> evaluated prior to adding it to "a"?

On many systems, the result is the same until you try to dereference
the result.

Seems to me that on any system where the result isn't the same, that
the compiler better do it in the appropriate order.

With the appropriate wrap on overflow characteristic, fixed point
arithmetic is associative. If the compiler knows that, it can compute
i+(j-k) as (i+j)-k, knowing the result is the same.

If something else happens on overflow, the compiler shouldn't do that.

It gets more interesting with floating point.

> Haven't we had this discussion earlier, related to other forms of UB, with

-- glen

glen herrmannsfeldt, Dec 12, 2012
12. James KuyperGuest

On 12/12/2012 02:36 PM, Ken Brody wrote:
> On 12/10/2012 9:28 PM, James Kuyper wrote:
>> Context:
>> char a[10];
>> size_t i=20,j=15;
>> *(a+i-j)=42;
>>
>> On 12/10/2012 07:46 PM, Edward A. Falk wrote:
>> ...
>>> Heh; learn something new every day. I never would have guessed
>>> that there was an actual architecture that would blow up with
>>> this construct.
>>>
>>> I assume that *(a+(i-j)) would be ok?

>>
>> That should be safe for all conforming implementations of C.

>
> Are you sure? Does anything in the Standard *require* that "i-j" be
> evaluated prior to adding it to "a"?

Yes, I'm sure; if you aren't, perhaps there's been a miscommunication of
some kind?

Check the grammar rules. The right operand of a binary '+' expression
must be a multiplicative-expression (6.5.6p1). '(' doesn't qualify;
neither does '(i', or '(i+' or '(i+j'; the only thing that can be parsed
as the right operand of the '+' operator in that expression is (i+j),
which parses as primary-expression (6.5.1p1), and therefore as a
postfix-expression (6.5.2p1), a unary-expression (6.5.3p1), a
cast-expression (6.5.4p1), and a multiplicative expression (6.5.5p1), in
that order.

For C99, I would have stopped the explanation at that point, considering
of two events is specified, and when it isn't, so there's a couple of
additional citations that are relevant. I believe that what they say was
inherently true even in C99, where it was not explicitly said:

"The value computations of the operands of an operator
are sequenced before the value computation of the result of the
operator." 6.5p1.

"An evaluation A happens before an evaluation B if A is sequenced before
B." 5.1.2.4p9

> Haven't we had this discussion earlier, related to other forms of UB, with

The problem with *(a+i-j) is that the standard mandates that 'i' be
added to 'a' before 'j' is subtracted from the result. Putting a
parenthesis around 'i - j' converts those three tokens into a single
primary-expression. That's why *(a+(i-j)) fixes the problem. It forces
the value computations for the subtraction expression to happen before
the value computations of the binary addition expression.

Kenneth gives i+(i++) as an example of a case where parentheses do
nothing to resolve the underlying problem. That is because the problem
is the absences of a sequence point separating 'i' from 'i++'.
Parenthesis do not insert a sequence point, and therefore do NOT solve
that problem.

James Kuyper, Dec 12, 2012
13. Eric SosmanGuest

On 12/12/2012 2:36 PM, Ken Brody wrote:
> On 12/10/2012 9:28 PM, James Kuyper wrote:
>> Context:
>> char a[10];
>> size_t i=20,j=15;
>> *(a+i-j)=42;
>>
>> On 12/10/2012 07:46 PM, Edward A. Falk wrote:
>> ...
>>> Heh; learn something new every day. I never would have guessed
>>> that there was an actual architecture that would blow up with
>>> this construct.
>>>
>>> I assume that *(a+(i-j)) would be ok?

>>
>> That should be safe for all conforming implementations of C.

>
> Are you sure? Does anything in the Standard *require* that "i-j" be
> evaluated prior to adding it to "a"?

No, but the Standard requires that the thing added to `a'
be the value of `i-j'. The "as if" rule still applies, so an
actual implementation might calculate something that might be
written as `a-j+i' or `i+a-j' or `a-(j-i)' or a host of other
possibilities. Still, the result -- including the definedness
of the result -- must be as for "`a' plus `i-j'".

> Haven't we had this discussion earlier, related to other forms of UB,
> with the questioner asking if adding parentheses would "fix" the problem?

Nitpick: Since this isn't UB, "other" is out of place.

The usual misunderstanding is that the association of
operators with their operands -- "expression tree order" --
dictates evaluation order, which it doesn't. (Except for
certain special operators like ||, and even then only in
part.)

--
Eric Sosman
d

Eric Sosman, Dec 12, 2012
14. James KuyperGuest

On 12/12/2012 04:02 PM, Eric Sosman wrote:
....
> The usual misunderstanding is that the association of
> operators with their operands -- "expression tree order" --
> dictates evaluation order, which it doesn't. (Except for
> certain special operators like ||, and even then only in
> part.)

The expression tree does not impose an evaluation order on it's branches
at the same level (with the exceptions that you noted), but it does
impose a requirement that the operands be evaluated before the
expression itself. I believe that this requirement has always been
implied by the semantics of each expression, but C2011 has made this
requirement explicit for all expression in 6.5p1 and 5.1.2.4p18 (which I
just mis-cited in my response to Kenneth as 5.1.2.4p9).

James Kuyper, Dec 12, 2012
15. Ken BrodyGuest

On 12/12/2012 2:30 PM, Ken Brody wrote:
> On 12/10/2012 7:46 PM, Edward A. Falk wrote:
>> In article <ka59e6\$mq7\$>,
>> Eric Sosman <> wrote:
>>>
>>> A colleague who did some work on IBM's AS/400 (they've
>>> changed the name; I forget the new one) told me that simply
>>> trying to calculate an out-of-range pointer yielded a null
>>> pointer as a result.

>>
>> Heh; learn something new every day. I never would have guessed
>> that there was an actual architecture that would blow up with
>> this construct.
>>
>> I assume that *(a+(i-j)) would be ok?

>
> No. There is no requirement that the value of "i-j" be calculated prior to
> adding it to "a". (Check the numerous threads here involving using
> parentheses to "fix" UB in things involving such constructs as "i + (i++)".)
> Operator precedence only guarantees how the expression is to be
> interpreted, not the actual order of evaluation.

As noted in the replies to my post, I stand corrected. Because of the
"as-if" rule, if evaluating "i-j" first would not cause an overflow in
"a+(i-j)", then the compiler must guarantee that any rearranging of the code
will give an identical result, even if an overflow does occur.

Ken Brody, Dec 12, 2012
16. Eric SosmanGuest

On 12/12/2012 4:13 PM, James Kuyper wrote:
> On 12/12/2012 04:02 PM, Eric Sosman wrote:
> ...
>> The usual misunderstanding is that the association of
>> operators with their operands -- "expression tree order" --
>> dictates evaluation order, which it doesn't. (Except for
>> certain special operators like ||, and even then only in
>> part.)

>
> The expression tree does not impose an evaluation order on it's branches
> at the same level (with the exceptions that you noted), but it does
> impose a requirement that the operands be evaluated before the
> expression itself. I believe that this requirement has always been
> implied by the semantics of each expression, but C2011 has made this
> requirement explicit for all expression in 6.5p1 and 5.1.2.4p18 (which I
> just mis-cited in my response to Kenneth as 5.1.2.4p9).

Although I haven't studied the C11 stuff in detail, I'd
be surprised (and disappointed!) if in

#define WHICH 1
...
int r = WHICH * (x + y) + (1 - WHICH) * (z - x);

.... the Standard required that `z - x' be evaluated at all,
much less "before" the entire expression.

However, neither surprise nor disappointment is entirely
strange to me. Embarrassment is an old pal, too ...

--
Eric Sosman
d

Eric Sosman, Dec 13, 2012
17. James KuyperGuest

On 12/12/2012 09:18 PM, Eric Sosman wrote:
....
> Although I haven't studied the C11 stuff in detail, I'd
> be surprised (and disappointed!) if in
>
> #define WHICH 1
> ...
> int r = WHICH * (x + y) + (1 - WHICH) * (z - x);
>
> ... the Standard required that `z - x' be evaluated at all,
> much less "before" the entire expression.

Well, the as-if rule always trumps any other requirements, when it
applies - if a strictly conforming program can't determine whether or
not sub-expressions were evaluated in the required order, evaluating
them in that order isn't really required. If it can't even determine
whether they were evaluated, they don't even have to be evaluated.

> However, neither surprise nor disappointment is entirely
> strange to me. Embarrassment is an old pal, too ...

Yep, I know him well myself.
--
James Kuyper

James Kuyper, Dec 13, 2012
18. Phil CarmodyGuest

Eric Sosman <> writes:
> On 12/12/2012 4:13 PM, James Kuyper wrote:
> > On 12/12/2012 04:02 PM, Eric Sosman wrote:
> > ...
> >> The usual misunderstanding is that the association of
> >> operators with their operands -- "expression tree order" --
> >> dictates evaluation order, which it doesn't. (Except for
> >> certain special operators like ||, and even then only in
> >> part.)

> >
> > The expression tree does not impose an evaluation order on it's branches
> > at the same level (with the exceptions that you noted), but it does
> > impose a requirement that the operands be evaluated before the
> > expression itself. I believe that this requirement has always been
> > implied by the semantics of each expression, but C2011 has made this
> > requirement explicit for all expression in 6.5p1 and 5.1.2.4p18 (which I
> > just mis-cited in my response to Kenneth as 5.1.2.4p9).

>
> Although I haven't studied the C11 stuff in detail, I'd
> be surprised (and disappointed!) if in
>
> #define WHICH 1
> ...
> int r = WHICH * (x + y) + (1 - WHICH) * (z - x);
>
> ... the Standard required that `z - x' be evaluated at all,
> much less "before" the entire expression.

I am deliriously happy that the Standard requires that (the implementation
behave as if) `z - x' is evaluated. That would be, and is, consistent behaviour.

Pulling out the big cannon - if z and x are volatile, of course you

If you meant to say

int r = WHICH ? (x+y) : (z-x);

then write that, not some other silly expression which does arithmetic rather
than conditional evaluation.

Phil
--
I'm not saying that google groups censors my posts, but there's a strong link
between me saying "google groups sucks" in articles, and them disappearing.

Oh - I guess I might be saying that google groups censors my posts.

Phil Carmody, Dec 17, 2012
19. Phil CarmodyGuest

"christian.bau" <> writes:
> On Dec 11, 10:36 am, Noob <r...@127.0.0.1> wrote:
> > Edward A. Falk wrote:
> > > I assume that *(a+(i-j)) would be ok?

> >
> > Please correct me if I am wrong,
> >
> > *(a+(i-j)) is strictly equivalent to a[i-j]
> >
> > (I find the latter clearer.)

>
> Yes, it's the same. But there are also cases where * (a + i - j) would
> be fine and * (a + (i - j)) or a [i - j] wouldn't: If you have 64 bit
> pointers and 32 bit ints, then i - j might overflow, while a + i - j
> could be correct.

i and j are not int but size_t. What do you mean by "overflow" in that context?
Can you come up with a concrete example of failure which doesn't have UB
in the "correct" version?

Phil
--
I'm not saying that google groups censors my posts, but there's a strong link
between me saying "google groups sucks" in articles, and them disappearing.

Oh - I guess I might be saying that google groups censors my posts.

Phil Carmody, Dec 17, 2012
20. Phil CarmodyGuest

Ken Brody <> writes:
> On 12/10/2012 7:46 PM, Edward A. Falk wrote:
> > In article <ka59e6\$mq7\$>,
> > Eric Sosman <> wrote:
> >>
> >> A colleague who did some work on IBM's AS/400 (they've
> >> changed the name; I forget the new one) told me that simply
> >> trying to calculate an out-of-range pointer yielded a null
> >> pointer as a result.

> >
> > Heh; learn something new every day. I never would have guessed
> > that there was an actual architecture that would blow up with
> > this construct.
> >
> > I assume that *(a+(i-j)) would be ok?

>
> No. There is no requirement that the value of "i-j" be calculated
> prior to adding it to "a". (Check the numerous threads here involving
> using parentheses to "fix" UB in things involving such constructs as
> "i + (i++)".) Operator precedence only guarantees how the expression
> is to be interpreted, not the actual order of evaluation.

The "*(a+(i-j))" expression has *nothing* in common with the part of the
"i+(i++)" that pertains to UB. That brackets fail to do something in an
unrelated situation is basically irrelevant.

Phil
--
I'm not saying that google groups censors my posts, but there's a strong link
between me saying "google groups sucks" in articles, and them disappearing.

Oh - I guess I might be saying that google groups censors my posts.

Phil Carmody, Dec 17, 2012