Is pointer arithmetic within a struct well defined?

F

Francois Grieu

Consider the following code:


typedef unsigned char byt;

struct {
byt fa[8];
byt fb[8];
} mys;

byt* const gp = mys.fa;

int f0(void) { return (int)(mys.fb-gp-2); }

int f1(void) { return (int)(mys.fb-2-gp); }



I recently encountered a platform where
f0() returns 6
but
f1() returns 262

I understand why the machine code generated for f1
produces that result, and what kind of optimization
triggers the generation of that code.

But I wonder exactly what chapter and verse (if any)
in the C standards makes f1 invoke UB; and if f0 is
safe or not.

TIA,
Francois Grieu
 
N

Nobody

But I wonder exactly what chapter and verse (if any) in the C standards
makes f1 invoke UB; and if f0 is safe or not.

Pointer subtraction is only defined when both pointers point to elements
of the same array. 6.5.6p9:

[#9] When two pointers are subtracted, both shall point to
elements of the same array object, or one past the last
element of the array object; the result is the difference of
the subscripts of the two array elements.

So neither f0() nor f1() are "safe".
 
B

Barry Schwarz

Consider the following code:


typedef unsigned char byt;

struct {
byt fa[8];
byt fb[8];
} mys;

byt* const gp = mys.fa;

int f0(void) { return (int)(mys.fb-gp-2); }

int f1(void) { return (int)(mys.fb-2-gp); }



I recently encountered a platform where
f0() returns 6
but
f1() returns 262

I understand why the machine code generated for f1
produces that result, and what kind of optimization
triggers the generation of that code.

But I wonder exactly what chapter and verse (if any)
in the C standards makes f1 invoke UB; and if f0 is
safe or not.

If and only if there is no padding between fa and fb (as is most
likely the case), f0 should be safe (defined). In this case, the
expression mys.fb "effectively" evaluates to the address one byte
beyond the end of fa. Therefore, mys.fb-gp evaluates to the number of
elements between the two addresses and the rest of the expression is
evaluated using integers (ptrdiff_t?).

If there is padding, then the expression mys.fb-gp violates the first
constraint in n1256, paragraph 6.5.6-9.

In the case of f1, the expression mys.fb-2 does not involve fa and
violates the constraint in the next to last sentence of n1256,
paragraph 6.5.6-8.
 
T

Tim Rentsch

Francois Grieu said:
Consider the following code:

typedef unsigned char byt;

struct {
byt fa[8];
byt fb[8];
} mys;

byt* const gp = mys.fa;

int f0(void) { return (int)(mys.fb-gp-2); }

int f1(void) { return (int)(mys.fb-2-gp); }

I recently encountered a platform where
f0() returns 6
but
f1() returns 262

I understand why the machine code generated for f1
produces that result, and what kind of optimization
triggers the generation of that code.

But I wonder exactly what chapter and verse (if any)
in the C standards makes f1 invoke UB; and if f0 is
safe or not.

Both are undefined behavior, for slightly different reasons
but essentially the same one, which is pointer arithmetic is
defined only within a single array. Adding an integer to
'mys.fb' is defined only when the integer's value is between
0 and 8, inclusive. Subtracting a pointer from 'mys.fb' is
defined only when the pointer being subtracted is in the
range of { &mys.fb[0] .. &mys.fb[8] }. The function f0()
violates the second of these conditions, the function f1()
violates the first. The key observation in both cases is
that use of 'mys.fb' limits the array in question to just
that member, not the entire struct. Similarly the variable
'gp' is limited to the member 'mys.fa', because that is
where its value came from.

Unfortunately the Standard does not spell out very precisely
which array governs any particular pointer being operated on
(at least not precisely enough IMO). However the case here
is addressed more or less directly in the Defect Reports
that concern this question. The paragraphs on pointer
arithmetic are 6.5.6 p8 and p9 in n1256, although that may
not help much because of the ambiguity as to which array
governs the restrictions.

There is a way to do what you want definedly, using
the offsetof() macro

typedef struct { byt fa[8]; byt fb[8]; } T;
char *p_mys = (char *) &mys;
char *p_fa = p_mys + offsetof( T, fa );
char *p_fb = p_mys + offsetof( T, fb );
ASSERT( (p_fb - p_fa) % sizeof (byt) == 0 );

int result_f0 = (p_fb - p_fa - 2) / sizeof (byt);
int result_f1 = (p_fb - 2 - p_fa) / sizeof (byt);

Both of the last two expressions are well defined. (Yes it
is possible to simplify the expressions given above, taking
advantage of 'byt' being a character type, and 'fa' being
the first member of the struct, but the example is meant to
illustrage a general pattern.)

Note that in principle it is possible for the ASSERT() to
fail if sizeof (byt) > 1. Of course it is highly unlikely
that this will ever occur but it's a good practice to put in
something as a reminder, especially in the general case
where there may be members between 'fa' and 'fb'. Depending
on just what it is you're trying to do you might want to
test the stronger condition 'p_fa + sizeof mys.fa == p_fb',
which I believe is what is expected to be the case.
 
S

Seebs

Note that in principle it is possible for the ASSERT() to
fail if sizeof (byt) > 1.

Since "byt" is "unsigned char", this is... well, it's not so much
that it's impossible, as that if it happens, you should run away
quickly. Because that's some kind of unholy monstrosity pretending
to be a C compiler.

-s
 
S

Seebs

Consider the following code:
typedef unsigned char byt;
struct {
byt fa[8];
byt fb[8];
} mys;
byt* const gp = mys.fa;
int f0(void) { return (int)(mys.fb-gp-2); }
int f1(void) { return (int)(mys.fb-2-gp); }

I have encountered this before. I think the answer is a very
resounding "maybe".

My opinion, which I am not 100% sure of, is:

The boundaries of the object depend on how you obtain the
pointer.

So:

struct mystruct {
byt fa[8];
byt fb[8];
} mys;
byt* const p1 = (byt *) &mys;
byt* const p2 = mys.fa;
byt* const p3 = p1 + offsetof(struct mystruct, fa);
byt* const p4 = mys.fb;
byt* const p5 = p1 + offsetof(struct mystruct, fb);

int f0(void) {
return (int) (p4 - p1);
}
int f1(void) {
return (int) (p4 - p2);
}
int f2(void) {
return (int) (p4 - p3);
}
int f3(void) {
return (int) (p5 - p1);
}
int f4(void) {
return (int) (p5 - p2);
}
int f5(void) {
return (int) (p5 - p3);
}

Of these, I'm pretty sure that f3 and f5 are defined, because the pointers
they subtract are clearly into the same object. I am not as sure about f4.
And I think f0-f2 are potentially undefined because they are trying to do
pointer arithmetic on a pointer into an array and another pointer not in
that array.
But I wonder exactly what chapter and verse (if any)
in the C standards makes f1 invoke UB; and if f0 is
safe or not.

I would, in general, use only pointers computed by obtaining the address of
the whole structure, then finding offsets within it, to try to do arithmetic
comparing members. I think this came up once with regards to some customer
code at work, where they had a structure that had a hunk of data, and a
block of bytes before it, and they wanted to do negative indexes into the
data block to assemble a contiguous, packed, header in front of it.

-s
 
T

Tim Rentsch

Seebs said:
Since "byt" is "unsigned char", this is... well, it's not so much
that it's impossible, as that if it happens, you should run away
quickly. Because that's some kind of unholy monstrosity pretending
to be a C compiler.

I was trying to illustrate a general pattern, where the type
'byt' is not otherwise known and may not be a character type.
Obviously if 'byt' is (unsigned char) then sizeof (byt) == 1.
 
T

Tim Rentsch

Barry Schwarz said:
Consider the following code:


typedef unsigned char byt;

struct {
byt fa[8];
byt fb[8];
} mys;

byt* const gp = mys.fa;

int f0(void) { return (int)(mys.fb-gp-2); }

int f1(void) { return (int)(mys.fb-2-gp); }



I recently encountered a platform where
f0() returns 6
but
f1() returns 262

I understand why the machine code generated for f1
produces that result, and what kind of optimization
triggers the generation of that code.

But I wonder exactly what chapter and verse (if any)
in the C standards makes f1 invoke UB; and if f0 is
safe or not.

If and only if there is no padding between fa and fb (as is most
likely the case), f0 should be safe (defined). In this case, the
expression mys.fb "effectively" evaluates to the address one byte
beyond the end of fa. [snip elaboration]

The conclusion about f0 is not correct. It's true that if there is
no padding then mys.fb == mys.fa+8, but mys.fb still "belongs" to
the array mys.fb, and is not interchangeable with a pointer value
that points one past mys.fa[7]. The values are equal but what
Defect Reports call the "provenance" is different, so mys.fb can't
be combined with any pointer derived from mys.fa, even when there
is no padding.
 
T

Tim Rentsch

Seebs said:
Consider the following code:
typedef unsigned char byt;
struct {
byt fa[8];
byt fb[8];
} mys;
byt* const gp = mys.fa;
int f0(void) { return (int)(mys.fb-gp-2); }
int f1(void) { return (int)(mys.fb-2-gp); }

I have encountered this before. I think the answer is a very
resounding "maybe".

My opinion, which I am not 100% sure of, is:

The boundaries of the object depend on how you obtain the
pointer.

This viewpoint agrees with the responses given in the Defect
Report (or possibly DR's plural) on this issue.
So:

struct mystruct {
byt fa[8];
byt fb[8];
} mys;
byt* const p1 = (byt *) &mys;
byt* const p2 = mys.fa;
byt* const p3 = p1 + offsetof(struct mystruct, fa);
byt* const p4 = mys.fb;
byt* const p5 = p1 + offsetof(struct mystruct, fb);

int f0(void) {
return (int) (p4 - p1);
}
int f1(void) {
return (int) (p4 - p2);
}
int f2(void) {
return (int) (p4 - p3);
}
int f3(void) {
return (int) (p5 - p1);
}
int f4(void) {
return (int) (p5 - p2);
}
int f5(void) {
return (int) (p5 - p3);
}

Of these, I'm pretty sure that f3 and f5 are defined, because
the pointers they subtract are clearly into the same object.

I concur. Clearly defined.
And I think f0-f2 are potentially undefined because they are
trying to do pointer arithmetic on a pointer into an array and
another pointer not in that array.

I concur here also. The provenance of p4 clearly is not the
same as any of p1, p2, p3.
I am not as sure about f4.

My take is that f4 is undefined, and even that 'p2 - p1' is
undefined. The reasoning is simple: the array that 'p1' points
into is not the same as the array that either 'p2' or 'p4' points
into. Among other things, it has a different length, so it can't
be the same array. 6.5.6 p9 requires that both pointers "shall
point to elements of the same array object." Since they don't,
the behavior is undefined.

By the way, kudos for a nice careful analysis.
But I wonder exactly what chapter and verse (if any)
in the C standards makes f1 invoke UB; and if f0 is
safe or not.

I would, in general, use only pointers computed by obtaining the
address of the whole structure, then finding offsets within it,
to try to do arithmetic comparing members. [snip elaboration]

I agree with this recommendation.
 
S

Seebs

I was trying to illustrate a general pattern, where the type
'byt' is not otherwise known and may not be a character type.
Obviously if 'byt' is (unsigned char) then sizeof (byt) == 1.

Ahh, okay.

-s
 
N

Noob

Tim said:
Francois Grieu said:
Consider the following code:

typedef unsigned char byt;

struct {
byt fa[8];
byt fb[8];
} mys;

byt* const gp = mys.fa;

int f0(void) { return (int)(mys.fb-gp-2); }

int f1(void) { return (int)(mys.fb-2-gp); }

I recently encountered a platform where
f0() returns 6
but
f1() returns 262

I understand why the machine code generated for f1
produces that result, and what kind of optimization
triggers the generation of that code.

But I wonder exactly what chapter and verse (if any)
in the C standards makes f1 invoke UB; and if f0 is
safe or not.

Both are undefined behavior, for slightly different reasons
but essentially the same one, which is pointer arithmetic is
defined only within a single array.

Slight digression, I thought it was OK to compute the
offset between two struct fields? Was I misled?

struct foo { int a,b,c,d,e,f; };
foo bar;
return &bar.e - &bar.b;

Regards.
 
T

Tim Rentsch

Noob said:
Tim said:
Francois Grieu said:
Consider the following code:

typedef unsigned char byt;

struct {
byt fa[8];
byt fb[8];
} mys;

byt* const gp = mys.fa;

int f0(void) { return (int)(mys.fb-gp-2); }

int f1(void) { return (int)(mys.fb-2-gp); }

I recently encountered a platform where
f0() returns 6
but
f1() returns 262

I understand why the machine code generated for f1
produces that result, and what kind of optimization
triggers the generation of that code.

But I wonder exactly what chapter and verse (if any)
in the C standards makes f1 invoke UB; and if f0 is
safe or not.

Both are undefined behavior, for slightly different reasons
but essentially the same one, which is pointer arithmetic is
defined only within a single array.

Slight digression, I thought it was OK to compute the
offset between two struct fields? Was I misled?

struct foo { int a,b,c,d,e,f; };
foo bar;
return &bar.e - &bar.b;

You may be thinking of 6.5.8 p5, about relational operators, which
defines an ordering for structure members. Thus an expression like
'&bar.e > &bar.b' is well-defined and must be equal to 1. But as
far as pointer subtraction goes this case is just like the earlier
one, ie, undefined behavior, because of the rule in 6.5.6 p7:

For the purposes of these operators, a pointer to an object
that is not an element of an array behaves the same as a
pointer to the first element of an array of length one with
the type of the object as its element type.

Hence the pointers '&bar.e' and '&bar.b' point to different arrays,
and therefore do not satisfy the condition in 6.5.6 p9 that both
pointers point to elements of (or one past the last element of) the
same array object. There is no exception given for pointers to
members of a common struct.
 
J

James Kuyper

On 11/30/2013 03:22 AM, Noob wrote:
....
Slight digression, I thought it was OK to compute the
offset between two struct fields? Was I misled?

struct foo { int a,b,c,d,e,f; };
foo bar;
return &bar.e - &bar.b;

"When two pointers are subtracted, both shall point to elements of the
same array object, or one past the last element of the array object;
...." (6.5.6p9)

There's no single array object of which bar.e and bar.b are both
elements. "If a ‘‘shall’’ or ‘‘shall not’’ requirement that appears
outside of a constraint or runtime constraint is violated, the behavior
is undefined." (4p2)

In practice, your code is very likely to do precisely what you expect it
to do; but it's a bad idea to rely upon that fact.
 
E

Eric Sosman

On 11/30/2013 03:22 AM, Noob wrote:
...

"When two pointers are subtracted, both shall point to elements of the
same array object, or one past the last element of the array object;
..." (6.5.6p9)

There's no single array object of which bar.e and bar.b are both
elements. "If a ‘‘shall’’ or ‘‘shall not’’ requirement that appears
outside of a constraint or runtime constraint is violated, the behavior
is undefined." (4p2)

Any thoughts on

return (char*)&bar.e - (char*)&bar.b;

? We know that bar consists of a sequence of bytes (6.2.6.1p2),
and C programmers have long felt free to access the bytes of that
"sequence" as if they were bytes of an array. Have programmers
been sinning by doing so?

6.2.6.1p4 is a little coy about the array-of-bytes viewpoint,
speaking only of what you'd see if you copied the bytes into an
actual array before looking at them. 7.21.8.{1,2} don't speak of
doing I/O on the actual bytes of objects, but on an array that
exactly overlays them. On the other hand, 7.24.2.{1,2} don't bother
with the overlaid-array subterfuge, but talk of working directly
on the objects' bytes. (An array is mentioned in 7.24.2.2, but it
serves a different role.)

Finally: If it is in fact incorrect to treat a struct instance
as an array of bytes, is the offsetof() macro of any practical use?
 
S

Seebs

Any thoughts on

return (char*)&bar.e - (char*)&bar.b;
? We know that bar consists of a sequence of bytes (6.2.6.1p2),
and C programmers have long felt free to access the bytes of that
"sequence" as if they were bytes of an array. Have programmers
been sinning by doing so?

That one doesn't change anything, because the pointers are still
based on b and e.

Basically: While it's true that the bytes of bar are a sequence of
bytes and you can use it that way, the bytes of b and the bytes of
e are both subsequences of that, and their own little sequences of
bytes. And they are two separate sequences.
Finally: If it is in fact incorrect to treat a struct instance
as an array of bytes, is the offsetof() macro of any practical use?

That is part of why we need it:

((char *) &bar) + offsetof(struct foo, e)
(char *)&bar.e

These are the same location, but one of them is a pointer into an
int-sized array, and one of them is a pointer into a struct foo sized
array. Only the former is treating the struct as an array of bytes;
the latter is treating a single int as an array of bytes.

-s
 
T

Tim Rentsch

Eric Sosman said:
Any thoughts on

return (char*)&bar.e - (char*)&bar.b;

?

Strictly speaking I believe the behavior here is undefined. More
specifically, I am not aware of any text in the Standard, or in
Defect Reports, or in the Rationale document, that suggests the
behavior in this case is any more defined than the same expression
without the casts. Pragmatically it is of course highly likely to
work as expected, but AFAIAA there is nothing to indicate there
is any difference in the defined-ness of these two expressions.
We know that bar consists of a sequence of bytes (6.2.6.1p2),
and C programmers have long felt free to access the bytes of that
"sequence" as if they were bytes of an array. Have programmers
been sinning by doing so?

No. The safety of doing so is provided by the combination of
6.3.2.3 p7 and 6.5.6 p8. The semantics of pointer arithmetic is
defined in 6.5.6. The last sentence of 6.3.2.3 p7 does not change
or expand on those semantics, it simply says what the results will
be in relation to the original object, ie, that the storage of the
object exactly overlays the implied character array. As the
semantics given in 6.5.6 p8 are defined only when the pointer in
question points to the element of an array, there must be such an
array so that the statement in 6.3.2.3 p7 will in fact hold true.
6.2.6.1p4 is a little coy about the array-of-bytes viewpoint,
speaking only of what you'd see if you copied the bytes into an
actual array before looking at them. 7.21.8.{1,2} don't speak of
doing I/O on the actual bytes of objects, but on an array that
exactly overlays them. On the other hand, 7.24.2.{1,2} don't bother
with the overlaid-array subterfuge, but talk of working directly
on the objects' bytes. (An array is mentioned in 7.24.2.2, but it
serves a different role.)

6.2.6.1 p4 is only about how objects are represented, not about
character array overlays; in fact the phrasing used supports
the idea that objects can be accessed as character arrays when
it says '/e.g./, by memcpy' (my emphasis). The writing in the
other cited passages is further evidence that the Standard's
authors consider the two perspectives interchangeable.
Finally: If it is in fact incorrect to treat a struct instance
as an array of bytes, is the offsetof() macro of any practical use?

It is perfectly okay to treat a struct instance as an array of
bytes, as long as the provenance of the pointer used refers to the
entire structure and not just an individual member.
 
E

Eric Sosman

That one doesn't change anything, because the pointers are still
based on b and e.

Basically: While it's true that the bytes of bar are a sequence of
bytes and you can use it that way, the bytes of b and the bytes of
e are both subsequences of that, and their own little sequences of
bytes. And they are two separate sequences.

That's (more or less) the interpretation used by a bounds-
checking C implementation I once read of: Every pointer was
associated with an invisible "extent" datum, and every dereference
was decomposed into "base plus offset" and checked against the
extent. In that implementation -- and in C itself, if you're
right -- these two pointers could behave differently despite
having the same type and comparing equal.

It gives me the heebie-jeebies, though. Seems to open an
enormous can of unpleasant worms. For example,

struct foo { int x, y; } fa = { 42, -42 }, fb;
assert((void*)&fa == (void*)&fa.x); // Okay (6.7.2.1p15)
assert((void*)&fb == (void*)&fb.x); // Okay (ditto)
memcpy(&fb, &fa, sizeof fb); // Okay
memcpy(&fb, &fa.x, sizeof fb); // Undefined?!
memcpy(&fb.x, &fa, sizeof fb); // Undefined?!

(If the fact that memcpy need not be implemented in C distracts
you, don't let it: Just substitute a C-implemented work-alike.)
 
S

Seebs

It gives me the heebie-jeebies, though. Seems to open an
enormous can of unpleasant worms.

It does!
For example,

struct foo { int x, y; } fa = { 42, -42 }, fb;
assert((void*)&fa == (void*)&fa.x); // Okay (6.7.2.1p15)
assert((void*)&fb == (void*)&fb.x); // Okay (ditto)
memcpy(&fb, &fa, sizeof fb); // Okay
memcpy(&fb, &fa.x, sizeof fb); // Undefined?!
memcpy(&fb.x, &fa, sizeof fb); // Undefined?!
(If the fact that memcpy need not be implemented in C distracts
you, don't let it: Just substitute a C-implemented work-alike.)

I know it seems weird, but it makes sense to me, because this is exactly
what you have to have be true for the optimizer to be able to make informed
decisions about whether a write to one thing can alter another thing.

Consider, given the above:
unsigned char *p1 = (unsigned char *) &fa;
unsigned char *p2 = (unsigned char *) &fa.x;
unsigned char *p3 = (unsigned char *) &fa.y;

It's *useful* to be assured that a write through p2 can't legitimately alter
the contents of p3. It's not a pointer into the whole struct, just into one
member of it. But you have to be able to indicate that sometimes you really
do mean you want to write through the pointer, thus p1 has to work.

-s
 
J

James Kuyper

On 11/30/2013 9:48 AM, James Kuyper wrote: ....

Any thoughts on

return (char*)&bar.e - (char*)&bar.b;

? We know that bar consists of a sequence of bytes (6.2.6.1p2),
and C programmers have long felt free to access the bytes of that
"sequence" as if they were bytes of an array. Have programmers
been sinning by doing so?

6.2.6.1p4 is a little coy about the array-of-bytes viewpoint,
speaking only of what you'd see if you copied the bytes into an
actual array before looking at them.

6.2.6.1p4 talks about copying an object, "e.g. by memcpy". I read that
"e.g." as implying that memcpy() doesn't have any any special blessing
for performing such a copy; that any other routine with behavior
indistinguishable from that required by the standard for memcpy() could
also be used. Such a routine has to treat the input object as an array
of unsigned char, with each element of that array having a value that
can be copied to the output array. There's lots of things that you could
do in such a routine without violating those requirements, such as
rearranging the order of the accesses to the input array of unsigned
char. There's even more changes you could make that would make it no
longer meet the requirements for memcpy(), but would also not give it
undefined behavior, such printing out the values of those unsigned char
values. From these considerations I conclude that 6.2.6.1p4 implies than
any object can be freely treated, for arbitrary purposes, as an array of
unsigned char (with due respect to const and volatile qualifications) .
In particular, this implies that pointers to different parts of the
object can be compared for relative order, and subtracted to calculate
the number of bytes between them.

This is an awful lot to infer from a single "e.g." I believe the
inference to be correct, but I'd prefer it if the standard said these
things more explicitly.
 
G

glen herrmannsfeldt

Tim Rentsch said:
Strictly speaking I believe the behavior here is undefined. More
specifically, I am not aware of any text in the Standard, or in
Defect Reports, or in the Rationale document, that suggests the
behavior in this case is any more defined than the same expression
without the casts. Pragmatically it is of course highly likely to
work as expected, but AFAIAA there is nothing to indicate there
is any difference in the defined-ness of these two expressions.

Some years ago I was interested in the possibility of generating
JVM code from a C compiler. That is, generating something similar
to what a Java compiler might generate. JVM has no operations for
subtracting reference objects. A C pointer would contain an object
reference and offset into the array, such that subtraction would
still work.
No. The safety of doing so is provided by the combination of
6.3.2.3 p7 and 6.5.6 p8. The semantics of pointer arithmetic is
defined in 6.5.6. The last sentence of 6.3.2.3 p7 does not change
or expand on those semantics, it simply says what the results will
be in relation to the original object, ie, that the storage of the
object exactly overlays the implied character array. As the
semantics given in 6.5.6 p8 are defined only when the pointer in
question points to the element of an array, there must be such an
array so that the statement in 6.3.2.3 p7 will in fact hold true.

-- glen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,733
Messages
2,569,440
Members
44,829
Latest member
PIXThurman

Latest Threads

Top