when is typecasting (unsigned char*) to (char*) dangerous?

K

Keith Thompson

Harald van Dijk said:
If those exclusions were dropped, then using memcpy (or rather, a
custom function written in standard C that behaves exactly like
memcpy) to copy an object holding a trap representation would be
invalid.

/* the standard function memcpy, but implemented in 100% standard C */
extern void *mymemcpy(void *dest, void *src, size_t n);

struct S
{
int ptrIsValid;
void *ptr;
};

{
struct S s1, s2;
s2.ptrIsValid = 0; /* ptr is left uninitialised */
mymemcpy(&s1, &s2, sizeof(s1));
}

Without the exclusion in 6.2.6.1p5, if pointer types can have trap
representations, mymemcpy would potentially use a character type to
read a trap representation. This should be allowed, and by excluding
character types in that paragraph, this is allowed.

memcpy() can just use unsigned char, which is guaranteed not to have
padding bits or trap representations.
 
H

Harald van Dijk

memcpy() can just use unsigned char, which is guaranteed not to have
padding bits or trap representations.

Yes, but that was not the point I was trying to make. I'm reading
s2.ptr, which potentially holds a trap representation, and 6.2.6.1p5
disallows reading trap representations. It doesn't refer to
representations that do not represent a value in the type of the
lvalue expression, it refers to representations that do not represent
a value in the type of the object. The object has type void *, even if
it's accessed using unsigned char. That's why there needs to be a
specific exception for when the lvalue expression has character type.
 
K

Keith Thompson

Harald van Dijk said:
Yes, but that was not the point I was trying to make. I'm reading
s2.ptr, which potentially holds a trap representation, and 6.2.6.1p5
disallows reading trap representations. It doesn't refer to
representations that do not represent a value in the type of the
lvalue expression, it refers to representations that do not represent
a value in the type of the object. The object has type void *, even if
it's accessed using unsigned char. That's why there needs to be a
specific exception for when the lvalue expression has character type.

That's an interesting interpretation, but I'm still not quite convinced.

Here's the paragraph:

Certain object representations need not represent a value of
the object type. If the stored value of an object has such a
representation and is read by an lvalue expression that does
not have character type, the behavior is undefined. If such
a representation is produced by a side effect that modifies
all or any part of the object by an lvalue expression that
does not have character type, the behavior is undefined.
Such a representation is called a *trap representation*.

I still think that it refers to (or *should* refer to) a trap
representation for the type of the lvalue.

Consider:

void *ptr = malloc(sizeof (int));
assert(ptr != NULL);
/* code to set the bytes pointed to by ptr to a
trap representation for type int */
int n = *(int*)ptr;

The object created by the malloc() call isn't an object of type int;
it's just raw storage. If 6.2.6.1p5 doesn't imply that accessing
it via an lvalue of type int has undefined behavior, then what does?
 
H

Harald van Dijk

Here's the paragraph:

    Certain object representations need not represent a value of
    the object type. If the stored value of an object has such a
    representation and is read by an lvalue expression that does
    not have character type, the behavior is undefined. If such
    a representation is produced by a side effect that modifies
    all or any part of the object by an lvalue expression that
    does not have character type, the behavior is undefined.
    Such a representation is called a *trap representation*.

I still think that it refers to (or *should* refer to) a trap
representation for the type of the lvalue.

Consider:

    void *ptr = malloc(sizeof (int));
    assert(ptr != NULL);
    /* code to set the bytes pointed to by ptr to a
       trap representation for type int */
    int n = *(int*)ptr;

The object created by the malloc() call isn't an object of type int;
it's just raw storage.  If 6.2.6.1p5 doesn't imply that accessing
it via an lvalue of type int has undefined behavior, then what does?

If *ptr doesn't hold an int, then reading it as an int is a violation
of the aliasing rules.
 
K

Keith Thompson

Harald van Dijk said:
If *ptr doesn't hold an int, then reading it as an int is a violation
of the aliasing rules.

6.5p7:

An object shall have its stored value accessed only by an lvalue
expression that has one of the following types:

-- a type compatible with the effective type of the object,
[...]

Going back to paragraph 6:

For all other accesses to an object having no declared type, the
effective type of the objec simply the type of the lvalue used for
the access.

So the effective type of the object is int.
 
H

Harald van Dijk

If *ptr doesn't hold an int, then reading it as an int is a violation
of the aliasing rules.

6.5p7:

    An object shall have its stored value accessed only by an lvalue
    expression that has one of the following types:

    -- a type compatible with the effective type of the object,
    [...]

Going back to paragraph 6:

    For all other accesses to an object having no declared type, the
    effective type of the objec simply the type of the lvalue used for
    the access.

So the effective type of the object is int.

Depending on the omitted "code to set the bytes pointed to by ptr to a
trap representation for type int", there are three possibilities right
before n is initialised:
1) *ptr has no effective type
2) *ptr has an effective type that is int
3) *ptr has an effective type that is not int

You're right, 1) and 2) are effectively the same, the effective type
becomes int. In both of those cases, 6.2.6.1p5 says the behaviour is
undefined. For possibility 3), 6.5p7 says the behaviour is undefined.
 
L

Lauri Alanko

I actually stumbled into this problem just now.

I need to represent UTF-8 strings in C, but although the common wisdom
is that UTF-8 is nicely compatible with legacy C code, it seems that
this isn't strictly true: we cannot portably cast an (unsigned char*)
buffer containing a UTF-8-encoded, 0-terminated string to (char*) and
operate on it with standard C functions. A UTF-8 encoded buffer may
contain the byte value 0x80, which, when casted to char, might be a
trap representation on platforms where CHAR_MIN is -127.

This is awful shame, since there are byte values that never occur in
well-formed UTF-8: 0xc0, 0xc1 and 0xf5-0xff. If one of those had been
0x80, to my understanding there wouldn't have been a problem.

But if char might not be able to represent all the possible bytes of
UTF-8, how can C1X have UTF-8 encoded string literals? Maybe I'll
write a separate post about this to comp.std.c.


Lauri
 
H

Harald van Dijk

I actually stumbled into this problem just now.

I need to represent UTF-8 strings in C, but although the common wisdom
is that UTF-8 is nicely compatible with legacy C code, it seems that
this isn't strictly true: we cannot portably cast an (unsigned char*)
buffer containing a UTF-8-encoded, 0-terminated string to (char*) and
operate on it with standard C functions. A UTF-8 encoded buffer may
contain the byte value 0x80, which, when casted to char, might be a
trap representation on platforms where CHAR_MIN is -127.

The standard C functions in <string.h> treat their arguments as
unsigned char * (7.21.1p3 for the interested), so strlen() etc. work
even on the system you describe, of course assuming you expect a
string length in bytes.
 
K

Keith Thompson

Lauri Alanko said:
I actually stumbled into this problem just now.

I need to represent UTF-8 strings in C, but although the common wisdom
is that UTF-8 is nicely compatible with legacy C code, it seems that
this isn't strictly true: we cannot portably cast an (unsigned char*)
buffer containing a UTF-8-encoded, 0-terminated string to (char*) and
operate on it with standard C functions. A UTF-8 encoded buffer may
contain the byte value 0x80, which, when casted to char, might be a
trap representation on platforms where CHAR_MIN is -127.

This is awful shame, since there are byte values that never occur in
well-formed UTF-8: 0xc0, 0xc1 and 0xf5-0xff. If one of those had been
0x80, to my understanding there wouldn't have been a problem.

But if char might not be able to represent all the possible bytes of
UTF-8, how can C1X have UTF-8 encoded string literals? Maybe I'll
write a separate post about this to comp.std.c.

C does seem to be a bit inconsistent about whether unsigned chars can
safely be aliased as plain chars. Plain char *can* have trap
representations, and if it does, a lot of common C idioms can break.

Practically speaking, implementations don't do this.

I think you can be reasonably safe if you do something like:

#include <limits.h>
#include <assert.h>
...
assert(CHAR_BIT == 8 &&
((CHAR_MIN == -128 && CHAR_MAX == +127) ||
(CHAR_MIN == 0 && CHAR_MAX == +255)));

There are tricks you can use to get the effect of a compile-time
assertion as well.
 
P

Phil Carmody

Vincenzo Mercuri said:
Keith Thompson ha scritto:
[...]
If you're going to cast (not "typecast") an unsigned char* to char*,
surely you have a reason for doing so, presumably to access the
pointed-to memory as char.

Yes, I can't imagine of a reason for making such a cast without
accessing the pointed-to memory...

When it's an arg for a call-back function, which you will use
by first casting back to the right type?

Phil
 
S

Shao Miller

thanks in advance for your help, tim

This one bothers me. We have:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1310.htm

And it appears to have been corrected for C11. But aren't there some
unusual bits in C99 without the assumption that 'signed char' has no
padding bits?

"Expressions" @ 6.5p6 has:

"The effective type of an object for an access to its stored value is
the declared type of the object, if any.75) If a value is stored into an
object having no declared type through an lvalue having a type that is
not a character type, then the type of the lvalue becomes the effective
type of the object for that access and for subsequent accesses that do
not modify the stored value. If a value is copied into an object having
no declared type using memcpy or memmove, or is copied as an array of
character type, then the effective type of the modified object for that
access and for subsequent accesses that do not modify the value is the
effective type of the object from which the value is copied, if it has
one. For all other accesses to an object having no declared type, the
effective type of the object is simply the type of the lvalue used for
the access."

This seems to suggest that an object's value _can_ be copied via:
- 'memcpy'
- 'memmove'
- treatment as an array of character type

If that value is a union value, and the union type goes:

union {
unsigned char bytes[10];
};

and if 'signed char' is a "character type" whose range of values cannot
be mapped to provide coverage for the range of 'unsigned char' values,
how can the union value be copied via treatment as an array of this
particular character type?

As another example assuming no padding bits, suppose "sign and
magnitude" representation is used for 'signed char'. With the sign bit
set and the value bits zero, we seem[6.2.6.2p2] to be allowed one of:
- a "trap representation"
- a normal value called a "negative zero"

Well 6.2.6.1p5 suggests that character types are exempt from having
trapresentations, so it looks like 'signed char' with "sign and
magnitude" _must_ support "negative zeroes." Fortunately, that doesn't
seem to mandate that _all_ signed types must support "negative zeroes."
(Right?)

Here's "General" @ 6.2.6.1p5:

"Certain object representations need not represent a value of the
object type. If the stored value of an object has such a representation
and is read by an lvalue expression that does not have character type,
the behavior is undefined. If such a representation is produced by a
side effect that modifies all or any part of the object by an lvalue
expression that does not have character type, the behavior is
undefined.41) Such a representation is called a trap representation."

The way I interpret this is: The first sentence talks about "certain
object representations." They are the overall subject for the entire
paragraph by their placement in the first sentence and by the later
references. They are qualified as being "[those representations that
do] not represent a value of the object type."

The second sentence refers to "such a representation," so is referring
to the very same representations as the first sentence.

The third sentence refers to "such a representation," so is referring to
the very same representations as the second and first sentences.

The fourth sentence refers to "such a representation," so is referring
to the very same representations as the third, second and first sentences.

Is this an agreeable interpretation? If so, it would seem to suggest that:
- Padding bits can be present in the object representation of a valid
object value (multiple representations for a value)
- Padding bits can be present in an object representation of an invalid
value and are part of that trap representation

It seems as though reading a stored object's value via any character
type is explicitly exempt from undefined behaviour. But if there are
padding bits in 'signed char' and no trap representations, doesn't that
suggest loss of information when copying the union value up above?

I'd like to ask: Although it's been "fixed" in C11, doesn't it already
follow from what we had in C99? Or is C99 sufficiently vague that
someone could claim that:
- sizeof (unsigned char) == sizeof (signed char)
- Rank of unsigned char < unsigned short < unsigned int[6.3.1.1p1, third
item)
- Rank of signed char < signed short < signed int
- Width of signed char < width of unsigned char
- Width of unsigned char < width of unsigned short < width of unsigned int
- Width of signed char < width of signed short < width of signed int
- Width of signed int < width of unsigned char?!

This would seem to result in some unusual circumstances, such as:
- not being able to write the maximum value for an 'unsigned char' as an
integer constant without some extended integer type supporting it[6.4.4.1p5]
- 'unsigned char' values promoting to 'unsigned int' but not to
'int'[6.3.1.1p2]
 
H

Harald van Dijk

This would seem to result in some unusual circumstances, such as:
- 'unsigned char' values promoting to 'unsigned int' but not to
'int'[6.3.1.1p2]

Without comment on the rest of the message, this unusual circumstance
is required on systems where sizeof(int) == 1. They are admittedly
uncommon, but they do exist in the real world, and to the best of my
knowledge, this is (or at least, was) intended to be allowed.
 
K

Keith Thompson

Shao Miller said:

That says:

It is clear from the standard (specifically 6.2.6.2p1) that
unsigned integer types in general are not allowed to have trap
representations, and that unsigned char is not allowed to have
any padding bits.

I don't believe that conclusion is correct. 6.2.6.2p1, both in C99 and
in C11 (N1570) defines a value of an unsigned type in terms of the value
bits, but unsigned types can also have padding bits, and as the footnote
says, "some combinations of padding bits might generate trap
representations". An arithmetic operation on an unsigned type cannot
generate a trap representation, but operations that act directly on the
representation can.

[...]
 
J

Jens Gustedt

Am 01/23/2012 09:37 AM, schrieb Keith Thompson:
That says:

It is clear from the standard (specifically 6.2.6.2p1) that
unsigned integer types in general are not allowed to have trap
representations, and that unsigned char is not allowed to have
any padding bits.

I don't believe that conclusion is correct.

The statement might not be correct in its claim for general unsigned
types but the second half phrase is a correct statement:

unsigned char is not allowed to have any padding bits

and the same is true for signed char, which is explicitly mentioned in
6.2.6.2p2.

Jens
 
T

Tim Rentsch

Keith Thompson said:
And I still don't. If, hypothetically, the standard permitted objects
to be aliased using unsigned chars but not signed or plain chars, how
would that imply that "no access at all would be well-defined"?


Ah, thank you, that's one of the clues I was missing. The other is
6.2.6.1p5 (thanks to James Kuyper for catching that one); that says
explicitly that you can access an object via an lvalue of character
type.

Neither 6.5 p7 nor 6.2.6.1 p5 promises that access using a
character type must be defined behavior. What they do say is
that access using (many) other types is _un_defined behavior.
Simply failing to disallow something doesn't mean it's defined;
there also must be provided some definition that works under
the circumstances in question.
So let's assume that you have an object of type unsigned char with
the value SCHAR_MAX + 1, and you access it as a signed char --
but that representation is a trap representation
for signed char:

unsigned char u = SCHAR_MAX + 1;
signed char s = *(signed char*)&u;

My reading is that the behavior is undefined by omission. 6.2.6.1p5
says that storing a non-character trap representation has undefined
behavior; it explicitly excludes character types. 6.5p7 says that
an object shall have its stored value accessed *only* by an lvalue of
certain types, including character types, but that doesn't imply that
the behavior of such an access is defined. For example, accessing
an int object by an lvalue of of type int is permitted by 6.5p7,
but has undefined behavior if the object holds a trap representation.

If the behavior is defined, what is it?

Assuming that the object representation in 'u' is a trap
representation for signed char, the behavior is undefined.
This conclusion follows directly from 6.5 p5 and the
definition of trap representation (and 6.3.2.1 p2, naturally).
 
T

Tim Rentsch

Keith Thompson said:
Harald van Dijk said:
Yes, but that was not the point I was trying to make. I'm reading
s2.ptr, which potentially holds a trap representation, and 6.2.6.1p5
disallows reading trap representations. It doesn't refer to
representations that do not represent a value in the type of the
lvalue expression, it refers to representations that do not represent
a value in the type of the object. The object has type void *, even if
it's accessed using unsigned char. That's why there needs to be a
specific exception for when the lvalue expression has character type.

That's an interesting interpretation, but I'm still not quite convinced.

Here's the paragraph:

Certain object representations need not represent a value of
the object type. If the stored value of an object has such a
representation and is read by an lvalue expression that does
not have character type, the behavior is undefined. If such
a representation is produced by a side effect that modifies
all or any part of the object by an lvalue expression that
does not have character type, the behavior is undefined.
Such a representation is called a *trap representation*.

I still think that it refers to (or *should* refer to) a trap
representation for the type of the lvalue. [snip example]

The point of 6.2.6.1p5 is that, when an object holds a trap
representation for its own type, reading it using another type
(that isn't a character type) is undefined behavior, _even if_
the particular object representation is not a trap representation
for the accessing type. For example:

int foo;
memcpy( &foo, &int_trap_representation, sizeof foo );
* (unsigned int *) &foo; /* BAM! */

The BAM line is undefined behavior, even if 'unsigned int'
has no trap representations. (If both types have no trap
representations then the access on the BAM line is well-
defined.)

Of course, trying to access an object that holds a trap
representation of the accessing (lvalue) type is always
undefined behavior, as I just explained in my last posting.
 
T

Tim Rentsch

Shao Miller said:
thanks in advance for your help, tim

This one bothers me. We have:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1310.htm

And it appears to have been corrected for C11. But aren't there some
unusual bits in C99 without the assumption that 'signed char' has no
padding bits?

[snip]

Responding generally,

1. signed char is a signed integer type, and all
signed integer types may have trap representations
(more specifically, one TR) considering just the value
bits;

2. Reading a trap representation of type signed char
(ie, signed char TR read with a signed char lvalue)
is undefined behavior;

3. C99 allows the possibility that signed char has
padding bits (although one presumes no existing
implementation actually has them); and

4. C11 tightened the requirements on signed char so
that padding bits are no longer allowed, but (1)
and (2) above still hold (disclaimer: unless there
is something I haven't seen yet in C11 that changes
that; I am not yet as well versed in that text as
the C99 version).
 
T

Tim Rentsch

Keith Thompson said:
Shao Miller said:

That says:

It is clear from the standard (specifically 6.2.6.2p1) that
unsigned integer types in general are not allowed to have trap
representations, and that unsigned char is not allowed to have
any padding bits.

I don't believe that conclusion is correct. 6.2.6.2p1, both in C99 and
in C11 (N1570) defines a value of an unsigned type in terms of the value
bits, but unsigned types can also have padding bits, [snip]

I assume he meant unsigned types cannot have trap representations
considering just the value bits. Of course unsigned types can
have TRs if they do have padding bits, but not if they don't
(as is always the case for unsigned char).
 
K

Keith Thompson

Tim Rentsch said:
Keith Thompson said:
Shao Miller said:
On 11/15/2011 14:59, tim wrote:
thanks in advance for your help, tim

This one bothers me. We have:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1310.htm

That says:

It is clear from the standard (specifically 6.2.6.2p1) that
unsigned integer types in general are not allowed to have trap
representations, and that unsigned char is not allowed to have
any padding bits.

I don't believe that conclusion is correct. 6.2.6.2p1, both in C99 and
in C11 (N1570) defines a value of an unsigned type in terms of the value
bits, but unsigned types can also have padding bits, [snip]

I assume he meant unsigned types cannot have trap representations
considering just the value bits. Of course unsigned types can
have TRs if they do have padding bits, but not if they don't
(as is always the case for unsigned char).

I think that's a strained interpretation. It says "unsigned integer
types in general are not allowed to have trap representations";
I take that to mean that unsigned integer types are not allowed to
have trap representations (which is clearly incorrect).

I've just e-mailed the author; I'll let you know if I get a response.
 
T

Tim Rentsch

Keith Thompson said:
Tim Rentsch said:
Keith Thompson said:
On 11/15/2011 14:59, tim wrote:
thanks in advance for your help, tim

This one bothers me. We have:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1310.htm

That says:

It is clear from the standard (specifically 6.2.6.2p1) that
unsigned integer types in general are not allowed to have trap
representations, and that unsigned char is not allowed to have
any padding bits.

I don't believe that conclusion is correct. 6.2.6.2p1, both in C99 and
in C11 (N1570) defines a value of an unsigned type in terms of the value
bits, but unsigned types can also have padding bits, [snip]

I assume he meant unsigned types cannot have trap representations
considering just the value bits. Of course unsigned types can
have TRs if they do have padding bits, but not if they don't
(as is always the case for unsigned char).

I think that's a strained interpretation.

I don't disagree. I put it forward only as a reasonably plausible
interpretation considering the circumstances. I prefer to give
people the benefit of the doubt, especially when (as is true in
this case) there is reason to expect they deserve it.
It says "unsigned integer
types in general are not allowed to have trap representations";
I take that to mean that unsigned integer types are not allowed to
have trap representations (which is clearly incorrect).

I've just e-mailed the author; I'll let you know if I get a response.

Thank you, I'm interested to hear the result.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top