char and strict aliasing

  • Thread starter Paul Brettschneider
  • Start date
P

Paul Brettschneider

Hello all,

consider the following code:

typedef char T;
class test {
T *data;
public:
void f(T, T, T);
void f2(T, T, T);
};

void test::f(T a, T b, T c)
{
data[3] = a;
data[4] = b;
data[5] = c;
}

void test::f2(T a, T b, T c)
{
T *d = data;
d[3] = a;
d[4] = b;
d[5] = c;
}

g++ (v4.3, options "-fomit-frame-pointer -O3 -S -Wall") for x86 produces the
following nice code for f2:
movq (%rdi), %rax
movb %sil, 3(%rax)
movb %dl, 4(%rax)
movb %cl, 5(%rax)
ret
but quite strange code for f:
movq (%rdi), %rax
movb %sil, 3(%rax)
movq (%rdi), %rax
movb %dl, 4(%rax)
movq (%rdi), %rax
movb %cl, 5(%rax)
ret

Apparently the pointer data is reloaded after every store. I guess this is
due to the aliasing rules for char types: for some strange reason data
might point to itself and to be correct it has to be reloaded after every
store. Indeed replacing the char for an int gives the same code for f and
f2. IMO this is a bad language decision: It's highly inconsistent. Anyway,
having to live with it, I have to wonder how to implement a char type which
does not alias with everything.

Besides "char" I tried "unsigned char", "signed char", "uint8_t"
and "int8_t", all to no avail. Also the restrict keyword didn't help: g++
doesn't like it. As a last measure I tried a wrapper class:

typedef class my_char {
char data;
public:
my_char() { }
my_char(char c) { data = c; }
char operator=(char c) { return data = c; }
char operator=(my_char c) { return data = c.data; }
operator char() { return data; }
} T;

Amazingly, this produces byte by byte the same code as using a simple char.
g++ cannot be right about this one: Does "class { char x; }" really have
the same aliasing rules as "char"?

Or am I missing something obvious?

TIA!
 
C

courpron

Hello all,

consider the following code:

typedef char T;
class test {
        T       *data;
public:
        void f(T, T, T);
        void f2(T, T, T);

};

void test::f(T a, T b, T c)
{
        data[3] = a;
        data[4] = b;
        data[5] = c;

}

void test::f2(T a, T b, T c)
{
        T *d = data;
        d[3] = a;
        d[4] = b;
        d[5] = c;

}

g++ (v4.3, options "-fomit-frame-pointer -O3 -S -Wall") for x86 produces the
following nice code for f2:
        movq    (%rdi), %rax
        movb    %sil, 3(%rax)
        movb    %dl, 4(%rax)
        movb    %cl, 5(%rax)
        ret
but quite strange code for f:
        movq    (%rdi), %rax
        movb    %sil, 3(%rax)
        movq    (%rdi), %rax
        movb    %dl, 4(%rax)
        movq    (%rdi), %rax
        movb    %cl, 5(%rax)
        ret

Apparently the pointer data is reloaded after every store. I guess this is
due to the aliasing rules for char types: for some strange reason data
might point to itself and to be correct it has to be reloaded after every
store.


Yes. The f function just has, as a parameter, the implicit pointer
"this" and no other information. Data may indeed point to itself.

Indeed replacing the char for an int gives the same code for f and
f2. IMO this is a bad language decision: It's highly inconsistent.


C++, like C, can be used for low level system programming. As such,
accessing to raw data in a type safe way is necessary, i.e. you can
access to any data of any type through a char* (but not an int*, which
is undefined behavior). Consequently, alias analysis is limited by the
presence of char. In your example, if you replace char by int, you
tell the compiler that "data" can only point to an int (so that "data"
can't point to itself).

Anyway,
having to live with it, I have to wonder how to implement a char type which
does not alias with everything.

Besides "char" I tried "unsigned char", "signed char", "uint8_t"
and "int8_t", all to no avail. Also the restrict keyword didn't help: g++
doesn't like it. As a last measure I tried a wrapper class:


g++ likes the restrict keyword. It works as intended. There is simply
no (explicit) parameter to apply the restrict keyword.

typedef class my_char {
        char data;
public:
        my_char() { }
        my_char(char c) { data = c; }
        char operator=(char c) { return data = c; }
        char operator=(my_char c) { return data = c.data; }
        operator char() { return data; }

} T;

Amazingly, this produces byte by byte the same code as using a simple char.
g++ cannot be right about this one: Does "class { char x; }" really have
the same aliasing rules as "char"?

In the end, operations on my_char still involve char.

Note that I might check the behavior of the concerned compiler on
this, if I have time.

Alexandre Courpron.
 
J

James Kanze

consider the following code:
typedef char T;
class test {
T *data;
public:
void f(T, T, T);
void f2(T, T, T);
};
void test::f(T a, T b, T c)
{
data[3] = a;
data[4] = b;
data[5] = c;
}
void test::f2(T a, T b, T c)
{
T *d = data;
d[3] = a;
d[4] = b;
d[5] = c;
}
g++ (v4.3, options "-fomit-frame-pointer -O3 -S -Wall") for x86 produces the
following nice code for f2:
movq (%rdi), %rax
movb %sil, 3(%rax)
movb %dl, 4(%rax)
movb %cl, 5(%rax)
ret
but quite strange code for f:
movq (%rdi), %rax
movb %sil, 3(%rax)
movq (%rdi), %rax
movb %dl, 4(%rax)
movq (%rdi), %rax
movb %cl, 5(%rax)
ret
Apparently the pointer data is reloaded after every store. I
guess this is due to the aliasing rules for char types: for
some strange reason data might point to itself and to be
correct it has to be reloaded after every store.
Indeed replacing the char for an int gives the same code for f
and f2. IMO this is a bad language decision: It's highly
inconsistent.

It's a pragmatic compromise. Low level software (think of the
implementation of memcpy or a garbage collector) must be able to
access the raw memory underlying the objects; at this level, the
compiler really should consider all pointers as possible aliases
to anything. Optimization needs require aliasing to be
restricted as much as possible, and in application code, of
course, there should pratically never be any such aliasing. The
C++ solution (inherited from C) is to allow char* and unsigned
char* (in C, only unsigned char*, I think) to alias anything,
since that covers most of the low level needs, and to restrict
the aliasing for other types. In practice, even this turned out
to be insufficient for optimization purposes, and C99 introduced
restrict.

Normally, I would expect a compiler to offer options to control
this: one to request it to ignore the types in possible aliasing
analysis (because there is code around which counts on e.g.
looking at a double through an unsigned short*), and another to
state that even char* won't alias another type (which is
non-conform, but if you don't need the feature). If the first
is missing, the compiler is pratically unusable for certain low
level tasks (although in general, it suffices to turn
optimization off); the latter is probably less important, but it
would help you here.
Anyway, having to live with it, I have to wonder how to
implement a char type which does not alias with everything.

struct MyChar { char ch ; } ;

A bit more awkward to use, but a MyChar* can only access a
MyChar.
Besides "char" I tried "unsigned char", "signed char",
"uint8_t" and "int8_t", all to no avail.

Well, uint8_t and int8_t are only typedef's. And in C++, I'm
not sure it's clear whether signed char is required or not, but
char and unsigned char certainly are. (Again, it's a
compromise. For the intended purpose, char and signed char
aren't usable in portable code. But most code doesn't have to
be that portable; in fact, most such low level code isn't, by
its very nature, portable. And correct or not, the use of char
for this is widespread, historically.)
Also the restrict keyword didn't help: g++
doesn't like it.

It's not legal C++. I would expect most C++ compilers to
support it, however, but only as an extension. So you'd loose
it if you turn extensions off (-std=c++98 or -ansi with g++). I
thought that this was the case with g++, but I've never had the
occasion to verify it.
As a last measure I tried a wrapper class:
typedef class my_char {
char data;
public:
my_char() { }
my_char(char c) { data = c; }
char operator=(char c) { return data = c; }
char operator=(my_char c) { return data = c.data; }
operator char() { return data; }
} T;
Amazingly, this produces byte by byte the same code as using a
simple char. g++ cannot be right about this one: Does "class
{ char x; }" really have the same aliasing rules as "char"?

You'd have to show us the actual code you used. my_char* cannot
be used to access a pointer, so it should work.
 
P

Paul Brettschneider

Hello Alexandre and James, thanks for your reply!

James said:
consider the following code:
typedef char T;
class test {
T *data;
public:
void f(T, T, T);
void f2(T, T, T);
};
void test::f(T a, T b, T c)
{
data[3] = a;
data[4] = b;
data[5] = c;
}
void test::f2(T a, T b, T c)
{
T *d = data;
d[3] = a;
d[4] = b;
d[5] = c;
}
g++ (v4.3, options "-fomit-frame-pointer -O3 -S -Wall") for x86 produces
the following nice code for f2:
movq (%rdi), %rax
movb %sil, 3(%rax)
movb %dl, 4(%rax)
movb %cl, 5(%rax)
ret
but quite strange code for f:
movq (%rdi), %rax
movb %sil, 3(%rax)
movq (%rdi), %rax
movb %dl, 4(%rax)
movq (%rdi), %rax
movb %cl, 5(%rax)
ret
Apparently the pointer data is reloaded after every store. I
guess this is due to the aliasing rules for char types: for
some strange reason data might point to itself and to be
correct it has to be reloaded after every store.
Indeed replacing the char for an int gives the same code for f
and f2. IMO this is a bad language decision: It's highly
inconsistent.

It's a pragmatic compromise. Low level software (think of the
implementation of memcpy or a garbage collector) must be able to
access the raw memory underlying the objects; at this level, the
compiler really should consider all pointers as possible aliases
to anything.

I understand that. But I would expect programmers of low level code like
garbage collectors to understand aliasing and be able to explicitly tell
the compiler when aliasing is possible. Of course some old weird code might
break. OTOH C++ breaks old C code anyway...
Optimization needs require aliasing to be
restricted as much as possible, and in application code, of
course, there should pratically never be any such aliasing. The
C++ solution (inherited from C) is to allow char* and unsigned
char* (in C, only unsigned char*, I think) to alias anything,
since that covers most of the low level needs, and to restrict
the aliasing for other types. In practice, even this turned out
to be insufficient for optimization purposes, and C99 introduced
restrict.

Normally, I would expect a compiler to offer options to control
this: one to request it to ignore the types in possible aliasing
analysis (because there is code around which counts on e.g.
looking at a double through an unsigned short*), and another to
state that even char* won't alias another type (which is
non-conform, but if you don't need the feature).
Exactly.

If the first
is missing, the compiler is pratically unusable for certain low
level tasks (although in general, it suffices to turn
optimization off); the latter is probably less important, but it
would help you here.


struct MyChar { char ch ; } ;

A bit more awkward to use, but a MyChar* can only access a
MyChar.


Well, uint8_t and int8_t are only typedef's. And in C++, I'm
not sure it's clear whether signed char is required or not, but
char and unsigned char certainly are. (Again, it's a
compromise. For the intended purpose, char and signed char
aren't usable in portable code. But most code doesn't have to
be that portable; in fact, most such low level code isn't, by
its very nature, portable. And correct or not, the use of char
for this is widespread, historically.)


It's not legal C++. I would expect most C++ compilers to
support it, however, but only as an extension. So you'd loose
it if you turn extensions off (-std=c++98 or -ansi with g++). I
thought that this was the case with g++, but I've never had the
occasion to verify it.

My editor recognises it as reserved word, but g++ doesn't like it - at least
not without some command line argument.
You'd have to show us the actual code you used. my_char* cannot
be used to access a pointer, so it should work.

Exactly the same code as above, but with the other typedef:

typedef class my_char {
char data;
public:
my_char() { }
my_char(char c) { data = c; }
char operator=(char c) { return data = c; }
char operator=(my_char c) { return data = c.data; }
operator char() { return data; }
} T;

class test {
T *data;
public:
void f(T, T, T);
void f2(T, T, T);
};

void test::f(T a, T b, T c)
{
data[3] = a;
data[4] = b;
data[5] = c;
}

void test::f2(T a, T b, T c)
{
T *d = data;
d[3] = a;
d[4] = b;
d[5] = c;
}

Gives byte by byte the same code as with "typedef char T;". Of course I'm
not sure that you can call this a bug since after all the code is correct,
it's just not as efficient as it could be. Using stronger aliasing rules
you're always on the safe side. Still it makes me wonder where the aliasing
rules are implemented in g++? You can even change the wrapper class to
(note the negations):

typedef class my_char {
char data;
public:
my_char() { }
my_char(char c) { data = -c; }
char operator=(char c) { return data = -c; }
char operator=(my_char c) { return data = -c.data; }
operator char() { return data; }
} T;

and get the following code:
f:
movq (%rdi), %rax
negl %esi
negl %edx
negl %ecx
movb %sil, 3(%rax)
movq (%rdi), %rax #!!
movb %dl, 4(%rax)
movq (%rdi), %rax #!!
movb %cl, 5(%rax)
ret
f2:
movq (%rdi), %rax
negl %esi
negl %edx
negl %ecx
movb %sil, 3(%rax)
movb %dl, 4(%rax)
movb %cl, 5(%rax)
ret

So g++ apparently assumes that my_char*, a class that shows completely
different behaviour than char, can point to a "class test").

But I guess this starts to be highly compiler specific and is offtopic
here...
 
J

James Kanze

Hello Alexandre and James, thanks for your reply!

[...]
I understand that. But I would expect programmers of low level
code like garbage collectors to understand aliasing and be
able to explicitly tell the compiler when aliasing is
possible.

The language standard doesn't provide any real means of telling
the compiler anything, outside of the language. And it has
pretty much been a principle of the language not to provide such
means.
Of course some old weird code might break. OTOH C++ breaks old
C code anyway...

:).

The real problem in C++ is C compatibility. This is one of the
most fundamental parts of the basic object model, shared with C.
Modern C keeps the rule to avoid breaking older C, and C++ keeps
it to avoid breaking C compatibility. (Historically, C didn't
have such a rule. But the compilers at the time didn't do
enough alias checking to make it worthwhile. When C was being
standardized, in the late 1980's, it was becoming an issue, with
different compilers taking different positions. When I said
"pragmatic compromise", I really meant it: the C committee did
not want to innovate, introduce new keywords, etc., and worked
out a solution which guaranteed that most of the low level code
still worked, and that most of the optimizations---the people
most concerned with optimization are usually using float and
double---also worked, without making any fundamental additions
to the language.)

The best solution I've seen here is Modula-3, which had "safe"
modules (the default), and "unsafe" modules (explicitly declared
as such). In a safe module, the only pointers which could exist
were to dynamic objects, and a pointer to T could only point to
a T, or to something derived from a T, and all pointers were
garbage collected. In an unsafe module, practically anything
was allowed. Given the C++ object model, you'd probably have to
loosen the restrictions in "safe" modules somewhat, but I see
nothing wrong with saying that the compiler can assume no
cross-type aliasing except in unsafe modules. (If we ever get
modules, maybe we could arrange for three levels: "safe", with
guaranteed garbage collection, and pointers only allowed to
dynamically allocated objects, the default level, which could
correspond to the current situation, but possible with no
support for cross-type aliasing, even when char* in involved,
and no reinterpret_cast, and "unsafe", where anything goes, and
the compiler must assume you've used every dirty trick
imaginable.)

[...]
In C++ code. It works fine in C code, at least if you specify
-std=c99. Which is correct: it is part of C99, but not C90, nor
C++98 or C++03. And I've not heard that it will be adopted into
the next C++ standard; when all is said and done, it's really
just an additional source of undefined behavior.

Long term, of course, it won't be necessary. Compilers are
getting better and better at inter-module optimization, and
there are already compilers (maybe only experimental) which can
detect the lack of aliasing across compilation unit boundaries,
and do this optimization, dependent on whether there actually is
aliasing or not. But for most users, that's probably "very long
term", rather than just "long term".
My editor recognises it as reserved word, but g++ doesn't like
it - at least not without some command line argument.

It's a keyword in C99. If I were writing a compiler, I'd at
least warn if you used it otherwise (e.g. as the name of a
variable). Whether C++ adopts it officially or not, I imagine
that most C++ compilers will eventually support it as an
extension.
Exactly the same code as above, but with the other typedef:

typedef class my_char {
char data;
public:
my_char() { }
my_char(char c) { data = c; }
char operator=(char c) { return data = c; }
char operator=(my_char c) { return data = c.data; }
operator char() { return data; }
} T;
class test {
T *data;
public:
void f(T, T, T);
void f2(T, T, T);
};
void test::f(T a, T b, T c)
{
data[3] = a;
data[4] = b;
data[5] = c;
}
void test::f2(T a, T b, T c)
{
T *d = data;
d[3] = a;
d[4] = b;
d[5] = c;
}
Gives byte by byte the same code as with "typedef char T;". Of
course I'm not sure that you can call this a bug since after
all the code is correct, it's just not as efficient as it
could be. Using stronger aliasing rules you're always on the
safe side. Still it makes me wonder where the aliasing rules
are implemented in g++? You can even change the wrapper class
to (note the negations):

Formally, the optimization is legal here. Practically, g++
probably determines that it is dealing with char's (optimizing
out the wrapper) before it applies the aliasing analysis. As
you say, it could be better, since this causes a possible
optimization to be missed, but it is certainly legal. Or
possibly it simply "pessimizes" aliasing analysis anytime it
sees a char in the expression. (This is probably the simplest
way of handling the C++ requirements.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top