I understand several definitions of UB cannot be avoided (e.g., "char
*c=NULL; *c;") or even useful (e.g., "int *p = (int *)0x1234;"), but when
statements such as "i = i++;" -
* have no useful value,
* code that has such statements is broken (and most likely don't know
about it),
* can be detected by the compiler
Not all instances of this can be detected statically. Think about:
(*p) = (*q)++;
This can be well-defined if p and q point to different objects,
but is undefined if they point to the same object.
The values of p and q can vary at run-time, and can be made to depend
on input to the program.
So to issue the diagnostic at translation time, you have to have the input to
the program available, and be prepared to solve the halting problem.
But I agree with you. Unspecified orders of evaluation, in a primarily
imperative language, are complete nonsense, and atrociously irresponsible
engineering that partially keeps us in the dark ages.
Rather than inventing misfeatures and then trying to diagnose them,
we should specify the order of everything, so that there is no ambiguity.
There is a religious belief, completely unsubstantiated, that unspecified
evaluation orders are required for the generation of good code.
This is pure bunk because:
- actual evaluation can be considerably rearranged in the face of
required orders.
informal proof 1: there are already sequence points in C programs. If
optimizers could not move effects across abstract sequence points,
most optimizations would not be possible. Optimizations like
function inlining and loop unrolling ``obliterate semicolons''.
informal proof 2: programmers are encouraged to rewrite ambiguous-looking
code into multiple statements, with sequence points.
But wait, aren't we supposed to stuff everything into one expression
with lots of side effects to get the benefit of speed?
Maybe, if you're working with a PDP-11 C compiler from a 1979 Unix box.
- the few cases where this is true are now addressed with restrict
pointers.
suppose that side effects are nicely ordered left to right
(they aren't, of course, but consider an imaginary C dialect)
and you have this expression:
(*p) = (*q)++;
because this is well-defined, the compiler for our imaginary
dialct has to make it work properly. The problem is that p and q may or may
not point to the same object, and it has to work regardless. The compiler
for this strictly evaluated dialect could generate better code if it could
assume that p and q do not point to the same object, just like it does for
code like:
i = j++;
where i and j are known not to be aliases since they are separately
defined variables.
In the C99 language, we can make p and q restrict-qualified
pointers. By doing so, we promise to the language implementation
that these ojects are not aliased.
So we have a way to tell the compiler: ``Please assume these object
accessed through pointers are different objects, so that updating
one has no effect on the value of the other, or else I will eat my
unsigned shorts.''
But in the C language being what it is, with its unspecified
evaluation orders, we don't actually need to indicate
that p and q are different objects. The (*p) = (*q)++ expression
encodes the assumption that they are!
In other words, bmbiguity in expressions is also a way of promising to the
compiler that there is no aliasing. With it you can express ``since I am
updating several things here without a sequence point, or accessing some
things while modifying others, I am hereby promising that they are all
distinct things.''
Using a declared attribute of the pointer (restrict qualifier) is
a better way of achieving this. It can't hurt you if you don't use it,
and you don't have to jam multiple operations into one evaluation between two
sequence points to get the optimization benefit.
If p and q are declared as pointing to distinct objects, then this assumption
still helps optimization even if there are sequence points:
*p = *q;
(*q)++;
In spite of the sequence point, the compiler can assume that the
assignment to *p has no effect on *q. We are free to restructure
the code; we don't lose the no-aliasing assumption just because
we added a semicolons.