If...
i = 2;
a[ i ] = i++;
... then I claim that a[ i ] will be either a[ 2 ] or a[ 3 ],
depending on whether i++ gets evaluated first or last, but it must be
either one or the other.
What if they are done simultaneously?
If you are familiar with the Itanium, or with "out of order"
execution on various CPUs, consider the kind of CPU called a "VLIW",
or "Very Long Instruction Word", CPU. (The Itanium is actually a
sort of "hybrid VLIW", as it were.)
A VLIW CPU has multiple functional units (just like most modern
pipelining processors), but instead of feeding and controlling them
semi-automatically by reading a sequence of ostensibly linear
instructions -- "do this step, then when it is finished, do that
step, then when that is finished, do a third step" -- and attempting
to assemble them into multiple simultaneous actions, a VLIW CPU
simply reads a very long instruction word that gives a bunch of
actions to take simultaneously.
For instance, to do something like:
total = val1 * 1.07 + val2 * 3 + array[index];
on a conventional CPU, we might get:
shl %r4, 2, %r7 # %r4 = index (in %r7) * 4
load %f1, %r3(%r7) # sets %f1 to array[index]
load %f4, CONSTANT_POOL_3_0(%rPOOL)
mul %f4, %f5, %f4 # set %f4 to val2 * 3.0
load %f3, CONSTANT_POOL_1_07(%rPOOL)
mul %f3, %f2, %f3 # set %f3 to val1 * 1.07
add %f3, %f4, %f3 # %f3 = %f3 + %f4
add %f1, %f1, %f3 # %f1 = %f1 + %f3
stor %f1, addr(total) # total = <result>
for a total of 9 instructions. On the VLIW, however, we might
get something more like:
block {
%r4 = (%r7 << 2) + %r3 # actually uses two integer units
# (output of shifter fed as input
# to adder)
%f4 = CONSTANT_POOL_3.0 # one load/store unit
%f3 = CONSTANT_POOL_1.07 # another load/store unit
# four units run simultaneously
}
# wait for results
block {
%f1 = load(%r4) # one load/store unit
%f4 *= %f5 # one FPU unit
%f3 *= %f2 # another FPU unit
# three units run simultaneously
}
# wait for results
block {
%t1 = %f1 + %f3 + %f4 # actually uses two FPU units (output
# of adder1 fed as input to adder2)
store(%t1, addr(total)) # one load/store unit (using output of adder2)
}
for a total of 3 instructions.
The tricky part occurs when the VLIW compiler is fed code with
undefined behavior like a
= i++ -- we end up with an "invalid"
block that reads:
block {
%t1 = (%r7 << 2) + %r3 # a, with a in %r3 and i in %r7
store(%r7, %t1)
%r7 += 1
}
which tries to use one load/store unit to store the old value of
i (in %r7) to a (computed into the entity denoted %t1 here,
although really we just route the output of the adder into the
load/store unit) while at the same time trying to use a third
integer unit to increment %r7. Since registers have three or more
read ports and one write port, this is OK electrically, but the
result is unpredictable because the timing of the output of the
adder-and-shifter ALUs is not defined with respect to the timing
of the incrementing ALU and the timing of the load-store unit.
On early implementations, you get one a different answer from
intermediate implementations. Because debugging this turns out to
be so difficult, the final implementation of the chip includes
circutry to detect the "collision", causing a runtime trap. The
effect of a = i++ then becomes:
illegal instruction - core dumped
rather than setting a[2] or a[3] to either 2 or 3.