how would you optimize this:
if (a) {
x = v + size;
} else if (b) {
v = buf;
x = v + size;
} else {
fun();
}
I asked a C compiler, because it's probably better at this sort of thing
than I am, asking for the fastest possible code, then translated its
output back into C:
if (a) goto l7;
if (!b) goto l4;
v = buf;
x = v + size;
l3:
/* rest of the function goes here, including the code for the "return"
at the end of it*/
l4:
fun();
goto l3;
l7:
x = v + size;
goto l3;
Why did it leave the "x = v + size" lines separate? Because it used a
different register to store v in the two codepaths. In the "if (b)"
codepath, the value of v was already in a register because it had just
assigned it from buf, so the assignment could be done in a faster way.
With small assignments like that, leaving the codepaths separate is
actually the fastest thing you can do.
Then I asked it for the shortest possible code, not the fastest, and got
this:
/* a is not in a register, so on x86 it needs to be compared against
a literal zero rather than just taking its boolean value */
if (a == 0) goto l2;
x = v + size;
goto l3;
l2:
if (!b) goto l4;
x = buf + size;
v = buf;
goto l3;
l4:
fun();
l3:
You might note that the "x = v + size" line /still/ hasn't been merged
in the two codepaths; in fact, it's actually different code in them,
now! Interestingly, it'd be possible to save bytes compared to the
compiler (x86 gcc 4.8.1 with -Os) via generating the assembler
corresponding to the following:
if (a != 0) goto l5;
if (!b) goto l4;
v = buf;
l5:
x = v + size;
goto l3;
l4:
fun();
l3:
which would be three bytes shorter, and AFAICT would work fine.
Just to make sure, I asked a different compiler (clang 3.2 with -Os),
asking for minimum size, and it produced this code:
if (a == 0) goto lbb0_2;
x = v + size;
goto lbb0_5;
lbb0_2:
if (!b) goto lbb0_4;
x = buf + size;
v = buf;
goto lbb0_5;
lbb0_4:
fun();
lbb0_5:
which is identical to the code produced by gcc, except with different
label names.
I'm not sure why compilers are doing it this way. If two compilers agree
on not merging the "x = v + size", then either neither optimizes merging
the start of an if block, or there's something I'm missing. (In case
anyone's wondering, "x = v + size" and "x = buf + size" both compile
into 3 bytes of machine code on my platform, with the registers the
compilers chose for the variables in question.)
Here's a comparison of the resulting machine code sizes for my test
program:
gcc's -O3: 21+28 = 49 bytes
gcc's -Os: 33 bytes
clang's -Os: 28 bytes
my optimization shown above, compiled via gcc -Os: 26 bytes
my optimization shown above, compiled via clang -Os: 30 bytes
The difference between clang's output and gcc's for the non-optimized
case was a little spurious; clang chose registers that were shorter for
the code in question but which required some extra bytes to use the
values of some of the variables afterwards, whereas gcc placed those
extra bytes inside the code in question.
What's most interesting, though, is what's happening with the code that
I optimized by hand; it gives the shortest output yet in gcc, but
actually makes it worse in clang! The worsening in clang is entirely
because it put a variable in the wrong register, costing four bytes of
moving variables around between registers. gcc's output also placed six
extra bytes inside the code in question that could have been outside (it
decided to store "size" in %eax rather than in memory, meaning that it
had to copy it to a temporary when it called fun(). In general, the
issues of trying to do register allocation well meant that the savings
from the shortened code came mostly from the fact that it saves one
"goto", rather than from the removal of the repeated assignment to x
(which is just three bytes on x86; it has a three-byte instruction that
does "a = b + c" on many combinations of registers).
For what it's worth, I think I can get it down to 23 bytes of assembler
by hand via massaging the compiler output, although I haven't verified
that this assembler version is correct, and I might have made a mistake
somewhere.
(For anyone wondering how I was counting bytes: I initialized all the
variables from stdin, and then wrote them all out to stdout, to prevent
any of them being optimized out, rather than using "volatile" or the
like in order to prevent optimizations, which would have defeated the
point of the exercise. Then I counted only the bit between the
initializations and the final writeout. gcc's -O3 output had the final
writeout in the middle, so I had to count both the code before and the
code after it.)
So, the upshot from this:
* Trying to optimize the code by hand is likely to have unexpected
effects on modern compilers. In general, writing it in the most
straightforward way increases the chance that it will be optimized
well.
* Trying to optimize for speed by hand is basically impossible nowadays;
the compiler has a better idea than you of what will run quickly.
* It's still possible to optimize for size by hand and beat the compiler,
if you know of an optimization that it doesn't know, or are just
better at allocating registers. (This last advantage only really
matters on x86; other systems tend to have less insane register
allocation requirements.)
* Trying to generate specific assembler from specific C code is very
difficult, or near impossible, to do in such a way that the original
C code is platform-agnostic. (I have done this before, but it basically
involves trying lots of different C as compiler input until it
generates the output that you want.)
And the answer to the original question is "I wouldn't". I think the
cleanest solution involves goto, and when the cleanest solution uses
goto, it's probably not the sort of problem you want to solve.