C
CBFalconer
CBFalconer said:Stephen Sprunk wrote:
... snip ...I am curious if the following has a substantially different speed
from your version:
-------
/* NOTE: c1 and c2 must point to strings of exactly 3 chars */
int
CurEqual(char *c1, char *c2)
{
/* catch people calling this function with bad params */
assert((strlen(c1)==3) && (strlen(c2)==3));
/* Note: & is safe here since == always returns 0 or 1,
and it's likely to be faster than && */
if ((c1[0]==c2[0])&(c1[1]==c2[1])&(c1[3]==c2[3]))
printf("Same currency %s\n", c1);
else
printf("%s and %s are different\n", c1, c2);
}
Lets cut that down to the essentials, and make it compilable:
int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}
with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.
A modern CPU will perform these operations substantially in *PARALLEL*.
In fact the full schedule in an OOO machine like an Athlon (3-way
subscalar, with free offsets) looks something like this:
Cycle 1: r0 <- load(c1) ||| r1 <- load (c2)
Cycle 2: r2 <- load(c1+4) ||| r3 <- load (c2+4)
Cycle 3: f1 <- cmp(r0, r1) ||| r4 <- load (c1+8) ||| r5 <- load (c2+8)
Cycle 4: r6 <- f1 ||| f2 <- cmp (r2, r3)
Cycle 5: r7 <- f2 ||| f3 <- cmp (r4, r5)
Cycle 6: r8 <- and(r6, r7) ||| r9 <- f3
Cycle 7: ra <- and(r8, r9) ||| return
Every cycle performs at least two operations with only cycles 4 and 5
being somewhat less efficient due to the way the x86 architecture works
(you have to copy from the flags to the register.) This tells us that
you should instead use the expression:
static int cneq (const char * c1, const char * c2) {
return (c1[0] - c2[0]) | (c1[1] - c2[1]) | (c1[2] - c2[2]);
}
This leads us to:
Cycle 1: r0 <- load(c1) ||| r1 <- load(c2)
Cycle 2: r2 <- load(c1+4) ||| r3 <- load(c2+4)
Cycle 3: r6 <- sub(r0, r1) ||| r4 <- load(c1+8) ||| r5 <- load(c2+8)
Cycle 4: r7 <- sub(r2, r3)
Cycle 5: r8 <- sub(r4, r5) ||| r9 <- ior(r6, r7)
Cycle 6: ra <- ior(r8, r9) ||| return
which is a cycle faster.
Now, lets analyze a normal comparison, without the limitations on
length etc. operating on the same length 3 strings:
int ceq(const char *c1, const char *c2)
{
while (*c1 && (*c1 == *c2)) {
c1++; c2++;
}
return *c1 == *c2;
}
In the best case, *c1 != *c2, assuming reasonable register
allocations, there will be 2 dereferences, one zero test, and one
comparison.
And *one branch prediction*. In any event, lets take a look:
Cycle 1: r0 <- load(c1) ||| r1 <- load (c1)
Cycle 2: r2 <- load(c2+4)
Cycle 3: f0 <- cmp(r0, 0) ||| ifnz(f0) goto LEnd;
{ Now at this point a branch prediction is made, which costs
either 0 or 10 clocks, and splits into two cases. }
Cycle 4/14: r3 <- load(c1) ||| r4 <- load(c2)
Cycle 5/15: noop
Cycle 6/16: f2 <- cmp(r3, r4)
Cycle 7/17: r5 <- f2 ||| return
{ or: }
Cycle 4/14: f1 <- cmp(r1, r2) ||| ifnz(f1) goto LEnd;
{ Now at this point a branch prediction is made, which costs
either 0 or 10 clocks, and splits into two cases. }
Cycle 5/15: r3 <- load(c1) ||| r4 <- load(c2)
Cycle 6/16: noop
Cycle 7/17: f2 <- cmp(r3, r4)
Cycle 8/18: r5 <- f2 ||| return
etc. So your *best* case costs basically 7 clocks anyways. (Once you
either start predicting wrong, or enter the loop, you are clearly much
worse.) These are highly idealized models of the Athlon machine, so
real results may be different. It might be interesting to see the real
world results.
[...] The loop is never executed. Note that for random
strings this is the MOST LIKELY result. Much faster.
It highly depends on the random distribution. The branch prediction
penalty could *easily* dominate. For example, if there were only two
or three really common strings (USD, Euros, and Yen, say) but no
discernable pattern in those actual cases, then you could easily tend
to get the branch prediction wrong around 30% of the time, which
translates to an average of 3 clocks extra.
In the worst case, there is no difference, and all three components
have to be checked. For the same allocation assumptions, there
will be 7 dereferences, 4 zero tests, 3 comparisons, and 6 pointer
increments, followed by a final comparison.
I haven't done the bench, and clearly you haven't either. Care to put
money on this claim?
No. Clearly you are much more familiar with the internals of
modern CPUs than I am. However, considering the chances of
entering the loop at all, a priori (for random strings) there is
only one chance in 256 (for 8 bit bytes) of entering. In actuality
the chance is going to be much higher, maybe one in 100, yet not to
be sneezed at. In the probable case there are two decisions,
whether to enter the loop, and what to return. By putting these in
the appropriate default path, as an intelligent compiler can
(possibly by generating hints), all delays in the routine can be
avoided most of the time. In fact, by putting those decisions in
the calling sequence all control transfers can be avoided most of
the time! You can even express that in C as a conditional calling
macro and a specialized subroutine.
I tend to think of much simpler processors, i.e. the kind that are
really prevalent in the embedded world, and which make up the vast
majority. These are the ones that have relatively limited
performance, and need careful tending to get maximized results.
This, while interesting, is OT for c.l.c, yet it is the sort of
thing with which every programmer should familiarize themselves.
Otherwise it can be impossible to make some fundamental decisions
on methods and techniques, at least in cases where it matters.
--
"The power of the Executive to cast a man into prison without
formulating any charge known to the law, and particularly to
deny him the judgement of his peers, is in the highest degree
odious and is the foundation of all totalitarian government
whether Nazi or Communist." -- W. Churchill, Nov 21, 1943