Storage of char in 64 bit machine

W

websnarf

Mikhail said:
Well, here are some benchmarks comparing the use of strcmp() to compare
short character strings (4 characters).

You shouldn't fall for the "strawman" argument, especially when its
clearly wrong. Obviously int32_t comparison will annihilate strcmp()
in real world code regardless of how good the compiler is. The only
chance the compiler has is to support cross-file inlining, constant
propagation, and a kind of global string analysis that just isn't going
to yeild fruitful optimizations in general. And that would be just to
try to pull even with integer comparison. If you used memcmp() instead
of strcmp(), things *might* be different.

Benching this is unnecessary. *Any* inner loop that contains a str*
function is always going to be way slower than any reasonably
reconsidered alternative. That's a rule of thumb people should just
always be aware of. In many cases, the std C library is actually
behind by actual orders of complexity. Here its slower because of the
poor not really parallizable str* function design (it ends up
pointlessly scanning for 4 '\0' characters no matter what).
[...] It seems, that for the limited cases like this -- when the strings are of
the same length and fit nicely into an integer type -- treating them as such
is hugely beneficial. And, contrary to authoritative assertions posted in
this thread, compiler is NOT able to detect such cases.

These "authorities" are overstepping their bounds. They have decided
that this is a newsgroup just about the C standard. Practical C
programming is way beyond both the scope of this newsgroup and those so
called authorities. Their brains have, over the years, become
hardwired in a very specific way -- if you ask the question "how do I
improve the performance of some C code" they will definitively answer
"you can't; C is not characterizable in terms of performance and
furthermore your efforts will always be made irrelevant because of how
good your compiler is, and besides you are premature in your attempts
to optimize". It doesn't matter *how* you ask the question, they will
always answer the question that way. So you should not take their
naysaying too seriously, they are just replaying a
keyboard/brain-macro.

The issue with your "trick" is that its not portable. But that's
typical of anything useful that most people do in the C language.
Otherwise, yes of course it will lead to an enormous performance
improvement (after all you are removing a str* from an inner loop.)

To make it portable, you probably want to try to isolate the *system*
through ifdef's. The reason is that you need to isolate *systems* that
align things on at most 32-bit boundaries. The exceptions will be
characterized by system. Then there's the whole sizeof(int32_t) issue
-- you probably just want to deal with 3 cases of 1, 2, 4 and forget
the other ones. Then, since you know that these strings are exactly 3
characters, you should use *memcmp()*, not strcmp() (its a little
faster) on the fallback cases.
 
W

websnarf

Eric said:
Mikhail said:
So, comparing, say, 4-char arrays (like currency codes) can NOT be done in
the following way?

typedef union {
char acCUR[4];
int32_t iCUR;
} xCUR;
[...]

Having to call a strcmp() in such cases seems like a bad waste to me, but I
don't see, how the compiler could possibly optimize such a code without the
trick above...

The compiler cannot do that optimization because its not correct.
strcmp() stops executing its inner loop once it reads a '\0'. I.e.,
its possible for the strcmp()'s to be equal where the int32_t's are not
equal. Also the compiler is allowed to align struct entries as they
like. So on a 64 bit big endian system, the int32_t might not
intersect with any of the 4 acCur[] characters.

Agree with the first part but not with the second. Look
again: it's not a struct, but a union.

If the system is a 64 bit one, then it might force all entries to be
aligned to 64 bit boundaries. Then if its big-endian, it might throw
the relevant bits of the int32_t entry into the high 4 bytes of the 8
byte word. I.e., the acCUR contents might just be overlapping the
*padding* that's put in front of the 32-bit entry. So what exactly do
I have to look again at?
 
E

Eric Sosman

Eric said:
Mikhail Teterin wrote:

So, comparing, say, 4-char arrays (like currency codes) can NOT be done in
the following way?

typedef union {
char acCUR[4];
int32_t iCUR;
} xCUR;
[...]

[...] Also the compiler is allowed to align struct entries as they
like. So on a 64 bit big endian system, the int32_t might not
intersect with any of the 4 acCur[] characters.

Agree with the first part but not with the second. Look
again: it's not a struct, but a union.

If the system is a 64 bit one, then it might force all entries to be
aligned to 64 bit boundaries. Then if its big-endian, it might throw
the relevant bits of the int32_t entry into the high 4 bytes of the 8
byte word. I.e., the acCUR contents might just be overlapping the
*padding* that's put in front of the 32-bit entry. So what exactly do
I have to look again at?

You could look at 6.5.8/5:

"[...] All pointers to members of the same union
object compare equal."

.... and at 6.7.2.1/14:

"[...] A pointer to a union object, suitably converted,
points to each of its members [...] and vice versa."

That is, there is no padding at the start of a union. The
four bytes (assuming CHAR_BIT==8) of the int32_t must occupy
the same four bytes as the char[4] array. If the union as a
whole is padded to eight bytes, the extra four are after the
int32_t and after the char[4], never before them.

Things would be different if he'd used int_least32_t or
int_fast32_t. With the exact-width int32_t he's on solid
ground (still assuming CHAR_BIT==8) as far as the overlap is
concerned, although he's on quicksand in other respects.
 
C

CBFalconer

Mikhail said:
We did. And I ended up convinced, that only inertia (and the desire
to force a newcomer to obey the rules of the club), are what makes
this an issue in the first place.

People with news-readers, that are not MIME-aware will just see
these textual attachments as part of the article.

MIME-aware news-readers will be able to handle them better this way...

Sorry, if it were one file, I would've inlined it, but three --
that's just too much trouble.

I am restraining myself. Don't you realize there are good reasons
for these rules? Many systems will automatically strip any
attachments from news articles, or even destroy the article
altogether.

No, we don't automatically get together and harass all newbies for
the pure unadulterated pleasure of plaguing them. Yes, you are
suffering from paranoia.

--
"The power of the Executive to cast a man into prison without
formulating any charge known to the law, and particularly to
deny him the judgement of his peers, is in the highest degree
odious and is the foundation of all totalitarian government
whether Nazi or Communist." -- W. Churchill, Nov 21, 1943
 
M

Mikhail Teterin

If that is true, then in fact this is a useful performance boost, but
its platform specific. Personally, I would just capture it like this:
*((int32_t *) &currency) rather than bothering with the union.

A plain string would not neccessarily allign the way an int32_t would
allign, would it? This can cause nasty SIGBUS-es...
I am pretty sure there are real 64 bit systems (though likely they are
marginal) that will fail to do your trick correctly. Not AMD64, but
some old Crays or Sparc64s might in fact fail (they need to be big
Endian, and align struct/union entries to 64 bits).

That's why I used the union -- does not it ensure correctness even on these
8-byte alligning platforms? I thought, my only problem is with finding the
correct integer type to cover exactly the 4 characters.
And obviously those silly DSPs that don't support int32_t's would just
fail to compile your code.

Compile-time failure is acceptable -- it is the run-time, that worries me.

-mi
 
M

Mikhail Teterin

Eric said:
Elsethread you have posted an attempt to quantify HOW MUCH,
and your results suggest that the particular "cool" trick you
favor will save ...

0.00000000713037 seconds per comparison.

It is 4-6 times faster -- that, the relative, rather than absolute
difference -- is what counts.

A human being can not distinguish a millisecond from a microsecond.

Until the operation needs to be repeated a million time, that is. Because
millisecond becomes 16 minutes, while microsecond -- only a second.

An EOD (End-of-Day) process in a big bank/hedge fund makes (sometimes) many
millions of such currency comparisions as well as assignments (which don't
_really_ require a memcpy/strcpy), etc.

Using pessimal code for the sake of maintaining portability to Digital
Signal Processing boards is even more wasteful than owning an SUV for the
sake of an imaginary (but possible) once-in-a-lifetime off-road ride.
Mikhail, I can see your error and understand it and sympathize

Thank you, thank you. Agreeing to disagree.

-mi
 
E

ena8t8si

Mikhail said:
We did. And I ended up convinced, that only inertia (and the desire to force
a newcomer to obey the rules of the club), are what makes this an issue in
the first place.

What sort of response would have convinced you otherwise?

If the answer is, "There isn't one," then there doesn't
seem to be much point in talking to you, does there?
 
W

Walter Roberson

Mikhail Teterin said:
An EOD (End-of-Day) process in a big bank/hedge fund makes (sometimes) many
millions of such currency comparisions as well as assignments (which don't
_really_ require a memcpy/strcpy), etc.
Using pessimal code for the sake of maintaining portability to Digital
Signal Processing boards is even more wasteful than owning an SUV for the
sake of an imaginary (but possible) once-in-a-lifetime off-road ride.

In that End-of-Day processing, is the bank/fund doing nothing when
the currencies do not match, or is the bank/fund doing a currency
conversion? If it is doing a conversion, then eliminate the check
by simply setting the conversion rate for all currencies to themselves
to be 1, and then unconditionally using the conversion routine
with the conversion factor indexed by both conversions.

But the point is moot, because we know that in the OP's code, the
OP is not primarily concerned with speed: the primary concern for
the OP is ease of debugging, and the code is intended by the OP to
be as fast as practical provided the ease of debugging is preserved.
If speed were the primary concern, then the OP's code would have
converted to an arbitrary currency index and then would compare the
indices "millions of times".
 
D

Default User

Mikhail said:
I have not spent much time on this group, but, so far, I have not seen
anything that would make me truly saddened by your decision...

Then you are a fool.



Brian
 
M

Mikhail Teterin

We went from "don't do this, you idiot, the compiler is much better
optimizing, than you will ever be" to "yeah, it is 4-6 times faster to do
things your way, but it is still not worth the effort".

I think, this is a considerable progress for one thread and will shut up for
a while...

-mi
 
M

Mark McIntyre

I have not spent much time on this group, but, so far, I have not seen
anything that would make me truly saddened by your decision...

As you say, you've not spent much time here if you think Keith's
contributions are worth little.
You are confirming my impression, that the actual merits of text-only
attachments are secondary (or even fully irrelevant) to your decision, with
the annoyance over my violating the unwritten and unofficial "rules of the
club" being the primary...

You're incorrect.
--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan
 
W

Walter Roberson

We went from "don't do this, you idiot, the compiler is much better
optimizing, than you will ever be" to "yeah, it is 4-6 times faster to do
things your way, but it is still not worth the effort".

No we didn't. We went from "Don't do this, you idiot, that's
non-portable" to -you- saying "Yeah, but it's faster".

A couple of people then {perhaps unwisely} said "a good strcmp() can
probably do just as well", and you produced a benchmark that appeared
to show that your code is measurably faster... at least for the systems
you tried.

This was followed by a collective "So what if it executes faster?
The time differences is small enough to be inconsequential for most
systems, and in the meantime you've written unclear code that will
likely break on future systems. Paul Hsieh is the only one who
said that it's worth exploring, and it isn't clear that he is taking
into account relative costs of maintaince and execution time.
I think, this is a considerable progress for one thread and will shut up for
a while...

You haven't really answered to the costs vs portability concerns.
You mooted about banks and funds, but I pointed out that if
the execution time of the comparison was really such a key factor
then you wouldn't have used this approach at all.

Sooo... the space we are in right now is "Here's a non-portable
and less-clear hack that's often faster" and you wanting general
approval for your cleverness... or at the very least, you wanting
people to be content to have such hacks be topical here.

The only progress that I see that you've made is Paul Hseih's
"don't take the naysaying too seriously". Paul is [to me] obviously
fairly knowledgable in some areas, but it -looks- to me as if he
often ends up in arguments, and I get the -impression- that a fair
number of people don't pay careful attention to him (possibly
because of the -way- he says things...)
 
K

Keith Thompson

Mikhail Teterin said:
We went from "don't do this, you idiot, the compiler is much better
optimizing, than you will ever be" to "yeah, it is 4-6 times faster to do
things your way, but it is still not worth the effort".

I think, this is a considerable progress for one thread and will shut up for
a while...

Looking back in this thread's history, you weren't the OP, but you did
bring up something that triggered a lot of discussion:

| So, comparing, say, 4-char arrays (like currency codes) can NOT be done in
| the following way?
|
| typedef union {
| char acCUR[4];
| int32_t iCUR;
| } xCUR;
|
| int
| CurEqual(xCUR *c1, xCUR *c2)
| {
| if (c1->iCUR == c2->iCUR)
| printf("Same currency %s\n", c1->acCUR);
| else
| printf("%s and %s are different\n",
| c1->acCUR, c2->acCUR);
| }
|
| Having to call a strcmp() in such cases seems like a bad waste to me, but I
| don't see, how the compiler could possibly optimize such a code without the
| trick above...

In this case, it seems to me that there are solutions better than
either using strcmp() or pretending that a char[4] is an int32_t.

Presumably you want the operation of comparing two currency codes to
be as fast as possible, because that operation is a bottleneck in your
application.

You can probably achieve this by storing and comparing the currency
codes *as integers*. One simple way to do this is to compute the
numeric codes arithmetically from the string values. I think you said
that all the currency codes are two characters; if so, it's as simple
as:

numeric_code = (string_code[0] << CHAR_BIT) + string_code[1];

and the reverse conversion is also quite simple.

Note that the values of the numeric codes could vary with different
character sets, and if you store them in files, they could vary with
system byte ordering.

This avoids the need for strcmp() *and* it doesn't depend on type
punning.
 
O

Old Wolf

Lew said:
Assume the code fragment...

{ /* A */
char aa; int ab, ac;
char ad;
{ /* B */
char ae;
}
}

In the nesting level I've called "A", the compiler /may/ optimize the
allocations of aa, ab, ac, and ad so that aa and ad are adjacent in
"memory".
However, because variable ae is declared within a different "level" of
the code, I doubt that most compilers would "optimize" its allocation
to occupy another 8 bits within that 64bit allocation that aa and ad
potentially occupy.

Every compiler I've ever used allocates all local variables,
on all "levels", at the start of the function (by advancing the
stack pointer a set amount). It doesn't declare new stack
frames for code blocks within the same function, or
anything like that.
 
D

Dik T. Winter

> We went from "don't do this, you idiot, the compiler is much better
> optimizing, than you will ever be" to "yeah, it is 4-6 times faster to do
> things your way, but it is still not worth the effort".

You quoted a gain of 0.00000000713037 seconds per comparison in EOD
processing at a bank. Even if in that EOD processing 1000 million of
such comparisons are made, your net gain is about 7 seconds. I think
that is peanuts compared to the total time required for EOD processing.

I would think that in the EOD process the comparisons are a small fraction
of the total time involved. Say 10% (which is, eh, quite large in my
opinion). If we reduce that by a factor 5, the total time will be
reduced by only 8%.

But even *if* it is significant, there are better methods to increase
speed. Set up a table with all existing currencies, and use indexing.
When a transaction is entered, look up the index numbers for the
transaction (takes about 0.00003 seconds if there are two currencies
involved), and use those indices for the remainder.

Even when loading a full table of conversion rates between two currencies,
the finding of an index would be peanuts (0.0003 seconds gain with your
method). I would think that reading the rates themselves would take
much more time.

This is called micro-optimisation. Optimising a part of the program
that does not use a significant amount of time compared to the
remainder. You should start at the bottle-necks, and for that you
have to measure what the bottle-necks are. Profile the program.
 
E

Edmond Dantes

Walter said:
No we didn't. We went from "Don't do this, you idiot, that's
non-portable" to -you- saying "Yeah, but it's faster".

There must be an equivalent of the old aphorism, "penny-wise, pound-foolish"
that can be applied to optimizing code.

Today, with rare exception, there is hardly the need to optimize assembly
code by hand. I could see it for, say, an embedded system running on
limited resources where it may make sense. But in general? Nope.

--
-- Edmond Dantes, CMC
And Now for something Completely Different:
http://mini-cooper.ReallyHotOrNot.com
http://washers.industrialtips.com
http://embroidery.womencraft.com
http://commodity.yourtruemoney.com
http://Linux.Internet69.com
http://Honda.ReallyHotOrNot.com
http://playset.GadgetRUs.com
 
S

Stephen Sprunk

Mikhail Teterin said:
Well, here are some benchmarks comparing the use of strcmp() to
compare short character strings (4 characters). ....
It seems, that for the limited cases like this -- when the strings are of
the same length and fit nicely into an integer type -- treating them as
such
is hugely beneficial. And, contrary to authoritative assertions posted in
this thread, compiler is NOT able to detect such cases.

Okay, I retract my statement. The compiler does a good job with strcmp(),
but it'll never be as fast as your version because the compiler has no way
to know that the strings will _always_ be exactly of length 3, so it has to
test for '\0' several times. That difference of knowledge will easily
account for the 3x-5x difference in execution speed.

I am curious if the following has a substantially different speed from your
version:

-------
/* NOTE: c1 and c2 must point to strings of exactly 3 chars */
int
CurEqual(char *c1, char *c2)
{
/* catch people calling this function with bad params */
assert((strlen(c1)==3) && (strlen(c2)==3));

/* Note: & is safe here since == always returns 0 or 1, and
it's likely to be faster than && */
if ((c1[0]==c2[0])&(c1[1]==c2[1])&(c1[3]==c2[3]))
printf("Same currency %s\n", c1);
else
printf("%s and %s are different\n", c1, c2);
}
-------

This version looks, to me, like it's fully portable and should be
substantially faster than strcmp(), at least when NDEBUG is defined.

S
 
M

Mikhail Teterin

Stephen said:
if ((c1[0]==c2[0])&(c1[1]==c2[1])&(c1[3]==c2[3]))

Yes, it is faster than strcmp(), of course. But still slower than the
int-compare. I added two more functions to the test:

int
comp_mem(const char cur1[4], const char cur2[4])
{
return memcmp(cur1, cur2, sizeof cur1) == 0;
}

int
comp_Sprunk(const char cur1[4], const char cur2[4])
{
return cur1[0] == cur2[0] && cur1[1] == cur2[1]
&& cur1[2] == cur2[2];
}


Here are the results:

FreeBSD/i386 (Pentium D @3GHz):
str: used 1091883 microseconds
int: used 404868 microseconds
Sprunk: used 590683 microseconds
memcmp: used 4424082 microseconds

FreeBSD/amd64 (opteron 244):
str: used 1402967 microseconds
int: used 392841 microseconds
Sprunk: used 617995 microseconds
memcmp: used 1571644 microseconds

Solaris8/sparc (@900MHz)
str: used 730000 microseconds
int: used 130000 microseconds
Sprunk: used 160000 microseconds
memcmp: used 560000 microseconds

AIX/powerpc
str: used 3200000 microseconds
int: used 710000 microseconds
Sprunk: used 820000 microseconds
memcmp: used 2990000 microseconds

Linux/i386 (Xeon @3GHz)
str: used 1060000 microseconds
int: used 390000 microseconds
Sprunk: used 530000 microseconds
memcmp: used 3520000 microseconds

What puzzles me, actually, is the poor performance of memcmp() vs. strcmp(),
but that's it -- no other surprises...
This version looks, to me, like it's fully portable and should be
substantially faster than strcmp(), at least when NDEBUG is defined.

Agreed about the strcmp, but portability is something, my version is not,
actually, lacking either. I'm unaware of a general-purpose platform, with
non-8 CHAR_BIT. And if int32_t is missing, well, that's a compile-time
error, not a run-time disaster...

-mi
 
C

CBFalconer

Stephen said:
.... snip ...

I am curious if the following has a substantially different speed
from your version:

-------
/* NOTE: c1 and c2 must point to strings of exactly 3 chars */
int
CurEqual(char *c1, char *c2)
{
/* catch people calling this function with bad params */
assert((strlen(c1)==3) && (strlen(c2)==3));
/* Note: & is safe here since == always returns 0 or 1,
and it's likely to be faster than && */
if ((c1[0]==c2[0])&(c1[1]==c2[1])&(c1[3]==c2[3]))
printf("Same currency %s\n", c1);
else
printf("%s and %s are different\n", c1, c2);
}

Lets cut that down to the essentials, and make it compilable:

int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}

with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.

Now, lets analyze a normal comparison, without the limitations on
length etc. operating on the same length 3 strings:

int ceq(const char *c1, const char *c2)
{
while (*c1 && (*c1 == *c2)) {
c1++; c2++;
}
return *c1 == *c2;
}

In the best case, *c1 != *c2, assuming reasonable register
allocations, there will be 2 dereferences, one zero test, and one
comparison. The loop is never executed. Note that for random
strings this is the MOST LIKELY result. Much faster.

In the worst case, there is no difference, and all three components
have to be checked. For the same allocation assumptions, there
will be 7 dereferences, 4 zero tests, 3 comparisons, and 6 pointer
increments, followed by a final comparison.

You will have to do an unholy number of calls to these routines to
measure the difference with most clock systems resolution. Have
fun. Don't forget that execution speed is data dependant so you
will need to generate bags and bags of random strings. I suggest
the Mersenne Twister to do that, but you will have to compensate
for string generation time.

--
"The power of the Executive to cast a man into prison without
formulating any charge known to the law, and particularly to
deny him the judgement of his peers, is in the highest degree
odious and is the foundation of all totalitarian government
whether Nazi or Communist." -- W. Churchill, Nov 21, 1943
 
W

websnarf

CBFalconer said:
Stephen Sprunk wrote:
... snip ...
I am curious if the following has a substantially different speed
from your version:

-------
/* NOTE: c1 and c2 must point to strings of exactly 3 chars */
int
CurEqual(char *c1, char *c2)
{
/* catch people calling this function with bad params */
assert((strlen(c1)==3) && (strlen(c2)==3));
/* Note: & is safe here since == always returns 0 or 1,
and it's likely to be faster than && */
if ((c1[0]==c2[0])&(c1[1]==c2[1])&(c1[3]==c2[3]))
printf("Same currency %s\n", c1);
else
printf("%s and %s are different\n", c1, c2);
}

Lets cut that down to the essentials, and make it compilable:

int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}

with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.

A modern CPU will perform these operations substantially in *PARALLEL*.
In fact the full schedule in an OOO machine like an Athlon (3-way
subscalar, with free offsets) looks something like this:

Cycle 1: r0 <- load(c1) ||| r1 <- load (c2)
Cycle 2: r2 <- load(c1+4) ||| r3 <- load (c2+4)
Cycle 3: f1 <- cmp(r0, r1) ||| r4 <- load (c1+8) ||| r5 <- load (c2+8)
Cycle 4: r6 <- f1 ||| f2 <- cmp (r2, r3)
Cycle 5: r7 <- f2 ||| f3 <- cmp (r4, r5)
Cycle 6: r8 <- and(r6, r7) ||| r9 <- f3
Cycle 7: ra <- and(r8, r9) ||| return

Every cycle performs at least two operations with only cycles 4 and 5
being somewhat less efficient due to the way the x86 architecture works
(you have to copy from the flags to the register.) This tells us that
you should instead use the expression:

static int cneq (const char * c1, const char * c2) {
return (c1[0] - c2[0]) | (c1[1] - c2[1]) | (c1[2] - c2[2]);
}

This leads us to:

Cycle 1: r0 <- load(c1) ||| r1 <- load(c2)
Cycle 2: r2 <- load(c1+4) ||| r3 <- load(c2+4)
Cycle 3: r6 <- sub(r0, r1) ||| r4 <- load(c1+8) ||| r5 <- load(c2+8)
Cycle 4: r7 <- sub(r2, r3)
Cycle 5: r8 <- sub(r4, r5) ||| r9 <- ior(r6, r7)
Cycle 6: ra <- ior(r8, r9) ||| return

which is a cycle faster.
Now, lets analyze a normal comparison, without the limitations on
length etc. operating on the same length 3 strings:

int ceq(const char *c1, const char *c2)
{
while (*c1 && (*c1 == *c2)) {
c1++; c2++;
}
return *c1 == *c2;
}

In the best case, *c1 != *c2, assuming reasonable register
allocations, there will be 2 dereferences, one zero test, and one
comparison.

And *one branch prediction*. In any event, lets take a look:

Cycle 1: r0 <- load(c1) ||| r1 <- load (c1)
Cycle 2: r2 <- load(c2+4)
Cycle 3: f0 <- cmp(r0, 0) ||| ifnz(f0) goto LEnd;

{ Now at this point a branch prediction is made, which costs
either 0 or 10 clocks, and splits into two cases. }

Cycle 4/14: r3 <- load(c1) ||| r4 <- load(c2)
Cycle 5/15: noop
Cycle 6/16: f2 <- cmp(r3, r4)
Cycle 7/17: r5 <- f2 ||| return

{ or: }

Cycle 4/14: f1 <- cmp(r1, r2) ||| ifnz(f1) goto LEnd;

{ Now at this point a branch prediction is made, which costs
either 0 or 10 clocks, and splits into two cases. }

Cycle 5/15: r3 <- load(c1) ||| r4 <- load(c2)
Cycle 6/16: noop
Cycle 7/17: f2 <- cmp(r3, r4)
Cycle 8/18: r5 <- f2 ||| return

etc. So your *best* case costs basically 7 clocks anyways. (Once you
either start predicting wrong, or enter the loop, you are clearly much
worse.) These are highly idealized models of the Athlon machine, so
real results may be different. It might be interesting to see the real
world results.
[...] The loop is never executed. Note that for random
strings this is the MOST LIKELY result. Much faster.

It highly depends on the random distribution. The branch prediction
penalty could *easily* dominate. For example, if there were only two
or three really common strings (USD, Euros, and Yen, say) but no
discernable pattern in those actual cases, then you could easily tend
to get the branch prediction wrong around 30% of the time, which
translates to an average of 3 clocks extra.
In the worst case, there is no difference, and all three components
have to be checked. For the same allocation assumptions, there
will be 7 dereferences, 4 zero tests, 3 comparisons, and 6 pointer
increments, followed by a final comparison.

I haven't done the bench, and clearly you haven't either. Care to put
money on this claim?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
474,265
Messages
2,571,069
Members
48,771
Latest member
ElysaD

Latest Threads

Top