Storage of char in 64 bit machine

C

CBFalconer

CBFalconer said:
Stephen Sprunk wrote:
... snip ...
I am curious if the following has a substantially different speed
from your version:

-------
/* NOTE: c1 and c2 must point to strings of exactly 3 chars */
int
CurEqual(char *c1, char *c2)
{
/* catch people calling this function with bad params */
assert((strlen(c1)==3) && (strlen(c2)==3));
/* Note: & is safe here since == always returns 0 or 1,
and it's likely to be faster than && */
if ((c1[0]==c2[0])&(c1[1]==c2[1])&(c1[3]==c2[3]))
printf("Same currency %s\n", c1);
else
printf("%s and %s are different\n", c1, c2);
}

Lets cut that down to the essentials, and make it compilable:

int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}

with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.

A modern CPU will perform these operations substantially in *PARALLEL*.
In fact the full schedule in an OOO machine like an Athlon (3-way
subscalar, with free offsets) looks something like this:

Cycle 1: r0 <- load(c1) ||| r1 <- load (c2)
Cycle 2: r2 <- load(c1+4) ||| r3 <- load (c2+4)
Cycle 3: f1 <- cmp(r0, r1) ||| r4 <- load (c1+8) ||| r5 <- load (c2+8)
Cycle 4: r6 <- f1 ||| f2 <- cmp (r2, r3)
Cycle 5: r7 <- f2 ||| f3 <- cmp (r4, r5)
Cycle 6: r8 <- and(r6, r7) ||| r9 <- f3
Cycle 7: ra <- and(r8, r9) ||| return

Every cycle performs at least two operations with only cycles 4 and 5
being somewhat less efficient due to the way the x86 architecture works
(you have to copy from the flags to the register.) This tells us that
you should instead use the expression:

static int cneq (const char * c1, const char * c2) {
return (c1[0] - c2[0]) | (c1[1] - c2[1]) | (c1[2] - c2[2]);
}

This leads us to:

Cycle 1: r0 <- load(c1) ||| r1 <- load(c2)
Cycle 2: r2 <- load(c1+4) ||| r3 <- load(c2+4)
Cycle 3: r6 <- sub(r0, r1) ||| r4 <- load(c1+8) ||| r5 <- load(c2+8)
Cycle 4: r7 <- sub(r2, r3)
Cycle 5: r8 <- sub(r4, r5) ||| r9 <- ior(r6, r7)
Cycle 6: ra <- ior(r8, r9) ||| return

which is a cycle faster.
Now, lets analyze a normal comparison, without the limitations on
length etc. operating on the same length 3 strings:

int ceq(const char *c1, const char *c2)
{
while (*c1 && (*c1 == *c2)) {
c1++; c2++;
}
return *c1 == *c2;
}

In the best case, *c1 != *c2, assuming reasonable register
allocations, there will be 2 dereferences, one zero test, and one
comparison.

And *one branch prediction*. In any event, lets take a look:

Cycle 1: r0 <- load(c1) ||| r1 <- load (c1)
Cycle 2: r2 <- load(c2+4)
Cycle 3: f0 <- cmp(r0, 0) ||| ifnz(f0) goto LEnd;

{ Now at this point a branch prediction is made, which costs
either 0 or 10 clocks, and splits into two cases. }

Cycle 4/14: r3 <- load(c1) ||| r4 <- load(c2)
Cycle 5/15: noop
Cycle 6/16: f2 <- cmp(r3, r4)
Cycle 7/17: r5 <- f2 ||| return

{ or: }

Cycle 4/14: f1 <- cmp(r1, r2) ||| ifnz(f1) goto LEnd;

{ Now at this point a branch prediction is made, which costs
either 0 or 10 clocks, and splits into two cases. }

Cycle 5/15: r3 <- load(c1) ||| r4 <- load(c2)
Cycle 6/16: noop
Cycle 7/17: f2 <- cmp(r3, r4)
Cycle 8/18: r5 <- f2 ||| return

etc. So your *best* case costs basically 7 clocks anyways. (Once you
either start predicting wrong, or enter the loop, you are clearly much
worse.) These are highly idealized models of the Athlon machine, so
real results may be different. It might be interesting to see the real
world results.
[...] The loop is never executed. Note that for random
strings this is the MOST LIKELY result. Much faster.

It highly depends on the random distribution. The branch prediction
penalty could *easily* dominate. For example, if there were only two
or three really common strings (USD, Euros, and Yen, say) but no
discernable pattern in those actual cases, then you could easily tend
to get the branch prediction wrong around 30% of the time, which
translates to an average of 3 clocks extra.
In the worst case, there is no difference, and all three components
have to be checked. For the same allocation assumptions, there
will be 7 dereferences, 4 zero tests, 3 comparisons, and 6 pointer
increments, followed by a final comparison.

I haven't done the bench, and clearly you haven't either. Care to put
money on this claim?

No. Clearly you are much more familiar with the internals of
modern CPUs than I am. However, considering the chances of
entering the loop at all, a priori (for random strings) there is
only one chance in 256 (for 8 bit bytes) of entering. In actuality
the chance is going to be much higher, maybe one in 100, yet not to
be sneezed at. In the probable case there are two decisions,
whether to enter the loop, and what to return. By putting these in
the appropriate default path, as an intelligent compiler can
(possibly by generating hints), all delays in the routine can be
avoided most of the time. In fact, by putting those decisions in
the calling sequence all control transfers can be avoided most of
the time! You can even express that in C as a conditional calling
macro and a specialized subroutine.

I tend to think of much simpler processors, i.e. the kind that are
really prevalent in the embedded world, and which make up the vast
majority. These are the ones that have relatively limited
performance, and need careful tending to get maximized results.

This, while interesting, is OT for c.l.c, yet it is the sort of
thing with which every programmer should familiarize themselves.
Otherwise it can be impossible to make some fundamental decisions
on methods and techniques, at least in cases where it matters.

--
"The power of the Executive to cast a man into prison without
formulating any charge known to the law, and particularly to
deny him the judgement of his peers, is in the highest degree
odious and is the foundation of all totalitarian government
whether Nazi or Communist." -- W. Churchill, Nov 21, 1943
 
W

websnarf

CBFalconer said:
No. Clearly you are much more familiar with the internals of
modern CPUs than I am. However, considering the chances of
entering the loop at all, a priori (for random strings) there is
only one chance in 256 (for 8 bit bytes) of entering.

The point of my analysis was to explain that this idea of yours is
wrong. Short-circuiting only works if there is a substantial amount of
work that you are saving by doing so. As the clock counts get smaller,
its a generally better idea to make code straightline and
parallelizable and avoid any control-flow.
[...] In actuality
the chance is going to be much higher, maybe one in 100, yet not to
be sneezed at.

Huh? The O.P. has explained that these are currencies. So presumably
he's tracking transactions at a bank or something like that. Not that
I am an expert on such things, but I would imagine that there are
extreme biases in the input data that he sees. I.e., he will see a lot
more USD, Euro and Yen than he will of the Nicaraguan Cordoba for
example. If its pretty much *all* USD, then the branch predictors
start working for you again. But as soon as you see a mix with a
handful of very significant entries, the branch predictors will just
end up failing. My intuition is that in fact for certain international
banks or european banks (that are mixing the british pound and the
euro, for example) this is, in fact, a worst case scenario.

You generally can't think of probabilities of real world data in
simplistic terms such as expectations about the range of the data type.
Even approximating them this way can so easily lead you astray. This
is the same reason that its so important to really look at worst case
scenarios in generic code.
[...] In the probable case there are two decisions,
whether to enter the loop, and what to return. By putting these in
the appropriate default path, as an intelligent compiler can
(possibly by generating hints), all delays in the routine can be
avoided most of the time. In fact, by putting those decisions in
the calling sequence all control transfers can be avoided most of
the time! You can even express that in C as a conditional calling
macro and a specialized subroutine.

I tend to think of much simpler processors, i.e. the kind that are
really prevalent in the embedded world, and which make up the vast
majority. These are the ones that have relatively limited
performance, and need careful tending to get maximized results.

You mean embedded processors from the past? MIPS, Xscale, and PowerPC
are all super scalar and fully pipelined processors. There are
embodiments of these which are deeply pipelined and support some degree
of OOO. While some are not able to completely re-order the
instructions, good compilers would *try* to do that for them anyways.
Over time standard processors become embedded processors.
[...]
This, while interesting, is OT for c.l.c, yet it is the sort of
thing with which every programmer should familiarize themselves.

Which do you think they should learn? Your incorrect/inapplicable
analysis, or my somewhat detailed analysis? I think people should
first learn to benchmark as our other poster here seems keen on. They
can learn what I know if they really feel up to it, or then can just
stick with benchmarking and learn some rules of thumb along the way.
Otherwise it can be impossible to make some fundamental decisions
on methods and techniques, at least in cases where it matters.

If you want to make *correct* decisions, and somehow benchmarking is
not a good option for you, then you have little choice but to engage in
the kind of analysis that I do. But I too have to eventually put
things to the acid test -- because in real processors there are often
factors that are just not easy to see.
 
M

Malcolm

Edmond Dantes said:
There must be an equivalent of the old aphorism, "penny-wise,
pound-foolish"
that can be applied to optimizing code.

Today, with rare exception, there is hardly the need to optimize assembly
code by hand. I could see it for, say, an embedded system running on
limited resources where it may make sense. But in general? Nope.
If you've got a very slow computer or a very fast computer, often you need
to optimise.
If you are using a 3GHz PC to run a medium-sized company accounts on a
spreadsheet, it is pretty pointless.
 
F

Flash Gordon

Walter said:
No we didn't. We went from "Don't do this, you idiot, that's
non-portable" to -you- saying "Yeah, but it's faster".

A couple of people then {perhaps unwisely} said "a good strcmp() can
probably do just as well", and you produced a benchmark that appeared
to show that your code is measurably faster... at least for the systems
you tried.

I just tried the benchmark on my machine and got the same result.
However, when I changed bench.c to call strcmp directly, which is how I
would actually do the string comparison inline the results switched to
strcmp being far faster. When, to be fair, I changed the integer
comparison to being inline I got the *same* results for both. So I don't
thing the comment about good strcmp implementations was unwise, I think
it was justifiable at least in part. Under the right conditions the
strcmp method *does* perform as efficiently because the compiler knows
what strcmp does, inlines it, and then optimises the inlined code in
context.

<OT>
I believe that this is the reason gcc has a number of "library"
functions implemented as built-ins rather than just calling the library.
The built ins *are* inlined before the optimiser is invoked.
This was followed by a collective "So what if it executes faster?
The time differences is small enough to be inconsequential for most
systems, and in the meantime you've written unclear code that will
likely break on future systems. Paul Hsieh is the only one who
said that it's worth exploring, and it isn't clear that he is taking
into account relative costs of maintaince and execution time.

There is also the point that even if it save 3 hours each run it may
still not be worth it. My customers still have jobs that are run as
overnight jobs for very good reasons, they don't care how long it takes
as long as it is finished by morning. In other situations a 1ms saving
can be critical. In other words, you have to consider how valuable
execution time is as well as how much you save and what it costs you.
You haven't really answered to the costs vs portability concerns.
You mooted about banks and funds, but I pointed out that if
the execution time of the comparison was really such a key factor
then you wouldn't have used this approach at all.

Sooo... the space we are in right now is "Here's a non-portable
and less-clear hack that's often faster" and you wanting general
approval for your cleverness... or at the very least, you wanting
people to be content to have such hacks be topical here.

The reason I believe most people object to hacks like these is, I
believe, because they make it needlessly hard to understand the code
without any corresponding *significant* benefit. Note that even when
they do improve performance (some tricks people have posted would make
performance *worse* on some systems) the improved performance is often
not enough for the user to notice.
The only progress that I see that you've made is Paul Hseih's
"don't take the naysaying too seriously". Paul is [to me] obviously
fairly knowledgable in some areas, but it -looks- to me as if he
often ends up in arguments, and I get the -impression- that a fair
number of people don't pay careful attention to him (possibly
because of the -way- he says things...)

What gets me is his apparent assumption that just because optimisations
will work and produce better results on the systems he has experience
with means that they are generally worth while. See my response to his
post pointing out that his *assumptions* about performance are wrong
even for a bog-standard notebook running Linux.
 
F

Flash Gordon

You shouldn't fall for the "strawman" argument, especially when its
clearly wrong. Obviously int32_t comparison will annihilate strcmp()
in real world code regardless of how good the compiler is.

Wrong. I've changed the benchmark slightly and can show strcmp being no
slower.
> The only
chance the compiler has is to support cross-file inlining,

True, but in reality you would be unlikely to wrap up your strcmp call
inside another function. I, at least, would just call strcmp directly to
compare strings.
> constant
propagation, and a kind of global string analysis that just isn't going
to yeild fruitful optimizations in general. And that would be just to
try to pull even with integer comparison. If you used memcmp() instead
of strcmp(), things *might* be different.

Wrong. Calling strcmp directly since that is how one would actually use
it and inlining the other comparison for fairness I get identical
performance for both integer and string comparison.

Note that the whole point of this is to demonstrate that if you write
clear code and give the compiler a fair chance (not hobbling it by
introducing an additional level of function call you would not use in
the real world) the compiler *can* do amazing pieces of optimisation.
Benching this is unnecessary. *Any* inner loop that contains a str*
function is always going to be way slower than any reasonably
reconsidered alternative.

Wrong. See above.
> That's a rule of thumb people should just
always be aware of.

Like many rules of thumb from years ago it is wrong today.
> In many cases, the std C library is actually
behind by actual orders of complexity. Here its slower because of the
poor not really parallizable str* function design (it ends up
pointlessly scanning for 4 '\0' characters no matter what).

Obviously it does manage to do something far cleverer on some systems if
you give it a chance because it does on my system which is nothing special.
[...] It seems, that for the limited cases like this -- when the strings are of
the same length and fit nicely into an integer type -- treating them as such
is hugely beneficial. And, contrary to authoritative assertions posted in
this thread, compiler is NOT able to detect such cases.

You are obviously wrong. See comments above.
These "authorities" are overstepping their bounds. They have decided
that this is a newsgroup just about the C standard. Practical C
programming is way beyond both the scope of this newsgroup and those so
called authorities. Their brains have, over the years, become
hardwired in a very specific way -- if you ask the question "how do I

<snip>

and yours is hard wired in to believing you can always outstrip the
compiler by doing dirty tricks even though optimisers have moved on a
long way as illustrated when you actually give the compiler a chance to
optimise by using strcmp directly instead of indirectly.
The issue with your "trick" is that its not portable.

That is only part of the issue. It also makes the code needlessly hard
to read or maintain and is *not* guaranteed to provide the suggested
performance gain.
> But that's
typical of anything useful that most people do in the C language.

As has been pointed out before, there are vast amounts of very useful
things that *can* and *are* done in standard portable C.
Otherwise, yes of course it will lead to an enormous performance
improvement (after all you are removing a str* from an inner loop.)

You show your lack of knowledge about modern optimisers when they are
given a chance by writing clear code.
To make it portable, you probably want to try to isolate the *system*
through ifdef's. The reason is that you need to isolate *systems* that
align things on at most 32-bit boundaries. The exceptions will be
characterized by system. Then there's the whole sizeof(int32_t) issue
-- you probably just want to deal with 3 cases of 1, 2, 4 and forget
the other ones. Then, since you know that these strings are exactly 3
characters, you should use *memcmp()*, not strcmp() (its a little
faster) on the fallback cases.

I agree that when you do need to use system specific tricks you isolate
them. However it is entirely possible for strcmp to be as fast as
memcmp, seem my comments about performance else where in this post.
 
F

Flash Gordon

Mikhail said:
Stephen said:
The implementation is likely to have a very, very clever strcmp() that
will perform at least as well as your code (possibly doing the same thing
internally, if it's known to be safe) and likely even better if the
compiler is reasonably modern due special knowledge and treatment of
common functions/idioms.

Well, here are some benchmarks comparing the use of strcmp() to compare
short character strings (4 characters).

FreeBSD6.1/i386 using `gcc version 3.4.4 [FreeBSD] 20050518'
with "-pedantic -Wall -O5 -pipe -march=pentium4 -DNDEBUG -DNODEBUG
-fomit-frame-pointer":

./bench 100000000
str: used 1119486 microseconds
int: used 406449 microseconds

Of the above, the Sun's cc/libmil is definitely has the special "knowledge
and treatment of common functions/idioms" such as strcmp(), but even there
using strcmp() was 5 times slower...

Try giving the compiler a fighting chance of optimising the code as a
whole. After all, you would not normally wrap up your strcmp calls in
another layer of function call, would you?
It seems, that for the limited cases like this -- when the strings are of
the same length and fit nicely into an integer type -- treating them as such
is hugely beneficial. And, contrary to authoritative assertions posted in
this thread, compiler is NOT able to detect such cases.

Oh yes it is if you actually give the optimiser a fighting change. Try
this version of your code (yes people, I know it contains Unix
specifics, I could not be bothered to remove them from Mikhail's code):

/* bench.c */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sysexits.h>
#include <sys/types.h>
#include <sys/time.h>
#include <sys/resource.h>

#include "comp.h"

static void
reporttime(label, pu1, pu2)
const char *label;
const struct rusage *pu1, *pu2;
{

printf("%s: used %ld microseconds\n", label,
(pu2->ru_utime.tv_sec -
pu1->ru_utime.tv_sec)*1000000 +
(pu2->ru_utime.tv_usec -
pu1->ru_utime.tv_usec));
}

int
main(int argc, char *argv[])
{
xCUR cur1 = { "USD" }, cur2 = { "UHR" };
long i, iterations,t1=0,t2=0;
struct rusage ru1, ru2;

switch (argc) {
case 1:
iterations = 10000000;
break;
case 2:
iterations = atol(argv[1]);
if (iterations > 0)
break;
/* FALLTHROUGH */
default:
fprintf(stderr, "Usage:\n\t%s [iterations]\n", argv[0]);
exit(EX_USAGE);
}

if (getrusage(RUSAGE_SELF, &ru1)) {
perror("getrusage");
exit(EX_OSERR);
}

for (i = 0; i < iterations; i++)
t1 += (strcmp(cur1.acCUR, cur2.acCUR) == 0);

if (getrusage(RUSAGE_SELF, &ru2)) {
perror("getrusage");
exit(EX_OSERR);
}

reporttime("str", &ru1, &ru2);

if (getrusage(RUSAGE_SELF, &ru1)) {
perror("getrusage");
exit(EX_OSERR);
}

for (i = 0; i < iterations; i++)
t2 += (cur1.iCUR == cur2.iCUR);

if (getrusage(RUSAGE_SELF, &ru2)) {
perror("getrusage");
exit(EX_OSERR);
}
reporttime("int", &ru1, &ru2);
printf("Force strcomp to happen by printing %ld %ld\n",t1,t2);
return EX_OK;
}

/* comp.h */
#include <inttypes.h>
#include <limits.h>

typedef union {
char acCUR[4];
#if CHAR_BIT == 8
uint32_t iCUR;
#elif CHAR_BIT == 16
uint64_t iCUR;
#else
#error "I'm not prepared for this platform's value of CHAR_BIT"
#endif
} xCUR;

int comp_str(const char cur1[4], const char cur2[4]);
int comp_int(const xCUR *cur1, const xCUR *cur2);
 
W

websnarf

Flash said:
Wrong. I've changed the benchmark slightly and can show strcmp being no
slower.

I tried your code, and was not able to reproduce your results on any of
the following compilers: gcc 3.4.4, WATCOM C/C++ 11.0c, MSVC 7.x,
Borland C++, Intel C 4.0. This is all on an Athlon CPU. Your silly
theory that inlining would matter is irrelevant -- some of those
compilers successfully inlined strcmp, but it made no difference, the
integer equality was always faster.

The only interesting thing to note is that Microsoft's compiler was
able to hoist the integer equality test right out of the inner loop,
and thus compute it as a constant. So it was always basically
infinitely faster. The difference was closest with Borland C/C++ which
did a very bad integer test and was only 62% faster. Watcom, which did
not inline strcmp was 10 times faster with the integer comparison.
Intel, the fairest, inlined but didn't hoist the integer comparison and
yeilded a 7x performance improvement.

If you are seeing something different, you need to explain more about
your platform and environment, and show the results of the objdump -d
on the bench.o file. There does not appear to be any normal way that
strcmp can be anywhere close to the integer comparison.
 
C

Chris Torek

CBFalconer said:
Lets cut that down to the essentials, and make it compilable:

int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}

with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.

A modern CPU will perform these operations substantially in *PARALLEL*.

Indeed. However, I suspect most programmers would have written it
as:

return c1[0] == c2[0] && c1[1] == c2[1] && c1[2] == c2[2];

and the double "&"s would prohibit parallel evaluation (since
any of c1+1, c1+2, c2+1, or c2+2 could run off the end of a
page boundary and hence cause a page fault when followed). (If
the CPU has a speculative fetch where such a fault is not
taken until the value is used, we can go back to the parallel
loading method.)

All of this is somewhat moot, since the original code could be
rewritten to use memcmp():

if (memcmp(c1, c2, 4) == 0)

which could then be (but is not, in the versions I tested) optimized
by the compiler into a "load 4 bytes and compare" instruction --
in fact, the very one the OP was getting via his union trick. All
that is required is a little work on the compiler: make memcmp()
a builtin like memcpy(), and then notice small constant sizes (and
aligned addresses, if required by the CPU) and that the result is
tested only for equality to zero. That is, given:

if (memcmp(c1, c2, k) == 0) /* or != 0 */

where k is a small constant, we can easily generate good code inline
without a library call; but if the test is:

if (memcmp(c1, c2, k) > 0) /* or < 0, etc. */

the job becomes harder (it is now byte-order dependent, if we try
to use integer comparison instructions). Note that memcmp() is
always allowed to access all the bytes of the region, so on x86-64:

memcmp(c1, c2, 12) == 0 /* or != 0 */

can turn into:

((*(long long *)c1 - *(long long *)c2) |
(((int *)c1)[2] - ((int *)c2)[2])) == 0 /* or != 0 */

which should still be faster than a library call.
 
F

Flash Gordon

I tried your code, and was not able to reproduce your results on any of
the following compilers: gcc 3.4.4, WATCOM C/C++ 11.0c, MSVC 7.x,
Borland C++, Intel C 4.0. This is all on an Athlon CPU. Your silly
theory that inlining would matter is irrelevant -- some of those
compilers successfully inlined strcmp, but it made no difference, the
integer equality was always faster.

It was not a silly theory it was the result of profiling. It's is not my
fault if none of the compilers your use are as good as the one I use.

As I said else where. You assume that your experience applies to
everything else. It does not.
The only interesting thing to note is that Microsoft's compiler was
able to hoist the integer equality test right out of the inner loop,
and thus compute it as a constant. So it was always basically
infinitely faster. The difference was closest with Borland C/C++ which
did a very bad integer test and was only 62% faster. Watcom, which did
not inline strcmp was 10 times faster with the integer comparison.
Intel, the fairest, inlined but didn't hoist the integer comparison and
yeilded a 7x performance improvement.

So? I have a compiler which produces different results thus countering
your claim that the integer comparison is *always* faster. One only
needs one counter example to disprove a claim of always, and no number
of example where integer comparison is faster will change that.
If you are seeing something different, you need to explain more about
your platform and environment, and show the results of the objdump -d
on the bench.o file. There does not appear to be any normal way that
strcmp can be anywhere close to the integer comparison.

I'm not claiming it is using a "normal" way only that it find *a* way
however devious. Quite possibly it is making use of the fact that it
knows the strings are exactly 4 bytes.

It is a bog-standard Ubunto install on a Fujitsu Siemens Amilo Pro 2040V
notebook.

markg@brenda:~/tmp$ gcc --version
gcc (GCC) 4.0.3 (Ubuntu 4.0.3-1ubuntu5)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I'm not prepared to post long assembler dumps to this group since this
is a C group. However I do have a web site so I've shoved the dump of
the executable (I did not bother producing a .o file) up at
http://home.flash-gordon.me.uk/asm.txt for now. You are just lucky I did
it on an x86 based machine rather than a PPC based one!

Please note that I don't want to get in to discussing the intricacies of
x86 assembler or what gcc has done. I'm providing this for your interest
and as a demonstration of how a compiler/optimiser can be devious
potentially meaning that you don't get the performance boost you expect
on all systems under all conditions. The fact that you don't see the
same effect just shows how unpredictable these things are.

By the way, when I left the integer trick as a function the strcmp code
was *faster* than the int comparison trick. So to use the int comparison
trick you would have to implement it as a macro rather than a function
in a library to give it a good chance.
 
S

Stephen Sprunk

Chris Torek said:
CBFalconer said:
Lets cut that down to the essentials, and make it compilable:

int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}

with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.

A modern CPU will perform these operations substantially in
*PARALLEL*.

Indeed. However, I suspect most programmers would have written it
as:

return c1[0] == c2[0] && c1[1] == c2[1] && c1[2] == c2[2];

and the double "&"s would prohibit parallel evaluation (since
any of c1+1, c1+2, c2+1, or c2+2 could run off the end of a
page boundary and hence cause a page fault when followed). (If
the CPU has a speculative fetch where such a fault is not
taken until the value is used, we can go back to the parallel
loading method.)

True, most novices probably would have used && instead of &; heck,
Mikhail seems fairly bright but he managed to turn the &s I wrote into
&&s. I used & intentionally to prevent short-circuit evaluation, which
would actually be harmful in this scenario.

Most modern CPUs will do speculative loads for the six chars, but doing
the tests in series, with branches in between, will still kill
performance due to lower IPC and horrible branch misprediction penalties
(up to 20 cycles per branch on some CPUs). Not that this is strictly
on-topic anyways, but IMHO discussing how to make portable code as fast
as unportable code should be.

S
 
W

websnarf

Flash said:
It was not a silly theory it was the result of profiling. It's is not my
fault if none of the compilers your use are as good as the one I use.

I'm not sure you understand how unlikely that is.
As I said else where. You assume that your experience applies to
everything else. It does not.

I'm waiting for a counter example -- this doesn't count as one, BTW.
So? I have a compiler which produces different results thus countering
your claim that the integer comparison is *always* faster. One only
needs one counter example to disprove a claim of always, and no number
of example where integer comparison is faster will change that.

I only mean to say that "extraordinary claims demand extraordinary
evidence". I can only ask this if I establish that your claim is
extraordinary, which I think I did.
I'm not claiming it is using a "normal" way only that it find *a* way
however devious. Quite possibly it is making use of the fact that it
knows the strings are exactly 4 bytes.

That itself would still be considered "normal". In my thinking on
this, I have incorporated a lot of consideration for what the compiler
can and cannot do. By not "normal" I am talking about tricks like some
compilers used to "detect SpecCPU" (a famous benchmark) and replace
code with hand tuned transformations that are not ordinarily within the
capabilities of any compiler.
It is a bog-standard Ubunto install on a Fujitsu Siemens Amilo Pro 2040V
notebook.

markg@brenda:~/tmp$ gcc --version
gcc (GCC) 4.0.3 (Ubuntu 4.0.3-1ubuntu5)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I'm not prepared to post long assembler dumps to this group since this
is a C group. However I do have a web site so I've shoved the dump of
the executable (I did not bother producing a .o file) up at
http://home.flash-gordon.me.uk/asm.txt for now.

You don't get it do you? From my previous post, I was just shy of
calling you a liar. With this post, I now fully confirm it.

That assembly code has no significant loops in it of any kind (it does
call gets for some reason). But you aren't just missing a .o file,
that main doesn't call out anywhere. So this *is* all the code, but it
does *NOT* run the benchmark in question.
[...] You are just lucky I did
it on an x86 based machine rather than a PPC based one!

It would slow me down a bit, but it would serve the exact same purpose
-- to prove that you are lying.
Please note that I don't want to get in to discussing the intricacies of
x86 assembler or what gcc has done.

Deep analysis is not necessary. Look for the fragment of code under
<main>, look for any j?? instructions and notice that there aren't any
(so there are no loops in main). Then look at all the lines with call
hex-number <symboliclabel>. Each symboliclabel points into standard
library functions. (A quick check of these labels earlier in the code
shows they are all direct jumps into the standard library, so no tricky
shenanigans are happening here.)
[...] I'm providing this for your interest
and as a demonstration of how a compiler/optimiser can be devious
potentially meaning that you don't get the performance boost you expect
on all systems under all conditions.

This is "deviousness" of the original source construction and claims
being made, and nothing more. I.e., the assembly does not correspond
to the source you claim it does, Mr. Fraud Gordon.
[...] The fact that you don't see the
same effect just shows how unpredictable these things are.

You just don't have any idea. If forced to reveal the assembly, you
cannot trick me. You simply don't possess the skill. You *believed* I
would not check this assembly, or perhaps thought I was all talk and no
show. You don't even realize that I actually compiled your source, ran
all the benches *AND* disassembled each one to confirm my understanding
of what was going on on *5* compilers. If you understood this, perhaps
you would have no tried to perpetrate such a blatant attempt at a con.

Of course, nobody else has chimed in (all you have to do is compile and
run his source to see he's obviously wrong/lying). Perhaps this
explains your level of expectation. I don't know why you thought you
would be able to get away with this, but I'm the wrong person to try
pull a trick like this on.
 
F

Flash Gordon

I'm not sure you understand how unlikely that is.


I'm waiting for a counter example -- this doesn't count as one, BTW.

I made a simple, honest mistake. I forgot that I had built another
sample of code from here in the same directory. Since I normally don't
bother specifying an output name it writes it as a.out, thus I put up
the wrong sample.
I only mean to say that "extraordinary claims demand extraordinary
evidence". I can only ask this if I establish that your claim is
extraordinary, which I think I did.

I put up the wrong dump. That is now rectified.
That itself would still be considered "normal". In my thinking on
this, I have incorporated a lot of consideration for what the compiler
can and cannot do. By not "normal" I am talking about tricks like some
compilers used to "detect SpecCPU" (a famous benchmark) and replace
code with hand tuned transformations that are not ordinarily within the
capabilities of any compiler.

Well, whatever gcc is doing on my system it is within the bounds of an
"ordinary" compiler since I do not have a custom copy of gcc.
You don't get it do you? From my previous post, I was just shy of
calling you a liar. With this post, I now fully confirm it.

No, it shows that like anyone else I can make an honest mistake. Do you
never make mistakes then?
That assembly code has no significant loops in it of any kind (it does
call gets for some reason). But you aren't just missing a .o file,
that main doesn't call out anywhere. So this *is* all the code, but it
does *NOT* run the benchmark in question.

Agreed. I did the wrong code by mistake.
[...] You are just lucky I did
it on an x86 based machine rather than a PPC based one!

It would slow me down a bit, but it would serve the exact same purpose
-- to prove that you are lying.

No. It was a simple mistake.

[...] The fact that you don't see the
same effect just shows how unpredictable these things are.

You just don't have any idea. If forced to reveal the assembly, you
cannot trick me. You simply don't possess the skill. You *believed* I
would not check this assembly, or perhaps thought I was all talk and no
show. You don't even realize that I actually compiled your source, ran
all the benches *AND* disassembled each one to confirm my understanding
of what was going on on *5* compilers. If you understood this, perhaps
you would have no tried to perpetrate such a blatant attempt at a con.

I have no intention of trying to trick you. I already know that you are
far better at x86 assembler than I am ever likely to be since I stopped
writing assembler a few years back and even then it was not x86.
Of course, nobody else has chimed in (all you have to do is compile and
run his source to see he's obviously wrong/lying). Perhaps this
explains your level of expectation. I don't know why you thought you
would be able to get away with this, but I'm the wrong person to try
pull a trick like this on.

See above. I made a mistake and posted the wrong executable.

This time I've put the correct executable up as
http://home.flash-gordon.me.uk/a.out and the objdump output at
http://home.flash-gordon.me.uk/asm.txt

If you want you can look at the PPC equivalent generated on my Gentoo
box. On that I am certain it is cheating heavily since I get a reported
execution time of 0. Those are at
http://home.flash-gordon.me.uk/ppcasm.txt and
http://home.flash-gordon.me.uk/p.out

That machine (which is the web server) I could even give you a shell
account on and you could build the code yourself. You would just have to
give me a public key and I would create an account for you if you don't
believe what I'm telling you now that I've posted links to the correct code.

Sorry for the mistake, but it really was an honest one. Whatever I may
think of some of the things you post I do respect your assembler
knowledge and know that it is greater than mine.
 
W

websnarf

Flash said:
I made a simple, honest mistake. I forgot that I had built another
sample of code from here in the same directory. Since I normally don't
bother specifying an output name it writes it as a.out, thus I put up
the wrong sample.

Ok fine. But this only shows the problem with your methodology.
Well, whatever gcc is doing on my system it is within the bounds of an
"ordinary" compiler since I do not have a custom copy of gcc.

Indeed, its exploiting your bad benchmarking methodology.
Agreed. I did the wrong code by mistake.

You didn't check it either. There was a call to gets in there, and
that didn't seem to ring an alarm bell of any kind for you.
[...] If you want you can look at the PPC equivalent generated on my Gentoo
box. On that I am certain it is cheating heavily since I get a reported
execution time of 0.

And this doesn't make you stop and think even for one second?

Here are the relevant loops in the x86 build:

mov 0xffffffec(%ebp),%eax
cmp %eax,0xfffffff0(%ebp)
sete %al
movzbl %al,%eax
xor %edx,%edx

80485cf: add %eax,%ebx
add $0x1,%edx
cmp %edx,%esi
jne 80485cf


lea 0xffffffec(%ebp),%eax
lea 0xfffffff0(%ebp),%edx
mov %eax,0x4(%esp)
mov %edx,(%esp)
call 80483ac <strcmp@plt>
test %eax,%eax
sete %al
movzbl %al,%eax
xor %edx,%edx

8048694: add %eax,%edi
add $0x1,%edx
cmp %edx,%esi
jne 8048694

These are equivalent to:

int tmp = cur1.iCUR == cur2.iCur;
for (i = 0; i < iterations; i++)
t2 += (long) tmp;

and

int tmp = strcmp(cur1.acCUR, cur2.acCUR) == 0;
for (i = 0; i < iterations; i++)
t1 += (long) tmp;

respectively. Now lets cogitate on these code fragments for a second.
Ok, so gcc 4.x has decided that higher level hoisting is a feature
worth implementing. No doubt this is why Mikhail Teterin made the
function calls external in the first place (so that gcc couldn't pull
that kind of trick.)

In any event, we can see clearly that this benchmark doesn't actually
compare the int== speed versus a call to strlen(). (But you can
eyeball a 5-instructions versus 9-instructions plus a call to an
external function in the assembly above. If you managed to put those
into an inner loop you would get results more in lne with what everyone
else has been getting.) So all you've proven is that if you change
what you are measuring, you can make the measurements whatever you
like.
 
E

ena8t8si

Stephen said:
Chris Torek said:
CBFalconer wrote:
Lets cut that down to the essentials, and make it compilable:

int ceq(const char *c1, const char *c2)
{
return (*c1 == *c2) & (c1[1] == c2[1]) & (c1[2] == c2[2]);
}

with the same comments about & as yours. This must perform at
least 4 indexing operations, 6 dereferences, 3 comparisons, and 2
bit ands.

A modern CPU will perform these operations substantially in
*PARALLEL*.

Indeed. However, I suspect most programmers would have written it
as:

return c1[0] == c2[0] && c1[1] == c2[1] && c1[2] == c2[2];

and the double "&"s would prohibit parallel evaluation (since
any of c1+1, c1+2, c2+1, or c2+2 could run off the end of a
page boundary and hence cause a page fault when followed). (If
the CPU has a speculative fetch where such a fault is not
taken until the value is used, we can go back to the parallel
loading method.)

True, most novices probably would have used && instead of &; heck,
Mikhail seems fairly bright but he managed to turn the &s I wrote into
&&s. I used & intentionally to prevent short-circuit evaluation, which
would actually be harmful in this scenario.

Most modern CPUs will do speculative loads for the six chars, but doing
the tests in series, with branches in between, will still kill
performance due to lower IPC and horrible branch misprediction penalties
(up to 20 cycles per branch on some CPUs). Not that this is strictly
on-topic anyways, but IMHO discussing how to make portable code as fast
as unportable code should be.

Undoubtedly true for some processors, but not for all.
My benchmarks showed the && version running slightly
faster than the & version in all cases (by 10%-20%
or so, depending on the case). FWIW.
 
D

Dave Thompson

Mikhail Teterin <[email protected]> writes:
| So, comparing, say, 4-char arrays (like currency codes) can NOT be done in
| the following way?
|
| typedef union {
| char acCUR[4];
| int32_t iCUR;
| } xCUR;
In this case, it seems to me that there are solutions better than
either using strcmp() or pretending that a char[4] is an int32_t.
You can probably achieve this by storing and comparing the currency
codes *as integers*. One simple way to do this is to compute the
numeric codes arithmetically from the string values. I think you said
that all the currency codes are two characters; if so, it's as simple

He said they are 3 characters plus null, hence the int32=4*char test
would -- if not for that danged UB -- always give the same equality
result as strcmp, i.e. there will never be any trailing possibly
garbage byte(s) ignored by strcmp but included in int32.

Unsurprisingly, as the applicable official standard for interchange,
ISO 4217, defines codes of three characters = 2 characters country
code (mostly ISO 3166) + one letter allowing for multiple (i.e.
replacement) currencies, and IME is often used internally as well.
More relevant to the discussion here, ISO 4217 also provides numeric
codes which will fit easily in a (minimal standard) 16-bit short,
which are equally official but IME not as widely used.
as:

numeric_code = (string_code[0] << CHAR_BIT) + string_code[1];
This avoids the need for strcmp() *and* it doesn't depend on type
punning.

But I suspect it isn't very likely to be more efficient than strcmp(),
which was the stated goal. (Setting aside the usual discussions,
already had, about whether/when/how to optimize.)

- David.Thompson1 at worldnet.att.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
474,265
Messages
2,571,069
Members
48,771
Latest member
ElysaD

Latest Threads

Top