memcmp() - compare buffer for 0

M

Mark Adler

For example:

int isallz(char *buf, size_t len)
{
long *wide;

while ((size_t)buf & (sizeof(long) - 1))
if (len--, *buf++)
return 0;
wide = (long *)buf;
while (len >= sizeof(long))
if (len -= sizeof(long), *wide++)
return 0;
buf = (char *)wide;
while (len)
if (len--, *buf++)
return 0;
return 1;
}

And of course, there's an error in there as an exercise for the reader
to find. Anyway, it illustrates the approach.

Mark
 
B

Ben Bacarisse

bartek szurgot said:
Just another few data points: I see the plain loop version as being 14
times *slower* than the memcmp version for 512MB, provided I don't
optimise! At gcc -O1 and above the memcmp takes twice the time the
plain loops does. The reason seems to be that gcc replaces memcmp with
a repz cmpsb instruction (plus housekeeping gubbins) and these string
instructions have a reputation for being slow.

The memcmp method is still four times faster than the loop (on this
data, on this machine, with the library, with... etc) provided gcc can
be told not to inline it to cmpsb instruction.

To summarise:
for loop memcmp
gcc unoptimised 13.333s 0.932s
gcc optimising 4.160s 8.250s

I'd post the CPU, gcc version and so on but I think the real messages is
*measure*. You can still be surprised after <mumble> years in this
field.

i was surprised with your results even more than with my own. :) i
already removed test code i've wrote before posting my last message and
so i had to rewrite it once more... and bang! it looks like i had a bug
(probably #ifndef instead of #ifdef for switching implementations with
-D), because now i'm getting similar computation times i got before, but
both versions are faster for memcmp(). namely:

for() memcmp()
-g3 ~1.2 ~0.11
-O3 ~0.4 ~0.11


It would seem that your compiler is not replacing memcmp when
optimising. Taking this into account, the two sets of results agree
pretty well -- there's a nearly constant factor between your results and
mine.

I could leave it at that and pretend that my machine is old and ten
times slower than yours, but I'll come clean -- to get accurate times I
execute the test 10 times and I did not divide by the 10! I usually
don't bother to divide because I normally only care about ratios, not
actual MB/s.
i think this can be ok, since it would be reasonable that call to ext.
library lasts the same amount of time, regardless of release/debug
builds and hand-made code to be faster, when optimized.

source code for this is:
http://pastebin.com/sP9hexnW#
(note: this code is only a PoC - it does not check all errors)

It's C++. Probably makes no difference. I still get slower times when
optimising memcmp:

for loop memcmp
g++ 1.4 0.10
g++ -O2 0.43 0.83
can you check your code as well, please?

I think the code is alright -- the units were wrong.
http://pastebin.com/cGZxpSia
 
M

MaciekL

Is there another way to compare the input buffer for being equal to
some pattern (all zeros in this case), i.e. ti avoid having additional
array zerobuf?

I'm surprised that no one offered the obvious answer that uses memcmp()
without another array:

if (MAX_LEN> 0&& buf[0] == 0&& memcmp(buf, buf + 1, MAX_LEN - 1) == 0)
puts("buf is all zeros");
else
puts("buf has at least one non-zero value");

Though it's probably faster to compare each entry to zero, or even
faster to recast to an array of large integers and compare those to
zero.

Usual alignment caveats apply!

int is_all_bytes_set(void const * data, int value, unsigned int size)
{
memset(&value, value, sizeof(value));
if (size > sizeof(value))
{
char const * ptr = (char const *)data;
return !(memcmp(ptr, &value, sizeof(value)) ||
memcmp(ptr, ptr + sizeof(value), size - sizeof(value)));
}
else
{
return !memcmp(&value, data, size);
}
}
 
B

bartek szurgot

It would seem that your compiler is not replacing memcmp when
optimising. Taking this into account, the two sets of results agree
pretty well -- there's a nearly constant factor between your results and
mine.

I could leave it at that and pretend that my machine is old and ten
times slower than yours, but I'll come clean -- to get accurate times I
execute the test 10 times and I did not divide by the 10! I usually
don't bother to divide because I normally only care about ratios, not
actual MB/s.

ok - this explains a lot. :)

It's C++. Probably makes no difference. I still get slower times when
optimising memcmp:

for loop memcmp
g++ 1.4 0.10
g++ -O2 0.43 0.83


I think the code is alright -- the units were wrong.
http://pastebin.com/cGZxpSia

after compiling & comparing execution times of both codes i still
couldn't understand the results. why, in (logically) identical programs
for()-version gives similar times (buf vs. buf++ was the only
difference), while memcmp() differs about 6x! :/

after doing some digging and exercising i found what looks like a gcc
performance problem.
http://pastebin.com/nMcD7f6z#
when compiling with "inline", time is far different then when no inline
is present. it does not affect debug version though. in short:

inline non-inline
-g3 ~0.8 ~0.8
-O3 ~0.14 ~0.8

i got this results for gcc-4.6. gcc-4.5 compiled code runs always
(deb/rel) in ~0.14 (i.e. fine). you probably observe similar effect
on your machine.
 
B

BartC

bartek szurgot said:
On 10/11/2011 07:31 PM, Ben Bacarisse wrote:
after doing some digging and exercising i found what looks like a gcc
performance problem.
http://pastebin.com/nMcD7f6z#
when compiling with "inline", time is far different then when no inline
is present. it does not affect debug version though. in short:

Benchmarking with gcc always gave me problems, especially when comparing
with other languages.

Anything over -O1 optimisation tended to remove essential bits of code whose
execution time I was trying to compare! So I only used -O1.

Although it's difficult with your essentially one-line program to see
exactly what it could leave out...
 
B

bartek szurgot

Anything over -O1 optimisation tended to remove essential bits of code
whose execution time I was trying to compare! So I only used -O1.

it is common for the compilers to remove dead code. it is actually a
nice feature - why computing something not used at all? there is a
simple trick to overcome this during measurements - output some
(randomly chosen) part of the result. then compiler cannot optimize-out
code, in such a situation, since it is used (printed on the screen). on
the other hand such a simple operation (usually) does not influence
measurements significantly.
 
B

Ben Bacarisse

bartek szurgot said:
after compiling & comparing execution times of both codes i still
couldn't understand the results. why, in (logically) identical programs
for()-version gives similar times (buf vs. buf++ was the only
difference), while memcmp() differs about 6x! :/


I thought I explained that when I posted the first times. I certainly
had an explanation that for what was happening on my machine with my
version of gcc.
after doing some digging and exercising i found what looks like a gcc
performance problem.
http://pastebin.com/nMcD7f6z#
when compiling with "inline", time is far different then when no inline
is present. it does not affect debug version though. in short:

inline non-inline
-g3 ~0.8 ~0.8
-O3 ~0.14 ~0.8

i got this results for gcc-4.6. gcc-4.5 compiled code runs always
(deb/rel) in ~0.14 (i.e. fine). you probably observe similar effect
on your machine.


No, I don't, but I don't think it matters. For one thing, you are using
C++ and I am using C. We might have different versions of gcc and/or
the C library. Who knows? The key points, for me, are:

(a) Often, there is no useful notion of "faster" code. The same code
will perform differently in different environments or even with
different compiler flags.

(b) Even in one environment, what's fast will depend on the data. We'd
probably be having a different discussion if you'd chosen a 30 byte
buffer.

(c) It's useful to have a feel for the relative costs of certain basic
operations, but clarity of expression will often trump speed, unless
measurement shows that readability must be sacrificed to get a program
performing as required.[1]

(d) In most cases, it's algorithmic changes that will get you really
valuable performance gains.

[1] I've just spent some time writing a slow Reed-Solomon decoder. All
the existing code I looked at was super fast but, for me, almost
unintelligible. That was probably the right thing for those authors to
do, since there are many situations where fast decoding is essential.
But I wanted to be able to see what the code was really doing, so I have
de-optimised like mad. The result is probably 10 (maybe more!) times
slower than the fastest software decoders, but I now largely understand
it, and it is fast enough because the program spends a tiny faction of
its run time decoding a small amount of data
 
J

Jorgen Grahn

Benchmarking with gcc always gave me problems, especially when comparing
with other languages.

Anything over -O1 optimisation tended to remove essential bits of code whose
execution time I was trying to compare! So I only used -O1.

So you'd rather have meaningless results? If your benchmark doesn't
survive the optimizer, it's not relevant to real code.

You *have* to make sure the things you are calculating actually get
used, as far as the compiler can tell. One way is feeding it into some
void touch(void*) function in another translation unit -- there's no
way around that as far as I can see.

/Jorgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top