Why large performance degradation?

K

Kevin Wan

Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

Thanks in advance!

array_copy_dumb.c:

/* array_copy_dumb.c */
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination =
source

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
DUMBCOPY; /* 1 */

return 0;
}

array_copy_smart.c:

/* array_copy_smart.c */
#include <string.h>

#define SMARTCOPY memcpy(destination, source, 65536)

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
SMARTCOPY;

return 0;
}

$ gcc -O3 array_copy_dumb.c -o array_copy_dumb
$ time array_copy_dumb
real 4.99
user 4.86
sys 0.00
$ gcc -O3 array_copy_smart.c -o array_copy_smart
$ time array_copy_smart
real 0.15
user 0.15
sys 0.00

I just know it's due to cache! But would anyone explain it in detail?

Thanks again!
 
C

Carsten Hansen

Kevin Wan said:
Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

Thanks in advance!

array_copy_dumb.c:

/* array_copy_dumb.c */
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination =
source

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
DUMBCOPY; /* 1 */

return 0;
}

array_copy_smart.c:

/* array_copy_smart.c */
#include <string.h>

#define SMARTCOPY memcpy(destination, source, 65536)

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
SMARTCOPY;

return 0;
}

$ gcc -O3 array_copy_dumb.c -o array_copy_dumb
$ time array_copy_dumb
real 4.99
user 4.86
sys 0.00
$ gcc -O3 array_copy_smart.c -o array_copy_smart
$ time array_copy_smart
real 0.15
user 0.15
sys 0.00

I just know it's due to cache! But would anyone explain it in detail?

Thanks again!



Any decent C library will have a highly optimized version memcpy. Typically
it will be implemented in assembly language.
At least it will do copying of 4 bytes at a time if possible. Some may do 8
bytes at a time using floating-point instructions.
Others may take advantage of SIMD and do copying of 16 bytes at a time. Some
may play with cache lines.

Your program has undefined behavior as you copy uninitialized data.

Carsten Hansen
 
T

Tim Prince

Kevin said:
Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

Thanks in advance!

array_copy_dumb.c:

/* array_copy_dumb.c */
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination =
source

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
DUMBCOPY; /* 1 */

return 0;
}

array_copy_smart.c:

/* array_copy_smart.c */
#include <string.h>

#define SMARTCOPY memcpy(destination, source, 65536)

int main(void)
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; ++j)
SMARTCOPY;

return 0;
}

$ gcc -O3 array_copy_dumb.c -o array_copy_dumb
$ time array_copy_dumb
real 4.99
user 4.86
sys 0.00
$ gcc -O3 array_copy_smart.c -o array_copy_smart
$ time array_copy_smart
real 0.15
user 0.15
sys 0.00

I just know it's due to cache! But would anyone explain it in detail?

If you know so much, and aren't willing to read up on it, nor to tell
anything about the architecture, why are you asking?
But, do consider that memcpy() surely performs this 4, 8, 16 or more bytes
at a time, as appropriate to the architecture for which it was built.
 
M

Martijn

Tim said:
But, do consider that memcpy() surely performs this 4, 8, 16 or more
bytes at a time, as appropriate to the architecture for which it was
built.

Apart from that, it may be faster on your particular implementation, it may
not be faster on another. Some reasons:

- memcpy may not be using array-indexing, but incrementing a pointer instead
(hence some extra memory access and calculations to get the memory address
of the source and destination) - you could do this to your own code and see
what happens

- memcpy may copy from back to front (on some systems, a comparison with 0
is faster than with an arbitrary number - only a very minor gain but in
65536 comparisons it may help)

- memcpy may make direct use of registers, and in your case i may not be
translated by the compiler to be a register (hence some extra copying to and
from the memory)

- memcpy may use movsb, movsw, movsd (intel only) or similar assembly
instructions

Hope this helped,
 
N

Nils Petter Vaskinn

Hello,

Would anyone explain why there is a consistent large performance
degradation with the dumb copy?
#define DUMBCOPY for (i = 0; i < 65536; ++i) destination =

65536 copy operations

#define SMARTCOPY memcpy(destination, source, 65536)
1 call to memcpy that has probably been optimized


You'll have to look at the implementation of memcpy to see what it does
differently.

hth
NPV
 
D

Dan Pop

In said:
Would anyone explain why there is a consistent large performance
degradation with the dumb copy?

#define DUMBCOPY for (i = 0; i < 65536; ++i) destination =

#define SMARTCOPY memcpy(destination, source, 65536)


There are many ways of accelerating memcpy on modern architectures, while
your dumb loop can't be much optimised by a compiler (unless the compiler
writer has spent some effort into recognising it as a dumb memcpy and
replaced it by a smart memcpy).

The first thing to do is to replace the byte-by-byte copying by a
register-by-register copying. The next thing is to play tricks with the
cache, so that the input data is already in the cache by the time you
need it. There may be even more processor-specific tricks for
accelerating the memory accesses when performed in a purely sequential
way. The processor may even have a specialised instruction for performing
this operation as fast as possible. To exploit all these things, memcpy
is typically implemented in assembly.

Dan
 
R

Randy Howard

There are many ways of accelerating memcpy on modern architectures, while
your dumb loop can't be much optimised by a compiler (unless the compiler
writer has spent some effort into recognising it as a dumb memcpy and
replaced it by a smart memcpy).

I had a situation not too long ago where memcmp() on MSVC 6.0 was a
huge performance bottleneck. gcc seems to be smart enough to use
block compare instructions for appropriate Intel CPUs, where the MS
compiler was not. This isn't normally important, but this app did
some very large block memcmp()'s, and it became a problem. Also
knowing that that the size of the blocks being compared would always
be evenly divisible by 4 allowed it to really be optimized on Intel.

Rewriting an inline assembly version for WIN32 resulted in a 30%
performance improvement. Linux/gcc code compiled from the same source
(minus the memcmp() replacement) ran w/basically identical performance
with no manual intervention.
 
R

Randy Howard

In any case, the morals of the story is "Don't do it yourself if the
language gives you a way of doing it".

Usually. Not always. But, there isn't much point in not following the
above, unless you have performance measurements that show a particular
library routine is a bottleneck. In those cases, worry about trying
to find a better method.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top