Why large performance degradation?

Discussion in 'C Programming' started by Kevin Wan, Jul 28, 2003.

  1. Kevin Wan

    Kevin Wan Guest

    Hello,

    Would anyone explain why there is a consistent large performance
    degradation with the dumb copy?

    Thanks in advance!

    array_copy_dumb.c:

    /* array_copy_dumb.c */
    #define DUMBCOPY for (i = 0; i < 65536; ++i) destination =
    source

    int main(void)
    {
    char source[65536], destination[65536];
    int i, j;
    for (j = 0; j < 100; ++j)
    DUMBCOPY; /* 1 */

    return 0;
    }

    array_copy_smart.c:

    /* array_copy_smart.c */
    #include <string.h>

    #define SMARTCOPY memcpy(destination, source, 65536)

    int main(void)
    {
    char source[65536], destination[65536];
    int i, j;
    for (j = 0; j < 100; ++j)
    SMARTCOPY;

    return 0;
    }

    $ gcc -O3 array_copy_dumb.c -o array_copy_dumb
    $ time array_copy_dumb
    real 4.99
    user 4.86
    sys 0.00
    $ gcc -O3 array_copy_smart.c -o array_copy_smart
    $ time array_copy_smart
    real 0.15
    user 0.15
    sys 0.00

    I just know it's due to cache! But would anyone explain it in detail?

    Thanks again!
     
    Kevin Wan, Jul 28, 2003
    #1
    1. Advertising

  2. "Kevin Wan" <> wrote in message
    news:...
    > Hello,
    >
    > Would anyone explain why there is a consistent large performance
    > degradation with the dumb copy?
    >
    > Thanks in advance!
    >
    > array_copy_dumb.c:
    >
    > /* array_copy_dumb.c */
    > #define DUMBCOPY for (i = 0; i < 65536; ++i) destination =
    > source
    >
    > int main(void)
    > {
    > char source[65536], destination[65536];
    > int i, j;
    > for (j = 0; j < 100; ++j)
    > DUMBCOPY; /* 1 */
    >
    > return 0;
    > }
    >
    > array_copy_smart.c:
    >
    > /* array_copy_smart.c */
    > #include <string.h>
    >
    > #define SMARTCOPY memcpy(destination, source, 65536)
    >
    > int main(void)
    > {
    > char source[65536], destination[65536];
    > int i, j;
    > for (j = 0; j < 100; ++j)
    > SMARTCOPY;
    >
    > return 0;
    > }
    >
    > $ gcc -O3 array_copy_dumb.c -o array_copy_dumb
    > $ time array_copy_dumb
    > real 4.99
    > user 4.86
    > sys 0.00
    > $ gcc -O3 array_copy_smart.c -o array_copy_smart
    > $ time array_copy_smart
    > real 0.15
    > user 0.15
    > sys 0.00
    >
    > I just know it's due to cache! But would anyone explain it in detail?
    >
    > Thanks again!



    Any decent C library will have a highly optimized version memcpy. Typically
    it will be implemented in assembly language.
    At least it will do copying of 4 bytes at a time if possible. Some may do 8
    bytes at a time using floating-point instructions.
    Others may take advantage of SIMD and do copying of 16 bytes at a time. Some
    may play with cache lines.

    Your program has undefined behavior as you copy uninitialized data.

    Carsten Hansen
     
    Carsten Hansen, Jul 28, 2003
    #2
    1. Advertising

  3. Kevin Wan

    Tim Prince Guest

    Kevin Wan wrote:

    > Hello,
    >
    > Would anyone explain why there is a consistent large performance
    > degradation with the dumb copy?
    >
    > Thanks in advance!
    >
    > array_copy_dumb.c:
    >
    > /* array_copy_dumb.c */
    > #define DUMBCOPY for (i = 0; i < 65536; ++i) destination =
    > source
    >
    > int main(void)
    > {
    > char source[65536], destination[65536];
    > int i, j;
    > for (j = 0; j < 100; ++j)
    > DUMBCOPY; /* 1 */
    >
    > return 0;
    > }
    >
    > array_copy_smart.c:
    >
    > /* array_copy_smart.c */
    > #include <string.h>
    >
    > #define SMARTCOPY memcpy(destination, source, 65536)
    >
    > int main(void)
    > {
    > char source[65536], destination[65536];
    > int i, j;
    > for (j = 0; j < 100; ++j)
    > SMARTCOPY;
    >
    > return 0;
    > }
    >
    > $ gcc -O3 array_copy_dumb.c -o array_copy_dumb
    > $ time array_copy_dumb
    > real 4.99
    > user 4.86
    > sys 0.00
    > $ gcc -O3 array_copy_smart.c -o array_copy_smart
    > $ time array_copy_smart
    > real 0.15
    > user 0.15
    > sys 0.00
    >
    > I just know it's due to cache! But would anyone explain it in detail?
    >
    >

    If you know so much, and aren't willing to read up on it, nor to tell
    anything about the architecture, why are you asking?
    But, do consider that memcpy() surely performs this 4, 8, 16 or more bytes
    at a time, as appropriate to the architecture for which it was built.
    --
    Tim Prince
     
    Tim Prince, Jul 28, 2003
    #3
  4. Kevin Wan

    Martijn Guest

    Tim Prince wrote:
    > But, do consider that memcpy() surely performs this 4, 8, 16 or more
    > bytes at a time, as appropriate to the architecture for which it was
    > built.


    Apart from that, it may be faster on your particular implementation, it may
    not be faster on another. Some reasons:

    - memcpy may not be using array-indexing, but incrementing a pointer instead
    (hence some extra memory access and calculations to get the memory address
    of the source and destination) - you could do this to your own code and see
    what happens

    - memcpy may copy from back to front (on some systems, a comparison with 0
    is faster than with an arbitrary number - only a very minor gain but in
    65536 comparisons it may help)

    - memcpy may make direct use of registers, and in your case i may not be
    translated by the compiler to be a register (hence some extra copying to and
    from the memory)

    - memcpy may use movsb, movsw, movsd (intel only) or similar assembly
    instructions

    Hope this helped,

    --
    Martijn Haak
    http://www.serenceconcepts.nl
     
    Martijn, Jul 28, 2003
    #4
  5. On Sun, 27 Jul 2003 20:38:06 -0700, Kevin Wan wrote:

    > Hello,
    >
    > Would anyone explain why there is a consistent large performance
    > degradation with the dumb copy?
    >


    > #define DUMBCOPY for (i = 0; i < 65536; ++i) destination =

    65536 copy operations


    > #define SMARTCOPY memcpy(destination, source, 65536)

    1 call to memcpy that has probably been optimized


    You'll have to look at the implementation of memcpy to see what it does
    differently.

    hth
    NPV
     
    Nils Petter Vaskinn, Jul 28, 2003
    #5
  6. Kevin Wan

    Dan Pop Guest

    In <> (Kevin Wan) writes:

    >Would anyone explain why there is a consistent large performance
    >degradation with the dumb copy?
    >
    >#define DUMBCOPY for (i = 0; i < 65536; ++i) destination =
    >
    >#define SMARTCOPY memcpy(destination, source, 65536)


    There are many ways of accelerating memcpy on modern architectures, while
    your dumb loop can't be much optimised by a compiler (unless the compiler
    writer has spent some effort into recognising it as a dumb memcpy and
    replaced it by a smart memcpy).

    The first thing to do is to replace the byte-by-byte copying by a
    register-by-register copying. The next thing is to play tricks with the
    cache, so that the input data is already in the cache by the time you
    need it. There may be even more processor-specific tricks for
    accelerating the memory accesses when performed in a purely sequential
    way. The processor may even have a specialised instruction for performing
    this operation as fast as possible. To exploit all these things, memcpy
    is typically implemented in assembly.

    Dan
    --
    Dan Pop
    DESY Zeuthen, RZ group
    Email:
     
    Dan Pop, Jul 28, 2003
    #6
  7. Kevin Wan

    Randy Howard Guest

    In article <bg3nee$2id$>, says...
    > There are many ways of accelerating memcpy on modern architectures, while
    > your dumb loop can't be much optimised by a compiler (unless the compiler
    > writer has spent some effort into recognising it as a dumb memcpy and
    > replaced it by a smart memcpy).


    I had a situation not too long ago where memcmp() on MSVC 6.0 was a
    huge performance bottleneck. gcc seems to be smart enough to use
    block compare instructions for appropriate Intel CPUs, where the MS
    compiler was not. This isn't normally important, but this app did
    some very large block memcmp()'s, and it became a problem. Also
    knowing that that the size of the blocks being compared would always
    be evenly divisible by 4 allowed it to really be optimized on Intel.

    Rewriting an inline assembly version for WIN32 resulted in a 30%
    performance improvement. Linux/gcc code compiled from the same source
    (minus the memcmp() replacement) ran w/basically identical performance
    with no manual intervention.
     
    Randy Howard, Jul 29, 2003
    #7
  8. Kevin Wan

    Randy Howard Guest

    In article <>,
    says...
    > In any case, the morals of the story is "Don't do it yourself if the
    > language gives you a way of doing it".


    Usually. Not always. But, there isn't much point in not following the
    above, unless you have performance measurements that show a particular
    library routine is a bottleneck. In those cases, worry about trying
    to find a better method.
     
    Randy Howard, Jul 29, 2003
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Boo R. Ghost

    Performance degradation

    Boo R. Ghost, May 11, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    410
    Boo R. Ghost
    May 12, 2004
  2. Chakra

    Degradation

    Chakra, Apr 21, 2005, in forum: ASP .Net
    Replies:
    3
    Views:
    507
    Aquila Deus
    Apr 23, 2005
  3. Jason Heyes
    Replies:
    4
    Views:
    368
    Swampmonster
    Dec 14, 2004
  4. Joachim Worringen
    Replies:
    6
    Views:
    294
  5. Mr. SweatyFinger
    Replies:
    2
    Views:
    1,987
    Smokey Grindel
    Dec 2, 2006
Loading...

Share This Page