More optimisation

Discussion in 'C Programming' started by kid joe, Jun 16, 2009.

  1. kid joe

    kid joe Guest

    Hi all,

    I know this is fairly system specific but I was hoping some people here
    with a wide experience of C implementations on different platforms could
    pass on their knowledge.

    If I want to perform actions on several large arrays, whats the best way
    to do that? Lets say for a silly example I have arrays a,b,c all of size n
    and I want to multiply everything in a by 2, everything in b by 3 and
    everything in c by 4.

    Method 1:
    for(long int i = 0; i < n; i++) {
    a *= 2;
    b *= 3;
    c *= 4;
    }

    Method 2:
    for(long int i = 0; i < n; i++)
    a *= 2;
    for(long int i = 0; i < n; i++)
    b *= 3;
    for(long int i = 0; i < n; i++)
    c *= 4;

    Now you could argue that Method 1 should be faster because there are less
    branching instructions, and branching is expensive. But then you could
    argue that Method 2 should be better because theres better data locality
    so better use of cache.

    I know people will say to profile it on my system! :) But Id be interested
    in peoples thoughts on this on a variety of different hardwares.

    Cheers,
    Joe


    --
    ...................... o _______________ _,
    ` Good Evening! , /\_ _| | .-'_|
    `................, _\__`[_______________| _| (_|
    ] [ \, ][ ][ (_|
     
    kid joe, Jun 16, 2009
    #1
    1. Advertising

  2. kid joe

    user923005 Guest

    On Jun 16, 1:58 pm, kid joe <> wrote:
    > Hi all,
    >
    > I know this is fairly system specific but I was hoping some people here
    > with a wide experience of C implementations on different platforms could
    > pass on their knowledge.
    >
    > If I want to perform actions on several large arrays, whats the best way
    > to do that? Lets say for a silly example I have arrays a,b,c all of size n
    > and I want to multiply everything in a by 2, everything in b by 3 and
    > everything in c by 4.
    >
    > Method 1:
    > for(long int i = 0; i < n; i++) {
    >   a *= 2;
    >   b *= 3;
    >   c *= 4;
    >
    > }
    >
    > Method 2:
    > for(long int i = 0; i < n; i++)
    >   a *= 2;
    > for(long int i = 0; i < n; i++)
    >   b *= 3;
    > for(long int i = 0; i < n; i++)
    >   c *= 4;
    >
    > Now you could argue that Method 1 should be faster because there are less
    > branching instructions, and branching is expensive. But then you could
    > argue that Method 2 should be better because theres better data locality
    > so better use of cache.
    >
    > I know people will say to profile it on my system! :) But Id be interested
    > in peoples thoughts on this on a variety of different hardwares.


    My opinion is that these optimizations are very, very silly.

    You should not perform these optimizations unless:
    1. The routine does not meet performance specifications
    2. The profile reveals that initialization of arrays is the bottleneck

    Array initialization is linear for a 1-D array, and so it is probably
    the least of your worries.

    It's like buying Pirelli P3 tires for an old Volkswagen beetle.
    It's costly, it's silly, and it won't make the beetle go any faster.

    P.S.
    You are also right about the profile. I guess that different machines
    and different compilers will give different answers. And if you
    change the data types you may get different answers yet again.

    P.P.S.
    The first rule of optimization is "DON'T DO IT."
    The second rule of optimization (for experts only) is "Don't do it
    *yet*."

    These are serious rules. A practice of premature optimization because
    you think something might go faster due to tweaky little hacks is a
    very, very, very, very bad coding practice. Not only will it make the
    code more expensive, it is also going to be a complete waste of time
    in almost all cases.

    If you want to make code go faster, the very last place you should
    ever imagine going is hopping on board the good ship "tweaky hacks."
    Instead, improve the algorithm. Chances are very good you can find an
    algorithm in the literature that will improve the code. If you cannot
    find one, consider thinking about how to improve the algorithm
    yourself.

    I guess you will ignore this advice, since it did not stick last
    time. That's too bad.
     
    user923005, Jun 16, 2009
    #2
    1. Advertising

  3. In article <h1911r$vsv$>, kid joe <> wrote:
    >Hi all,
    >
    >I know this is fairly system specific but I was hoping some people here
    >with a wide experience of C implementations on different platforms could
    >pass on their knowledge.
    >
    >If I want to perform actions on several large arrays, whats the best way
    >to do that? Lets say for a silly example I have arrays a,b,c all of size n
    >and I want to multiply everything in a by 2, everything in b by 3 and
    >everything in c by 4.
    >
    >Method 1:
    >for(long int i = 0; i < n; i++) {


    Of course you know, all you will get in the way of responses is people
    pointing out that the above is a syntax error in C (C89, of course,
    since that's all there is).

    Or you may, if you are lucky, get responses telling you, in patronizing
    detail, how you need to flag the above as "C99" (or gcc, or whatever) only.
     
    Kenny McCormack, Jun 16, 2009
    #3
  4. kid joe

    Lie Ryan Guest

    The fastest would be to use SSE, I think...
     
    Lie Ryan, Jun 16, 2009
    #4
  5. kid joe

    Tim Prince Guest

    kid joe wrote:

    >
    > I know this is fairly system specific but I was hoping some people here
    > with a wide experience of C implementations on different platforms could
    > pass on their knowledge.
    >
    > If I want to perform actions on several large arrays, whats the best way
    > to do that? Lets say for a silly example I have arrays a,b,c all of size n
    > and I want to multiply everything in a by 2, everything in b by 3 and
    > everything in c by 4.
    >
    > Method 1:
    > for(long int i = 0; i < n; i++) {
    > a *= 2;
    > b *= 3;
    > c *= 4;
    > }
    >
    > Method 2:
    > for(long int i = 0; i < n; i++)
    > a *= 2;
    > for(long int i = 0; i < n; i++)
    > b *= 3;
    > for(long int i = 0; i < n; i++)
    > c *= 4;
    >
    > Now you could argue that Method 1 should be faster because there are less
    > branching instructions, and branching is expensive. But then you could
    > argue that Method 2 should be better because theres better data locality
    > so better use of cache.

    For several years, all desktop CPUs (and laptops of similar families)
    have supported at least 4 write combine buffers per logical or physical
    processor, thus efficiently supporting 4 or more parallel paths to
    memory. There are some still on the market which don't efficiently
    support vectorized store without 16-byte alignment, so then you might
    have an argument for splitting it up.
    Without restrict or other indications to the compiler that the arrays
    don't alias, you will kill performance with multiple arrays per loop.

    Certain compilers, with appropriate options and syntax, will choose
    these optimizations for you.
    The loop length threshold where OpenMP parallelism pays off will be
    lower with more work in the loop, if you remove the other possible
    obstacles to optimization.
     
    Tim Prince, Jun 17, 2009
    #5
  6. kid joe wrote:
    > Method 1:
    > for(long int i = 0; i < n; i++) {
    > a *= 2;
    > b *= 3;
    > c *= 4;
    > }
    >
    > Method 2:
    > for(long int i = 0; i < n; i++)
    > a *= 2;
    > for(long int i = 0; i < n; i++)
    > b *= 3;
    > for(long int i = 0; i < n; i++)
    > c *= 4;
    >
    > Now you could argue that Method 1 should be faster because there are less
    > branching instructions, and branching is expensive. But then you could
    > argue that Method 2 should be better because theres better data locality
    > so better use of cache.


    I expect the better data locality in method 2 would give better
    performance unless n is very small; your loops aren't complex enough to
    keep a modern desktop/server CPU busy in between waiting for cache line
    fills.

    Branching is _not_ expensive on such CPUs (though it may be on old or
    embedded CPUs); what is really expensive is a _mispredicted_ branch, and
    in simple for loops like the above, the branch should be correctly
    predicted for every iteration except the last (and perhaps the first).
    The rest of the branches are effectively free. The increments and
    comparisons before the branches will just fill in pipeline bubbles and
    are also effectively free. However, those costs may reassert themselves
    if your actual loops are more complicated than shown above and there
    aren't any bubbles to fill.

    S

    --
    Stephen Sprunk "Stupid people surround themselves with smart
    CCIE #3723 people. Smart people surround themselves with
    K5SSS smart people who disagree with them." --Isaac Jaffe
     
    Stephen Sprunk, Jun 17, 2009
    #6
  7. kid joe

    Kaz Kylheku Guest

    On 2009-06-16, kid joe <> wrote:
    > Hi all,
    >
    > I know this is fairly system specific but I was hoping some people here
    > with a wide experience of C implementations on different platforms could
    > pass on their knowledge.
    >
    > If I want to perform actions on several large arrays, whats the best way
    > to do that?


    The best way is to look at the memory hiearchy of the processor you are using.
    Try to avoid needless re-loading the same parts of an array from one level of
    caching to another; e.g. from main memory to the L2 cache, or from L2 to L1.

    > Lets say for a silly example I have arrays a,b,c all of size n
    > and I want to multiply everything in a by 2, everything in b by 3 and
    > everything in c by 4.
    >
    > Method 1:
    > for(long int i = 0; i < n; i++) {
    > a *= 2;
    > b *= 3;
    > c *= 4;
    > }
    >
    > Method 2:
    > for(long int i = 0; i < n; i++)
    > a *= 2;
    > for(long int i = 0; i < n; i++)
    > b *= 3;
    > for(long int i = 0; i < n; i++)
    > c *= 4;
    >
    > Now you could argue that Method 1 should be faster because there are less
    > branching instructions, and branching is expensive. But then you could
    > argue that Method 2 should be better because theres better data locality
    > so better use of cache.


    This caching analysis is false. Neither method uses caching particularly well.
    This program simply doesn't benefit from caching very much, because it
    processes large (or presumably large) arrays sequentially. From a caching point
    of view, it makes little difference whether we process each of three large
    arrays separately, or march through them together.

    The problem is that our program never looks at a value once it has
    updated it. Caching provides the most benefit when you access the same thing
    you have recently accessed. If you keep accessing new material, caching
    provides less of a benefit. The organization of the cache system can still
    speed up the accesses when the pattern is linear, because memory access is
    parallelized: entire cache lines are being sucked into the cache in a burst
    mode memory access, and in parallel through a wide bus, rather than individual
    words. So the fact that you are processing words in order is a kind of locality
    which helps to take advantage of the parallelism inherent in the organization
    of the cache in the lines.

    The one caching consideration may be that the arrays are subject to worst-case
    cache aliasing. Suppose that a[], b[] and c[] are situated at such addresses
    that their corresponding parts hash to the same cache line, and suppose that
    your processor's cache is of the simplest possible type: direct mapped. So
    a[0], b[0] and c[0] land on the same cache line and so on. This means that the
    first example will trash horribly. The very first access a[0] *= 2 will load an
    entire cache line. So you have a[1], a[2] through, say, a[7] ready for fast
    access. But you don't take advantage of it: the next access after a[0] is not
    a[1] but b[0]. And in our worst-case situation, this maps to the same cache
    line as a[0], with a direct mapping. Oops! The cache line is dirty since we
    just stored a[0], so the whole line has to be written out first. After being
    written out, the cache line is loaded with b[0] through b[7]. The b[0] is
    updated in the cache line, and now c[0] is accessed. Oops! The entire cache
    line is flushed and re-loaded again with c[0] thorugh c[7]. Then the trashing
    repeats with a[1], over the same three cache lines, and so on.

    This potential problem does not threaten the second example, since you are
    marching through large areas of memory in order; you are not accessing three
    different large areas of memory at some fixed displacement from each other.

    The first example will run about as fast as the second if the cache lines do
    not conflict. That is to say, if the parallel elements of a[], b[] and c[] map
    to three separate cache lines you are okay. It doesn't matter whether your
    cache has 32 lines in total or 256. You just need three unique ones at any time
    so that the loop can suck data without interference.
    Aliasing problems can occur at multiple levels in the memory hiearchy.

    And also in other kinds of caches like TLB's for virtual address translation.

    Suppose you have a direct mapped TLB, and the code alternates its accesses
    between two virtual pages that clash to the same TLB entry. Poof; you are
    playing a game of TLB miss-and-reload ping pong.
     
    Kaz Kylheku, Jun 17, 2009
    #7
  8. On 16 Jun 2009 at 21:33, user923005 wrote:
    > These are serious rules. A practice of premature optimization because
    > you think something might go faster due to tweaky little hacks is a
    > very, very, very, very bad coding practice.


    This is true up to a point.

    However, *all other things being equal* it's obviously better to choose
    code that's likely to run faster on most platforms (as opposed to the
    Heathfield method of choosing code likely to run slower on most
    platforms).

    I think the OP was saying: "OK, I've got this situation where I could do
    this in two different ways. Both are equally clear and readable, both
    are correct and maintainable, so why not let probable speed on probable
    architectures be the determining factor as to which one I pick?"

    And why not indeed? Even if you see efficiency (or in Heathfield's case,
    inefficiency) as the very last entry in your lexicographic order of the
    desirable properties of code, it still beats tossing a coin.
     
    Antoninus Twink, Jun 17, 2009
    #8
  9. kid joe

    Hamiral Guest

    kid joe a écrit :
    > Method 2:
    > for(long int i = 0; i < n; i++)
    > a *= 2;
    > for(long int i = 0; i < n; i++)
    > b *= 3;
    > for(long int i = 0; i < n; i++)
    > c *= 4;


    I'd sayr method two is faster, but if you like micro-optimization, I'd
    do it like this :

    long int i;
    int *tmp_ptr; (assuming a, b and c are int arrays)

    tmp_ptr = a;
    for(i=0; i<n; i++) {
    *tmp_ptr <<= 1; // equivalent to *=2
    tmp_ptr++;
    }
    tmp_ptr = b;
    for(i=0; i<n; i++) {
    *tmp_ptr *= 3;
    tmp_ptr++;
    }
    tmp_ptr = c;
    for(i=0; i<n; i++) {
    *tmp_ptr <<= 2; // equivalent to *=4
    tmp_ptr++;
    }

    But maybe assigning tmp_ptr each time makes you lose the speed you won
    by using bitshift and pointer incrementation...

    As a final note, I'd say you should rely on your compiler. Unless you're
    on a very low end hardware, I'm really not sure you can't optimize
    better than your compiler does...

    Ham
     
    Hamiral, Jun 17, 2009
    #9
  10. kid joe wrote:
    > Hi all,
    >
    > I know this is fairly system specific but I was hoping some people here
    > with a wide experience of C implementations on different platforms could
    > pass on their knowledge.
    >
    > If I want to perform actions on several large arrays, whats the best way
    > to do that? Lets say for a silly example I have arrays a,b,c all of size n
    > and I want to multiply everything in a by 2, everything in b by 3 and
    > everything in c by 4.
    >
    > Method 1:
    > for(long int i = 0; i < n; i++) {
    > a *= 2;
    > b *= 3;
    > c *= 4;
    > }
    >
    > Method 2:
    > for(long int i = 0; i < n; i++)
    > a *= 2;
    > for(long int i = 0; i < n; i++)
    > b *= 3;
    > for(long int i = 0; i < n; i++)
    > c *= 4;



    If your calculations are more expensive than *2 *3 *4 it could be worth running multiple threads if
    you have more cores/cpu.
    Method 2 can be run in parallel, method 1 not.

    Just my $0.02, benchmark if in doubt.

    Gr,
    Edwin
     
    Edwin van den Oetelaar, Jun 17, 2009
    #10
  11. Hamiral wrote:
    > kid joe a écrit :
    >> Method 2:
    >> for(long int i = 0; i < n; i++)
    >> a *= 2;
    >> for(long int i = 0; i < n; i++)
    >> b *= 3;
    >> for(long int i = 0; i < n; i++)
    >> c *= 4;

    >
    > I'd sayr method two is faster, but if you like micro-optimization, I'd
    > do it like this :
    >
    > long int i;
    > int *tmp_ptr; (assuming a, b and c are int arrays)
    >
    > tmp_ptr = a;
    > for(i=0; i<n; i++) {
    > *tmp_ptr <<= 1; // equivalent to *=2
    > tmp_ptr++;
    > }


    A left-shift is _not_ always the same thing as multiplying by two,
    particularly when it comes to signed values. Besides, if that
    transformation were safe and indeed faster, the compiler would already
    be doing it; they're very good at micro-optimizations like that, so the
    only real wins come from algorithmic changes.

    Also, if you're going to switch from an array-index loop to a pointer
    loop, you might as well use the pointer as the loop variable/condition
    and get rid of the index entirely.

    S

    --
    Stephen Sprunk "Stupid people surround themselves with smart
    CCIE #3723 people. Smart people surround themselves with
    K5SSS smart people who disagree with them." --Isaac Jaffe
     
    Stephen Sprunk, Jun 17, 2009
    #11
  12. kid joe

    Hamiral Guest

    Stephen Sprunk a écrit :
    > A left-shift is _not_ always the same thing as multiplying by two,
    > particularly when it comes to signed values.


    Oh yes, forgot that. I'm so used to dealing with unsigned values...

    > Besides, if that
    > transformation were safe and indeed faster, the compiler would already
    > be doing it; they're very good at micro-optimizations like that, so the
    > only real wins come from algorithmic changes.


    Yes, but you say that just as if you didn't read the last paragraph of
    my post...

    > Also, if you're going to switch from an array-index loop to a pointer
    > loop, you might as well use the pointer as the loop variable/condition
    > and get rid of the index entirely.


    Yes, you could do it like this then :

    int *tmp_ptr, *end_ptr;
    tmp_ptr = a;
    end_ptr = tmp_ptr + n*sizeof(*tmp_ptr);
    for(;tmp_ptr<=end_ptr; tmp_ptr++) {
    *tmp_ptr = whatever;
    }

    I'm not sure with the end condition though... <= or < ?

    Anyway, as I said earlier, the compiler most probably does that kind of
    optimizations much better than any human being...

    Ham
     
    Hamiral, Jun 17, 2009
    #12
  13. kid joe

    Ben Pfaff Guest

    Eric Sosman <> writes:

    > Kelsey Bjarnason wrote:
    >> [...]
    >>
    >> Frankly, I'd be surprised if there were any discernible difference
    >> other than possibly on an implementation designed to be
    >> pathological.

    >
    > In my one-off test, the second method ran 2.5 times faster than
    > the first. I guess that makes gcc and Opteron "pathological."


    I'd be surprised if the performance differences didn't depend
    significantly on the value of n. What n did you use?
    --
    "The fact that there is a holy war doesn't mean that one of the sides
    doesn't suck - usually both do..."
    --Alexander Viro
     
    Ben Pfaff, Jun 17, 2009
    #13
  14. Lie Ryan wrote:

    > The fastest would be to use SSE, I think...


    Not if it's used naively. Mixing SSE assembler instructions into C code
    makes it impossible for the optimizer to rearrange or substiture those
    instructions with code, that's better in this certain situation. Both the
    GNU and the Intel compiler collection provide propritary extensions in the
    form of builtin functions (intrinsics) to access a CPU's vector unit (also
    architecture dependent), which when used allow the optimizer to create a
    codepath that reaches that optimizers best metrik. Unless you are one of
    the top noth experts on the given CPU it will be hard to reach that level
    with hand written assembler.


    Wolfgang
     
    Wolfgang Draxinger, Jun 17, 2009
    #14
  15. kid joe wrote:

    > Now you could argue that Method 1 should be faster because there are less
    > branching instructions, and branching is expensive. But then you could
    > argue that Method 2 should be better because theres better data locality
    > so better use of cache.


    And one might argue, that the OOE execution dispatcher recognizes the 3
    consecutive loops and dispatches the code in three parallel pipelines, this
    executing it in a fashion that resembles method 1.

    The only way to know for sure is: Profile the code with many architectures
    and compiler versions you can get. Maybe even put #ifdef blocks around it,
    creating different codepaths, which are enabled upon runtime depending on
    the actual architecture the code is executed on (link into
    separate .so/.dll files and load the best matching variant upon program
    start).

    In my 3D engine about 2% of the core algorithms code is conditionally
    selected that way, depending on if it's executed on Intel or AMD CPU; it
    would work with other architectures as well, but those are rather rare in
    this application field.


    Wolfgang
     
    Wolfgang Draxinger, Jun 17, 2009
    #15
  16. kid joe

    user923005 Guest

    On Jun 17, 1:54 am, Antoninus Twink <> wrote:
    > On 16 Jun 2009 at 21:33, user923005 wrote:
    >
    > > These are serious rules.  A practice of premature optimization because
    > > you think something might go faster due to tweaky little hacks is a
    > > very, very, very, very bad coding practice.

    >
    > This is true up to a point.
    >
    > However, *all other things being equal* it's obviously better to choose
    > code that's likely to run faster on most platforms (as opposed to the
    > Heathfield method of choosing code likely to run slower on most
    > platforms).
    >
    > I think the OP was saying: "OK, I've got this situation where I could do
    > this in two different ways. Both are equally clear and readable, both
    > are correct and maintainable, so why not let probable speed on probable
    > architectures be the determining factor as to which one I pick?"
    >
    > And why not indeed? Even if you see efficiency (or in Heathfield's case,
    > inefficiency) as the very last entry in your lexicographic order of the
    > desirable properties of code, it still beats tossing a coin.


    I agree with you that we should start with the code that runs the
    fastest.
    However, I think that the right way to choose code that runs faster is
    to choose a better algorithm.
    Tweaky hacks are a bad idea.
     
    user923005, Jun 17, 2009
    #16
  17. kid joe

    Hamiral Guest

    Eric Sosman a écrit :
    > incorrect program is no gain. (Can you spot *both* errors?)


    No, I can't ;)
    But I usually don't micro-optimize. And my code always has basic bugs
    before I compile, test and correct it. So maybe I shouldn't post at all
    ? I just wanted to show some ways for optimizing, and I never intended
    to provide code that he could just paste in his own program without
    understanding or testing it.

    > Usually, yes. But the O.P.'s question was about two different
    > ways to organize the same calculation, and my exhaustive survey
    > (sample size N=1) showed that one of them ran 2.5 times faster than
    > the other *and* that none of the compilers tested (sample size N=1)
    > managed to optimize the slower version to match the faster. We
    > humans still have a wee bit of time left before the machines triumph
    > altogether.


    I never pretended my proposition was better than yours... Just wanted to
    follow-up on the "optimization topic"...

    hhhhhermm, seems to me I should think a bit more before posting here,
    some people are taking it too seriously...

    Ham
     
    Hamiral, Jun 17, 2009
    #17
  18. kid joe

    Kaz Kylheku Guest

    On 2009-06-17, Eric Sosman <> wrote:
    > Ben Pfaff wrote:
    >> Eric Sosman <> writes:
    >>
    >>> Kelsey Bjarnason wrote:
    >>>> [...]
    >>>>
    >>>> Frankly, I'd be surprised if there were any discernible difference
    >>>> other than possibly on an implementation designed to be
    >>>> pathological.
    >>> In my one-off test, the second method ran 2.5 times faster than
    >>> the first. I guess that makes gcc and Opteron "pathological."


    Is this on Linux with glibc?

    >> I'd be surprised if the performance differences didn't depend
    >> significantly on the value of n. What n did you use?

    >
    > Three mallocated arrays of 1024*1024 doubles apiece (I
    > reported on this yesterday). There's no point in optimizing
    > when n is small; speeding up a 10ns operation by a factor of
    > 2.5 (huzzah!) saves a whopping 6ns ...


    If this is glibc, then the malloc will use mmap, since these allocations
    are larger than the mmap threshold. Your requested size is a multiple of
    the page size, so malloc will add an extra page to the request.

    So even if the operating system's memory mapper places these maps adjacently,
    there won't be an 8 megabyte step size between them, but rather an 8 megabyte
    plus 1 page step size. That, and given that the Opteron's Level 2 TLB is
    four-way set associative, as far as I can google up, the performance hit is
    probably not attributable to TLB cache aliasing.

    Furthermore there are two L1 TLB's integrated with the L1 cache. These
    hold 32 entries (for regular 4K pages) and are /fully/ associative, whereas our
    program never works with more than 3 data pages at a time.

    The L2 is 16-way associative, so you can work with 16 different pieces of
    memory that clash to the same set.

    There shouldn't be L1 cache line aliasing either, because the virtual addresses
    of the parallel elements in the two arrays (e.g. a[0] and b[0]) should differ
    in bits 12, 13 and 14, bits which are part of the cache line index.

    What could be affecting peformance is bank aliasing. The Opteron's L1 cache is
    divided into 8 banks, which split each 64 byte cache line 8 ways into 8 units
    of 8 bytes. The least significant 3 bits of the address select a byte. The
    next bits after that select the bank.

    Thus for a given i, a, b and c are almost certainly colliding into the
    same L1 cache bank, wheras a, a[i+1], a[i+2] map to different banks. This
    collision reduces opportunities for parallelism. The Opteron has two ports into
    the cache, and simultaneous accesses are possible through these ports---if the
    accesses go to different banks.

    Could that make the entire 2.5 factor difference? Probably not, but probably
    makes a contribution.
     
    Kaz Kylheku, Jun 17, 2009
    #18
  19. kid joe

    BartC Guest

    "Eric Sosman" <> wrote in message
    news:1245256387.661108@news1nwk...
    > Kelsey Bjarnason wrote:
    >> [...]
    >>
    >> Frankly, I'd be surprised if there were any discernible difference other
    >> than possibly on an implementation designed to be pathological.

    >
    > In my one-off test, the second method ran 2.5 times faster than
    > the first. I guess that makes gcc and Opteron "pathological."


    In my tests method 2 ran 0 to 100% slower than method 1, in accordance with
    /my/ theory that method 2 has 3 times as much loop control as method 1.

    --
    Bart
     
    BartC, Jun 17, 2009
    #19
  20. kid joe

    Hamiral Guest

    Eric Sosman a écrit :
    > Kaz Kylheku wrote:
    >> On 2009-06-17, Eric Sosman <> wrote:
    >>> Ben Pfaff wrote:
    >>>> Eric Sosman <> writes:
    >>>>
    >>>>> Kelsey Bjarnason wrote:
    >>>>>> [...]
    >>>>>>
    >>>>>> Frankly, I'd be surprised if there were any discernible difference
    >>>>>> other than possibly on an implementation designed to be
    >>>>>> pathological.
    >>>>> In my one-off test, the second method ran 2.5 times faster than
    >>>>> the first. I guess that makes gcc and Opteron "pathological."

    >>
    >> Is this on Linux with glibc?

    >
    > Alas, no: Windows Vaster. (My excuse: I'm working with another
    > company's product, one that requires a Windows-based management
    > station even when running on a different platform entirely. So a
    > Windows desktop is pretty much forced upon me.) The C implementation
    > is the gcc-based DJGPP environment, originally developed for MS-DOS.


    I just compiled and profiled it as well, on gcc 4.3.3 on GNU/Linux
    Ubuntu, on a regular intel machine.
    Method 2 won :

    % cumulative self self total
    time seconds seconds calls ms/call ms/call name
    62.50 0.05 0.05 1 50.00 50.00 method1
    37.50 0.08 0.03 1 30.00 30.00 method2

    BUT

    This first benchmark was a bit simplistic.

    So I made a real one, where method1 and method2 are successively called
    50 times, and then their calls are interlaced 50 times.

    Here is the gprof result (initarrays is just what its name says) :

    % cumulative self self total
    time seconds seconds calls ms/call ms/call name
    51.46 2.99 2.99 100 29.90 29.90 method2
    47.50 5.75 2.76 100 27.60 27.60 method1
    1.03 5.81 0.06 1 60.00 60.00 initarrays

    So we can say that the difference becomes negligible in the real world.

    If you're interested, here is my benchmark, full of bugs and
    inaccuracies, of course, and with no doubt someone will point out that
    this is *not* real world... But, was the original problem real world ? ;)

    ==========================
    #include <malloc.h>
    #include <stdio.h>
    #include <time.h>

    #define N 1024*1024

    void initarrays(double *a, double *b, double *c)
    {
    long int i;
    for(i=0; i<N; i++) {
    a = 2;
    b = 2;
    c = 2;
    }
    }

    void method1(double *a, double *b, double *c)
    {
    long int i;

    for(i=0; i<N; i++) {
    a *= 1;
    b *= 2;
    c *= 3;
    }
    }

    void method2(double *a, double *b, double *c)
    {
    long int i;

    for(i=0; i<N; i++) {
    a *= 1;
    }
    for(i=0; i<N; i++) {
    b *= 2;
    }
    for(i=0; i<N; i++) {
    c *= 3;
    }
    }

    int main(int argc, char *argv[])
    {
    time_t start, end;
    int i;
    double *a = malloc(N * sizeof *a);
    double *b = malloc(N * sizeof *b);
    double *c = malloc(N * sizeof *c);

    initarrays(a, b, c);

    printf("Running method 1 50 times...\n");
    start = time(NULL);
    for(i=0; i<50; i++) {
    method1(a, b, c);
    }
    end = time(NULL);
    printf("%ld seconds\n", (unsigned long int)(end - start));
    printf("\n");


    printf("Running method 2 50 times...\n");
    start = time(NULL);
    for(i=0; i<50; i++) {
    method2(a, b, c);
    }
    end = time(NULL);
    printf("%ld seconds\n", (unsigned long int)(end - start));

    printf("\n");

    printf("Interlacing method 1 & 2 50 times...\n");
    start = time(NULL);
    for(i=0; i<50; i++) {
    method1(a, b, c);
    method2(a, b, c);
    }
    end = time(NULL);
    printf("%ld seconds\n", (unsigned long int)(end - start));

    free(a);
    free(b);
    free(c);

    return 0;
    }
    ==========================


    Ham
     
    Hamiral, Jun 18, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Fredrik Ramsberg

    Optimisation of regexps in Perl?

    Fredrik Ramsberg, Oct 14, 2003, in forum: Perl
    Replies:
    2
    Views:
    474
    Fredrik Ramsberg
    Oct 15, 2003
  2. Roedy Green

    boolean loop optimisation

    Roedy Green, Sep 11, 2003, in forum: Java
    Replies:
    8
    Views:
    2,826
    Chris Uppal
    Sep 12, 2003
  3. sorry.no.email@post_NG.com

    Search Engine Optimisation

    sorry.no.email@post_NG.com, May 8, 2006, in forum: HTML
    Replies:
    0
    Views:
    351
    sorry.no.email@post_NG.com
    May 8, 2006
  4. Michael
    Replies:
    4
    Views:
    419
    Matt Hammond
    Jun 26, 2006
  5. Robert Klemme

    With a Ruby Yell: more, more more!

    Robert Klemme, Sep 28, 2005, in forum: Ruby
    Replies:
    5
    Views:
    218
    Jeff Wood
    Sep 29, 2005
Loading...

Share This Page