Performance Question ...

Discussion in 'C++' started by Konrad Mühler, Jun 10, 2008.

  1. Hi,

    I've a list of objects. I iterate the list and read the value of each
    object for many operation like:

    x = myList.value1 + myList.value2
    etc.

    My question:
    Is it efficient to always use myList or should I get the pointer to
    the current element and use this instead like

    currentElement->value1 ...

    Is it low-performance to always let the list return the i's element?

    Thank you
    Konrad
     
    Konrad Mühler, Jun 10, 2008
    #1
    1. Advertising

  2. Konrad Mühler

    dizzy Guest

    Konrad Muhler wrote:

    > Hi,
    >
    > I've a list of objects. I iterate the list and read the value of each
    > object for many operation like:
    >
    > x = myList.value1 + myList.value2
    > etc.


    So you don't have a std::list of objects as std::list has no op[] (BECAUSE
    such an op[] would be too expensive). What does your list::eek:p[] do? What is
    the algorithmic complexity of it?

    > My question:
    > Is it efficient to always use myList or should I get the pointer to
    > the current element and use this instead like


    Depends on your imlementation of list. If your list is actually something
    like a std::vector then an op[] is fast enough. If your data set is much
    larger than your data cache then that will be a performance bottleneck no
    matter if you use references/pointers/iterators to elements or a fast op[].

    > currentElement->value1 ...
    >
    > Is it low-performance to always let the list return the i's element?


    Have you benchmarked/profiled it? What results does it have on your
    particular combination of platform, implementation, compiler (version),
    optimization flags?

    Acording to that make your decision.

    --
    Dizzy
     
    dizzy, Jun 10, 2008
    #2
    1. Advertising

  3. Konrad Mühler

    Mirco Wahab Guest

    Konrad Mühler wrote:
    > I've a list of objects. I iterate the list and read the value of each
    > object for many operation like:
    > x = myList.value1 + myList.value2
    > etc.
    > My question:
    > Is it efficient to always use myList or should I get the pointer to
    > the current element and use this instead like
    > currentElement->value1 ...
    > Is it low-performance to always let the list return the i's element?


    Thats depends on the compiler. The newer gcc's will
    optimize the differences away, whereas older ones
    (and maybe Visual Studio) will not.

    try it:

    ==>
    #include <cstdlib>
    #include <cstring>
    #include <ctime>
    #include <iostream>

    struct listitem { int value1, value2, value3; };

    int test_array(listitem *list, int nitems)
    {
    int x=0;
    for(int j=0; j<nitems; ++j)
    for(int i=0; i<nitems; ++i)
    x += (list.value1 - list[j].value1) + (list.value2 - list[j].value2);
    return x;
    }

    int test_pointer(listitem *list, int nitems)
    {
    int x=0;
    for(listitem *pj=list; pj<list+nitems; ++pj)
    for(listitem *pi=list; pi<list+nitems; ++pi)
    x += (pi->value1 - pj->value1) + (pi->value2 - pj->value2);
    return x;
    }

    double timethis(const char *s, int(*p)(listitem*,int),listitem *list, int nitems)
    {
    std::cout << s << std::endl;
    time_t t=clock();
    int x = p(list, nitems); // fool optimizer! (x is always zero)
    return double(clock() - t + x)/CLOCKS_PER_SEC;
    }

    int main(int argc, char*argv[])
    {
    int nitems = argv[1] ? atoi(argv[1]) : 20000;
    listitem *list = new listitem[nitems];
    memset(list, 0x1A, sizeof(listitem)*nitems);

    double ta=0, tp=0;
    tp += timethis("pointer", test_pointer, list, nitems);
    ta += timethis("array", test_array, list, nitems);
    tp += timethis("pointer", test_pointer, list, nitems);
    ta += timethis("array", test_array, list, nitems);

    std::cout << "length: " << nitems << std::endl
    << "array: " << ta << " sec" << std::endl
    << "pointer: " << tp << " sec" << std::endl;

    delete [] list;
    return 0;
    }
    <==

    Regards

    Mirco
     
    Mirco Wahab, Jun 10, 2008
    #3
  4. Hi,

    Konrad Mühler schrieb:
    > I've a list of objects. I iterate the list and read the value of each
    > object for many operation like:
    >
    > x = myList.value1 + myList.value2
    > etc.
    >
    > My question:
    > Is it efficient to always use myList or should I get the pointer to
    > the current element and use this instead like
    >
    > currentElement->value1 ...
    >
    > Is it low-performance to always let the list return the i's element?


    what is the complexity of myList.operator[]?
    The complexity of your loop will be one order higher (unless you have
    something less efficient somewhere else in the loop).

    The STL containers usually do not provide operator[] unless it has low
    complexity, usually O(1). So if you do not have your own implementation
    of operator[] it is likely that you end up with O(n), which is the
    minimum of a loop over the entire container content. The performance
    impact is minimal in most cases.

    However, if your operator[] has O(log(n)) (like SGI's rope
    implementation) or O(n) like a linked list, you will have O(n*log(n)) or
    O(n²) respectively, which is bad.


    Marcel
     
    Marcel Müller, Jun 10, 2008
    #4
  5. Konrad Mühler

    Jerry Coffin Guest

    In article <g2lsvb$fv2$-freiberg.de>, -
    halle.de says...
    > Konrad Mühler wrote:
    > > I've a list of objects. I iterate the list and read the value of each


    [ ... ]

    > Thats depends on the compiler. The newer gcc's will
    > optimize the differences away, whereas older ones
    > (and maybe Visual Studio) will not.


    This depends heavily upon interpretation. Some people only use "list" to
    mean "linked list". In that case, I find it difficult to believe that a
    compiler is capable of optimizing out the differences between subscript-
    style notation and an explicit iterator (a pointer, in this case).

    Other people use "list" to mean almost any sort of linear data
    structure. In this case, (e.g. using an array) optimizing away the
    difference between array-style and pointer-style access has been routine
    among nearly all compilers for years -- all versions of "Visual Studio"
    (and versions of Microsoft C++ and Microsoft C since long before they
    started using the "Visual" brand) have known how to do this kind of
    optimization.

    Likewise, while there may have been _some_ early version of gcc that
    couldn't do it, I doubt that version was ever in wide use -- I'm pretty
    sure by the time it was labeled "version 1.0", it could do this.

    > try it:


    [ code using dynamically allocated array elided ... ]

    Just keep in mind that this may not respond to the question being asked
    by the OP -- it all depends on what he means by "list".

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
     
    Jerry Coffin, Jun 10, 2008
    #5
  6. Konrad Mühler

    Mirco Wahab Guest

    Jerry Coffin wrote:
    > In article <g2lsvb$fv2$-freiberg.de>, -
    > halle.de says...
    >> Thats depends on the compiler. The newer gcc's will
    >> optimize the differences away, whereas older ones
    >> (and maybe Visual Studio) will not.

    >
    > This depends heavily upon interpretation. Some people only use "list" to
    > mean "linked list". In that case, I find it difficult to believe that a
    > compiler is capable of optimizing out the differences between subscript-
    > style notation and an explicit iterator (a pointer, in this case).


    Right. And this was my interpretation of the OP's list.
    How does this:

    >>> x = myList.value1 + myList.value2


    look like in your eyes?

    > Other people use "list" to mean almost any sort of linear data
    > structure. In this case, (e.g. using an array) optimizing away the
    > difference between array-style and pointer-style access has been routine
    > among nearly all compilers for years -- all versions of "Visual Studio"
    > (and versions of Microsoft C++ and Microsoft C since long before they
    > started using the "Visual" brand) have known how to do this kind of
    > optimization.


    Not entirely true. The code posted by me *will* show significant
    differences for pointer vs. array formulation between gcc 3.x and
    4.x *and* between VC++6 and VC++9. I tested it ;-)

    > [ code using dynamically allocated array elided ... ]
    > Just keep in mind that this may not respond to the question being asked
    > by the OP -- it all depends on what he means by "list".


    OK, I'm not absolutely sure (but so aren't you) ;-)

    From interpreting the OP's code, this seems (at last
    for me) a valid interpretation.

    Regards & Thanks

    Mirco
     
    Mirco Wahab, Jun 10, 2008
    #6
  7. Konrad Mühler

    Jerry Coffin Guest

    In article <g2mknj$md3$-freiberg.de>, -
    halle.de says...

    [ ... ]

    > Right. And this was my interpretation of the OP's list.
    > How does this:
    >
    > >>> x = myList.value1 + myList.value2

    >
    > look like in your eyes?


    This looks like it requires operator[] to be valid for whatever sort of
    things myList is -- but that doesn't tell us a thing. Overloading
    operator[] for a linked list is trivial.

    > > Other people use "list" to mean almost any sort of linear data
    > > structure. In this case, (e.g. using an array) optimizing away the
    > > difference between array-style and pointer-style access has been routine
    > > among nearly all compilers for years -- all versions of "Visual Studio"
    > > (and versions of Microsoft C++ and Microsoft C since long before they
    > > started using the "Visual" brand) have known how to do this kind of
    > > optimization.

    >
    > Not entirely true. The code posted by me *will* show significant
    > differences for pointer vs. array formulation between gcc 3.x and
    > 4.x *and* between VC++6 and VC++9. I tested it ;-)


    I glanced through your code, and frankly I wasn't much impressed. First
    of all, what you timed was substantially different from what the OP
    asked about. Second, you called the functions through a pointer
    (repeatedly) and included the time to call the function via a pointer in
    the overall time for the function. Third, you included the first run
    through the data in the time for executing the code -- on the first run,
    the data will frequently be in main memory, but on the second and
    subsequent runs, it's likely to be in the cache, so the first run will
    often be substantially slower than the second and subsequent runs.

    Here's a bit of code that gets rid of at least those specific problems:

    #include <cstdlib>
    #include <cstring>
    #include <ctime>
    #include <iostream>
    #include <vector>
    #include <algorithm>

    struct listitem { int value1, value2; };

    listitem gen_rand() {
    listitem ret;
    ret.value1 = rand();
    ret.value2 = rand();
    return ret;
    }

    int test_array(listitem const *list, size_t nitems)
    {
    int x=0;

    for(size_t i=0; i<nitems; ++i)
    x += list.value1 + list.value2;
    return x;
    }

    int test_pointer(listitem const *list, size_t nitems)
    {
    int x=0;
    for(listitem const *pi=list; pi<list+nitems; ++pi)
    x += pi->value1 + pi->value2;
    return x;
    }

    int main(int argc, char*argv[]) {
    const int rep_count = 20;
    const int size=2000000;

    static listitem list[size];

    size_t i;

    for (i=0; i<size; i++)
    list = gen_rand();

    int val1 = test_array(list, size); // prime the cache.
    int val2 = test_pointer(list, size);

    clock_t time1 = 0;

    // do the real tests
    for (i=0; i<rep_count; i++) {
    clock_t start1 = clock();
    val1 = test_array(list, size);
    clock_t stop1 = clock();
    time1 += stop1 - start1;
    }

    clock_t time2 = 0;

    for (i=0; i<rep_count; i++) {
    clock_t start2 = clock();
    val2 = test_pointer(list, size);
    clock_t stop2 = clock();
    time2 += stop2-start2;
    }

    std::cout << "Array value: " << val1;
    std::cout << "\nPointer value: " << val2;

    std::cout << "\nArray Time: " << time1;
    std::cout << "\nPointer Time : " << time2;

    return 0;
    }

    Compiled with VC++ 6.0, using the default "Release" settings for a
    console application, the results I get from a few runs of this are:

    C:\C\source\Release>test_ptr
    Array value: 1141859694
    Pointer value: 1141859694
    Array Time: 110
    Pointer Time : 93
    C:\C\source\Release>test_ptr
    Array value: 1141859694
    Pointer value: 1141859694
    Array Time: 94
    Pointer Time : 109
    C:\C\source\Release>test_ptr
    Array value: 1141859694
    Pointer value: 1141859694
    Array Time: 109
    Pointer Time : 94
    C:\C\source\Release>test_ptr
    Array value: 1141859694
    Pointer value: 1141859694
    Array Time: 94
    Pointer Time : 109

    The only differences we're seeing are of a single clock tick, and
    there's no apparent bias in either direction. For all practical
    purpsoses the two are identical in speed. Looking at the assembly
    language produced for each shows that it's not just for all practical
    purpuposes:

    test_array:
    ; _list$ = ecx
    ; _nitems$ = edx
    ; Line 18
    push esi
    ; Line 19
    xor eax, eax
    ; Line 21
    xor esi, esi
    test edx, edx
    jbe SHORT $L10248
    push edi
    npad 6
    $L10246:
    ; Line 22
    mov edi, DWORD PTR [ecx+esi*8+4]
    add edi, DWORD PTR [ecx+esi*8]
    add esi, 1
    add eax, edi
    cmp esi, edx
    jb SHORT $L10246
    pop edi
    $L10248:
    pop esi
    ; Line 24
    ret 0

    test_pointer:
    ; _list$ = ecx
    ; _nitems$ = edx
    ; Line 29
    lea edx, DWORD PTR [ecx+edx*8]
    xor eax, eax
    cmp ecx, edx
    jae SHORT $L10257
    push esi
    npad 6
    $L10255:
    ; Line 30
    mov esi, DWORD PTR [ecx+4]
    add esi, DWORD PTR [ecx]
    add ecx, 8
    add eax, esi
    cmp ecx, edx
    jb SHORT $L10255
    pop esi
    $L10257:
    ; Line 32
    ret 0

    I also did a quick test by stripping the source code down to the parts
    we care about:

    struct listitem { int value1, value2; };

    int test_array(listitem const *list, int nitems)
    {
    int x=0;
    int i;
    for(i=0; i<nitems; ++i)
    x += list.value1 + list.value2;
    return x;
    }

    int test_pointer(listitem const *list, int nitems)
    {
    int x=0;
    listitem const *pi;
    for(pi=list; pi<list+nitems; ++pi)
    x += pi->value1 + pi->value2;
    return x;
    }

    and then compiled the result with MS C/C++ 7.0 (the predecessor of MS
    Visual C++ 1.0). Here's the assembly language it produces for the two
    inner loops:

    test_array:
    $F344:
    ; Line 8
    mov ax,WORD PTR [bx-2]
    add ax,WORD PTR [bx]
    add cx,ax
    add bx,4
    dec dx
    jne $F344

    test_pointer:
    $F355:
    ; Line 17
    mov ax,WORD PTR [si+2]
    add ax,WORD PTR [si]
    add dx,ax
    add si,4
    cmp WORD PTR [bp-4],si ;list
    ja $F355

    In this case, there may be a small difference between the two -- but
    it's the array version that's faster. Even on a machine from that time
    frame, the difference would be too small to care much about though --
    given the memory constraints of MS-DOS, the time difference for the
    largest array you could create would still have been less than a
    millisecond.

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
     
    Jerry Coffin, Jun 10, 2008
    #7
  8. Konrad Mühler

    Mirco Wahab Guest

    Hi Jerry,

    thanks for your meaningful response.

    Jerry Coffin wrote:
    > In article <g2mknj$md3$-freiberg.de>, -
    > halle.de says...
    >> How does this:
    >>
    >>>>> x = myList.value1 + myList.value2

    >> look like in your eyes?

    >
    > This looks like it requires operator[] to be valid for whatever sort of
    > things myList is -- but that doesn't tell us a thing. Overloading
    > operator[] for a linked list is trivial.


    OK

    >>> [...] optimizing away the
    >>> difference between array-style and pointer-style access has been routine
    >>> among nearly all compilers for years -- all versions of "Visual Studio"
    >>> (and versions of Microsoft C++ and Microsoft C since long before they
    >>> started using the "Visual" brand) have known how to do this kind of
    >>> optimization.

    >> Not entirely true. The code posted by me *will* show significant
    >> differences for pointer vs. array formulation between gcc 3.x and
    >> 4.x *and* between VC++6 and VC++9. I tested it ;-)

    >
    > I glanced through your code, and frankly I wasn't much impressed.


    What the ... ;-)

    > First of all, what you timed was substantially different from what the OP
    > asked about. Second, you called the functions through a pointer
    > (repeatedly) and included the time to call the function via a pointer in
    > the overall time for the function. Third, you included the first run
    > through the data in the time for executing the code -- on the first run,
    > the data will frequently be in main memory, but on the second and
    > subsequent runs, it's likely to be in the cache, so the first run will
    > often be substantially slower than the second and subsequent runs.


    All three points are somehow unfounded, imho.
    1) what was timed was simply interleaved sequential access to
    a block of data that is not too trivial regarding L2
    access of the underlying architecture. To take out one
    "dimension", as you did, also makes things too simple
    for any optimizing compiler and wouldn't support my view ... ,
    2) calling through pointer was done 2x per test scenario, which
    has to be compared with the NxN loop calculation in the function,
    that leaves us at a error ratio of 2 / (20,000 x 20,000) (only) if
    we assume the indirect call to take as much cycles as the inner-loop
    calculation (which it obviously doesn't),
    3) this does't really count because the data block initialization is
    done immediately before the first test function invocation. BTW,
    due to the "indirect call dispatcher design", this was easy to
    add(and I added it - it didn't change any results, see enclosed source)

    > Here's a bit of code that gets rid of at least those specific problems:
    > #include <cstdlib>
    > [snipped]
    > ...
    > struct listitem { int value1, value2; };
    > ...

    Why not to include some other data item in between?

    struct listitem { int value1, dummy, value2; };

    > int test_array(listitem const *list, size_t nitems)
    > {
    > int x=0;
    >
    > for(size_t i=0; i<nitems; ++i)
    > x += list.value1 + list.value2;
    > return x;
    > }


    This is just a sequential block access which can
    be translated to assembler using only one index
    register, one counter, and one accumulator.

    > [snipped]
    > std::cout << "Array value: " << val1;
    > std::cout << "\nPointer value: " << val2;
    >
    > std::cout << "\nArray Time: " << time1;
    > std::cout << "\nPointer Time : " << time2;


    On any of my machines, your code (using your
    prameters) runs just through within a few clock
    cycles. But I see your point. On such simple
    constructs, almost any compiler can deduce
    the optimal access sequence and can easily
    prevent repeated index offset multiplications.

    > Compiled with VC++ 6.0, using the default "Release" settings for a
    > console application, the results I get from a few runs of this are:
    > ...
    > [snipped]


    ohh, this is nice:

    > test_array:
    > ... $L10257:
    > mov edi, DWORD PTR [ecx+esi*8+4]
    > add edi, DWORD PTR [ecx+esi*8]
    > add esi, 1
    > add eax, edi
    > cmp esi, edx
    > jb SHORT $L10246
    > ...


    and

    > test_pointer:
    > ... $L10255:
    > mov esi, DWORD PTR [ecx+4]
    > add esi, DWORD PTR [ecx]
    > add ecx, 8
    > add eax, esi
    > cmp ecx, edx
    > jb SHORT $L10255
    > ...


    This makes my point absolutely clear, the
    array access is *not* optimized away by
    your compiler (aybe this is the asm of the
    "debug" settings?). If the above is really
    faster an a mychine, thats a problem of
    the instruction decoding/reordering or
    whatever.

    > I also did a quick test by stripping the source code down to the parts
    > we care about:
    > [simplified code snipped]
    > ...
    > and then compiled the result with MS C/C++ 7.0 (the predecessor of MS
    > Visual C++ 1.0). Here's the assembly language it produces for the two
    > inner loops:


    OK

    > test_array:
    > $F344:
    > ; Line 8
    > mov ax,WORD PTR [bx-2]
    > add ax,WORD PTR [bx]
    > add cx,ax
    > add bx,4
    > dec dx
    > jne $F344
    >
    > test_pointer:
    > $F355:
    > ; Line 17
    > mov ax,WORD PTR [si+2]
    > add ax,WORD PTR [si]
    > add dx,ax
    > add si,4
    > cmp WORD PTR [bp-4],si ;list
    > ja $F355


    Yes, thats simply using the adjacent positions
    of struct members:

    mov ax, WORD PTR [bx-2]
    add ax, WORD PTR [bx]

    and advancing the pointer into the data block
    by the struct size (16 bit ints?)

    add bx,4

    Thie compiler obviously did convert the
    "array access" nto a "pointer access".

    > In this case, there may be a small difference between the two -- but
    > it's the array version that's faster. Even on a machine from that time
    > frame, the difference would be too small to care much about though --
    > given the memory constraints of MS-DOS, the time difference for the
    > largest array you could create would still have been less than a
    > millisecond.


    Lets see what the new gcc 4.3.1 thinks about the inner loop
    if full optimization is set (g++-4.3 -O3 -dA -c stuff.cxx -S):

    [array]
    ..L6:
    # basic block 3
    addl 4(%ecx,%edx,8), %eax # displacement(base, index, scale)
    addl (%ecx,%edx,8), %eax
    addl $1, %edx
    cmpl %edx, %ebx
    jg .L6

    [pointer]
    ..L14:
    # basic block 3
    addl 4(%edx), %eax
    addl (%edx), %eax
    addl $8, %edx
    cmpl %ecx, %edx
    jb .L14

    What do we see here? The "array-loop" uses a different
    addressing scheme (indirect register w/base and displacement)
    compared to the "pointer-loop". This proves my point again,
    the pointer loop will be optimized into simpler and
    shorter machine instructions. Due to modern processor
    architectures and instruction translation/reordering,
    one can't deduce from the above code which is faster -
    but this might *possibly* pop up on some architectures.

    I modified my code and did rerun systematical comparisons
    on some boxes:

    --- 8< -----------------------------------------------------------

    [C2Q-9300/3.33GHz gcc-4.3.1 -O3 -march=pentiumpro]
    array: 1.1 sec pointer: 0.91 sec arr/ptr: 120.879%

    [C2Q-9300/3.33GHz icc-10.1 -fast]
    array: 1.1 sec pointer: 1.09 sec arr/ptr: 100.917%

    [C2Q-6600/3.15GHz gcc-4.3.1 -O3 -march=pentiumpro]
    array: 1.09 sec pointer: 0.91 sec arr/ptr: 119.78%

    [C2Q-6600/3.15GHz icc-10.1 -fast]
    array: 1.1 sec pointer: 1.11 sec arr/ptr: 99.0991%

    [P4/2.6GHz gcc-4.3.1 -O3 -march=pentiumpro]
    array: 1.61 sec pointer: 1.74 sec arr/ptr: 92.5287%

    [P4/2.6GHz icc 10.1 -O3 -xN]
    array: 1.97 sec pointer: 1.97 sec arr/ptr: 100%

    [Athlon64/2.3GHz VC++6 /O2 /Zp4]
    array: 1.875 sec pointer: 1.922 sec arr/ptr: 97.5546%

    [Athlon64/2.3GHz VC++9 /O2 /Oy /Ot /Zp4]
    array: 2.376 sec pointer: 1.89 sec arr/ptr: 125.714%

    [Athlon64/2.3GHz gcc-mingw-3.4.2 -O3 -march=pentiumpro]
    array: 3.656 sec pointer: 3.297 sec arr/ptr: 110.889%

    --- 8< -----------------------------------------------------------

    (using the code in the appendix). The *same* program shows
    different behaviour eg. on P4 vs. Core2Q, obviously
    due to the different chip-internals ...

    Regards

    Mirco

    code, modified: ==>

    #include <cstdlib>
    #include <cstring>
    #include <ctime>
    #include <iostream>

    struct listitem { int value1, dummy, value2; };

    double timethis(const char*s, int(*p)(listitem*,int), listitem *list, int n)
    {
    std::cout << s << ' ' << std::flush; // print some process message ...
    time_t t=clock(); // the indirect call does not contribute to run-time,
    int x = p(list, n); // it's just a dispatcher (and prevents inlining)
    return double(clock()-t + x)/CLOCKS_PER_SEC; // fool optimizer! (x == zero)
    }

    int test_array(listitem *list, int nitems)
    {
    int x=0;
    for(int j=0; j<nitems; ++j)
    for(int i=0; i<nitems; ++i)
    x += (list.value1-list[j].value1) + (list.value2-list[j].value2);
    return x;
    }

    int test_pointer(listitem *list, int nitems)
    {
    int x=0;
    for(listitem *pj=list; pj<list+nitems; ++pj)
    for(listitem *pi=list; pi<list+nitems; ++pi)
    x += (pi->value1-pj->value1) + (pi->value2-pj->value2);
    return x;
    }

    int main(int argc, char*argv[])
    {
    int nitems = argv[1] ? atoi(argv[1]) : 22222;
    listitem *list = new listitem[nitems];
    memset(list, 0x1A, sizeof(listitem)*nitems);
    timethis("warmup", test_array, list, nitems);

    double ta=0, tp=0;
    tp += timethis("pointer", test_pointer, list, nitems);
    ta += timethis("array", test_array, list, nitems);
    tp += timethis("pointer", test_pointer, list, nitems);
    ta += timethis("array", test_array, list, nitems);

    std::cout << "length: " << nitems << std::endl
    << "array: " << ta << " sec" << '\t'
    << "pointer: " << tp << " sec" << '\t'
    << "arr/ptr: " << (ta/tp)*100. << "%" << std::endl;

    delete [] list;
    return 0;
    }
    <==
     
    Mirco Wahab, Jun 11, 2008
    #8
  9. Konrad Mühler

    Jerry Coffin Guest

    In article <g2o6p6$14gf$-freiberg.de>, -
    halle.de says...

    [ ... ]

    > All three points are somehow unfounded, imho.
    > 1) what was timed was simply interleaved sequential access to
    > a block of data that is not too trivial regarding L2
    > access of the underlying architecture. To take out one
    > "dimension", as you did, also makes things too simple
    > for any optimizing compiler and wouldn't support my view ... ,


    The OP asked a fairly specific question, including a definition of the
    structure of interest. Adding more to that structure simply because the
    one he cares about doesn't "support your view" seems disingenuous at
    best.

    > 2) calling through pointer was done 2x per test scenario, which
    > has to be compared with the NxN loop calculation in the function,
    > that leaves us at a error ratio of 2 / (20,000 x 20,000) (only) if
    > we assume the indirect call to take as much cycles as the inner-loop
    > calculation (which it obviously doesn't),


    The problem isn't entirely with the indirection itself, as with the fact
    that when making an indirect call, the compiler often can't (or at least
    doesn't) carry out optimizations that it would carry out otherwise. If
    the function would normally be called indirectly, that's fine and
    wonderful -- but there's no indication that would be the case here. As
    such, the testing method may be having a substantial effect on the
    result.

    > 3) this does't really count because the data block initialization is
    > done immediately before the first test function invocation. BTW,
    > due to the "indirect call dispatcher design", this was easy to
    > add(and I added it - it didn't change any results, see enclosed source)
    >
    > > Here's a bit of code that gets rid of at least those specific problems:
    > > #include <cstdlib>
    > > [snipped]
    > > ...
    > > struct listitem { int value1, value2; };
    > > ...

    > Why not to include some other data item in between?


    Because the OP specified exactly this structure as being what he cares
    about.

    > struct listitem { int value1, dummy, value2; };
    >
    > > int test_array(listitem const *list, size_t nitems)
    > > {
    > > int x=0;
    > >
    > > for(size_t i=0; i<nitems; ++i)
    > > x += list.value1 + list.value2;
    > > return x;
    > > }

    >
    > This is just a sequential block access which can
    > be translated to assembler using only one index
    > register, one counter, and one accumulator.


    Sure -- and it corresponds directly to what the OP said he cared about.

    [ ... ]

    > This makes my point absolutely clear, the
    > array access is *not* optimized away by
    > your compiler (aybe this is the asm of the
    > "debug" settings?). If the above is really
    > faster an a mychine, thats a problem of
    > the instruction decoding/reordering or
    > whatever.


    Yes and no -- I decided it got too long, but probably should have left
    in a more detailed analysis of the code. The fact is that yes, there is
    a difference in the generated code. The other fact is that there's not
    even a theoretical difference in speed. The parts that are different are
    between adding one each iteration, then multiplying by 8, or adding 8
    either iteration. The addition is the same speed either way. The
    multiplication by 8 doesn't change the amount of time either. On the
    original 8088/8086, there would have been a difference -- calculating an
    effective address took a variable amount of time, and the more complex a
    calculation used, the longer it took. On those processors, you really
    would have gained something by transforming one form to the other -- and
    compilers at the time did.

    On every since (at least) the 286, there has been an effective address
    unit that can carry out the calculation in constant time. That being the
    case, there's no difference in speed based on the complexity of the
    effective address calculation -- addressing like [esi], [esi+ebx],
    [esi+ebx*8] all take _exactly_ the same amount of time to generate on
    every reasonably modern processor.

    [ ... ]

    > mov ax, WORD PTR [bx-2]
    > add ax, WORD PTR [bx]
    >
    > and advancing the pointer into the data block
    > by the struct size (16 bit ints?)


    Yes -- I said it was old! It generated code for MS-DOS and 16-bit
    Windows.

    >
    > add bx,4
    >
    > Thie compiler obviously did convert the
    > "array access" nto a "pointer access".


    Quite true -- back when processors were sufficiently primitive that it
    made a difference, they did the optimization. Nowadays, they don't
    because it's an ineffective waste of time.

    [ ... ]

    > What do we see here? The "array-loop" uses a different
    > addressing scheme (indirect register w/base and displacement)
    > compared to the "pointer-loop". This proves my point again,
    > the pointer loop will be optimized into simpler and
    > shorter machine instructions. Due to modern processor
    > architectures and instruction translation/reordering,
    > one can't deduce from the above code which is faster -
    > but this might *possibly* pop up on some architectures.


    Pardon my being blunt, but "Nonsense". People who write compilers have
    known about how to do this transformation for _years_. The reason they
    don't do it (at least on the x86) is simple: it's no longer effective.

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
     
    Jerry Coffin, Jun 11, 2008
    #9
  10. Konrad Mühler

    Mirco Wahab Guest

    Jerry Coffin wrote:
    > In article <g2o6p6$14gf$-freiberg.de>, -
    > halle.de says...
    >> All three points are somehow unfounded, imho.

    >
    > The OP asked a fairly specific question, including a definition of the
    > structure of interest. Adding more to that structure simply because the
    > one he cares about doesn't "support your view" seems disingenuous at
    > best.


    No, he didn't. It was simply an abstract question regarding
    efficiency of array vs. pointer access into data aggregates,
    possibly simple structs (we don't even know). How can you
    say: /a fairly specific question/?

    Also, he didn't give any /definition of the structure of interest/
    anywhere? He used a generalized idiom that wouldn't even compile.
    Whats your point in making such statements?

    > The problem isn't entirely with the indirection itself, as with the fact
    > that when making an indirect call, the compiler often can't (or at least
    > doesn't) carry out optimizations that it would carry out otherwise. If
    > the function would normally be called indirectly, that's fine and
    > wonderful -- but there's no indication that would be the case here. As
    > such, the testing method may be having a substantial effect on the
    > result.


    Thats entirely the point. The compiler *shouldn't* do anything
    that we don't want him to do. Why would we create a test case
    that probably destroys our interesting part by unpredictable
    *global* optimizations? So the indirect call served exactly
    the purpose intended by me. Fine.

    >> Why not to include some other data item in between?

    >
    > Because the OP specified exactly this structure as being what he cares
    > about.


    No, he used some symbolic code in order to make his point.
    He added even /and so on/ (whatever that means ;-)

    >> This is just a sequential block access which can
    >> be translated to assembler using only one index
    >> register, one counter, and one accumulator.

    >
    > Sure -- and it corresponds directly to what the OP said he cared about.


    Yes and no ;-)

    >> This makes my point absolutely clear, the
    >> array access is *not* optimized away by
    >> your compiler (aybe this is the asm of the
    >> "debug" settings?). If the above is really
    >> faster an a mychine, thats a problem of
    >> the instruction decoding/reordering or
    >> whatever.

    >
    > Yes and no -- I decided it got too long, but probably should have left
    > in a more detailed analysis of the code. The fact is that yes, there is
    > a difference in the generated code. The other fact is that there's not
    > even a theoretical difference in speed. The parts that are different are
    > between adding one each iteration, then multiplying by 8, or adding 8
    > either iteration. The addition is the same speed either way. The
    > multiplication by 8 doesn't change the amount of time either.


    Thats not really conclusive for me. As you probably know, the scaled
    index variant (mov edi, DWORD PTR [ecx+esi*8+4]) uses at least one byte
    of opcode length more than the unscaled one (mov esi, DWORD PTR [ecx+4])
    according to the Intel instruction set manual. So your assertion
    might be probably wrong.

    > On the original 8088/8086, there would have been a difference -- calculating an
    > effective address took a variable amount of time, and the more complex a
    > calculation used, the longer it took. On those processors, you really
    > would have gained something by transforming one form to the other -- and
    > compilers at the time did.
    > On every since (at least) the 286, there has been an effective address
    > unit that can carry out the calculation in constant time.


    I'm not exactly sure if you are correct here but it reminds me
    somehow from the dark that the scaled address generation is ex-
    pensive until the Pentium or even P6 (frequent AGI stalls on the P5?).

    > That being the case, there's no difference in speed based on the complexity of the
    > effective address calculation -- addressing like [esi], [esi+ebx],
    > [esi+ebx*8] all take _exactly_ the same amount of time to generate on
    > every reasonably modern processor.


    You have all sorts of issues here, pipelining, instruction size
    differences, used up registers (modifications under way in the
    pipeline) and so on. /all take _exactly_ the same amount of time/
    is nothing I would bet my hand for.

    >> What do we see here? The "array-loop" uses a different
    >> addressing scheme (indirect register w/base and displacement)
    >> compared to the "pointer-loop". This proves my point again,
    >> the pointer loop will be optimized into simpler and
    >> shorter machine instructions. Due to modern processor
    >> architectures and instruction translation/reordering,
    >> one can't deduce from the above code which is faster -
    >> but this might *possibly* pop up on some architectures.

    >
    > Pardon my being blunt, but "Nonsense". People who write compilers have
    > known about how to do this transformation for _years_. The reason they
    > don't do it (at least on the x86) is simple: it's no longer effective.


    Nice argument. As you know, there are some implementations of
    the x86 instruction set and from running the programs posted
    earlier I can see 10-25% difference on effectiveness between
    addressing modes with the same program on different cpus.
    And this *has* to do with the OP's question: will there
    be /bad performance/ or not.

    What exactly is the *Nonsense* you found out in your critic
    of my paragraph above?

    Regards & Thanks

    Mirco
     
    Mirco Wahab, Jun 11, 2008
    #10
  11. Konrad Mühler

    Jerry Coffin Guest

    In article <g2p47e$1d1q$-freiberg.de>, -
    halle.de says...

    [ ... ]

    > Thats entirely the point. The compiler *shouldn't* do anything
    > that we don't want him to do. Why would we create a test case
    > that probably destroys our interesting part by unpredictable
    > *global* optimizations? So the indirect call served exactly
    > the purpose intended by me. Fine.


    In that case, it appears that your purpose was to "prove" a foregone
    conclusion rather than to provide information that could possibly be of
    use to anybody else.

    > >> Why not to include some other data item in between?

    > >
    > > Because the OP specified exactly this structure as being what he cares
    > > about.

    >
    > No, he used some symbolic code in order to make his point.
    > He added even /and so on/ (whatever that means ;-)


    So why do you assume that he must have been asking about something he
    didn't even mention?

    [ ... ]

    > Thats not really conclusive for me. As you probably know, the scaled
    > index variant (mov edi, DWORD PTR [ecx+esi*8+4]) uses at least one byte
    > of opcode length more than the unscaled one (mov esi, DWORD PTR [ecx+4])
    > according to the Intel instruction set manual. So your assertion
    > might be probably wrong.


    Well, at least you've had the courtesy to not draw a conclusion -- a
    wise decision, since you don't seem to know much of what you're talking
    about.

    [ ... ]

    > I'm not exactly sure if you are correct here but it reminds me
    > somehow from the dark that the scaled address generation is ex-
    > pensive until the Pentium or even P6 (frequent AGI stalls on the P5?).


    Nonsense. AGI had to do with having two instructions in which a register
    was the destination of one instruction, and then used as the source
    address in the next. For example, code like:

    mov ebx, 1234h
    mov eax, [ebx]

    would have caused an AGI stall between the two instructions. First of
    all, however, the CPUs for which this is true are well and truly
    obsolete -- even if you get your computers by dumpster diving, you're
    going to end up with something much newer than this. Second, scaling the
    address doesn't contribute to AGI stalls in any way.

    [ ... ]

    > > Pardon my being blunt, but "Nonsense". People who write compilers have
    > > known about how to do this transformation for _years_. The reason they
    > > don't do it (at least on the x86) is simple: it's no longer effective.

    >
    > Nice argument. As you know, there are some implementations of
    > the x86 instruction set and from running the programs posted
    > earlier I can see 10-25% difference on effectiveness between
    > addressing modes with the same program on different cpus.
    > And this *has* to do with the OP's question: will there
    > be /bad performance/ or not.
    >
    > What exactly is the *Nonsense* you found out in your critic
    > of my paragraph above?


    The fact that when you get down to it, all you're doing is throwing
    around FUD. You bring up things like AGI stalls and possible differences
    in architecture, not because they're real but because you've apparently
    reached a conclusion through prejudice and are doing whatever you can to
    protect it.

    It's certainly true that under some circumstance or other, you might see
    some difference in speed between two loops that do the same thing -- but
    this remains equally true whether one uses pointer-style notation and
    the other subscripting notation, or both use the same notation. So far
    you've provided precisely zero real support for the claim that one will
    consistently execute more efficiently than the other. Quite the
    contrary, I've proven that in the one case where there really was a
    provable difference, it favored the one you said should be slower.

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
     
    Jerry Coffin, Jun 11, 2008
    #11
  12. Konrad Mühler

    Mirco Wahab Guest

    Jerry Coffin wrote:
    > It's certainly true that under some circumstance or other, you might see
    > some difference in speed between two loops that do the same thing -- but
    > this remains equally true whether one uses pointer-style notation and
    > the other subscripting notation, or both use the same notation. So far
    > you've provided precisely zero real support for the claim that one will
    > consistently execute more efficiently than the other. Quite the
    > contrary, I've proven that in the one case where there really was a
    > provable difference, it favored the one you said should be slower.


    This makes not much sense if you don't read others results and/or
    conclusions thoroughly and continue to repeat your assumptions
    while throwing around accusations of 'prejudice' and use improper
    wording.

    Bye bye & thanks for your time spent

    Mirco
     
    Mirco Wahab, Jun 11, 2008
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Don Beal
    Replies:
    13
    Views:
    861
    Richard Grimes [MVP]
    Sep 29, 2003
  2. jm
    Replies:
    1
    Views:
    539
    alien2_51
    Dec 12, 2003
  3. Cris Rock

    Performance related Question.....

    Cris Rock, Feb 12, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    322
    Stefano Mostarda
    Feb 12, 2004
  4. cjl
    Replies:
    3
    Views:
    1,009
    John Nagle
    May 21, 2007
  5. Software Engineer
    Replies:
    0
    Views:
    368
    Software Engineer
    Jun 10, 2011
Loading...

Share This Page