Performance Question ...

Konrad Mühler · Jun 10, 2008

Hi,

I've a list of objects. I iterate the list and read the value of each
object for many operation like:

x = myList.value1 + myList.value2
etc.

My question:
Is it efficient to always use myList or should I get the pointer to
the current element and use this instead like

currentElement->value1 ...

Is it low-performance to always let the list return the i's element?

Thank you
Konrad

dizzy · Jun 10, 2008

Konrad said:
Hi,

I've a list of objects. I iterate the list and read the value of each
object for many operation like:

x = myList.value1 + myList.value2
etc.

So you don't have a std::list of objects as std::list has no op[] (BECAUSE
such an op[] would be too expensive). What does your list:p[] do? What is
the algorithmic complexity of it?

My question:
Is it efficient to always use myList or should I get the pointer to
the current element and use this instead like

Click to expand...

Depends on your imlementation of list. If your list is actually something
like a std::vector then an op[] is fast enough. If your data set is much
larger than your data cache then that will be a performance bottleneck no
matter if you use references/pointers/iterators to elements or a fast op[].

currentElement->value1 ...

Is it low-performance to always let the list return the i's element?

Click to expand...

Have you benchmarked/profiled it? What results does it have on your
particular combination of platform, implementation, compiler (version),
optimization flags?

Acording to that make your decision.

Mirco Wahab · Jun 10, 2008

Konrad said:
I've a list of objects. I iterate the list and read the value of each
object for many operation like:
x = myList.value1 + myList.value2
etc.
My question:
Is it efficient to always use myList or should I get the pointer to
the current element and use this instead like
currentElement->value1 ...
Is it low-performance to always let the list return the i's element?

Thats depends on the compiler. The newer gcc's will
optimize the differences away, whereas older ones
(and maybe Visual Studio) will not.

try it:

==>
#include <cstdlib>
#include <cstring>
#include <ctime>
#include <iostream>

struct listitem { int value1, value2, value3; };

int test_array(listitem *list, int nitems)
{
int x=0;
for(int j=0; j<nitems; ++j)
for(int i=0; i<nitems; ++i)
x += (list.value1 - list[j].value1) + (list.value2 - list[j].value2);
return x;
}

int test_pointer(listitem *list, int nitems)
{
int x=0;
for(listitem *pj=list; pj<list+nitems; ++pj)
for(listitem *pi=list; pi<list+nitems; ++pi)
x += (pi->value1 - pj->value1) + (pi->value2 - pj->value2);
return x;
}

double timethis(const char *s, int(*p)(listitem*,int),listitem *list, int nitems)
{
std::cout << s << std::endl;
time_t t=clock();
int x = p(list, nitems); // fool optimizer! (x is always zero)
return double(clock() - t + x)/CLOCKS_PER_SEC;
}

int main(int argc, char*argv[])
{
int nitems = argv[1] ? atoi(argv[1]) : 20000;
listitem *list = new listitem[nitems];
memset(list, 0x1A, sizeof(listitem)*nitems);

double ta=0, tp=0;
tp += timethis("pointer", test_pointer, list, nitems);
ta += timethis("array", test_array, list, nitems);
tp += timethis("pointer", test_pointer, list, nitems);
ta += timethis("array", test_array, list, nitems);

std::cout << "length: " << nitems << std::endl
<< "array: " << ta << " sec" << std::endl
<< "pointer: " << tp << " sec" << std::endl;

delete [] list;
return 0;
}
<==

Regards

Mirco

Marcel Müller · Jun 10, 2008

Hi,

I've a list of objects. I iterate the list and read the value of each
object for many operation like:

x = myList.value1 + myList.value2
etc.

My question:
Is it efficient to always use myList or should I get the pointer to
the current element and use this instead like

currentElement->value1 ...

Is it low-performance to always let the list return the i's element?

what is the complexity of myList.operator[]?
The complexity of your loop will be one order higher (unless you have
something less efficient somewhere else in the loop).

The STL containers usually do not provide operator[] unless it has low
complexity, usually O(1). So if you do not have your own implementation
of operator[] it is likely that you end up with O(n), which is the
minimum of a loop over the entire container content. The performance
impact is minimal in most cases.

However, if your operator[] has O(log(n)) (like SGI's rope
implementation) or O(n) like a linked list, you will have O(n*log(n)) or
O(n²) respectively, which is bad.

Marcel

Jerry Coffin · Jun 10, 2008

[ ... ]

Thats depends on the compiler. The newer gcc's will
optimize the differences away, whereas older ones
(and maybe Visual Studio) will not.

This depends heavily upon interpretation. Some people only use "list" to
mean "linked list". In that case, I find it difficult to believe that a
compiler is capable of optimizing out the differences between subscript-
style notation and an explicit iterator (a pointer, in this case).

Other people use "list" to mean almost any sort of linear data
structure. In this case, (e.g. using an array) optimizing away the
difference between array-style and pointer-style access has been routine
among nearly all compilers for years -- all versions of "Visual Studio"
(and versions of Microsoft C++ and Microsoft C since long before they
started using the "Visual" brand) have known how to do this kind of
optimization.

Likewise, while there may have been _some_ early version of gcc that
couldn't do it, I doubt that version was ever in wide use -- I'm pretty
sure by the time it was labeled "version 1.0", it could do this.

try it:

[ code using dynamically allocated array elided ... ]

Just keep in mind that this may not respond to the question being asked
by the OP -- it all depends on what he means by "list".

Mirco Wahab · Jun 10, 2008

Jerry said:
This depends heavily upon interpretation. Some people only use "list" to
mean "linked list". In that case, I find it difficult to believe that a
compiler is capable of optimizing out the differences between subscript-
style notation and an explicit iterator (a pointer, in this case).

Right. And this was my interpretation of the OP's list.
How does this:

x = myList.value1 + myList.value2

Click to expand...

Click to expand...

look like in your eyes?

Other people use "list" to mean almost any sort of linear data
structure. In this case, (e.g. using an array) optimizing away the
difference between array-style and pointer-style access has been routine
among nearly all compilers for years -- all versions of "Visual Studio"
(and versions of Microsoft C++ and Microsoft C since long before they
started using the "Visual" brand) have known how to do this kind of
optimization.

Click to expand...

Not entirely true. The code posted by me *will* show significant
differences for pointer vs. array formulation between gcc 3.x and
4.x *and* between VC++6 and VC++9. I tested it ;-)

[ code using dynamically allocated array elided ... ]
Just keep in mind that this may not respond to the question being asked
by the OP -- it all depends on what he means by "list".

Click to expand...

OK, I'm not absolutely sure (but so aren't you) ;-)

From interpreting the OP's code, this seems (at last
for me) a valid interpretation.

Regards & Thanks

Mirco

Jerry Coffin · Jun 10, 2008

[ ... ]

Right. And this was my interpretation of the OP's list.
How does this:

x = myList.value1 + myList.value2

Click to expand...

Click to expand...

look like in your eyes?

This looks like it requires operator[] to be valid for whatever sort of
things myList is -- but that doesn't tell us a thing. Overloading
operator[] for a linked list is trivial.

Not entirely true. The code posted by me *will* show significant
differences for pointer vs. array formulation between gcc 3.x and
4.x *and* between VC++6 and VC++9. I tested it ;-)

Click to expand...

I glanced through your code, and frankly I wasn't much impressed. First
of all, what you timed was substantially different from what the OP
asked about. Second, you called the functions through a pointer
(repeatedly) and included the time to call the function via a pointer in
the overall time for the function. Third, you included the first run
through the data in the time for executing the code -- on the first run,
the data will frequently be in main memory, but on the second and
subsequent runs, it's likely to be in the cache, so the first run will
often be substantially slower than the second and subsequent runs.

Here's a bit of code that gets rid of at least those specific problems:

#include <cstdlib>
#include <cstring>
#include <ctime>
#include <iostream>
#include <vector>
#include <algorithm>

struct listitem { int value1, value2; };

listitem gen_rand() {
listitem ret;
ret.value1 = rand();
ret.value2 = rand();
return ret;
}

int test_array(listitem const *list, size_t nitems)
{
int x=0;

for(size_t i=0; i<nitems; ++i)
x += list.value1 + list.value2;
return x;
}

int test_pointer(listitem const *list, size_t nitems)
{
int x=0;
for(listitem const *pi=list; pi<list+nitems; ++pi)
x += pi->value1 + pi->value2;
return x;
}

int main(int argc, char*argv[]) {
const int rep_count = 20;
const int size=2000000;

static listitem list[size];

size_t i;

for (i=0; i<size; i++)
list = gen_rand();

int val1 = test_array(list, size); // prime the cache.
int val2 = test_pointer(list, size);

clock_t time1 = 0;

// do the real tests
for (i=0; i<rep_count; i++) {
clock_t start1 = clock();
val1 = test_array(list, size);
clock_t stop1 = clock();
time1 += stop1 - start1;
}

clock_t time2 = 0;

for (i=0; i<rep_count; i++) {
clock_t start2 = clock();
val2 = test_pointer(list, size);
clock_t stop2 = clock();
time2 += stop2-start2;
}

std::cout << "Array value: " << val1;
std::cout << "\nPointer value: " << val2;

std::cout << "\nArray Time: " << time1;
std::cout << "\nPointer Time : " << time2;

return 0;
}

Compiled with VC++ 6.0, using the default "Release" settings for a
console application, the results I get from a few runs of this are:

C:\C\source\Release>test_ptr
Array value: 1141859694
Pointer value: 1141859694
Array Time: 110
Pointer Time : 93
C:\C\source\Release>test_ptr
Array value: 1141859694
Pointer value: 1141859694
Array Time: 94
Pointer Time : 109
C:\C\source\Release>test_ptr
Array value: 1141859694
Pointer value: 1141859694
Array Time: 109
Pointer Time : 94
C:\C\source\Release>test_ptr
Array value: 1141859694
Pointer value: 1141859694
Array Time: 94
Pointer Time : 109

The only differences we're seeing are of a single clock tick, and
there's no apparent bias in either direction. For all practical
purpsoses the two are identical in speed. Looking at the assembly
language produced for each shows that it's not just for all practical
purpuposes:

test_array:
; _list$ = ecx
; _nitems$ = edx
; Line 18
push esi
; Line 19
xor eax, eax
; Line 21
xor esi, esi
test edx, edx
jbe SHORT $L10248
push edi
npad 6
$L10246:
; Line 22
mov edi, DWORD PTR [ecx+esi*8+4]
add edi, DWORD PTR [ecx+esi*8]
add esi, 1
add eax, edi
cmp esi, edx
jb SHORT $L10246
pop edi
$L10248:
pop esi
; Line 24
ret 0

test_pointer:
; _list$ = ecx
; _nitems$ = edx
; Line 29
lea edx, DWORD PTR [ecx+edx*8]
xor eax, eax
cmp ecx, edx
jae SHORT $L10257
push esi
npad 6
$L10255:
; Line 30
mov esi, DWORD PTR [ecx+4]
add esi, DWORD PTR [ecx]
add ecx, 8
add eax, esi
cmp ecx, edx
jb SHORT $L10255
pop esi
$L10257:
; Line 32
ret 0

I also did a quick test by stripping the source code down to the parts
we care about:

struct listitem { int value1, value2; };

int test_array(listitem const *list, int nitems)
{
int x=0;
int i;
for(i=0; i<nitems; ++i)
x += list.value1 + list.value2;
return x;
}

int test_pointer(listitem const *list, int nitems)
{
int x=0;
listitem const *pi;
for(pi=list; pi<list+nitems; ++pi)
x += pi->value1 + pi->value2;
return x;
}

and then compiled the result with MS C/C++ 7.0 (the predecessor of MS
Visual C++ 1.0). Here's the assembly language it produces for the two
inner loops:

test_array:
$F344:
; Line 8
mov ax,WORD PTR [bx-2]
add ax,WORD PTR [bx]
add cx,ax
add bx,4
dec dx
jne $F344

test_pointer:
$F355:
; Line 17
mov ax,WORD PTR [si+2]
add ax,WORD PTR [si]
add dx,ax
add si,4
cmp WORD PTR [bp-4],si ;list
ja $F355

In this case, there may be a small difference between the two -- but
it's the array version that's faster. Even on a machine from that time
frame, the difference would be too small to care much about though --
given the memory constraints of MS-DOS, the time difference for the
largest array you could create would still have been less than a
millisecond.

Mirco Wahab · Jun 11, 2008

Hi Jerry,

thanks for your meaningful response.

Jerry said:
[email protected] said:

How does this:

x = myList.value1 + myList.value2

Click to expand...

look like in your eyes?

Click to expand...

This looks like it requires operator[] to be valid for whatever sort of
things myList is -- but that doesn't tell us a thing. Overloading
operator[] for a linked list is trivial.
OK

[...] optimizing away the
difference between array-style and pointer-style access has been routine
among nearly all compilers for years -- all versions of "Visual Studio"
(and versions of Microsoft C++ and Microsoft C since long before they
started using the "Visual" brand) have known how to do this kind of
optimization.

Click to expand...

Not entirely true. The code posted by me *will* show significant
differences for pointer vs. array formulation between gcc 3.x and
4.x *and* between VC++6 and VC++9. I tested it ;-)

Click to expand...

I glanced through your code, and frankly I wasn't much impressed.

What the ... ;-)

First of all, what you timed was substantially different from what the OP
asked about. Second, you called the functions through a pointer
(repeatedly) and included the time to call the function via a pointer in
the overall time for the function. Third, you included the first run
through the data in the time for executing the code -- on the first run,
the data will frequently be in main memory, but on the second and
subsequent runs, it's likely to be in the cache, so the first run will
often be substantially slower than the second and subsequent runs.

Click to expand...

All three points are somehow unfounded, imho.
1) what was timed was simply interleaved sequential access to
a block of data that is not too trivial regarding L2
access of the underlying architecture. To take out one
"dimension", as you did, also makes things too simple
for any optimizing compiler and wouldn't support my view ... ,
2) calling through pointer was done 2x per test scenario, which
has to be compared with the NxN loop calculation in the function,
that leaves us at a error ratio of 2 / (20,000 x 20,000) (only) if
we assume the indirect call to take as much cycles as the inner-loop
calculation (which it obviously doesn't),
3) this does't really count because the data block initialization is
done immediately before the first test function invocation. BTW,
due to the "indirect call dispatcher design", this was easy to
add(and I added it - it didn't change any results, see enclosed source)

Here's a bit of code that gets rid of at least those specific problems:
#include <cstdlib>
[snipped]
...
struct listitem { int value1, value2; };
...

Click to expand...

Why not to include some other data item in between?

struct listitem { int value1, dummy, value2; };

int test_array(listitem const *list, size_t nitems)
{
int x=0;

for(size_t i=0; i<nitems; ++i)
x += list.value1 + list.value2;
return x;
}

Click to expand...

This is just a sequential block access which can
be translated to assembler using only one index
register, one counter, and one accumulator.

[snipped]
std::cout << "Array value: " << val1;
std::cout << "\nPointer value: " << val2;

std::cout << "\nArray Time: " << time1;
std::cout << "\nPointer Time : " << time2;

Click to expand...

On any of my machines, your code (using your
prameters) runs just through within a few clock
cycles. But I see your point. On such simple
constructs, almost any compiler can deduce
the optimal access sequence and can easily
prevent repeated index offset multiplications.

Compiled with VC++ 6.0, using the default "Release" settings for a
console application, the results I get from a few runs of this are:
...
[snipped]

Click to expand...

ohh, this is nice:

test_array:
... $L10257:
mov edi, DWORD PTR [ecx+esi*8+4]
add edi, DWORD PTR [ecx+esi*8]
add esi, 1
add eax, edi
cmp esi, edx
jb SHORT $L10246
...
and

test_pointer:
... $L10255:
mov esi, DWORD PTR [ecx+4]
add esi, DWORD PTR [ecx]
add ecx, 8
add eax, esi
cmp ecx, edx
jb SHORT $L10255
...

Click to expand...

This makes my point absolutely clear, the
array access is *not* optimized away by
your compiler (aybe this is the asm of the
"debug" settings?). If the above is really
faster an a mychine, thats a problem of
the instruction decoding/reordering or
whatever.

I also did a quick test by stripping the source code down to the parts
we care about:
[simplified code snipped]
...
and then compiled the result with MS C/C++ 7.0 (the predecessor of MS
Visual C++ 1.0). Here's the assembly language it produces for the two
inner loops:
OK

test_array:
$F344:
; Line 8
mov ax,WORD PTR [bx-2]
add ax,WORD PTR [bx]
add cx,ax
add bx,4
dec dx
jne $F344

test_pointer:
$F355:
; Line 17
mov ax,WORD PTR [si+2]
add ax,WORD PTR [si]
add dx,ax
add si,4
cmp WORD PTR [bp-4],si ;list
ja $F355

Click to expand...

Yes, thats simply using the adjacent positions
of struct members:

mov ax, WORD PTR [bx-2]
add ax, WORD PTR [bx]

and advancing the pointer into the data block
by the struct size (16 bit ints?)

add bx,4

Thie compiler obviously did convert the
"array access" nto a "pointer access".

In this case, there may be a small difference between the two -- but
it's the array version that's faster. Even on a machine from that time
frame, the difference would be too small to care much about though --
given the memory constraints of MS-DOS, the time difference for the
largest array you could create would still have been less than a
millisecond.

Click to expand...

Lets see what the new gcc 4.3.1 thinks about the inner loop
if full optimization is set (g++-4.3 -O3 -dA -c stuff.cxx -S):

[array]
..L6:
# basic block 3
addl 4(%ecx,%edx,8), %eax # displacement(base, index, scale)
addl (%ecx,%edx,8), %eax
addl $1, %edx
cmpl %edx, %ebx
jg .L6

[pointer]
..L14:
# basic block 3
addl 4(%edx), %eax
addl (%edx), %eax
addl $8, %edx
cmpl %ecx, %edx
jb .L14

What do we see here? The "array-loop" uses a different
addressing scheme (indirect register w/base and displacement)
compared to the "pointer-loop". This proves my point again,
the pointer loop will be optimized into simpler and
shorter machine instructions. Due to modern processor
architectures and instruction translation/reordering,
one can't deduce from the above code which is faster -
but this might *possibly* pop up on some architectures.

I modified my code and did rerun systematical comparisons
on some boxes:

--- 8< -----------------------------------------------------------

[C2Q-9300/3.33GHz gcc-4.3.1 -O3 -march=pentiumpro]
array: 1.1 sec pointer: 0.91 sec arr/ptr: 120.879%

[C2Q-9300/3.33GHz icc-10.1 -fast]
array: 1.1 sec pointer: 1.09 sec arr/ptr: 100.917%

[C2Q-6600/3.15GHz gcc-4.3.1 -O3 -march=pentiumpro]
array: 1.09 sec pointer: 0.91 sec arr/ptr: 119.78%

[C2Q-6600/3.15GHz icc-10.1 -fast]
array: 1.1 sec pointer: 1.11 sec arr/ptr: 99.0991%

[P4/2.6GHz gcc-4.3.1 -O3 -march=pentiumpro]
array: 1.61 sec pointer: 1.74 sec arr/ptr: 92.5287%

[P4/2.6GHz icc 10.1 -O3 -xN]
array: 1.97 sec pointer: 1.97 sec arr/ptr: 100%

[Athlon64/2.3GHz VC++6 /O2 /Zp4]
array: 1.875 sec pointer: 1.922 sec arr/ptr: 97.5546%

[Athlon64/2.3GHz VC++9 /O2 /Oy /Ot /Zp4]
array: 2.376 sec pointer: 1.89 sec arr/ptr: 125.714%

[Athlon64/2.3GHz gcc-mingw-3.4.2 -O3 -march=pentiumpro]
array: 3.656 sec pointer: 3.297 sec arr/ptr: 110.889%

--- 8< -----------------------------------------------------------

(using the code in the appendix). The *same* program shows
different behaviour eg. on P4 vs. Core2Q, obviously
due to the different chip-internals ...

Regards

Mirco

code, modified: ==>

#include <cstdlib>
#include <cstring>
#include <ctime>
#include <iostream>

struct listitem { int value1, dummy, value2; };

double timethis(const char*s, int(*p)(listitem*,int), listitem *list, int n)
{
std::cout << s << ' ' << std::flush; // print some process message ...
time_t t=clock(); // the indirect call does not contribute to run-time,
int x = p(list, n); // it's just a dispatcher (and prevents inlining)
return double(clock()-t + x)/CLOCKS_PER_SEC; // fool optimizer! (x == zero)
}

int test_array(listitem *list, int nitems)
{
int x=0;
for(int j=0; j<nitems; ++j)
for(int i=0; i<nitems; ++i)
x += (list.value1-list[j].value1) + (list.value2-list[j].value2);
return x;
}

int test_pointer(listitem *list, int nitems)
{
int x=0;
for(listitem *pj=list; pj<list+nitems; ++pj)
for(listitem *pi=list; pi<list+nitems; ++pi)
x += (pi->value1-pj->value1) + (pi->value2-pj->value2);
return x;
}

int main(int argc, char*argv[])
{
int nitems = argv[1] ? atoi(argv[1]) : 22222;
listitem *list = new listitem[nitems];
memset(list, 0x1A, sizeof(listitem)*nitems);
timethis("warmup", test_array, list, nitems);

double ta=0, tp=0;
tp += timethis("pointer", test_pointer, list, nitems);
ta += timethis("array", test_array, list, nitems);
tp += timethis("pointer", test_pointer, list, nitems);
ta += timethis("array", test_array, list, nitems);

std::cout << "length: " << nitems << std::endl
<< "array: " << ta << " sec" << '\t'
<< "pointer: " << tp << " sec" << '\t'
<< "arr/ptr: " << (ta/tp)*100. << "%" << std::endl;

delete [] list;
return 0;
}
<==

Jerry Coffin · Jun 11, 2008

[ ... ]

All three points are somehow unfounded, imho.
1) what was timed was simply interleaved sequential access to
a block of data that is not too trivial regarding L2
access of the underlying architecture. To take out one
"dimension", as you did, also makes things too simple
for any optimizing compiler and wouldn't support my view ... ,

The OP asked a fairly specific question, including a definition of the
structure of interest. Adding more to that structure simply because the
one he cares about doesn't "support your view" seems disingenuous at
best.

2) calling through pointer was done 2x per test scenario, which
has to be compared with the NxN loop calculation in the function,
that leaves us at a error ratio of 2 / (20,000 x 20,000) (only) if
we assume the indirect call to take as much cycles as the inner-loop
calculation (which it obviously doesn't),

The problem isn't entirely with the indirection itself, as with the fact
that when making an indirect call, the compiler often can't (or at least
doesn't) carry out optimizations that it would carry out otherwise. If
the function would normally be called indirectly, that's fine and
wonderful -- but there's no indication that would be the case here. As
such, the testing method may be having a substantial effect on the
result.

3) this does't really count because the data block initialization is
done immediately before the first test function invocation. BTW,
due to the "indirect call dispatcher design", this was easy to
add(and I added it - it didn't change any results, see enclosed source)

Here's a bit of code that gets rid of at least those specific problems:
#include <cstdlib>
[snipped]
...
struct listitem { int value1, value2; };
...

Click to expand...

Why not to include some other data item in between?

Because the OP specified exactly this structure as being what he cares
about.

struct listitem { int value1, dummy, value2; };

int test_array(listitem const *list, size_t nitems)
{
int x=0;

for(size_t i=0; i<nitems; ++i)
x += list.value1 + list.value2;
return x;
}

Click to expand...

This is just a sequential block access which can
be translated to assembler using only one index
register, one counter, and one accumulator.

Sure -- and it corresponds directly to what the OP said he cared about.

[ ... ]

This makes my point absolutely clear, the
array access is *not* optimized away by
your compiler (aybe this is the asm of the
"debug" settings?). If the above is really
faster an a mychine, thats a problem of
the instruction decoding/reordering or
whatever.

Click to expand...

Yes and no -- I decided it got too long, but probably should have left
in a more detailed analysis of the code. The fact is that yes, there is
a difference in the generated code. The other fact is that there's not
even a theoretical difference in speed. The parts that are different are
between adding one each iteration, then multiplying by 8, or adding 8
either iteration. The addition is the same speed either way. The
multiplication by 8 doesn't change the amount of time either. On the
original 8088/8086, there would have been a difference -- calculating an
effective address took a variable amount of time, and the more complex a
calculation used, the longer it took. On those processors, you really
would have gained something by transforming one form to the other -- and
compilers at the time did.

On every since (at least) the 286, there has been an effective address
unit that can carry out the calculation in constant time. That being the
case, there's no difference in speed based on the complexity of the
effective address calculation -- addressing like [esi], [esi+ebx],
[esi+ebx*8] all take _exactly_ the same amount of time to generate on
every reasonably modern processor.

[ ... ]

mov ax, WORD PTR [bx-2]
add ax, WORD PTR [bx]

and advancing the pointer into the data block
by the struct size (16 bit ints?)

Click to expand...

Yes -- I said it was old! It generated code for MS-DOS and 16-bit
Windows.

add bx,4

Thie compiler obviously did convert the
"array access" nto a "pointer access".

Click to expand...

Quite true -- back when processors were sufficiently primitive that it
made a difference, they did the optimization. Nowadays, they don't
because it's an ineffective waste of time.

[ ... ]

What do we see here? The "array-loop" uses a different
addressing scheme (indirect register w/base and displacement)
compared to the "pointer-loop". This proves my point again,
the pointer loop will be optimized into simpler and
shorter machine instructions. Due to modern processor
architectures and instruction translation/reordering,
one can't deduce from the above code which is faster -
but this might *possibly* pop up on some architectures.

Click to expand...

Pardon my being blunt, but "Nonsense". People who write compilers have
known about how to do this transformation for _years_. The reason they
don't do it (at least on the x86) is simple: it's no longer effective.

Mirco Wahab · Jun 11, 2008

Jerry said:
The OP asked a fairly specific question, including a definition of the
structure of interest. Adding more to that structure simply because the
one he cares about doesn't "support your view" seems disingenuous at
best.

No, he didn't. It was simply an abstract question regarding
efficiency of array vs. pointer access into data aggregates,
possibly simple structs (we don't even know). How can you
say: /a fairly specific question/?

Also, he didn't give any /definition of the structure of interest/
anywhere? He used a generalized idiom that wouldn't even compile.
Whats your point in making such statements?

The problem isn't entirely with the indirection itself, as with the fact
that when making an indirect call, the compiler often can't (or at least
doesn't) carry out optimizations that it would carry out otherwise. If
the function would normally be called indirectly, that's fine and
wonderful -- but there's no indication that would be the case here. As
such, the testing method may be having a substantial effect on the
result.

Thats entirely the point. The compiler *shouldn't* do anything
that we don't want him to do. Why would we create a test case
that probably destroys our interesting part by unpredictable
*global* optimizations? So the indirect call served exactly
the purpose intended by me. Fine.

Because the OP specified exactly this structure as being what he cares
about.

No, he used some symbolic code in order to make his point.
He added even /and so on/ (whatever that means ;-)

Sure -- and it corresponds directly to what the OP said he cared about.

Yes and no ;-)

Yes and no -- I decided it got too long, but probably should have left
in a more detailed analysis of the code. The fact is that yes, there is
a difference in the generated code. The other fact is that there's not
even a theoretical difference in speed. The parts that are different are
between adding one each iteration, then multiplying by 8, or adding 8
either iteration. The addition is the same speed either way. The
multiplication by 8 doesn't change the amount of time either.

Thats not really conclusive for me. As you probably know, the scaled
index variant (mov edi, DWORD PTR [ecx+esi*8+4]) uses at least one byte
of opcode length more than the unscaled one (mov esi, DWORD PTR [ecx+4])
according to the Intel instruction set manual. So your assertion
might be probably wrong.

On the original 8088/8086, there would have been a difference -- calculating an
effective address took a variable amount of time, and the more complex a
calculation used, the longer it took. On those processors, you really
would have gained something by transforming one form to the other -- and
compilers at the time did.
On every since (at least) the 286, there has been an effective address
unit that can carry out the calculation in constant time.

I'm not exactly sure if you are correct here but it reminds me
somehow from the dark that the scaled address generation is ex-
pensive until the Pentium or even P6 (frequent AGI stalls on the P5?).

That being the case, there's no difference in speed based on the complexity of the
effective address calculation -- addressing like [esi], [esi+ebx],
[esi+ebx*8] all take _exactly_ the same amount of time to generate on
every reasonably modern processor.

You have all sorts of issues here, pipelining, instruction size
differences, used up registers (modifications under way in the
pipeline) and so on. /all take _exactly_ the same amount of time/
is nothing I would bet my hand for.

Pardon my being blunt, but "Nonsense". People who write compilers have
known about how to do this transformation for _years_. The reason they
don't do it (at least on the x86) is simple: it's no longer effective.

Nice argument. As you know, there are some implementations of
the x86 instruction set and from running the programs posted
earlier I can see 10-25% difference on effectiveness between
addressing modes with the same program on different cpus.
And this *has* to do with the OP's question: will there
be /bad performance/ or not.

What exactly is the *Nonsense* you found out in your critic
of my paragraph above?

Regards & Thanks

Mirco

Jerry Coffin · Jun 11, 2008

[ ... ]

Thats entirely the point. The compiler *shouldn't* do anything
that we don't want him to do. Why would we create a test case
that probably destroys our interesting part by unpredictable
*global* optimizations? So the indirect call served exactly
the purpose intended by me. Fine.

In that case, it appears that your purpose was to "prove" a foregone
conclusion rather than to provide information that could possibly be of
use to anybody else.

No, he used some symbolic code in order to make his point.
He added even /and so on/ (whatever that means ;-)

So why do you assume that he must have been asking about something he
didn't even mention?

[ ... ]

Thats not really conclusive for me. As you probably know, the scaled
index variant (mov edi, DWORD PTR [ecx+esi*8+4]) uses at least one byte
of opcode length more than the unscaled one (mov esi, DWORD PTR [ecx+4])
according to the Intel instruction set manual. So your assertion
might be probably wrong.

Well, at least you've had the courtesy to not draw a conclusion -- a
wise decision, since you don't seem to know much of what you're talking
about.

[ ... ]

I'm not exactly sure if you are correct here but it reminds me
somehow from the dark that the scaled address generation is ex-
pensive until the Pentium or even P6 (frequent AGI stalls on the P5?).

Nonsense. AGI had to do with having two instructions in which a register
was the destination of one instruction, and then used as the source
address in the next. For example, code like:

mov ebx, 1234h
mov eax, [ebx]

would have caused an AGI stall between the two instructions. First of
all, however, the CPUs for which this is true are well and truly
obsolete -- even if you get your computers by dumpster diving, you're
going to end up with something much newer than this. Second, scaling the
address doesn't contribute to AGI stalls in any way.

[ ... ]

Nice argument. As you know, there are some implementations of
the x86 instruction set and from running the programs posted
earlier I can see 10-25% difference on effectiveness between
addressing modes with the same program on different cpus.
And this *has* to do with the OP's question: will there
be /bad performance/ or not.

What exactly is the *Nonsense* you found out in your critic
of my paragraph above?

The fact that when you get down to it, all you're doing is throwing
around FUD. You bring up things like AGI stalls and possible differences
in architecture, not because they're real but because you've apparently
reached a conclusion through prejudice and are doing whatever you can to
protect it.

It's certainly true that under some circumstance or other, you might see
some difference in speed between two loops that do the same thing -- but
this remains equally true whether one uses pointer-style notation and
the other subscripting notation, or both use the same notation. So far
you've provided precisely zero real support for the claim that one will
consistently execute more efficiently than the other. Quite the
contrary, I've proven that in the one case where there really was a
provable difference, it favored the one you said should be slower.

Mirco Wahab · Jun 11, 2008

Jerry said:
It's certainly true that under some circumstance or other, you might see
some difference in speed between two loops that do the same thing -- but
this remains equally true whether one uses pointer-style notation and
the other subscripting notation, or both use the same notation. So far
you've provided precisely zero real support for the claim that one will
consistently execute more efficiently than the other. Quite the
contrary, I've proven that in the one case where there really was a
provable difference, it favored the one you said should be slower.

This makes not much sense if you don't read others results and/or
conclusions thoroughly and continue to repeat your assumptions
while throwing around accusations of 'prejudice' and use improper
wording.

Bye bye & thanks for your time spent

Mirco

Question about my projects	3	Jul 23, 2021
Tasks	1	Nov 29, 2022
A question about thrift performance.	0	Jan 6, 2013
map vs list performance	15	Dec 30, 2009
Boost.bind and performance penalties?	6	May 25, 2011
std::priority_queue: Abysmally bad performance versus std::set, can'tfigure out why.	5	Nov 30, 2012
R-value references and performance	2	Oct 11, 2008
performance critical Python features	0	Jun 23, 2011

Performance Question ...

Konrad Mühler

dizzy

Mirco Wahab

Marcel Müller

Jerry Coffin

Mirco Wahab

Jerry Coffin

Mirco Wahab

Jerry Coffin

Mirco Wahab

Jerry Coffin

Mirco Wahab

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads