array index and pointer, which is faster?

R

Richard Heathfield

Paul Hsieh said:

In his searching and sorting book [Knuth] presents the algorithms and
measures their performance complexity with painstaking detail. Its
odd for someone who claims optimization is the root of all evil.

Knuth made no such claim.

But if we his: "We should forget about small efficiencies, say about
97% of the time: premature optimization is the root of all evil."

That's a bit closer to the mark, and the difference is a significant one.

<snip>
 
J

James Harris

p is an array and N is a large number. Which of the following two
loops is faster?

Case1:

for(i=0; i<N; i++){
p = i;

}

Case2:

for(i=0; i<N; i++){
*p = i;
p++;

}

It seems that Case 1 needs multiplication/shifting to calculate the
address/index and the second case needs only an ADD operation. But I
tested it, the speed is almost the same. Can anybody tell me why?
Thanks.


As others have hinted/said since the overall effect of either
algorithm is the same then in theory the relative speeds depend on the
effectiveness of the compiler optimisations. Different optimisation
settings may change the code you get. In this case it looks as though
little optimisation has happened in either case.

The main reason the speeds are similar in this case is that no
mutiplication is needed to address the elements of the array. Compare
a) the addressing, b) the loop maintenance. In code fragment 1 they
are

a)
mov dword ptr [eax+ecx*4],ecx
b)
add ecx,1
cmp ecx,5F5E100h ;cmp i with N
jl main+20h (401020h)

and in code fragment 2 they are

a)
mov dword ptr [eax],ecx
b)
add ecx,1
add eax,4
cmp ecx,5F5E100h
jl main+20h (401020h)

If memory serves, on the Pentium and above there is no penalty in
using a register multiplied by 2, 4 or 8 in an address expression so
[eax+ecx*4] in the first case executes as quickly as [eax] in the
second case.

That just leaves the loop maintenance. Neither of these does anything
complex so each instruction should operate in about half a cycle due
to pairing. The fist one looks like a 2- or 3-cycle loop. The second
looks like a 3-cycle loop. The limiting factor in either case may
therefore be the speed at which the memory can be written which itself
may depend on whether some or all of the array memory is already
cached.

Of course, the array indexing loop could be made lower overhead by
decrementing

a)
mov dword ptr [eax+ecx*4],ecx
b)
dec ecx
jnz main+20h (401020h)

or it could be unrolled (multiple mov instructions in each iteration
and fewer iterations) or, depending on the CPU, some data lines could
be prefetched into cache.
For your reference, the assembly code is as follows:

For Case 1
for(i=0; i<N; i++){
0040101B xor ecx,ecx
0040101D lea ecx,[ecx]
p = i;
00401020 mov dword ptr [eax+ecx*4],ecx
00401023 add ecx,1
00401026 cmp ecx,5F5E100h
0040102C jl main+20h (401020h)
//p++;
}

for Case 2, it is:
for(i=0; i<N; i++){
0040101B xor ecx,ecx
0040101D lea ecx,[ecx]
*p = i;
00401020 mov dword ptr [eax],ecx
00401022 add ecx,1
p++;
00401025 add eax,4
00401028 cmp ecx,5F5E100h
0040102E jl main+20h (401020h)
}
 
H

Herbert Rosenau

p is an array and N is a large number. Which of the following two
loops is faster?

This is of no interest. A halfways modern compiler will generate
exactly the same code in any case and more often it may neither use
what you defined in case1 or case 2 but another instead:

A: for (pe = p+N, i = 0; p < pe; *p++ = i++) ;
or
B: for (pe = p+N, i = 0; (*p++ = i) < pe; i++) ;

The code the optimiser will generate depends on the instruction set
the mashine it rouns on is defined. Some mashines will do this

_ip stands for instruction pointer in this C like assembly (1 line per
instruction):

i = 0;
pe = p+N;
_ip++; /* left out 1 instruction */
*p++ = i;
i += 1;
p < pe ? _ip -= 3; /* continues with i++ when condition true */
/* loop done */
i -= 1; /* correct i because i will be used later */

The best one can do is to write the loop to be most readable to the
human reader and let the compiler do the best possible optimation on
it. The optimiser will find out best way for the mashine the compiler
is designed for.

20 years ago I had prefered A on one mashine, B on another. Today I
prefere a because it is a bit more readable to human.

--
Tschau/Bye
Herbert

Visit http://www.ecomstation.de the home of german eComStation
eComStation 1.2R Deutsch ist da!
 
C

christian.bau

On a related note, considering micro-optimization is what allows one
to realize that heap sort has gotten faster over time because of
changing micro-processor technology:

   http://www.pobox.com/~qed/sort.html

I had a look at this. Just wanted to mention some interesting things
from your page and the linked article to heapsort that people usually
won't find in a text book:

1. Quicksort with randomised pivot doesn't give minimal number of
comparison. Using the median of three numbers as the pivot is better.
Even better (with random input data) would be to do a few comparisons,
check how many went each way, and possibly decide that you picked a
bad pivot value.

2. Heapsort has some rather trivial improvements that reduce the
comparisons considerably.

3. With modern architectures, the number of comparisons isn't
everything. The number of _mispredicted_ comparisons is also
important. (This tends to make heapsort faster than expected. It also
means that changes to Quicksort to reduce the number of comparisons
might be counterproductive, because the number of mispredicted
comparisons grows).

4. Not mentioned on your page: If comparisons are cheap compared to
moving of elements, then you don't actually want to minimise the
number of comparisons in quicksort. Using a pivot that splits the
array unevenly will reduce the number of moves.
 
M

Micah Cowan

christian.bau said:
1. Quicksort with randomised pivot doesn't give minimal number of
comparison. Using the median of three numbers as the pivot is better.
Even better (with random input data) would be to do a few comparisons,
check how many went each way, and possibly decide that you picked a
bad pivot value.

Is that true? How are the three numbers determined? If those three
included the two worst values, wouldn't that lead to worst-case
performance?
 
C

CBFalconer

Micah said:
Is that true? How are the three numbers determined? If those
three included the two worst values, wouldn't that lead to
worst-case performance?

No selection guarantees avoiding a worst case O(n*n) performance.
The usual technique, with three values, is to pick the end values
of the segment, and the (physically) center value. Then select the
median of those.

For a slightly slower sort, but guaranteed O(NlogN) performance,
try heapsort or mergesort. Heap is preferable for arrays, merge
for linked lists. mergesort is also stable.
 
P

pete

Paul said:
Donald Knuth said, paraphrasing Hoare[1],
And do you pray to the esteemed Mr. Knuth as well?

No. Why? Is there a problem with his email?
why have you not questioned Knuth
about his obsession over sorting?

Why don't you?

Send your comments either by email to (e-mail address removed)
or by old-fashioned mail to

Donald E. Knuth
Computer Science Department
Gates Building 4B
Stanford University
Stanford, CA 94305-9045 USA.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top