Why std::sort is ~20% faster then introsort coded by hand?

I

ikarus

Hello C++ Gurus!

I'm comparing sorting algorithm for study goals. I've compared STL
std::sort and hand-coded introsort on millions (tens of millions) of
integers array sorting. It was tested on random, sorted, and sorted in
reverse arrays. In all tests on such a huge arrays std::sort is faster
by ~1/4 then my version. So it seems that it some call optimization.

My 'hand coded' version is based on D. Musser's suggestions and on
some STL header which I've found in diffrent STL installations (GCC,
and MSVS2008). The maximum optimization flags was used in both cases,
for g++ and cl. But 'hand-coded' version is still slower.

Why? May be in production std::sort version it is used some SIMD
optimization internally or assembler optiomization?

Thanks in advance.
Alex.

Pseudocode (C++, may be bugs) of introsort version:

// CODE

template <typename T>
inline void swap(T *a, T *b)
{
T temp = *a;
*a = *b;
*b = temp;
}

template <typename T>
inline void cmpswap(T *a, T *b)
{
if (*a > *b) {
T temp = *a;
*a = *b;
*b = temp;
}
}

template <typename T>
inline T *partition(T *first, T *last)
{
T *v = last-1;
first--;

do {
while(*(++first) > *v);
while(*last-- < *v) if (first == last) break;

if (first >= last) break;
swap(first,last);
} while(true);

swap(v,first);
return ++first;
}

template <typename T>
inline void median3(T *first, T *last)
{
T *r = last-1, *rr = last-2;
swap(first + (last - first)/2 , rr);
cmpswap(first, rr);
cmpswap(first, r);
cmpswap(rr, r);
}

template <typename T>
inline void _introsort(T *first, T *last, int recurs_depth)
{
while(last - first > M) {
if (recurs_depth--) return hsort(first, last);

median3(first, last);
T *k = partition(first+1, last-1);
introsort(k,last, recurs_depth);
last = k;
}
}

template <typename T>
void introsort(T *first, T *last, int recurs_depth)
{
_introsort(first, last, recurs_depth);
insertion(first, last);
}

// CODE END
 
J

Juha Nieminen

ikarus said:

Because std::sort() skips subpartitioning partitions smaller than a
certain size, and instead sorts those partitions with insertion sort.
That accounts for approximately that 20% speed increase.

I don't know exactly how small the partition must be before it skips
subpartitioning it, but according to my own tests something between 15
and 25 elements seems optimal (although it depends on the element type).

If you add this additional step to your sorting algorithm you will
most probably see the ~20% speed increase.
 
J

Juha Nieminen

Juha said:
Because std::sort() skips subpartitioning partitions smaller than a
certain size, and instead sorts those partitions with insertion sort.

Oh, and by the way: When I say insertion sort, I mean insertion sort.
I do *not* mean bubble sort, selection sort or any other type of sort.
I *do* mean insertion sort. Nothing else.

(Yes, I have seen tons of misconceptions about this subject. Insertion
sort is the most optimal comparison-based sorting algorithm for very
small arrays. Don't even bother trying anything else.)
 
J

Jerry Coffin

(e-mail address removed)>, (e-mail address removed)
says...
Hello C++ Gurus!

I'm comparing sorting algorithm for study goals. I've compared STL
std::sort and hand-coded introsort on millions (tens of millions) of
integers array sorting. It was tested on random, sorted, and sorted in
reverse arrays. In all tests on such a huge arrays std::sort is faster
by ~1/4 then my version. So it seems that it some call optimization.

While it's impossible to say for _sure_ without even knowing which exact
library implementation you're looking at, the obvious guess would be
that the library is using a "modified introsort", which is essentially
the same as a modified quicksort, but (obviously enough) with the extra
book-keeping and switching of an introsort added.

The idea of a modified quicksort is pretty simple: below some number of
items, the "simple" sorts (insertion sort, selection sort, bubble sort)
are generally faster than a quicksort. Therefore, when you reach that
threshold, you're better off switching to something else (and the
threshold is rarely very critical).

A further optimization is due to Robert Sedgewick: he noted that you
still have a fair amount of overhead repeatedly calling the simple sort
on all those small partitions. He also noted that with an insertion sort
(unlike anything else) the speed doesn't really depend on the number of
items being sorted, but strictly upon the distances they need to be
moved during sorting -- therefore, rather than calling it once every
time a partition reaches a certain size threshold, you just _stop_ doing
more sorting when the size reaches the threshold, and then when all the
partitions have been reduced to that size, you call an insertion sort
_once_ for the entire collection. Even though you're using an insertion
sort on a large colletion, its speed is proportional to the sizes of the
partitions (when you stopped partitioning) rather than to the size of
the entire collection.

The difference this makes will depend _heavily_ upon the speed of
function calls. For example, it makes a MUCH larger difference on an x86
than it does on a SPARC. Based on the 20% difference you noted, it's a
fair guess that you were probably testing on an x86 or something very
similar. More importantly, that overhead also has an effect on the point
at which you should stop partitioning -- there's no one size that's
right for all architectures (though, as I mentioned above, the threshold
isn't usually critical).
 
K

Kai-Uwe Bux

Juha said:
Because std::sort() skips subpartitioning partitions smaller than a
certain size, and instead sorts those partitions with insertion sort.
That accounts for approximately that 20% speed increase.

I don't know exactly how small the partition must be before it skips
subpartitioning it, but according to my own tests something between 15
and 25 elements seems optimal (although it depends on the element type).

If you add this additional step to your sorting algorithm you will
most probably see the ~20% speed increase.

I think, he already has that. In the code, the threshold would be M:

template <typename T>
inline void _introsort(T *first, T *last, int recurs_depth)
{
while(last - first > M) {
...
}
}

template <typename T>
void introsort(T *first, T *last, int recurs_depth)
{
_introsort(first, last, recurs_depth);
insertion(first, last);
}


Best

Kai-Uwe Bux
 
K

Kai-Uwe Bux

ikarus said:
Hello C++ Gurus!

I'm comparing sorting algorithm for study goals. I've compared STL
std::sort and hand-coded introsort on millions (tens of millions) of
integers array sorting. It was tested on random, sorted, and sorted in
reverse arrays. In all tests on such a huge arrays std::sort is faster
by ~1/4 then my version. So it seems that it some call optimization.

My 'hand coded' version is based on D. Musser's suggestions and on
some STL header which I've found in diffrent STL installations (GCC,
and MSVS2008). The maximum optimization flags was used in both cases,
for g++ and cl. But 'hand-coded' version is still slower.

Why? May be in production std::sort version it is used some SIMD
optimization internally or assembler optiomization?

Thanks in advance.
Alex.

Pseudocode (C++, may be bugs) of introsort version:

// CODE

template <typename T>
inline void swap(T *a, T *b)
{
T temp = *a;
*a = *b;
*b = temp;
}

template <typename T>
inline void cmpswap(T *a, T *b)
{
if (*a > *b) {
T temp = *a;
*a = *b;
*b = temp;
}
}

template <typename T>
inline T *partition(T *first, T *last)
{
T *v = last-1;
first--;

do {
while(*(++first) > *v);
while(*last-- < *v) if (first == last) break;

if (first >= last) break;
swap(first,last);
} while(true);

swap(v,first);
return ++first;
}

template <typename T>
inline void median3(T *first, T *last)
{
T *r = last-1, *rr = last-2;
swap(first + (last - first)/2 , rr);
cmpswap(first, rr);
cmpswap(first, r);
cmpswap(rr, r);
}

template <typename T>
inline void _introsort(T *first, T *last, int recurs_depth)
{
while(last - first > M) {
if (recurs_depth--) return hsort(first, last);

Is this if() right? It appears that you are using heap sort unless your
guess for recurs_depth is exactly 1.
median3(first, last);
T *k = partition(first+1, last-1);
introsort(k,last, recurs_depth);
last = k;
}
}

template <typename T>
void introsort(T *first, T *last, int recurs_depth)
{
_introsort(first, last, recurs_depth);
insertion(first, last);
}

// CODE END


Best

Kai-Uwe Bux
 
J

James Kanze

(e-mail address removed)>,
(e-mail address removed) says...
While it's impossible to say for _sure_ without even knowing
which exact library implementation you're looking at, the
obvious guess would be that the library is using a "modified
introsort", which is essentially the same as a modified
quicksort, but (obviously enough) with the extra book-keeping
and switching of an introsort added.
The idea of a modified quicksort is pretty simple: below some
number of items, the "simple" sorts (insertion sort, selection
sort, bubble sort) are generally faster than a quicksort.
Therefore, when you reach that threshold, you're better off
switching to something else (and the threshold is rarely very
critical).
A further optimization is due to Robert Sedgewick: he noted
that you still have a fair amount of overhead repeatedly
calling the simple sort on all those small partitions. He also
noted that with an insertion sort (unlike anything else) the
speed doesn't really depend on the number of items being
sorted, but strictly upon the distances they need to be moved
during sorting -- therefore, rather than calling it once every
time a partition reaches a certain size threshold, you just
_stop_ doing more sorting when the size reaches the threshold,
and then when all the partitions have been reduced to that
size, you call an insertion sort _once_ for the entire
collection. Even though you're using an insertion sort on a
large colletion, its speed is proportional to the sizes of the
partitions (when you stopped partitioning) rather than to the
size of the entire collection.
The difference this makes will depend _heavily_ upon the speed
of function calls. For example, it makes a MUCH larger
difference on an x86 than it does on a SPARC. Based on the 20%
difference you noted, it's a fair guess that you were probably
testing on an x86 or something very similar. More importantly,
that overhead also has an effect on the point at which you
should stop partitioning -- there's no one size that's right
for all architectures (though, as I mentioned above, the
threshold isn't usually critical).

Two other possibilities to consider:

-- Except for random data, quicksort is very sensitive to how
the pivot is chosen. If he's constantly seeing roughly the
same difference (including with already sorted arrays), then
this probably isn't the case, but it can certainly explain
significant differences in some specific cases.

-- Using the modified quicksort, it's possible to gain a little
time once you go to do the final insertion sort by first
searching for the smallest value (which can only be in the
first partition, so you don't have to search far), and
putting it in place first. Having done that, you can drop
the test for having reached the beginning of the array in
the loop of the insertion sort, since you're guaranteed to
find a value which precedes the one you're inserting before
reaching it.
 
I

ikarus

Thanks for everybody for suggestions. But as I mentioned, I already
using most of suggested improvments, except searching of minimal
element only in minimal left 'subproblem' to set up it as 'sentinel'
while in insertiosort, which is used on last stage. I already tryed to
play with M coefficient, which is size of array for which insertion
sort is going in game. It dosn't give any notable impovements,
std::sort is still faster by avarage 4 times (not 20%, sorry for
misleading you), on different sets of data.

The reference implemetation of introsort was one from D. Musser docs,
another from STL algorithm headers from MSVS 2008 C++ implementation,
which seems to me as plain introsort, but I'm not sure that it is used
while compilation, may be it is only for Óompatibility there and it is
used some other internal code (I'll check this).
The difference this makes will depend _heavily_ upon the speed of
function calls. For example, it makes a MUCH larger difference on an x86
than it does on a SPARC. Based on the 20% difference you noted, it's a
fair guess that you were probably testing on an x86 or something very
similar.

Yes, I'll agree, it seems to me too that the probelm is in some 'call'
overhead, but I'm using templates whith inline keywords, and all
needed optimization flags (/Ox for MS's cl, -O3 for mingw g++). Also I
looked at generated assembler, and yes most (all) used functions are
inlined.

I can't see any assembler inlines or similar optimization in STL
reference implemntation which I'm using as reference, but it is faster
by 4 times.

So how we can avoid this overhead by using only C++ language
construction?
 
M

Martin Eisenberg

Jerry said:
A further optimization is due to Robert Sedgewick: he noted that
you still have a fair amount of overhead repeatedly calling the
simple sort on all those small partitions.

Do you have a reference handy?


Thanks,
Martin
 
S

sean_in_raleigh

Why not just read the sources to the GNU
std::sort() and see what they're doing
differently to you?

Sean
 
B

Bernd Strieder

Hello,
I'm comparing sorting algorithm for study goals.
But 'hand-coded' version is still slower.

Why? May be in production std::sort version it is used some SIMD
optimization internally or assembler optiomization?
Pseudocode (C++, may be bugs) of introsort version:

// CODE

template <typename T>
inline void swap(T *a, T *b)
{
T temp = *a;
*a = *b;
*b = temp;
}

First some stylistic critique: The swap of standard C++ is supposed to
be used differently, reusing that name with that interface is highly
confusing.

A possible cause for the 20% slowdown: For nontrivial classes T with a
proper swap this implementation of swap is suboptimal, you are doing 3
full copies of T instances, even involving 3 allocations and
deallocations. If there is a std::swap for T objects, that one should
be used. This alone could well suffice for a 20% difference, because
the number of swaps is in the same order as the number of comparisions
for many sorting algorithms, so a suboptimal swap could very well add
to the overall constant of the algorithm, e.g. if T is std::string. If
T is a simple type like int, then there probably is no big problem with
your swap. I have not seen the types you used for T. So you have to
try, whether this plays a role.

Bernd Strieder
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top