Why std::sort is ~20% faster then introsort coded by hand?

Discussion in 'C++' started by ikarus, Sep 14, 2008.

  1. ikarus

    ikarus Guest

    Hello C++ Gurus!

    I'm comparing sorting algorithm for study goals. I've compared STL
    std::sort and hand-coded introsort on millions (tens of millions) of
    integers array sorting. It was tested on random, sorted, and sorted in
    reverse arrays. In all tests on such a huge arrays std::sort is faster
    by ~1/4 then my version. So it seems that it some call optimization.

    My 'hand coded' version is based on D. Musser's suggestions and on
    some STL header which I've found in diffrent STL installations (GCC,
    and MSVS2008). The maximum optimization flags was used in both cases,
    for g++ and cl. But 'hand-coded' version is still slower.

    Why? May be in production std::sort version it is used some SIMD
    optimization internally or assembler optiomization?

    Thanks in advance.
    Alex.

    Pseudocode (C++, may be bugs) of introsort version:

    // CODE

    template <typename T>
    inline void swap(T *a, T *b)
    {
    T temp = *a;
    *a = *b;
    *b = temp;
    }

    template <typename T>
    inline void cmpswap(T *a, T *b)
    {
    if (*a > *b) {
    T temp = *a;
    *a = *b;
    *b = temp;
    }
    }

    template <typename T>
    inline T *partition(T *first, T *last)
    {
    T *v = last-1;
    first--;

    do {
    while(*(++first) > *v);
    while(*last-- < *v) if (first == last) break;

    if (first >= last) break;
    swap(first,last);
    } while(true);

    swap(v,first);
    return ++first;
    }

    template <typename T>
    inline void median3(T *first, T *last)
    {
    T *r = last-1, *rr = last-2;
    swap(first + (last - first)/2 , rr);
    cmpswap(first, rr);
    cmpswap(first, r);
    cmpswap(rr, r);
    }

    template <typename T>
    inline void _introsort(T *first, T *last, int recurs_depth)
    {
    while(last - first > M) {
    if (recurs_depth--) return hsort(first, last);

    median3(first, last);
    T *k = partition(first+1, last-1);
    introsort(k,last, recurs_depth);
    last = k;
    }
    }

    template <typename T>
    void introsort(T *first, T *last, int recurs_depth)
    {
    _introsort(first, last, recurs_depth);
    insertion(first, last);
    }

    // CODE END
     
    ikarus, Sep 14, 2008
    #1
    1. Advertising

  2. ikarus wrote:
    > Why?


    Because std::sort() skips subpartitioning partitions smaller than a
    certain size, and instead sorts those partitions with insertion sort.
    That accounts for approximately that 20% speed increase.

    I don't know exactly how small the partition must be before it skips
    subpartitioning it, but according to my own tests something between 15
    and 25 elements seems optimal (although it depends on the element type).

    If you add this additional step to your sorting algorithm you will
    most probably see the ~20% speed increase.
     
    Juha Nieminen, Sep 14, 2008
    #2
    1. Advertising

  3. Juha Nieminen wrote:
    > Because std::sort() skips subpartitioning partitions smaller than a
    > certain size, and instead sorts those partitions with insertion sort.


    Oh, and by the way: When I say insertion sort, I mean insertion sort.
    I do *not* mean bubble sort, selection sort or any other type of sort.
    I *do* mean insertion sort. Nothing else.

    (Yes, I have seen tons of misconceptions about this subject. Insertion
    sort is the most optimal comparison-based sorting algorithm for very
    small arrays. Don't even bother trying anything else.)
     
    Juha Nieminen, Sep 14, 2008
    #3
  4. ikarus

    Jerry Coffin Guest

    In article <aae2f3e6-16c1-4e60-a214-
    >,
    says...
    > Hello C++ Gurus!
    >
    > I'm comparing sorting algorithm for study goals. I've compared STL
    > std::sort and hand-coded introsort on millions (tens of millions) of
    > integers array sorting. It was tested on random, sorted, and sorted in
    > reverse arrays. In all tests on such a huge arrays std::sort is faster
    > by ~1/4 then my version. So it seems that it some call optimization.


    While it's impossible to say for _sure_ without even knowing which exact
    library implementation you're looking at, the obvious guess would be
    that the library is using a "modified introsort", which is essentially
    the same as a modified quicksort, but (obviously enough) with the extra
    book-keeping and switching of an introsort added.

    The idea of a modified quicksort is pretty simple: below some number of
    items, the "simple" sorts (insertion sort, selection sort, bubble sort)
    are generally faster than a quicksort. Therefore, when you reach that
    threshold, you're better off switching to something else (and the
    threshold is rarely very critical).

    A further optimization is due to Robert Sedgewick: he noted that you
    still have a fair amount of overhead repeatedly calling the simple sort
    on all those small partitions. He also noted that with an insertion sort
    (unlike anything else) the speed doesn't really depend on the number of
    items being sorted, but strictly upon the distances they need to be
    moved during sorting -- therefore, rather than calling it once every
    time a partition reaches a certain size threshold, you just _stop_ doing
    more sorting when the size reaches the threshold, and then when all the
    partitions have been reduced to that size, you call an insertion sort
    _once_ for the entire collection. Even though you're using an insertion
    sort on a large colletion, its speed is proportional to the sizes of the
    partitions (when you stopped partitioning) rather than to the size of
    the entire collection.

    The difference this makes will depend _heavily_ upon the speed of
    function calls. For example, it makes a MUCH larger difference on an x86
    than it does on a SPARC. Based on the 20% difference you noted, it's a
    fair guess that you were probably testing on an x86 or something very
    similar. More importantly, that overhead also has an effect on the point
    at which you should stop partitioning -- there's no one size that's
    right for all architectures (though, as I mentioned above, the threshold
    isn't usually critical).

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
     
    Jerry Coffin, Sep 14, 2008
    #4
  5. ikarus

    Kai-Uwe Bux Guest

    Juha Nieminen wrote:

    > ikarus wrote:
    >> Why?

    >
    > Because std::sort() skips subpartitioning partitions smaller than a
    > certain size, and instead sorts those partitions with insertion sort.
    > That accounts for approximately that 20% speed increase.
    >
    > I don't know exactly how small the partition must be before it skips
    > subpartitioning it, but according to my own tests something between 15
    > and 25 elements seems optimal (although it depends on the element type).
    >
    > If you add this additional step to your sorting algorithm you will
    > most probably see the ~20% speed increase.


    I think, he already has that. In the code, the threshold would be M:

    template <typename T>
    inline void _introsort(T *first, T *last, int recurs_depth)
    {
    while(last - first > M) {
    ...
    }
    }

    template <typename T>
    void introsort(T *first, T *last, int recurs_depth)
    {
    _introsort(first, last, recurs_depth);
    insertion(first, last);
    }


    Best

    Kai-Uwe Bux
     
    Kai-Uwe Bux, Sep 14, 2008
    #5
  6. ikarus

    Kai-Uwe Bux Guest

    ikarus wrote:

    > Hello C++ Gurus!
    >
    > I'm comparing sorting algorithm for study goals. I've compared STL
    > std::sort and hand-coded introsort on millions (tens of millions) of
    > integers array sorting. It was tested on random, sorted, and sorted in
    > reverse arrays. In all tests on such a huge arrays std::sort is faster
    > by ~1/4 then my version. So it seems that it some call optimization.
    >
    > My 'hand coded' version is based on D. Musser's suggestions and on
    > some STL header which I've found in diffrent STL installations (GCC,
    > and MSVS2008). The maximum optimization flags was used in both cases,
    > for g++ and cl. But 'hand-coded' version is still slower.
    >
    > Why? May be in production std::sort version it is used some SIMD
    > optimization internally or assembler optiomization?
    >
    > Thanks in advance.
    > Alex.
    >
    > Pseudocode (C++, may be bugs) of introsort version:
    >
    > // CODE
    >
    > template <typename T>
    > inline void swap(T *a, T *b)
    > {
    > T temp = *a;
    > *a = *b;
    > *b = temp;
    > }
    >
    > template <typename T>
    > inline void cmpswap(T *a, T *b)
    > {
    > if (*a > *b) {
    > T temp = *a;
    > *a = *b;
    > *b = temp;
    > }
    > }
    >
    > template <typename T>
    > inline T *partition(T *first, T *last)
    > {
    > T *v = last-1;
    > first--;
    >
    > do {
    > while(*(++first) > *v);
    > while(*last-- < *v) if (first == last) break;
    >
    > if (first >= last) break;
    > swap(first,last);
    > } while(true);
    >
    > swap(v,first);
    > return ++first;
    > }
    >
    > template <typename T>
    > inline void median3(T *first, T *last)
    > {
    > T *r = last-1, *rr = last-2;
    > swap(first + (last - first)/2 , rr);
    > cmpswap(first, rr);
    > cmpswap(first, r);
    > cmpswap(rr, r);
    > }
    >
    > template <typename T>
    > inline void _introsort(T *first, T *last, int recurs_depth)
    > {
    > while(last - first > M) {
    > if (recurs_depth--) return hsort(first, last);


    Is this if() right? It appears that you are using heap sort unless your
    guess for recurs_depth is exactly 1.

    > median3(first, last);
    > T *k = partition(first+1, last-1);
    > introsort(k,last, recurs_depth);
    > last = k;
    > }
    > }
    >
    > template <typename T>
    > void introsort(T *first, T *last, int recurs_depth)
    > {
    > _introsort(first, last, recurs_depth);
    > insertion(first, last);
    > }
    >
    > // CODE END



    Best

    Kai-Uwe Bux
     
    Kai-Uwe Bux, Sep 14, 2008
    #6
  7. ikarus

    James Kanze Guest

    On Sep 14, 4:25 pm, Jerry Coffin <> wrote:
    > In article <aae2f3e6-16c1-4e60-a214-
    > >,
    > says...


    > > I'm comparing sorting algorithm for study goals. I've
    > > compared STL std::sort and hand-coded introsort on millions
    > > (tens of millions) of integers array sorting. It was tested
    > > on random, sorted, and sorted in reverse arrays. In all
    > > tests on such a huge arrays std::sort is faster by ~1/4 then
    > > my version. So it seems that it some call optimization.


    > While it's impossible to say for _sure_ without even knowing
    > which exact library implementation you're looking at, the
    > obvious guess would be that the library is using a "modified
    > introsort", which is essentially the same as a modified
    > quicksort, but (obviously enough) with the extra book-keeping
    > and switching of an introsort added.


    > The idea of a modified quicksort is pretty simple: below some
    > number of items, the "simple" sorts (insertion sort, selection
    > sort, bubble sort) are generally faster than a quicksort.
    > Therefore, when you reach that threshold, you're better off
    > switching to something else (and the threshold is rarely very
    > critical).


    > A further optimization is due to Robert Sedgewick: he noted
    > that you still have a fair amount of overhead repeatedly
    > calling the simple sort on all those small partitions. He also
    > noted that with an insertion sort (unlike anything else) the
    > speed doesn't really depend on the number of items being
    > sorted, but strictly upon the distances they need to be moved
    > during sorting -- therefore, rather than calling it once every
    > time a partition reaches a certain size threshold, you just
    > _stop_ doing more sorting when the size reaches the threshold,
    > and then when all the partitions have been reduced to that
    > size, you call an insertion sort _once_ for the entire
    > collection. Even though you're using an insertion sort on a
    > large colletion, its speed is proportional to the sizes of the
    > partitions (when you stopped partitioning) rather than to the
    > size of the entire collection.


    > The difference this makes will depend _heavily_ upon the speed
    > of function calls. For example, it makes a MUCH larger
    > difference on an x86 than it does on a SPARC. Based on the 20%
    > difference you noted, it's a fair guess that you were probably
    > testing on an x86 or something very similar. More importantly,
    > that overhead also has an effect on the point at which you
    > should stop partitioning -- there's no one size that's right
    > for all architectures (though, as I mentioned above, the
    > threshold isn't usually critical).


    Two other possibilities to consider:

    -- Except for random data, quicksort is very sensitive to how
    the pivot is chosen. If he's constantly seeing roughly the
    same difference (including with already sorted arrays), then
    this probably isn't the case, but it can certainly explain
    significant differences in some specific cases.

    -- Using the modified quicksort, it's possible to gain a little
    time once you go to do the final insertion sort by first
    searching for the smallest value (which can only be in the
    first partition, so you don't have to search far), and
    putting it in place first. Having done that, you can drop
    the test for having reached the beginning of the array in
    the loop of the insertion sort, since you're guaranteed to
    find a value which precedes the one you're inserting before
    reaching it.

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, Sep 14, 2008
    #7
  8. ikarus

    ikarus Guest

    Thanks for everybody for suggestions. But as I mentioned, I already
    using most of suggested improvments, except searching of minimal
    element only in minimal left 'subproblem' to set up it as 'sentinel'
    while in insertiosort, which is used on last stage. I already tryed to
    play with M coefficient, which is size of array for which insertion
    sort is going in game. It dosn't give any notable impovements,
    std::sort is still faster by avarage 4 times (not 20%, sorry for
    misleading you), on different sets of data.

    The reference implemetation of introsort was one from D. Musser docs,
    another from STL algorithm headers from MSVS 2008 C++ implementation,
    which seems to me as plain introsort, but I'm not sure that it is used
    while compilation, may be it is only for Óompatibility there and it is
    used some other internal code (I'll check this).

    > The difference this makes will depend _heavily_ upon the speed of
    > function calls. For example, it makes a MUCH larger difference on an x86
    > than it does on a SPARC. Based on the 20% difference you noted, it's a
    > fair guess that you were probably testing on an x86 or something very
    > similar.


    Yes, I'll agree, it seems to me too that the probelm is in some 'call'
    overhead, but I'm using templates whith inline keywords, and all
    needed optimization flags (/Ox for MS's cl, -O3 for mingw g++). Also I
    looked at generated assembler, and yes most (all) used functions are
    inlined.

    I can't see any assembler inlines or similar optimization in STL
    reference implemntation which I'm using as reference, but it is faster
    by 4 times.

    So how we can avoid this overhead by using only C++ language
    construction?
     
    ikarus, Sep 15, 2008
    #8
  9. Jerry Coffin wrote:

    > A further optimization is due to Robert Sedgewick: he noted that
    > you still have a fair amount of overhead repeatedly calling the
    > simple sort on all those small partitions.


    Do you have a reference handy?


    Thanks,
    Martin

    --
    Quidquid latine scriptum est, altum videtur.
     
    Martin Eisenberg, Sep 15, 2008
    #9
  10. ikarus

    Guest

    Why not just read the sources to the GNU
    std::sort() and see what they're doing
    differently to you?

    Sean
     
    , Sep 15, 2008
    #10
  11. Hello,

    ikarus wrote:

    > I'm comparing sorting algorithm for study goals.


    > But 'hand-coded' version is still slower.
    >
    > Why? May be in production std::sort version it is used some SIMD
    > optimization internally or assembler optiomization?


    > Pseudocode (C++, may be bugs) of introsort version:
    >
    > // CODE
    >
    > template <typename T>
    > inline void swap(T *a, T *b)
    > {
    > T temp = *a;
    > *a = *b;
    > *b = temp;
    > }


    First some stylistic critique: The swap of standard C++ is supposed to
    be used differently, reusing that name with that interface is highly
    confusing.

    A possible cause for the 20% slowdown: For nontrivial classes T with a
    proper swap this implementation of swap is suboptimal, you are doing 3
    full copies of T instances, even involving 3 allocations and
    deallocations. If there is a std::swap for T objects, that one should
    be used. This alone could well suffice for a 20% difference, because
    the number of swaps is in the same order as the number of comparisions
    for many sorting algorithms, so a suboptimal swap could very well add
    to the overall constant of the algorithm, e.g. if T is std::string. If
    T is a simple type like int, then there probably is no big problem with
    your swap. I have not seen the types you used for T. So you have to
    try, whether this plays a role.

    Bernd Strieder
     
    Bernd Strieder, Sep 15, 2008
    #11
  12. ikarus

    Jerry Coffin Guest

    In article <-berlin.de>,
    says...
    > Jerry Coffin wrote:
    >
    > > A further optimization is due to Robert Sedgewick: he noted that
    > > you still have a fair amount of overhead repeatedly calling the
    > > simple sort on all those small partitions.

    >
    > Do you have a reference handy?


    Sorry, but no -- I don't have _any_ references handy right now (I just
    sold my house, and am in the process of moving).

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
     
    Jerry Coffin, Sep 16, 2008
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JerryJ
    Replies:
    11
    Views:
    1,432
    Dave Moore
    Apr 28, 2004
  2. Matt Bitten

    Introsort

    Matt Bitten, Feb 22, 2005, in forum: C++
    Replies:
    1
    Views:
    779
    Dave Vandervies
    Feb 22, 2005
  3. Mr. SweatyFinger
    Replies:
    2
    Views:
    2,254
    Smokey Grindel
    Dec 2, 2006
  4. Replies:
    1
    Views:
    661
  5. Replies:
    0
    Views:
    341
Loading...

Share This Page