Why is a vector find faster than a set with find()?

Discussion in 'C++' started by boltar2003@boltar.world, Aug 7, 2012.

  1. Guest

    Hello

    I'm storing a large (100K) list of integer id's in a set which I need to
    frequenctly search for a given id. When I try doing this with a set using the
    standard find() function it takes 3 times longer than doing the same with a
    vector!

    Why is this? Surely given that its ordered a set could be searched using
    a binary chop whereas a vector must be a linear search. Even if the set is
    searched linearly why does it take 3 times as long?

    Thanks for any info

    B2003
    , Aug 7, 2012
    #1
    1. Advertising

  2. writes:

    > I'm storing a large (100K) list of integer id's in a set which I need
    > to frequenctly search for a given id. When I try doing this with a set
    > using the standard find() function it takes 3 times longer than doing
    > the same with a vector!


    It is kind of surprising with such a large number of items. Linear scan
    is certainly faster with small vectors/sets, but 10^5 should be enough
    to make algorithmic complexity be the major factor.

    What do you count exactly? Does your timer include filling up the set?
    Can you post some (simplified) code exhibiting the phenomenon?

    > Why is this? Surely given that its ordered a set could be searched
    > using a binary chop whereas a vector must be a linear search. Even if
    > the set is searched linearly why does it take 3 times as long?


    Here is a plausible explanation. A vector stores integers in a
    contiguous area of memory, and recent processors have pretty efficient
    hardware prefetchers, which give their best when scanning an array
    (basically what you are doing when traversing a vector). STL set, on the
    other hand, allocates a new block for each element, so find needs to
    follow pointers. There is no good reason for these pointers to exhibit
    any kind of regularity, so prefetchers are inoperant, and you pay the
    cache miss cost on every access (if set cells are about the size of a
    cache line). Yes, data locality matters...

    Still, it would be spectacular to observe this effect on 10^5 items.

    -- Alain.
    Alain Ketterlin, Aug 7, 2012
    #2
    1. Advertising

  3. Guest

    On Tue, 07 Aug 2012 15:36:19 +0200
    Alain Ketterlin <-strasbg.fr> wrote:
    > writes:
    >
    >> I'm storing a large (100K) list of integer id's in a set which I need
    >> to frequenctly search for a given id. When I try doing this with a set
    >> using the standard find() function it takes 3 times longer than doing
    >> the same with a vector!

    >
    >It is kind of surprising with such a large number of items. Linear scan
    >is certainly faster with small vectors/sets, but 10^5 should be enough
    >to make algorithmic complexity be the major factor.
    >
    >What do you count exactly? Does your timer include filling up the set?
    >Can you post some (simplified) code exhibiting the phenomenon?


    No, doesn't include filling up but thats so fast it would hardly register.
    I traced the delay down to the search of the set and timed that. Tried a
    vector and it was 3 times faster. Its annoying because I wanted to use a
    generic function with find() in it but now I'm having to explicitely check for
    a set and use the sets built in find() which is about 200 times faster and I'm
    presuming does do a binary chop.

    >(basically what you are doing when traversing a vector). STL set, on the
    >other hand, allocates a new block for each element, so find needs to
    >follow pointers. There is no good reason for these pointers to exhibit


    Ah , didn't realise that. I assume a set was the same internally as a vector
    except kept ordered. I thought only lists and queues used pointers.

    >Still, it would be spectacular to observe this effect on 10^5 items.


    My code was nothing more than:

    myiterator = find(mydata.begin(),mydata.end(),val);

    So it would be fairly easy for someone to reproduce.

    B2003
    , Aug 7, 2012
    #3
  4. Joe Greer Guest

    wrote in news:jvr3p9$b7s$:

    > Hello
    >
    > I'm storing a large (100K) list of integer id's in a set which I need
    > to frequenctly search for a given id. When I try doing this with a set
    > using the standard find() function it takes 3 times longer than doing
    > the same with a vector!
    >
    > Why is this? Surely given that its ordered a set could be searched
    > using a binary chop whereas a vector must be a linear search. Even if
    > the set is searched linearly why does it take 3 times as long?
    >
    > Thanks for any info
    >
    > B2003
    >
    >


    If I had to guess (and I do :) ), I would guess that it has to do with the
    CPU's cache and read prediction logic making things fly for the list of
    integers. I suspect that things would start to change if you were dealing
    with a list of pointers to an object instead. If the a class were big
    enough, I suspect even a list of a non-trivial class would also start to
    slow down.

    joe
    Joe Greer, Aug 7, 2012
    #4
  5. writes:

    > On Tue, 07 Aug 2012 15:36:19 +0200
    > Alain Ketterlin <-strasbg.fr> wrote:
    >> writes:
    >>
    >>> I'm storing a large (100K) list of integer id's in a set which I need
    >>> to frequenctly search for a given id. When I try doing this with a set
    >>> using the standard find() function it takes 3 times longer than doing
    >>> the same with a vector!

    >>
    >>It is kind of surprising with such a large number of items. Linear scan
    >>is certainly faster with small vectors/sets, but 10^5 should be enough
    >>to make algorithmic complexity be the major factor.
    >>
    >>What do you count exactly? Does your timer include filling up the set?
    >>Can you post some (simplified) code exhibiting the phenomenon?

    >
    > No, doesn't include filling up but thats so fast it would hardly
    > register.


    Not sure. Inserting into a set may involve allocating memory, and that
    is costly.

    > I traced the delay down to the search of the set and timed that. Tried
    > a vector and it was 3 times faster. Its annoying because I wanted to
    > use a generic function with find() in it but now I'm having to
    > explicitely check for a set and use the sets built in find() which is
    > about 200 times faster and I'm presuming does do a binary chop.


    You lost me. What is 200 times faster and what is 3 times longer? Do you
    mean: myset.find(...) is fast, but find(myset.begin(),myset.end(),...)
    is slow? Yes, obviously (both functions do not do the same thing).

    OK, forget all this nonsense about cache lines etc. Of course
    myset.find() is faster than find(myset.begin(), myset.end(),...). If you
    want to stay generic, use a wrapper to std::find(), and specialize it
    for sets.

    -- Alain.
    Alain Ketterlin, Aug 7, 2012
    #5
  6. Guest

    On Tue, 07 Aug 2012 16:23:17 +0200
    Alain Ketterlin <-strasbg.fr> wrote:
    >> No, doesn't include filling up but thats so fast it would hardly
    >> register.

    >
    >Not sure. Inserting into a set may involve allocating memory, and that
    >is costly.


    Well, it seems quick when I time it. *shrug*

    >> I traced the delay down to the search of the set and timed that. Tried
    >> a vector and it was 3 times faster. Its annoying because I wanted to
    >> use a generic function with find() in it but now I'm having to
    >> explicitely check for a set and use the sets built in find() which is
    >> about 200 times faster and I'm presuming does do a binary chop.

    >
    >You lost me. What is 200 times faster and what is 3 times longer? Do you


    myset.find() is about 200 times faster than find(myset.begin().
    Using the latter is also about 3 times slower on a set than a vector.

    >mean: myset.find(...) is fast, but find(myset.begin(),myset.end(),...)
    >is slow?


    Yup.

    >Yes, obviously (both functions do not do the same thing).


    Thats what I didn't realise. I assumed the generic find() would be
    specialised internally for the type of container it was operating on. I
    didn't realise it just did a dumb linear search on everything.

    B2003
    , Aug 7, 2012
    #6
  7. wrote:
    >>Yes, obviously (both functions do not do the same thing).

    >
    > Thats what I didn't realise. I assumed the generic find() would be
    > specialised internally for the type of container it was operating on. I
    > didn't realise it just did a dumb linear search on everything.


    Care to show us a version of find() that searches faster when it's given
    iterators to a std::set?
    Juha Nieminen, Aug 7, 2012
    #7
  8. Guest

    On Tue, 7 Aug 2012 14:59:12 +0000 (UTC)
    Juha Nieminen <> wrote:
    > wrote:
    >>>Yes, obviously (both functions do not do the same thing).

    >>
    >> Thats what I didn't realise. I assumed the generic find() would be
    >> specialised internally for the type of container it was operating on. I
    >> didn't realise it just did a dumb linear search on everything.

    >
    >Care to show us a version of find() that searches faster when it's given
    >iterators to a std::set?


    What?

    B2003
    , Aug 7, 2012
    #8
  9. Jorgen Grahn Guest

    On Tue, 2012-08-07, Juha Nieminen wrote:
    > wrote:
    >>>Yes, obviously (both functions do not do the same thing).

    >>
    >> Thats what I didn't realise. I assumed the generic find() would be
    >> specialised internally for the type of container it was operating on. I
    >> didn't realise it just did a dumb linear search on everything.

    >
    > Care to show us a version of find() that searches faster when it's given
    > iterators to a std::set?


    I don't think he claims there could be such a thing, or even that he's
    criticizing our beloved STL! Hopefully he got enlightened instead.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .
    Jorgen Grahn, Aug 7, 2012
    #9
  10. Jorgen Grahn Guest

    On Tue, 2012-08-07, Alain Ketterlin wrote:
    > writes:
    >
    >> I'm storing a large (100K) list of integer id's in a set which I need
    >> to frequenctly search for a given id. When I try doing this with a set
    >> using the standard find() function it takes 3 times longer than doing
    >> the same with a vector!

    >
    > It is kind of surprising with such a large number of items. Linear scan
    > is certainly faster with small vectors/sets, but 10^5 should be enough
    > to make algorithmic complexity be the major factor.

    ....
    > Still, it would be spectacular to observe this effect on 10^5 items.


    (Later postings showed that the OP didn't do tree-based lookups.)

    FWIW, I expect that effect to be observable up to 20--100 items, on a
    recent machine. I.e., if I really needed performance for semi-small
    data sets, I'd do a benchmark before choosing std::map or similar.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .
    Jorgen Grahn, Aug 7, 2012
    #10
  11. wrote:
    > On Tue, 7 Aug 2012 14:59:12 +0000 (UTC)
    > Juha Nieminen <> wrote:
    >> wrote:
    >>>>Yes, obviously (both functions do not do the same thing).
    >>>
    >>> Thats what I didn't realise. I assumed the generic find() would be
    >>> specialised internally for the type of container it was operating on. I
    >>> didn't realise it just did a dumb linear search on everything.

    >>
    >>Care to show us a version of find() that searches faster when it's given
    >>iterators to a std::set?

    >
    > What?


    My point is that if you actually try to implement a find() function that
    tries to take advantage of the fact that the iterators are std::set
    iterators and hence tries to make a O(log n) search, you'll quickly see
    that it's not that trivial to do with those iterators only. (For one,
    you don't have access to the "left" and "right" pointers of the nodes,
    not in a standardized way.)

    Perhaps <set> *could* provide a specialized version of std::find() that
    it befriends and allows it to access those internal pointers in order to
    make a O(log n) search. However, this isn't required by the standard and
    thus you shouldn't rely on such a thing.
    Juha Nieminen, Aug 7, 2012
    #11
  12. Casey Carter Guest

    On 2012-08-07 08:45, wrote:
    > No, doesn't include filling up but thats so fast it would hardly register.
    > I traced the delay down to the search of the set and timed that. Tried a
    > vector and it was 3 times faster. Its annoying because I wanted to use a
    > generic function with find() in it but now I'm having to explicitely check for
    > a set and use the sets built in find() which is about 200 times faster and I'm
    > presuming does do a binary chop.


    If searches are an order of magnitude or two more common than
    inserts/deletes - which seems to be the case from your description - you
    will likely get best results from using std::lower_bound to binary
    search a sorted vector.
    Casey Carter, Aug 7, 2012
    #12
  13. rep_movsd Guest

    On Tuesday, August 7, 2012 6:34:41 PM UTC+5:30, (unknown) wrote:
    > Hello
    >
    >
    >
    > I'm storing a large (100K) list of integer id's in a set which I need to
    >
    > frequenctly search for a given id. When I try doing this with a set using the
    >
    > standard find() function it takes 3 times longer than doing the same with a
    >
    > vector!
    >
    >
    >
    > Why is this? Surely given that its ordered a set could be searched using
    >
    > a binary chop whereas a vector must be a linear search. Even if the set is
    >
    > searched linearly why does it take 3 times as long?
    >
    >
    >
    > Thanks for any info
    >
    >
    >
    > B2003


    You have a set of integers, the only sensible operations (which are close to log N) are :

    Add int to a set
    Remove int from a set
    Know if an int is in that set
    Finding the iterator pointing to a given int using set::find

    std::find makes no assumptions about the iterators its given except that they have ++ and * and it does a linear search.

    What exactly are you trying to achieve? The only thing I can think of is you look at a given int in the set and then iterate forward from there to get a certain range perhaps.
    If you only need to test if the int is in the set then set::count() is your friend
    rep_movsd, Aug 9, 2012
    #13
  14. James Kanze Guest

    On Tuesday, August 7, 2012 2:04:41 PM UTC+1, (unknown) wrote:

    > I'm storing a large (100K) list of integer id's in a set which
    > I need to frequenctly search for a given id. When I try doing
    > this with a set using the standard find() function it takes 3
    > times longer than doing the same with a vector!


    > Why is this? Surely given that its ordered a set could be
    > searched using a binary chop whereas a vector must be a linear
    > search. Even if the set is searched linearly why does it take
    > 3 times as long?


    By definition, std::find does a linear search. That's part of
    its definition. If performance is important, and you're dealing
    with sorted data, there is std::lower_bound and company.

    Practically speaking, neither std::find nor std::lower_bound
    have any way of knowing whether the range they're iterating over
    is sorted or not. It's up to you, the user, to tell it whether
    the range is sorted or not, by choosing the appropriate
    function. If you lie, it's undefined behavior, because
    typically, the function has no way of knowing.

    Finally: using an std::vector, and keeping it sorted, is often
    superior to using std::set, because of greater locality. I'm
    currently working on some high performance software, and we've
    banned the use of node based containers for critical data,
    because the loss of locality kills performance. We've also
    banned them in large parts outside of the critical area, because
    memory is very, very tight, and they tend to use a lot more
    memory than std::vector as well. (Note that this is *not* a
    general recommendation. We only took these steps because we had
    to.)

    --
    James Kanze
    James Kanze, Aug 11, 2012
    #14
  15. Melzzzzz Guest

    On Sat, 11 Aug 2012 11:16:06 -0700 (PDT)
    James Kanze <> wrote:

    > On Tuesday, August 7, 2012 2:04:41 PM UTC+1, (unknown) wrote:
    >
    > > I'm storing a large (100K) list of integer id's in a set which
    > > I need to frequenctly search for a given id. When I try doing
    > > this with a set using the standard find() function it takes 3
    > > times longer than doing the same with a vector!

    >
    > > Why is this? Surely given that its ordered a set could be
    > > searched using a binary chop whereas a vector must be a linear
    > > search. Even if the set is searched linearly why does it take
    > > 3 times as long?

    >
    > By definition, std::find does a linear search. That's part of
    > its definition. If performance is important, and you're dealing
    > with sorted data, there is std::lower_bound and company.
    >
    > Practically speaking, neither std::find nor std::lower_bound
    > have any way of knowing whether the range they're iterating over
    > is sorted or not. It's up to you, the user, to tell it whether
    > the range is sorted or not, by choosing the appropriate
    > function. If you lie, it's undefined behavior, because
    > typically, the function has no way of knowing.
    >
    > Finally: using an std::vector, and keeping it sorted, is often
    > superior to using std::set, because of greater locality. I'm
    > currently working on some high performance software, and we've
    > banned the use of node based containers for critical data,
    > because the loss of locality kills performance. We've also
    > banned them in large parts outside of the critical area, because
    > memory is very, very tight, and they tend to use a lot more
    > memory than std::vector as well. (Note that this is *not* a
    > general recommendation. We only took these steps because we had
    > to.)
    >


    You could try Btree as it is combination of vector and tree .
    Btree's have better lookup than standard red black trees,
    while erasing and inserting is slower due nature
    of vectors.
    Melzzzzz, Aug 12, 2012
    #15
  16. Melzzzzz Guest

    On Sat, 11 Aug 2012 19:35:05 +0100
    Leigh Johnston <> wrote:

    > On 11/08/2012 19:16, James Kanze wrote:
    > > On Tuesday, August 7, 2012 2:04:41 PM UTC+1, (unknown) wrote:
    > >
    > >> I'm storing a large (100K) list of integer id's in a set which
    > >> I need to frequenctly search for a given id. When I try doing
    > >> this with a set using the standard find() function it takes 3
    > >> times longer than doing the same with a vector!

    > >
    > >> Why is this? Surely given that its ordered a set could be
    > >> searched using a binary chop whereas a vector must be a linear
    > >> search. Even if the set is searched linearly why does it take
    > >> 3 times as long?

    > >
    > > By definition, std::find does a linear search. That's part of
    > > its definition. If performance is important, and you're dealing
    > > with sorted data, there is std::lower_bound and company.
    > >
    > > Practically speaking, neither std::find nor std::lower_bound
    > > have any way of knowing whether the range they're iterating over
    > > is sorted or not. It's up to you, the user, to tell it whether
    > > the range is sorted or not, by choosing the appropriate
    > > function. If you lie, it's undefined behavior, because
    > > typically, the function has no way of knowing.
    > >
    > > Finally: using an std::vector, and keeping it sorted, is often
    > > superior to using std::set, because of greater locality. I'm
    > > currently working on some high performance software, and we've
    > > banned the use of node based containers for critical data,
    > > because the loss of locality kills performance. We've also
    > > banned them in large parts outside of the critical area, because
    > > memory is very, very tight, and they tend to use a lot more
    > > memory than std::vector as well. (Note that this is *not* a
    > > general recommendation. We only took these steps because we had
    > > to.)

    >
    > You can improve the locality of the node based containers by using a
    > custom allocator; I got significant speed increases doing it this way
    > (but how much was down to improved locality and how much was down to
    > not using a general purpose allocator is unclear).
    >
    > /Leigh
    >


    I have tried to0, with custom allocator, but binary trees are inherently
    not cache friendly, as after number of deletions and inserts, order
    of nodes tend to became random, no matter how allocator is created.
    Melzzzzz, Aug 12, 2012
    #16
  17. Melzzzzz Guest

    On Tue, 7 Aug 2012 13:04:41 +0000 (UTC)
    wrote:

    > Hello
    >
    > I'm storing a large (100K) list of integer id's in a set which I need
    > to frequenctly search for a given id. When I try doing this with a
    > set using the standard find() function it takes 3 times longer than
    > doing the same with a vector!


    That is because sets iterators are complex. Incrementing sets iterator
    is O(n), while incrementing vectors iterator is O(1).


    >
    > Why is this? Surely given that its ordered a set could be searched
    > using a binary chop whereas a vector must be a linear search. Even if
    > the set is searched linearly why does it take 3 times as long?


    Do you mean sets member function find or generic find?

    On my system standard find with vector is 200 times slower than sets
    member function find with 100k elements.
    Melzzzzz, Aug 12, 2012
    #17
  18. Melzzzzz Guest

    On Sun, 12 Aug 2012 03:40:32 +0200
    Melzzzzz <> wrote:

    > On Tue, 7 Aug 2012 13:04:41 +0000 (UTC)
    > wrote:
    >
    > > Hello
    > >
    > > I'm storing a large (100K) list of integer id's in a set which I
    > > need to frequenctly search for a given id. When I try doing this
    > > with a set using the standard find() function it takes 3 times
    > > longer than doing the same with a vector!

    >
    > That is because sets iterators are complex. Incrementing sets iterator
    > is O(n), while incrementing vectors iterator is O(1).
    >
    >

    Pardon, O(log n)
    Melzzzzz, Aug 12, 2012
    #18
  19. Melzzzzz <> wrote:
    > That is because sets iterators are complex. Incrementing sets iterator
    > is O(log n)


    Actually it's not that simple.

    The amount of operations needed to increment a set iterator may be O(log n)
    at worst. However, traversing the entire tree from beginning to end is
    still just O(n) (and not O(n log n) as one could hastily think). It might
    not be immediately intuitive why, but it's a good thinking exercise.

    The slowness of traversing a set from beginning to end compared to a
    vector does not come from the computational complexity being larger
    (it's the same for both). It's just that the set iterator has to do
    more things to increment the iterator. (Also, there's a higher probability
    of cache misses with a tree than a vector.)
    Juha Nieminen, Aug 12, 2012
    #19
  20. Melzzzzz Guest

    On Sun, 12 Aug 2012 07:16:57 +0000 (UTC)
    Juha Nieminen <> wrote:

    > Melzzzzz <> wrote:
    > > That is because sets iterators are complex. Incrementing sets
    > > iterator is O(log n)

    >
    > Actually it's not that simple.
    >
    > The amount of operations needed to increment a set iterator may be
    > O(log n) at worst. However, traversing the entire tree from beginning
    > to end is still just O(n) (and not O(n log n) as one could hastily
    > think). It might not be immediately intuitive why, but it's a good
    > thinking exercise.


    Are you sure? Tree iterator has to travel up-down through nodes (take
    a look at Treap iterator I posted in thread about binary search trees)?
    I didn't think much, but traversing trough several nodes to
    reach target node, could not be O(n) when traversing whole tree, I
    think.

    >
    > The slowness of traversing a set from beginning to end compared to a
    > vector does not come from the computational complexity being larger
    > (it's the same for both). It's just that the set iterator has to do
    > more things to increment the iterator. (Also, there's a higher
    > probability of cache misses with a tree than a vector.)


    Cache misses are another problem, yes.
    Melzzzzz, Aug 12, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Zhang Le
    Replies:
    1
    Views:
    368
    Terry Reedy
    Mar 4, 2005
  2. Mr. SweatyFinger
    Replies:
    2
    Views:
    1,839
    Smokey Grindel
    Dec 2, 2006
  3. Replies:
    8
    Views:
    1,914
    Csaba
    Feb 18, 2006
  4. ck

    Is Set faster than List

    ck, Mar 8, 2007, in forum: Java
    Replies:
    7
    Views:
    1,524
    Ed Kirwan
    Mar 9, 2007
  5. John H.
    Replies:
    0
    Views:
    644
    John H.
    Mar 1, 2010
Loading...

Share This Page