# Why is a vector find faster than a set with find()?

Discussion in 'C++' started by boltar2003@boltar.world, Aug 7, 2012.

1. ### Guest

Hello

I'm storing a large (100K) list of integer id's in a set which I need to
frequenctly search for a given id. When I try doing this with a set using the
standard find() function it takes 3 times longer than doing the same with a
vector!

Why is this? Surely given that its ordered a set could be searched using
a binary chop whereas a vector must be a linear search. Even if the set is
searched linearly why does it take 3 times as long?

Thanks for any info

B2003

, Aug 7, 2012

2. ### Alain KetterlinGuest

writes:

> I'm storing a large (100K) list of integer id's in a set which I need
> to frequenctly search for a given id. When I try doing this with a set
> using the standard find() function it takes 3 times longer than doing
> the same with a vector!

It is kind of surprising with such a large number of items. Linear scan
is certainly faster with small vectors/sets, but 10^5 should be enough
to make algorithmic complexity be the major factor.

What do you count exactly? Does your timer include filling up the set?
Can you post some (simplified) code exhibiting the phenomenon?

> Why is this? Surely given that its ordered a set could be searched
> using a binary chop whereas a vector must be a linear search. Even if
> the set is searched linearly why does it take 3 times as long?

Here is a plausible explanation. A vector stores integers in a
contiguous area of memory, and recent processors have pretty efficient
hardware prefetchers, which give their best when scanning an array
(basically what you are doing when traversing a vector). STL set, on the
other hand, allocates a new block for each element, so find needs to
follow pointers. There is no good reason for these pointers to exhibit
any kind of regularity, so prefetchers are inoperant, and you pay the
cache miss cost on every access (if set cells are about the size of a
cache line). Yes, data locality matters...

Still, it would be spectacular to observe this effect on 10^5 items.

-- Alain.

Alain Ketterlin, Aug 7, 2012

3. ### Guest

On Tue, 07 Aug 2012 15:36:19 +0200
Alain Ketterlin <-strasbg.fr> wrote:
> writes:
>
>> I'm storing a large (100K) list of integer id's in a set which I need
>> to frequenctly search for a given id. When I try doing this with a set
>> using the standard find() function it takes 3 times longer than doing
>> the same with a vector!

>
>It is kind of surprising with such a large number of items. Linear scan
>is certainly faster with small vectors/sets, but 10^5 should be enough
>to make algorithmic complexity be the major factor.
>
>What do you count exactly? Does your timer include filling up the set?
>Can you post some (simplified) code exhibiting the phenomenon?

No, doesn't include filling up but thats so fast it would hardly register.
I traced the delay down to the search of the set and timed that. Tried a
vector and it was 3 times faster. Its annoying because I wanted to use a
generic function with find() in it but now I'm having to explicitely check for
a set and use the sets built in find() which is about 200 times faster and I'm
presuming does do a binary chop.

>(basically what you are doing when traversing a vector). STL set, on the
>other hand, allocates a new block for each element, so find needs to
>follow pointers. There is no good reason for these pointers to exhibit

Ah , didn't realise that. I assume a set was the same internally as a vector
except kept ordered. I thought only lists and queues used pointers.

>Still, it would be spectacular to observe this effect on 10^5 items.

My code was nothing more than:

myiterator = find(mydata.begin(),mydata.end(),val);

So it would be fairly easy for someone to reproduce.

B2003

, Aug 7, 2012
4. ### Joe GreerGuest

wrote in news:jvr3p9\$b7s\$:

> Hello
>
> I'm storing a large (100K) list of integer id's in a set which I need
> to frequenctly search for a given id. When I try doing this with a set
> using the standard find() function it takes 3 times longer than doing
> the same with a vector!
>
> Why is this? Surely given that its ordered a set could be searched
> using a binary chop whereas a vector must be a linear search. Even if
> the set is searched linearly why does it take 3 times as long?
>
> Thanks for any info
>
> B2003
>
>

If I had to guess (and I do ), I would guess that it has to do with the
CPU's cache and read prediction logic making things fly for the list of
integers. I suspect that things would start to change if you were dealing
with a list of pointers to an object instead. If the a class were big
enough, I suspect even a list of a non-trivial class would also start to
slow down.

joe

Joe Greer, Aug 7, 2012
5. ### Alain KetterlinGuest

writes:

> On Tue, 07 Aug 2012 15:36:19 +0200
> Alain Ketterlin <-strasbg.fr> wrote:
>> writes:
>>
>>> I'm storing a large (100K) list of integer id's in a set which I need
>>> to frequenctly search for a given id. When I try doing this with a set
>>> using the standard find() function it takes 3 times longer than doing
>>> the same with a vector!

>>
>>It is kind of surprising with such a large number of items. Linear scan
>>is certainly faster with small vectors/sets, but 10^5 should be enough
>>to make algorithmic complexity be the major factor.
>>
>>What do you count exactly? Does your timer include filling up the set?
>>Can you post some (simplified) code exhibiting the phenomenon?

>
> No, doesn't include filling up but thats so fast it would hardly
> register.

Not sure. Inserting into a set may involve allocating memory, and that
is costly.

> I traced the delay down to the search of the set and timed that. Tried
> a vector and it was 3 times faster. Its annoying because I wanted to
> use a generic function with find() in it but now I'm having to
> explicitely check for a set and use the sets built in find() which is
> about 200 times faster and I'm presuming does do a binary chop.

You lost me. What is 200 times faster and what is 3 times longer? Do you
mean: myset.find(...) is fast, but find(myset.begin(),myset.end(),...)
is slow? Yes, obviously (both functions do not do the same thing).

OK, forget all this nonsense about cache lines etc. Of course
myset.find() is faster than find(myset.begin(), myset.end(),...). If you
want to stay generic, use a wrapper to std::find(), and specialize it
for sets.

-- Alain.

Alain Ketterlin, Aug 7, 2012
6. ### Guest

On Tue, 07 Aug 2012 16:23:17 +0200
Alain Ketterlin <-strasbg.fr> wrote:
>> No, doesn't include filling up but thats so fast it would hardly
>> register.

>
>Not sure. Inserting into a set may involve allocating memory, and that
>is costly.

Well, it seems quick when I time it. *shrug*

>> I traced the delay down to the search of the set and timed that. Tried
>> a vector and it was 3 times faster. Its annoying because I wanted to
>> use a generic function with find() in it but now I'm having to
>> explicitely check for a set and use the sets built in find() which is
>> about 200 times faster and I'm presuming does do a binary chop.

>
>You lost me. What is 200 times faster and what is 3 times longer? Do you

myset.find() is about 200 times faster than find(myset.begin().
Using the latter is also about 3 times slower on a set than a vector.

>mean: myset.find(...) is fast, but find(myset.begin(),myset.end(),...)
>is slow?

Yup.

>Yes, obviously (both functions do not do the same thing).

Thats what I didn't realise. I assumed the generic find() would be
specialised internally for the type of container it was operating on. I
didn't realise it just did a dumb linear search on everything.

B2003

, Aug 7, 2012
7. ### Juha NieminenGuest

wrote:
>>Yes, obviously (both functions do not do the same thing).

>
> Thats what I didn't realise. I assumed the generic find() would be
> specialised internally for the type of container it was operating on. I
> didn't realise it just did a dumb linear search on everything.

Care to show us a version of find() that searches faster when it's given
iterators to a std::set?

Juha Nieminen, Aug 7, 2012
8. ### Guest

On Tue, 7 Aug 2012 14:59:12 +0000 (UTC)
Juha Nieminen <> wrote:
> wrote:
>>>Yes, obviously (both functions do not do the same thing).

>>
>> Thats what I didn't realise. I assumed the generic find() would be
>> specialised internally for the type of container it was operating on. I
>> didn't realise it just did a dumb linear search on everything.

>
>Care to show us a version of find() that searches faster when it's given
>iterators to a std::set?

What?

B2003

, Aug 7, 2012
9. ### Jorgen GrahnGuest

On Tue, 2012-08-07, Juha Nieminen wrote:
> wrote:
>>>Yes, obviously (both functions do not do the same thing).

>>
>> Thats what I didn't realise. I assumed the generic find() would be
>> specialised internally for the type of container it was operating on. I
>> didn't realise it just did a dumb linear search on everything.

>
> Care to show us a version of find() that searches faster when it's given
> iterators to a std::set?

I don't think he claims there could be such a thing, or even that he's
criticizing our beloved STL! Hopefully he got enlightened instead.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Jorgen Grahn, Aug 7, 2012
10. ### Jorgen GrahnGuest

On Tue, 2012-08-07, Alain Ketterlin wrote:
> writes:
>
>> I'm storing a large (100K) list of integer id's in a set which I need
>> to frequenctly search for a given id. When I try doing this with a set
>> using the standard find() function it takes 3 times longer than doing
>> the same with a vector!

>
> It is kind of surprising with such a large number of items. Linear scan
> is certainly faster with small vectors/sets, but 10^5 should be enough
> to make algorithmic complexity be the major factor.

....
> Still, it would be spectacular to observe this effect on 10^5 items.

(Later postings showed that the OP didn't do tree-based lookups.)

FWIW, I expect that effect to be observable up to 20--100 items, on a
recent machine. I.e., if I really needed performance for semi-small
data sets, I'd do a benchmark before choosing std::map or similar.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Jorgen Grahn, Aug 7, 2012
11. ### Juha NieminenGuest

wrote:
> On Tue, 7 Aug 2012 14:59:12 +0000 (UTC)
> Juha Nieminen <> wrote:
>> wrote:
>>>>Yes, obviously (both functions do not do the same thing).
>>>
>>> Thats what I didn't realise. I assumed the generic find() would be
>>> specialised internally for the type of container it was operating on. I
>>> didn't realise it just did a dumb linear search on everything.

>>
>>Care to show us a version of find() that searches faster when it's given
>>iterators to a std::set?

>
> What?

My point is that if you actually try to implement a find() function that
tries to take advantage of the fact that the iterators are std::set
iterators and hence tries to make a O(log n) search, you'll quickly see
that it's not that trivial to do with those iterators only. (For one,
you don't have access to the "left" and "right" pointers of the nodes,
not in a standardized way.)

Perhaps <set> *could* provide a specialized version of std::find() that
it befriends and allows it to access those internal pointers in order to
make a O(log n) search. However, this isn't required by the standard and
thus you shouldn't rely on such a thing.

Juha Nieminen, Aug 7, 2012
12. ### Casey CarterGuest

On 2012-08-07 08:45, wrote:
> No, doesn't include filling up but thats so fast it would hardly register.
> I traced the delay down to the search of the set and timed that. Tried a
> vector and it was 3 times faster. Its annoying because I wanted to use a
> generic function with find() in it but now I'm having to explicitely check for
> a set and use the sets built in find() which is about 200 times faster and I'm
> presuming does do a binary chop.

If searches are an order of magnitude or two more common than
inserts/deletes - which seems to be the case from your description - you
will likely get best results from using std::lower_bound to binary
search a sorted vector.

Casey Carter, Aug 7, 2012
13. ### rep_movsdGuest

On Tuesday, August 7, 2012 6:34:41 PM UTC+5:30, (unknown) wrote:
> Hello
>
>
>
> I'm storing a large (100K) list of integer id's in a set which I need to
>
> frequenctly search for a given id. When I try doing this with a set using the
>
> standard find() function it takes 3 times longer than doing the same with a
>
> vector!
>
>
>
> Why is this? Surely given that its ordered a set could be searched using
>
> a binary chop whereas a vector must be a linear search. Even if the set is
>
> searched linearly why does it take 3 times as long?
>
>
>
> Thanks for any info
>
>
>
> B2003

You have a set of integers, the only sensible operations (which are close to log N) are :

Remove int from a set
Know if an int is in that set
Finding the iterator pointing to a given int using set::find

std::find makes no assumptions about the iterators its given except that they have ++ and * and it does a linear search.

What exactly are you trying to achieve? The only thing I can think of is you look at a given int in the set and then iterate forward from there to get a certain range perhaps.
If you only need to test if the int is in the set then set::count() is your friend

rep_movsd, Aug 9, 2012
14. ### James KanzeGuest

On Tuesday, August 7, 2012 2:04:41 PM UTC+1, (unknown) wrote:

> I'm storing a large (100K) list of integer id's in a set which
> I need to frequenctly search for a given id. When I try doing
> this with a set using the standard find() function it takes 3
> times longer than doing the same with a vector!

> Why is this? Surely given that its ordered a set could be
> searched using a binary chop whereas a vector must be a linear
> search. Even if the set is searched linearly why does it take
> 3 times as long?

By definition, std::find does a linear search. That's part of
its definition. If performance is important, and you're dealing
with sorted data, there is std::lower_bound and company.

Practically speaking, neither std::find nor std::lower_bound
have any way of knowing whether the range they're iterating over
is sorted or not. It's up to you, the user, to tell it whether
the range is sorted or not, by choosing the appropriate
function. If you lie, it's undefined behavior, because
typically, the function has no way of knowing.

Finally: using an std::vector, and keeping it sorted, is often
superior to using std::set, because of greater locality. I'm
currently working on some high performance software, and we've
banned the use of node based containers for critical data,
because the loss of locality kills performance. We've also
banned them in large parts outside of the critical area, because
memory is very, very tight, and they tend to use a lot more
memory than std::vector as well. (Note that this is *not* a
general recommendation. We only took these steps because we had
to.)

--
James Kanze

James Kanze, Aug 11, 2012
15. ### MelzzzzzGuest

On Sat, 11 Aug 2012 11:16:06 -0700 (PDT)
James Kanze <> wrote:

> On Tuesday, August 7, 2012 2:04:41 PM UTC+1, (unknown) wrote:
>
> > I'm storing a large (100K) list of integer id's in a set which
> > I need to frequenctly search for a given id. When I try doing
> > this with a set using the standard find() function it takes 3
> > times longer than doing the same with a vector!

>
> > Why is this? Surely given that its ordered a set could be
> > searched using a binary chop whereas a vector must be a linear
> > search. Even if the set is searched linearly why does it take
> > 3 times as long?

>
> By definition, std::find does a linear search. That's part of
> its definition. If performance is important, and you're dealing
> with sorted data, there is std::lower_bound and company.
>
> Practically speaking, neither std::find nor std::lower_bound
> have any way of knowing whether the range they're iterating over
> is sorted or not. It's up to you, the user, to tell it whether
> the range is sorted or not, by choosing the appropriate
> function. If you lie, it's undefined behavior, because
> typically, the function has no way of knowing.
>
> Finally: using an std::vector, and keeping it sorted, is often
> superior to using std::set, because of greater locality. I'm
> currently working on some high performance software, and we've
> banned the use of node based containers for critical data,
> because the loss of locality kills performance. We've also
> banned them in large parts outside of the critical area, because
> memory is very, very tight, and they tend to use a lot more
> memory than std::vector as well. (Note that this is *not* a
> general recommendation. We only took these steps because we had
> to.)
>

You could try Btree as it is combination of vector and tree .
Btree's have better lookup than standard red black trees,
while erasing and inserting is slower due nature
of vectors.

Melzzzzz, Aug 12, 2012
16. ### MelzzzzzGuest

On Sat, 11 Aug 2012 19:35:05 +0100
Leigh Johnston <> wrote:

> On 11/08/2012 19:16, James Kanze wrote:
> > On Tuesday, August 7, 2012 2:04:41 PM UTC+1, (unknown) wrote:
> >
> >> I'm storing a large (100K) list of integer id's in a set which
> >> I need to frequenctly search for a given id. When I try doing
> >> this with a set using the standard find() function it takes 3
> >> times longer than doing the same with a vector!

> >
> >> Why is this? Surely given that its ordered a set could be
> >> searched using a binary chop whereas a vector must be a linear
> >> search. Even if the set is searched linearly why does it take
> >> 3 times as long?

> >
> > By definition, std::find does a linear search. That's part of
> > its definition. If performance is important, and you're dealing
> > with sorted data, there is std::lower_bound and company.
> >
> > Practically speaking, neither std::find nor std::lower_bound
> > have any way of knowing whether the range they're iterating over
> > is sorted or not. It's up to you, the user, to tell it whether
> > the range is sorted or not, by choosing the appropriate
> > function. If you lie, it's undefined behavior, because
> > typically, the function has no way of knowing.
> >
> > Finally: using an std::vector, and keeping it sorted, is often
> > superior to using std::set, because of greater locality. I'm
> > currently working on some high performance software, and we've
> > banned the use of node based containers for critical data,
> > because the loss of locality kills performance. We've also
> > banned them in large parts outside of the critical area, because
> > memory is very, very tight, and they tend to use a lot more
> > memory than std::vector as well. (Note that this is *not* a
> > general recommendation. We only took these steps because we had
> > to.)

>
> You can improve the locality of the node based containers by using a
> custom allocator; I got significant speed increases doing it this way
> (but how much was down to improved locality and how much was down to
> not using a general purpose allocator is unclear).
>
> /Leigh
>

I have tried to0, with custom allocator, but binary trees are inherently
not cache friendly, as after number of deletions and inserts, order
of nodes tend to became random, no matter how allocator is created.

Melzzzzz, Aug 12, 2012
17. ### MelzzzzzGuest

On Tue, 7 Aug 2012 13:04:41 +0000 (UTC)
wrote:

> Hello
>
> I'm storing a large (100K) list of integer id's in a set which I need
> to frequenctly search for a given id. When I try doing this with a
> set using the standard find() function it takes 3 times longer than
> doing the same with a vector!

That is because sets iterators are complex. Incrementing sets iterator
is O(n), while incrementing vectors iterator is O(1).

>
> Why is this? Surely given that its ordered a set could be searched
> using a binary chop whereas a vector must be a linear search. Even if
> the set is searched linearly why does it take 3 times as long?

Do you mean sets member function find or generic find?

On my system standard find with vector is 200 times slower than sets
member function find with 100k elements.

Melzzzzz, Aug 12, 2012
18. ### MelzzzzzGuest

On Sun, 12 Aug 2012 03:40:32 +0200
Melzzzzz <> wrote:

> On Tue, 7 Aug 2012 13:04:41 +0000 (UTC)
> wrote:
>
> > Hello
> >
> > I'm storing a large (100K) list of integer id's in a set which I
> > need to frequenctly search for a given id. When I try doing this
> > with a set using the standard find() function it takes 3 times
> > longer than doing the same with a vector!

>
> That is because sets iterators are complex. Incrementing sets iterator
> is O(n), while incrementing vectors iterator is O(1).
>
>

Pardon, O(log n)

Melzzzzz, Aug 12, 2012
19. ### Juha NieminenGuest

Melzzzzz <> wrote:
> That is because sets iterators are complex. Incrementing sets iterator
> is O(log n)

Actually it's not that simple.

The amount of operations needed to increment a set iterator may be O(log n)
at worst. However, traversing the entire tree from beginning to end is
still just O(n) (and not O(n log n) as one could hastily think). It might
not be immediately intuitive why, but it's a good thinking exercise.

The slowness of traversing a set from beginning to end compared to a
vector does not come from the computational complexity being larger
(it's the same for both). It's just that the set iterator has to do
more things to increment the iterator. (Also, there's a higher probability
of cache misses with a tree than a vector.)

Juha Nieminen, Aug 12, 2012
20. ### MelzzzzzGuest

On Sun, 12 Aug 2012 07:16:57 +0000 (UTC)
Juha Nieminen <> wrote:

> Melzzzzz <> wrote:
> > That is because sets iterators are complex. Incrementing sets
> > iterator is O(log n)

>
> Actually it's not that simple.
>
> The amount of operations needed to increment a set iterator may be
> O(log n) at worst. However, traversing the entire tree from beginning
> to end is still just O(n) (and not O(n log n) as one could hastily
> think). It might not be immediately intuitive why, but it's a good
> thinking exercise.

Are you sure? Tree iterator has to travel up-down through nodes (take
a look at Treap iterator I posted in thread about binary search trees)?
I didn't think much, but traversing trough several nodes to
reach target node, could not be O(n) when traversing whole tree, I
think.

>
> The slowness of traversing a set from beginning to end compared to a
> vector does not come from the computational complexity being larger
> (it's the same for both). It's just that the set iterator has to do
> more things to increment the iterator. (Also, there's a higher
> probability of cache misses with a tree than a vector.)

Cache misses are another problem, yes.

Melzzzzz, Aug 12, 2012