Container for Look Up to be Fast

Pallav singh · Aug 13, 2011

Hi All ,

i have data coming at variable rate and basic requirement was look up
and insertion should be very Fast
i used STL MAP for that , but when data was very HUGE(10 GB data was
to be stored in MAP ), its performance degraded.

Time to fetch data based on KEY increases exponentially.

when I use unordered_map of C++0x to store data, it got reduced
significantly.

Is there any Other container provided by Boost or Ace Framework, which
i can use to reduce my Look up time more.

Thanks
Pallav Singh

Dombo · Aug 13, 2011

Op 13-Aug-11 11:42, Pallav singh schreef:

Hi All ,

i have data coming at variable rate and basic requirement was look up
and insertion should be very Fast
i used STL MAP for that , but when data was very HUGE(10 GB data was
to be stored in MAP ), its performance degraded.

Time to fetch data based on KEY increases exponentially.

Considering the computational complexity of the std::map (O(log n)) I
wouldn't expect the lookup time to increase exponentially with the size
of the data set.

Probably you are experiencing the effects caching and/or virtual memory
which would explain why performance degrades dramatically when the data
set gets large. When there are a huge amount of elements in the map,
chances are that the nodes are spread out over memory. A lookup could
result in memory accesses all over the place which could result in many
cache misses and/or page faults. These days the way you access memory
(locality of reference) can have a bigger impact on performance than the
computational complexity.

when I use unordered_map of C++0x to store data, it got reduced
significantly.

Is there any Other container provided by Boost or Ace Framework, which
i can use to reduce my Look up time more.

As far as lookup performance is concerned hash based associative
containers are about as good as it gets, assuming the hash function is
good. Possibly you may be able to provide a better hashing function for
your data.

If that doesn't help the only other option I can think of is to tackle
the problem at a more fundamental (design) level.

Jorgen Grahn · Aug 13, 2011

Hi All ,

i have data coming at variable rate and basic requirement was look up
and insertion should be very Fast
i used STL MAP for that , but when data was very HUGE(10 GB data was
to be stored in MAP ), its performance degraded.

Time to fetch data based on KEY increases exponentially.

But what about insertions and deletions? I'd expect that to be even
worse.

when I use unordered_map of C++0x to store data, it got reduced
significantly.

Is there any Other container provided by Boost or Ace Framework, which
i can use to reduce my Look up time more.

I think I would do an experiment with a BerkeleyDB of some kind stored
on disk, to see if that's better for huge data sets. Maybe also try to
go from an unordered_map<Key, Value> to unordered_map<Key, Value*> to
see if that is more cache-friendly.

/Jorgen

Leo Equinox Gaspard · Aug 13, 2011

Le 13/08/2011 11:42, Pallav singh a écrit :

Hi All ,

i have data coming at variable rate and basic requirement was look up
and insertion should be very Fast
i used STL MAP for that , but when data was very HUGE(10 GB data was
to be stored in MAP ), its performance degraded.

Time to fetch data based on KEY increases exponentially.

when I use unordered_map of C++0x to store data, it got reduced
significantly.

Is there any Other container provided by Boost or Ace Framework, which
i can use to reduce my Look up time more.

Thanks
Pallav Singh

As you have 10GB of data, your OS is likely to swap, explaining the
performance issue (except is you have 10GB RAM ?). Moreover, even if
you have the RAM needed, RAM access isn't as fast as cache access.
So, maybe should you use something like a B-Tree with the number of
childs per node adapted to your processor cache, so you could get a
O(logK N) (with K the number of childs and N the number of elements),
and reducing RAM accesses by adapting to your cache processor.

But I'm not sure whether retrieving, say, 1Mo of data from RAM to
processor cache is better or worse in performance than retrieving 10
times 100Ko from RAM to cache.

However, if your OS is swapping (if you have not enough memory), then
the B-Tree is very likely to improve performance, compared to std::map.

Compared to std::unordered_map, I'm not sure which one is the best.
A B-tree is an improvement of a std::map for large amounts of data,
an unordered_map could also be improved.
For example, reserve enough buckets on the creation of the unordered_map,
of create a better hash function.

Then, maybe will some libraries offer you more data structures.

So, my advices :
- Try B-Trees
- For unordered_map :
- Reserve enough buckets to avoid re-hashing
- Provide a good hash function (examine the space your keys are in,
and try to make it fit with the most large space of integers, with
as little collisions as possible)
- Finally, maybe are there some other data structures I don't know yet.

Cheers & hth.,
Leo

Juha Nieminen · Aug 14, 2011

Dombo said:
Probably you are experiencing the effects caching and/or virtual memory

I think the term you are looking for is "swapping".

Nick Keighley · Aug 15, 2011

I think the term you are looking for is "swapping".

not really

Juha Nieminen · Aug 15, 2011

Nick Keighley said:
not really

I don't think caching and virtual memory would cause std::map to slow
down from logaritmic to exponential. However, swapping sounds exactly
like it could.

Jorgen Grahn · Aug 15, 2011

I don't think caching and virtual memory would cause std::map to slow
down from logaritmic to exponential. However, swapping sounds exactly
like it could.

Are you two /really/ talking about exponential, or are you using the
word to mean "ridiculously slow"?

Even if you have a tree that's 40 levels deep and everything has been
swapped to disk, the number of disk hits needed to find a certain node
are proportional to the depth -- that is, increases logarithmically
with the size of the tree.

/Jorgen

Leo Equinox Gaspard · Aug 16, 2011

Le 13/08/2011 14:22, Jorgen Grahn a écrit :
[...]

I think I would do an experiment with a BerkeleyDB of some kind stored
on disk, to see if that's better for huge data sets. Maybe also try to
go from an unordered_map<Key, Value> to unordered_map<Key, Value*> to
see if that is more cache-friendly.

/Jorgen

I don't think an unordered_map could be easily cached, because of its
random-like distribution. So, except if you are always requesting the
same elements, I don't think processor cache could choose which of the
thousands of data you got in your map it should cache.

BTW, to get better advices, you (Pallav singh) could say how much a
data is (10GB of 8B data isn't treated by the same data structure as
10GB of 1GB data).

Leo

Juha Nieminen · Aug 16, 2011

Jorgen Grahn said:
Are you two /really/ talking about exponential, or are you using the
word to mean "ridiculously slow"?

Ok, it's probably not technically speaking O(2^n) with swapping, just
extremely slow. My point was, however, that caching or virtual memory
is not the cause for this, but disk swapping.

Jorgen Grahn · Aug 16, 2011

Le 13/08/2011 14:22, Jorgen Grahn a écrit :
[...]

[...] Maybe also try to
go from an unordered_map<Key, Value> to unordered_map<Key, Value*> to
see if that is more cache-friendly.

/Jorgen

Click to expand...

I don't think an unordered_map could be easily cached, because of its
random-like distribution. So, except if you are always requesting the
same elements, I don't think processor cache could choose which of the
thousands of data you got in your map it should cache.

True, but my idea was that unordered_map<Key, Value*> might be better

which was what he was already using said:
BTW, to get better advices, you (Pallav singh) could say how much a
data is (10GB of 8B data isn't treated by the same data structure as
10GB of 1GB data).

Yes, moving to unordered_map<Key, Value*> can only be useful if his
value type is much bigger than his key. I forgot to point that out.

/Jorgen

Selecting Container for "Top 20" Application	25	Aug 17, 2013
container for insert/delete + fast index	2	Jul 21, 2006
How to get an insertion hint for an unordered associated container?	12	Nov 21, 2009
How to set up a fast correct java build?	41	Jan 8, 2010
Is this fast?	3	Feb 14, 2006
Improving memory consumption in the container library	6	Oct 9, 2009
This is why Ruby 1.8.6 can never be made to run anywhere near as fast as Python 2.5.1	61	Sep 24, 2007
Proposal for a non-periodic CPRNG (WARNING: CROSSPOST!)	4	Apr 29, 2012

Container for Look Up to be Fast

Pallav singh

Dombo

Jorgen Grahn

Leo Equinox Gaspard

Juha Nieminen

Nick Keighley

Juha Nieminen

Jorgen Grahn

Leo Equinox Gaspard

Juha Nieminen

Jorgen Grahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads