Container for Look Up to be Fast

P

Pallav singh

Hi All ,

i have data coming at variable rate and basic requirement was look up
and insertion should be very Fast
i used STL MAP for that , but when data was very HUGE(10 GB data was
to be stored in MAP ), its performance degraded.

Time to fetch data based on KEY increases exponentially.

when I use unordered_map of C++0x to store data, it got reduced
significantly.

Is there any Other container provided by Boost or Ace Framework, which
i can use to reduce my Look up time more.

Thanks
Pallav Singh
 
D

Dombo

Op 13-Aug-11 11:42, Pallav singh schreef:
Hi All ,

i have data coming at variable rate and basic requirement was look up
and insertion should be very Fast
i used STL MAP for that , but when data was very HUGE(10 GB data was
to be stored in MAP ), its performance degraded.

Time to fetch data based on KEY increases exponentially.

Considering the computational complexity of the std::map (O(log n)) I
wouldn't expect the lookup time to increase exponentially with the size
of the data set.

Probably you are experiencing the effects caching and/or virtual memory
which would explain why performance degrades dramatically when the data
set gets large. When there are a huge amount of elements in the map,
chances are that the nodes are spread out over memory. A lookup could
result in memory accesses all over the place which could result in many
cache misses and/or page faults. These days the way you access memory
(locality of reference) can have a bigger impact on performance than the
computational complexity.
when I use unordered_map of C++0x to store data, it got reduced
significantly.

Is there any Other container provided by Boost or Ace Framework, which
i can use to reduce my Look up time more.

As far as lookup performance is concerned hash based associative
containers are about as good as it gets, assuming the hash function is
good. Possibly you may be able to provide a better hashing function for
your data.

If that doesn't help the only other option I can think of is to tackle
the problem at a more fundamental (design) level.
 
J

Jorgen Grahn

Hi All ,

i have data coming at variable rate and basic requirement was look up
and insertion should be very Fast
i used STL MAP for that , but when data was very HUGE(10 GB data was
to be stored in MAP ), its performance degraded.

Time to fetch data based on KEY increases exponentially.

But what about insertions and deletions? I'd expect that to be even
worse.
when I use unordered_map of C++0x to store data, it got reduced
significantly.

Is there any Other container provided by Boost or Ace Framework, which
i can use to reduce my Look up time more.

I think I would do an experiment with a BerkeleyDB of some kind stored
on disk, to see if that's better for huge data sets. Maybe also try to
go from an unordered_map<Key, Value> to unordered_map<Key, Value*> to
see if that is more cache-friendly.

/Jorgen
 
L

Leo Equinox Gaspard

Le 13/08/2011 11:42, Pallav singh a écrit :
Hi All ,

i have data coming at variable rate and basic requirement was look up
and insertion should be very Fast
i used STL MAP for that , but when data was very HUGE(10 GB data was
to be stored in MAP ), its performance degraded.

Time to fetch data based on KEY increases exponentially.

when I use unordered_map of C++0x to store data, it got reduced
significantly.

Is there any Other container provided by Boost or Ace Framework, which
i can use to reduce my Look up time more.

Thanks
Pallav Singh

As you have 10GB of data, your OS is likely to swap, explaining the
performance issue (except is you have 10GB RAM ?). Moreover, even if
you have the RAM needed, RAM access isn't as fast as cache access.
So, maybe should you use something like a B-Tree with the number of
childs per node adapted to your processor cache, so you could get a
O(logK N) (with K the number of childs and N the number of elements),
and reducing RAM accesses by adapting to your cache processor.

But I'm not sure whether retrieving, say, 1Mo of data from RAM to
processor cache is better or worse in performance than retrieving 10
times 100Ko from RAM to cache.

However, if your OS is swapping (if you have not enough memory), then
the B-Tree is very likely to improve performance, compared to std::map.

Compared to std::unordered_map, I'm not sure which one is the best.
A B-tree is an improvement of a std::map for large amounts of data,
an unordered_map could also be improved.
For example, reserve enough buckets on the creation of the unordered_map,
of create a better hash function.

Then, maybe will some libraries offer you more data structures.

So, my advices :
- Try B-Trees
- For unordered_map :
- Reserve enough buckets to avoid re-hashing
- Provide a good hash function (examine the space your keys are in,
and try to make it fit with the most large space of integers, with
as little collisions as possible)
- Finally, maybe are there some other data structures I don't know yet.

Cheers & hth.,
Leo
 
J

Juha Nieminen

Nick Keighley said:
not really

I don't think caching and virtual memory would cause std::map to slow
down from logaritmic to exponential. However, swapping sounds exactly
like it could.
 
J

Jorgen Grahn

I don't think caching and virtual memory would cause std::map to slow
down from logaritmic to exponential. However, swapping sounds exactly
like it could.

Are you two /really/ talking about exponential, or are you using the
word to mean "ridiculously slow"?

Even if you have a tree that's 40 levels deep and everything has been
swapped to disk, the number of disk hits needed to find a certain node
are proportional to the depth -- that is, increases logarithmically
with the size of the tree.

/Jorgen
 
L

Leo Equinox Gaspard

Le 13/08/2011 14:22, Jorgen Grahn a écrit :
[...]
I think I would do an experiment with a BerkeleyDB of some kind stored
on disk, to see if that's better for huge data sets. Maybe also try to
go from an unordered_map<Key, Value> to unordered_map<Key, Value*> to
see if that is more cache-friendly.

/Jorgen

I don't think an unordered_map could be easily cached, because of its
random-like distribution. So, except if you are always requesting the
same elements, I don't think processor cache could choose which of the
thousands of data you got in your map it should cache.

BTW, to get better advices, you (Pallav singh) could say how much a
data is (10GB of 8B data isn't treated by the same data structure as
10GB of 1GB data).

Leo
 
J

Juha Nieminen

Jorgen Grahn said:
Are you two /really/ talking about exponential, or are you using the
word to mean "ridiculously slow"?

Ok, it's probably not technically speaking O(2^n) with swapping, just
extremely slow. My point was, however, that caching or virtual memory
is not the cause for this, but disk swapping.
 
J

Jorgen Grahn

Le 13/08/2011 14:22, Jorgen Grahn a écrit :
[...]
[...] Maybe also try to
go from an unordered_map<Key, Value> to unordered_map<Key, Value*> to
see if that is more cache-friendly.

/Jorgen

I don't think an unordered_map could be easily cached, because of its
random-like distribution. So, except if you are always requesting the
same elements, I don't think processor cache could choose which of the
thousands of data you got in your map it should cache.

True, but my idea was that unordered_map<Key, Value*> might be better
which was what he was already using said:
BTW, to get better advices, you (Pallav singh) could say how much a
data is (10GB of 8B data isn't treated by the same data structure as
10GB of 1GB data).

Yes, moving to unordered_map<Key, Value*> can only be useful if his
value type is much bigger than his key. I forgot to point that out.

/Jorgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top