"PriorityMap"

Andreas Leitgeb · Mar 30, 2008

I'm in search of a data structure for the followig usecase:

Some algorithm spits out *lots* of pairs (k,v) of type (int,long).
I only want either those ten(*) pairs with the lowest keys, or all
pairs with the lowest ten(*) keys. (whichever is easier to achieve)
The keys (k) are not necessarily unique, in fact I usually get a
couple of different values (v) for each k, and I'm interested only
in the v's, that came with those ten(*) smallest k's. The actual
k's do not matter afterwards.

There are much too many to save them all and filter afterwards.

PriorityQueue and TreeMap do not seem to make it easy to
efficiently prune them beyond the first ten(*) entries
after each add().

Since it's called in a long loop, efficiency is an issue, but
the algorithm itself is also non-trivial, so I can afford some
extra cpu-cycles.

(*): I'd prefer if the "ten" does not need to be hardwired.

Patricia Shanahan · Mar 30, 2008

Andreas said:
I'm in search of a data structure for the followig usecase:

Some algorithm spits out *lots* of pairs (k,v) of type (int,long).
I only want either those ten(*) pairs with the lowest keys, or all
pairs with the lowest ten(*) keys. (whichever is easier to achieve)
The keys (k) are not necessarily unique, in fact I usually get a
couple of different values (v) for each k, and I'm interested only
in the v's, that came with those ten(*) smallest k's. The actual
k's do not matter afterwards.

There are much too many to save them all and filter afterwards.

PriorityQueue and TreeMap do not seem to make it easy to
efficiently prune them beyond the first ten(*) entries
after each add().

I'd use a PriorityQueue but with either a Comparable element type or a
Comparator that uses descending order of the key values. That way, the
head item is the one with the largest key, and the one that should be
removed when an insertion increases the queue size beyond 10.

Patricia

Eric Sosman · Mar 30, 2008

Andreas said:
I'm in search of a data structure for the followig usecase:

Some algorithm spits out *lots* of pairs (k,v) of type (int,long).
I only want either those ten(*) pairs with the lowest keys, or all
pairs with the lowest ten(*) keys. (whichever is easier to achieve)
The keys (k) are not necessarily unique, in fact I usually get a
couple of different values (v) for each k, and I'm interested only
in the v's, that came with those ten(*) smallest k's. The actual
k's do not matter afterwards.

There are much too many to save them all and filter afterwards.

PriorityQueue and TreeMap do not seem to make it easy to
efficiently prune them beyond the first ten(*) entries
after each add().

Since it's called in a long loop, efficiency is an issue, but
the algorithm itself is also non-trivial, so I can afford some
extra cpu-cycles.

(*): I'd prefer if the "ten" does not need to be hardwired.

Since you have "*lots*" of (k,v) pairs, it would probably
be nice to avoid the overhead of creating an Integer and a Long
and a Map.Entry (or similar) for each pair. I'd suggest just
coding the thing up for the purpose at hand -- a simple heap
made out of an int[] and a long[] seems attractive. Should be
pretty fast, too: Once you've been through a few thousand pairs,
most of the remaining pairs will be bigger than the heap's root
and will be rejected on the first comparison.

Daniel Pitts · Mar 31, 2008

Eric said:
Andreas said:

I'm in search of a data structure for the followig usecase:

Some algorithm spits out *lots* of pairs (k,v) of type (int,long).
I only want either those ten(*) pairs with the lowest keys, or all
pairs with the lowest ten(*) keys. (whichever is easier to achieve)
The keys (k) are not necessarily unique, in fact I usually get a
couple of different values (v) for each k, and I'm interested only
in the v's, that came with those ten(*) smallest k's. The actual
k's do not matter afterwards.

There are much too many to save them all and filter afterwards.

PriorityQueue and TreeMap do not seem to make it easy to efficiently
prune them beyond the first ten(*) entries after each add().

Since it's called in a long loop, efficiency is an issue, but the
algorithm itself is also non-trivial, so I can afford some
extra cpu-cycles.

(*): I'd prefer if the "ten" does not need to be hardwired.

Click to expand...

Since you have "*lots*" of (k,v) pairs, it would probably
be nice to avoid the overhead of creating an Integer and a Long
and a Map.Entry (or similar) for each pair. I'd suggest just
coding the thing up for the purpose at hand -- a simple heap
made out of an int[] and a long[] seems attractive. Should be
pretty fast, too: Once you've been through a few thousand pairs,
most of the remaining pairs will be bigger than the heap's root
and will be rejected on the first comparison.

Personally, I'd try three approaches, and benchmark to pick the best:

a) add every result to a SortedMap (or sorted set) and then do an
iteration to find the first ten.
b) Create a Pair class, keep a List of all the Pairs. Sort and then get
a sublist
c) Create a List, as you add things, utilize Collections.binarySearch,
and insert into the corresponding location, truncating the list if it
becomes too long.

All three of these approaches have different merits, depending on the
structure and amount of data. If I had to bet before testing, I'd put
my money on (c). but it is also the most complicated to code (though not
by much at all).

I'd be curious to see your results.

Logan Shaw · Mar 31, 2008

Eric said:
Andreas Leitgeb wrote:

Since you have "*lots*" of (k,v) pairs, it would probably
be nice to avoid the overhead of creating an Integer and a Long
and a Map.Entry (or similar) for each pair. I'd suggest just
coding the thing up for the purpose at hand -- a simple heap
made out of an int[] and a long[] seems attractive. Should be
pretty fast, too: Once you've been through a few thousand pairs,
most of the remaining pairs will be bigger than the heap's root
and will be rejected on the first comparison.

Yeah, assuming you don't have the bad luck of somehow occasionally
getting all the pairs in decreasing order.

Another approach is just ditch the heap and do linear search. In
other words, keep track of the largest int value in the int[] array,
and if you find one smaller, go find the one to evict using brute
force linear search. Since this will happen so infrequently, and
since 10 is very small in comparison to millions, this should be
pretty negligible. An advantage of this approach is that it's easy
to make it "stable", which I'm defining as this: if two values of
equal value arrive but they both can't fit in the output of ten
pairs, you keep the first one you saw.

Also, you could use a heap, but optimize the common case by duplicating
the value at the top of the heap as a separate variable of type int.
Then you get the full generality and good asymptotic performance,
but 99.9% of the time when you are comparing to the top of the heap,
you avoid creating an Integer object.

- Logan

Andreas Leitgeb · Mar 31, 2008

Patricia Shanahan said:
I'd use a PriorityQueue but with either a Comparable element type or a
Comparator that uses descending order of the key values. That way, the
head item is the one with the largest key, and the one that should be
removed when an insertion increases the queue size beyond 10.

Cool idea! I didn't think of that, but now it seems obvious

Thanks a lot!

Andreas Leitgeb · Apr 1, 2008

Daniel Pitts said:
Personally, I'd try three approaches, and benchmark to pick the best:
a) add every result to a SortedMap (or sorted set) and then do an
iteration to find the first ten.

While I excluded the possibility of saving all elements, your
mention of SortedMap helped me a lot. I wasn't aware of TreeMap's
"lastKey()" method before.

After further inspection of my problem, I noticed that I really
needed them filtered by rank of key only, but not by total number
of pairs. Unfortunately this knocked out Patricias cool idea of
using a reversed PriorityQueue.

I now use a TreeMap (which is also a SortedMap), that maps Integer
to LinkedList<Long>, and for each pair (key,val), I do the following:
see if key is already in the map: then just "add" the val to
the associated list.
otherwise, see if it is better than SortedMap's lastKey(),
then "put" the key with a new list of just the val,
and eventually remove the lastKey()-element, if the Map
exceeded the wanted size.

I'm quite happy with the current performance, and will rather
address other parts of my program before going into further
optimization of this part.

I'd be curious to see your results.

Thanks to all who participated!

SortedMap: getting value for largest key less or equal a givenvalue...	5	Aug 1, 2012
SortedMap: getting value for largest key less or equal a given	5	Aug 2, 2012
SortedMap: getting value for largest key less or equal a given	2	Aug 4, 2012
SortedMap: getting value for largest key less or equal a given	2	Aug 3, 2012
SortedMap: getting value for largest key less or equal a given	2	Aug 4, 2012
For Peer Review	1	Apr 2, 2010
A Friday Python Programming Pearl: random sampling	2	May 28, 2010
dict and __cmp__() question	2	Sep 7, 2005

"PriorityMap"

Andreas Leitgeb

Patricia Shanahan

Eric Sosman

Daniel Pitts

Logan Shaw

Andreas Leitgeb

Andreas Leitgeb

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads