Selecting Container for "Top 20" Application

J

Jorgen Grahn

for(..300K items..)
{
get(item);
pq.push(item);
if (pq.size() > 20)
pq.pop()
}


The priority queue always has the "highest"* value available for
immediate removal. Let's say there are 20 items in the priority
queue, stuff a new item in, and then you have 21, with the "highest"
one of those immediately removable. Since we only want the top 20,
just pull that extra one off and discard it.

Ah, of course! I was mentally placing the elements in the wrong
order, so that pop() would prematurely remove a candidate for the
top-20 list rather than an obvious non-candidate.

You're right, this is the obvious way to do it.

And perhaps I should be more used to dealing with priority queues.
I *do* know what it is, but I couldn't apply it correctly to this
problem.

/Jorgen
 
J

James Kanze

I am looking for an STL container that will efficiently handle a "Top
20" list. Specifically, I have >300,000 data records that I must scan
to find and store the highest 20 values.
None of the available STL containers seems well suited for this
application. For example:
- array, even though fixed in size, seems cumbersome to use: no delete
or erase function is available; insertion at some point is difficult;
etc. Replacement of the "end value" and sorting might work, but it's
tedious and slow...
- vector, while having insert & erase functions, incurs repeated size
expansion (expensive!) and truncation to maintain a fixed size.
Appending a new value if it's greater than the 20th element, followed by
sorting and truncation might be less work, but it might be very slow to
execute.
- queue/deque, set, and stack seem inappropriate for this application,
and the others (map, list, etc.) are completely unusable here.
Am I missing something? Is/are there reasonably efficient ways to
use array or vector that are appropriate? TIA

The most obvious solution is std::vector, but keeping a maximum
of 20 elements in it, and using std::push_heap and std::pop_heap
to manage it. All of the algorithms you need are already
present in the standard library.
 
L

Luca Risolia

Sebastian said:
Finding top-elements via heap-extract: m*O(log n)

So a heap can be the better choice. It would be interesting to measure,
if this is true with n=300k and m=20.

Oh right, std::vector and std::make/push/pop_heap() might be a better choice,
but probably std::priority_queue is not ideal for the user needs, as it would
require to remove the elements from the container each time to get the "top
20" list.
 
J

Jorgen Grahn

Oh right, std::vector and std::make/push/pop_heap() might be a better choice,
but probably std::priority_queue is not ideal for the user needs, as it would
require to remove the elements from the container each time to get the "top
20" list.

Not /each/ time. Most elements can be discarded immediately without
inserting them, unless the input is already sorted.

/Jorgen
 
S

Sebastian Pfützner

The most obvious solution is std::vector, but keeping a maximum
of 20 elements in it, and using std::push_heap and std::pop_heap
to manage it. All of the algorithms you need are already
present in the standard library.

This also has the advantage that you don't need to remove the
top-elements from the vector as with the priority_queue. std::pop_heap
just swaps the top_element into the back. Do this 20 times and you have
your top-elements in [last - 20, last). You can simply reinert those
elements back into the heap. You also follow the same procedure when you
need to change the value of elements. (pop, change back element, push).
 
S

Sebastian Pfützner

The most obvious solution is std::vector, but keeping a maximum
of 20 elements in it, and using std::push_heap and std::pop_heap
to manage it. All of the algorithms you need are already
present in the standard library.

This also has the advantage that you don't need to remove the
top-elements from the vector as with the priority_queue. std::pop_heap
just swaps the top_element into the back. Do this 20 times and you have
your top-elements in [last - 20, last). You can simply reinert those
elements back into the heap. You also follow the same procedure when you
need to change the value of elements. (pop, change back element, push).

To clarify: I would store all elements in this vector and keep the
heap-sort-order in sync. This has the advantage that it is still
efficient if the top-elements are computed more than once after elements
are added/removed/changed. James' proposal does not have this property,
because the heap hast to be rebuild every time by iterating over all
elements even if only some elements changed.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top