std::priority_queue: Abysmally bad performance versus std::set, can'tfigure out why.

J

Jeremy Murphy

Hi everyone,

I recently implemented A* for some research, and because I need both the pqinterface AND an iterable interface to the open list, I (somewhat naively)started using std::set. Now that the basic implementation is done, out ofcuriosity more than anything else, I decided to go back and try std::priority_queue for the special case problem where I don't need to iterate over the open list. (I also realized that the boost::heap pq implementations allow iterating, albeit with poor efficiency.)

To my surprise and horror, when I used std::priority_queue in place of std::set, the time and space performance is abysmally bad. So bad that I can'timagine why. For example, a problem of size n = 17 takes ~3 s and ~24 MB using std::set versus ~10 s and ~400 MB using std::priority_queue. At n = 20, the pq implementation actually tries to allocate > 3 GB of RAM and usually gets OOM-killed.

I am stumped as to what is going on, because the solver yields identical answers and debugging output for both implementations. I have tried both std::vector and std::deque as the underlying container.

Now here I have to apologize that my minimal working example is not very minimal at all. I tried to recreate the behaviour without all the A* business, but with no success. Here is a link to the library in question:

http://code.google.com/p/policy-based-search/

There are two example problems, of which the TSP is the only serious one and is where the problem is observable. The master branch is std::set, the pq_interface branch uses std::priority_queue. The TSP example requires one parameter, the problem size, e.g.: ./TSP 17.

I don't really expect people to have time to delve through my code, but if you do I would of course be incredibly humbled and appreciative. (I sincerely hope that you don't need to understand the TSP example itself and that just inspecting the library will suffice.) Even if you just have some ideas off the top of your head as to why std::priority_queue would behave like this, that would help. So that you don't have to go looking at the code tounderstand the basics, the queue element type is a 'search node' and is defined roughly as:

struct Node {
Node(std::vector<unsigned int> S, std::shared_ptr<Node> P) : state(S), parent(P) {}
std::vector<unsigned int> state;
std::shared_ptr<Node> parent;
};

and the queue is then defined:

typedef std::shared_ptr<Node> Element;
typedef std::priority_queue<Element, vector<Element>, Comp> PriorityQueue;

Roughly, the algorithm pushes a bunch of nodes (relative to problem size), pops one, repeats. Sorry about the length of my explanation, I tried to say only what I thought necessary, but that amounted to a lot. :) Thanks inadvance for any advice, small or large. Cheers.

Jeremy
 
J

Jeremy Murphy

So, I've done some quick analysis with the Massif tool in valgrind to compare std::set vs std::priority_queue in my A* implementation.

The tool really slows down processing, so I have only run it on a small problem, n = 13.
Both analysis results attribute the majority of heap usage to std::__shared_count<long specification with references to Node>. A second common heap element is a vector<unsigned int>. Both these elements somehow originate from shared_ptr_base.h, so no doubt account for Node allocation.
Then each graph has a unique heap element. The set graph has a heap element attributed to std::_Rb_tree_iterator<...>, which comes from stl_set.h, sois no doubt the overhead from using std::set. The pq graph has a unique section that is effectively just attributed to the search function, which increases in powers of 2 and I can only assume that is the std::vector of shared_ptrs in the pq.

(MiB) Total __shared_count<...> vector<uint> _Rb_tree Node
set 2.8 1.41 0.52 0.926 0.0
pq 3.0 1.85 0.69 0.0 0.512

So this is quite interesting for two reasons. Although the total difference is 200 KiB, the sum of differences between the common heap elements is ~457 KiB, which says that the pq implementation is storing more Nodes than the set, right? But the unique heap elements are smaller for the pq than theset, which agrees with theory and other tests that a std::priority_queue has less overhead than a std::set.

But wtf is going on?? In light of this, I just added some more debugging to track how many Nodes are allocated, and it is the same in each implementation. Argh! How can that be?

Jeremy
 
Ö

Öö Tiib

To my surprise and horror, when I used std::priority_queue in place of std::set, the time and space performance is abysmally bad. So bad that I can't imagine why. ....

struct Node {
Node(std::vector<unsigned int> S, std::shared_ptr<Node> P) : state(S), parent(P) {}
std::vector<unsigned int> state;
std::shared_ptr<Node> parent;
};

and the queue is then defined:

typedef std::shared_ptr<Node> Element;
typedef std::priority_queue<Element, vector<Element>, Comp> PriorityQueue;

I can't really say what is going on ... perhaps make bit smaller example that demos
the problem.

My experience is directly opposite: priority_queue washes the floors with sets,
intrusive_sets, maps, sorted lists etc in context where you always pop the best
and alternate pops and pushes rather rapidly. Also the vector on what priority_queue
adapts is way more memory efficient than std::set.

So ... it feels that you have defect somewhere but can't show where, sorry.
One note ... I tend to remove usage of shared_ptr in performance critical algorithms.
It is heavy and performs way worse than some index, iterator, unique_ptr or raw pointer.
 
R

red floyd

I can't really say what is going on ... perhaps make bit smaller example that demos
the problem.

My experience is directly opposite: priority_queue washes the floors with sets,
intrusive_sets, maps, sorted lists etc in context where you always pop the best
and alternate pops and pushes rather rapidly. Also the vector on what priority_queue
adapts is way more memory efficient than std::set.

So ... it feels that you have defect somewhere but can't show where, sorry.
One note ... I tend to remove usage of shared_ptr in performance critical algorithms.
It is heavy and performs way worse than some index, iterator, unique_ptr or raw pointer.
Maybe OP's queue is reallocating the vector as he adds to the
priority queue, while std::set is node based, so that the memory
usage is higher for priority queue.
 
Ö

Öö Tiib

Maybe OP's queue is reallocating the vector as he adds to the
priority queue, while std::set is node based, so that the memory
usage is higher for priority queue.

Hmm. Indeed. It is easy to find out if that is the problem by switching
to underlying deque:

typedef std::priority_queue<Element, deque<Element>, Comp> PriorityQueue;

If that improves performance then the problem is that memory is getting
rather madly fragmented by the alorithm. Fortunately there are always
ways to do things lot less dynamically.
 
J

Jeremy Murphy

Thanks everyone for their replies, and sorry about the long delay in my reply.
Like most baffling programming problems, the issue did turn out to be my own stupidity: I was using a non-total comparison function that works as expected in a std::priority_queue but not in a std::set (due to the enforcement of uniqueness).

Thanks for your help, cheers.

Jeremy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top