? about priority_queue how make objects unique

Rich S. · Oct 5, 2005

Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so

------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

Thanks for any help.

Regards.

mlimber · Oct 5, 2005

Rich said:
Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so

------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

Thanks for any help.

Regards.

Are you getting duplicates because there are duplicates in your file(s)
or because there is an error in the program? (If the former, you might
consider using a std::unique, perhaps in conjunction with a
std::vector, to find the first N unique records. See the docs here:
http://www.sgi.com/tech/stl/unique.html .)

Cheers! --M

Kai-Uwe Bux · Oct 5, 2005

Rich said:
Hi

In an earlier posting I was asking how to read thru millions of data
records keeping the top 2000 (where the top values are based upon a field
in the record) and niklasb suggested using a priority_queue like so

------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT
greater than the top and then pushing obj but I'm convinced thats the best
way.

If you care about uniqueness, then maybe std::set is the way to go. Consider
someting like this:
#include <set>

template < typename T, unsigned long N >
struct top_N {

std::set<T> elements;

void insert ( T const & t ) {
elements.insert( t );
if ( elements.size() > N ) {
elements.erase( elements.begin() );
}
}

}; // struct top_N

Best

Kai-Uwe Bux

Kristo · Oct 5, 2005

Rich said:
Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so

[snip old message]

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

What if you used a set as the underlying container for your
priority_queue? That would automagically prevent duplicates from being
inserted.

I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

That's too much work IMHO, and probably inefficient. Just let the set
do that for you.

Kristo

Mark P · Oct 6, 2005

Rich said:
Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so

------------------- start old message ---------------------
A more efficient way than what you describe would be
to use priority_queue. Define the predicate in such
a way that the top() always returns the lowest scoring
object.

Let q be the priority_queue and suppose you want to
find the top N scores (where N is 2000 in your example),
the pseudo-code would be something like this:

q.reserve(N);

// add the first N elements to the priority queue
for (i = 0; i < N; ++i)
{
if (!ReadRecord(&obj))
break;

q.push(obj);
}

// invariant: q contains highest N elements
// q.top() is the lowest of the highest N elements
while (ReadRecord(&obj))
{
if (obj > q.top())
{
q.pop();
q.push(obj);
}
}

------------------- end of old message ---------------------

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

I was thinking of running a loop around the pop until the obj is NOT greater
than the top and then pushing obj but I'm convinced thats the best way.

Thanks for any help.

Regards.

I agree with previous replies that a std::set is a good approach (just
make sure to check the return value of insert to see if an item was
actually inserted, if you want to ensure always 2000 records). Another
option is a hashtable to record which elts. are currently in the queue.

niklasb · Oct 6, 2005

Kristo said:
Rich said:

Hi

In an earlier posting I was asking how to read thru millions of data records
keeping the top 2000 (where the top values are based upon a field in the
record) and niklasb suggested using a priority_queue like so

Click to expand...

[snip old message]

This is working well for me but I'm getting cases where I have duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

Click to expand...

What if you used a set as the underlying container for your
priority_queue? That would automagically prevent duplicates from being
inserted.

I believe the underlying container for priority_queue must be a
sequence container rather than an associative container. However,
you could use set _instead_ of priority_queue. The algorithm
doesn't change very much.

As before, this is just off the top of my head. I haven't compiled
it, much lest tested it, so use at your own risk and all that.

std::set<MyType> highest;
MyType obj;
while (ReadRecord(&obj))
{
// Is obj one of the highest so far?
if (highest.count() < count || obj > *highest.begin())
{
// Insert it into the set; this has no effect if the set
// already contains an identical value.
highest.insert(obj);

// If too many elements discard the lowest one.
if (highest.count() > count)
{
highest.erase(highest.begin());
}
}
}

Rich S. · Oct 6, 2005

mlimber said:
Are you getting duplicates because there are duplicates in your file(s)
or because there is an error in the program? (If the former, you might
consider using a std::unique, perhaps in conjunction with a
std::vector, to find the first N unique records. See the docs here:
http://www.sgi.com/tech/stl/unique.html .)

Cheers! --M

yes duplicates in my data which I expected. I switched to a set based
solution and it works good so far.

Thanks for the tip about unique that'll be handy one day.

Rich S. · Oct 6, 2005

Kristo said:
Kristo said:

Rich said:

Hi

In an earlier posting I was asking how to read thru millions of data
records
keeping the top 2000 (where the top values are based upon a field in
the
record) and niklasb suggested using a priority_queue like so

Click to expand...

[snip old message]

This is working well for me but I'm getting cases where I have
duplicate
records in the top 2000 and I'm not quite sure how to fix that problem.

Click to expand...

What if you used a set as the underlying container for your
priority_queue? That would automagically prevent duplicates from being
inserted.

Click to expand...

I believe the underlying container for priority_queue must be a
sequence container rather than an associative container. However,
you could use set _instead_ of priority_queue. The algorithm
doesn't change very much.

As before, this is just off the top of my head. I haven't compiled
it, much lest tested it, so use at your own risk and all that.

std::set<MyType> highest;
MyType obj;
while (ReadRecord(&obj))
{
// Is obj one of the highest so far?
if (highest.count() < count || obj > *highest.begin())
{
// Insert it into the set; this has no effect if the set
// already contains an identical value.
highest.insert(obj);

// If too many elements discard the lowest one.
if (highest.count() > count)
{
highest.erase(highest.begin());
}
}
}

I switched to a set based solution and it works pretty good.

Thanks

Rich S. · Oct 6, 2005

Mark P said:
I agree with previous replies that a std::set is a good approach (just
make sure to check the return value of insert to see if an item was
actually inserted, if you want to ensure always 2000 records). Another
option is a hashtable to record which elts. are currently in the queue.

I switched to a set but was thinking about a map solution for speed.

Rich S. · Oct 6, 2005

Kai-Uwe Bux said:
If you care about uniqueness, then maybe std::set is the way to go.
Consider
someting like this:
#include <set>

template < typename T, unsigned long N >
struct top_N {

std::set<T> elements;

void insert ( T const & t ) {
elements.insert( t );
if ( elements.size() > N ) {
elements.erase( elements.begin() );
}
}

}; // struct top_N

Best

Kai-Uwe Bux

Thanks, I did switch to a set solution and it works well.

Question about loggers	26	Mar 7, 2012
Getting carousel to continually rotate	0	Sep 27, 2017
How to design an object to get references to other needed objects?	1	Feb 25, 2009
Wondering about JCR	12	Dec 19, 2009
Please Help Newbie ( While Loop Validation Problem)	16	Feb 11, 2007
Logic check please	5	Sep 21, 2005
Can't make this page work	6	Mar 8, 2006
first ruby program, how do I make it faster?	1	Nov 15, 2006

? about priority_queue how make objects unique

Rich S.

mlimber

Kai-Uwe Bux

Kristo

Mark P

niklasb

Rich S.

Rich S.

Rich S.

Rich S.

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads