? about priority_queue how make objects unique

Discussion in 'C++' started by Rich S., Oct 5, 2005.

  1. Rich S.

    Rich S. Guest

    Hi

    In an earlier posting I was asking how to read thru millions of data records
    keeping the top 2000 (where the top values are based upon a field in the
    record) and niklasb suggested using a priority_queue like so


    ------------------- start old message ---------------------
    A more efficient way than what you describe would be
    to use priority_queue. Define the predicate in such
    a way that the top() always returns the lowest scoring
    object.

    Let q be the priority_queue and suppose you want to
    find the top N scores (where N is 2000 in your example),
    the pseudo-code would be something like this:

    q.reserve(N);

    // add the first N elements to the priority queue
    for (i = 0; i < N; ++i)
    {
    if (!ReadRecord(&obj))
    break;

    q.push(obj);
    }

    // invariant: q contains highest N elements
    // q.top() is the lowest of the highest N elements
    while (ReadRecord(&obj))
    {
    if (obj > q.top())
    {
    q.pop();
    q.push(obj);
    }
    }

    ------------------- end of old message ---------------------

    This is working well for me but I'm getting cases where I have duplicate
    records in the top 2000 and I'm not quite sure how to fix that problem.

    I was thinking of running a loop around the pop until the obj is NOT greater
    than the top and then pushing obj but I'm convinced thats the best way.

    Thanks for any help.

    Regards.
     
    Rich S., Oct 5, 2005
    #1
    1. Advertising

  2. Rich S.

    mlimber Guest

    Rich S. wrote:
    > Hi
    >
    > In an earlier posting I was asking how to read thru millions of data records
    > keeping the top 2000 (where the top values are based upon a field in the
    > record) and niklasb suggested using a priority_queue like so
    >
    >
    > ------------------- start old message ---------------------
    > A more efficient way than what you describe would be
    > to use priority_queue. Define the predicate in such
    > a way that the top() always returns the lowest scoring
    > object.
    >
    > Let q be the priority_queue and suppose you want to
    > find the top N scores (where N is 2000 in your example),
    > the pseudo-code would be something like this:
    >
    > q.reserve(N);
    >
    > // add the first N elements to the priority queue
    > for (i = 0; i < N; ++i)
    > {
    > if (!ReadRecord(&obj))
    > break;
    >
    > q.push(obj);
    > }
    >
    > // invariant: q contains highest N elements
    > // q.top() is the lowest of the highest N elements
    > while (ReadRecord(&obj))
    > {
    > if (obj > q.top())
    > {
    > q.pop();
    > q.push(obj);
    > }
    > }
    >
    > ------------------- end of old message ---------------------
    >
    > This is working well for me but I'm getting cases where I have duplicate
    > records in the top 2000 and I'm not quite sure how to fix that problem.
    >
    > I was thinking of running a loop around the pop until the obj is NOT greater
    > than the top and then pushing obj but I'm convinced thats the best way.
    >
    > Thanks for any help.
    >
    > Regards.


    Are you getting duplicates because there are duplicates in your file(s)
    or because there is an error in the program? (If the former, you might
    consider using a std::unique, perhaps in conjunction with a
    std::vector, to find the first N unique records. See the docs here:
    http://www.sgi.com/tech/stl/unique.html .)

    Cheers! --M
     
    mlimber, Oct 5, 2005
    #2
    1. Advertising

  3. Rich S.

    Kai-Uwe Bux Guest

    Rich S. wrote:

    > Hi
    >
    > In an earlier posting I was asking how to read thru millions of data
    > records keeping the top 2000 (where the top values are based upon a field
    > in the record) and niklasb suggested using a priority_queue like so
    >
    >
    > ------------------- start old message ---------------------
    > A more efficient way than what you describe would be
    > to use priority_queue. Define the predicate in such
    > a way that the top() always returns the lowest scoring
    > object.
    >
    > Let q be the priority_queue and suppose you want to
    > find the top N scores (where N is 2000 in your example),
    > the pseudo-code would be something like this:
    >
    > q.reserve(N);
    >
    > // add the first N elements to the priority queue
    > for (i = 0; i < N; ++i)
    > {
    > if (!ReadRecord(&obj))
    > break;
    >
    > q.push(obj);
    > }
    >
    > // invariant: q contains highest N elements
    > // q.top() is the lowest of the highest N elements
    > while (ReadRecord(&obj))
    > {
    > if (obj > q.top())
    > {
    > q.pop();
    > q.push(obj);
    > }
    > }
    >
    > ------------------- end of old message ---------------------
    >
    > This is working well for me but I'm getting cases where I have duplicate
    > records in the top 2000 and I'm not quite sure how to fix that problem.
    >
    > I was thinking of running a loop around the pop until the obj is NOT
    > greater than the top and then pushing obj but I'm convinced thats the best
    > way.


    If you care about uniqueness, then maybe std::set is the way to go. Consider
    someting like this:
    #include <set>

    template < typename T, unsigned long N >
    struct top_N {

    std::set<T> elements;

    void insert ( T const & t ) {
    elements.insert( t );
    if ( elements.size() > N ) {
    elements.erase( elements.begin() );
    }
    }

    }; // struct top_N


    Best

    Kai-Uwe Bux
     
    Kai-Uwe Bux, Oct 5, 2005
    #3
  4. Rich S.

    Kristo Guest

    Rich S. wrote:
    > Hi
    >
    > In an earlier posting I was asking how to read thru millions of data records
    > keeping the top 2000 (where the top values are based upon a field in the
    > record) and niklasb suggested using a priority_queue like so


    [snip old message]

    > This is working well for me but I'm getting cases where I have duplicate
    > records in the top 2000 and I'm not quite sure how to fix that problem.


    What if you used a set as the underlying container for your
    priority_queue? That would automagically prevent duplicates from being
    inserted.

    > I was thinking of running a loop around the pop until the obj is NOT greater
    > than the top and then pushing obj but I'm convinced thats the best way.


    That's too much work IMHO, and probably inefficient. Just let the set
    do that for you.

    Kristo
     
    Kristo, Oct 5, 2005
    #4
  5. Rich S.

    Mark P Guest

    Rich S. wrote:
    > Hi
    >
    > In an earlier posting I was asking how to read thru millions of data records
    > keeping the top 2000 (where the top values are based upon a field in the
    > record) and niklasb suggested using a priority_queue like so
    >
    >
    > ------------------- start old message ---------------------
    > A more efficient way than what you describe would be
    > to use priority_queue. Define the predicate in such
    > a way that the top() always returns the lowest scoring
    > object.
    >
    > Let q be the priority_queue and suppose you want to
    > find the top N scores (where N is 2000 in your example),
    > the pseudo-code would be something like this:
    >
    > q.reserve(N);
    >
    > // add the first N elements to the priority queue
    > for (i = 0; i < N; ++i)
    > {
    > if (!ReadRecord(&obj))
    > break;
    >
    > q.push(obj);
    > }
    >
    > // invariant: q contains highest N elements
    > // q.top() is the lowest of the highest N elements
    > while (ReadRecord(&obj))
    > {
    > if (obj > q.top())
    > {
    > q.pop();
    > q.push(obj);
    > }
    > }
    >
    > ------------------- end of old message ---------------------
    >
    > This is working well for me but I'm getting cases where I have duplicate
    > records in the top 2000 and I'm not quite sure how to fix that problem.
    >
    > I was thinking of running a loop around the pop until the obj is NOT greater
    > than the top and then pushing obj but I'm convinced thats the best way.
    >
    > Thanks for any help.
    >
    > Regards.
    >
    >


    I agree with previous replies that a std::set is a good approach (just
    make sure to check the return value of insert to see if an item was
    actually inserted, if you want to ensure always 2000 records). Another
    option is a hashtable to record which elts. are currently in the queue.
     
    Mark P, Oct 6, 2005
    #5
  6. Rich S.

    Guest

    Kristo wrote:
    > Rich S. wrote:
    > > Hi
    > >
    > > In an earlier posting I was asking how to read thru millions of data records
    > > keeping the top 2000 (where the top values are based upon a field in the
    > > record) and niklasb suggested using a priority_queue like so

    >
    > [snip old message]
    >
    > > This is working well for me but I'm getting cases where I have duplicate
    > > records in the top 2000 and I'm not quite sure how to fix that problem.

    >
    > What if you used a set as the underlying container for your
    > priority_queue? That would automagically prevent duplicates from being
    > inserted.


    I believe the underlying container for priority_queue must be a
    sequence container rather than an associative container. However,
    you could use set _instead_ of priority_queue. The algorithm
    doesn't change very much.

    As before, this is just off the top of my head. I haven't compiled
    it, much lest tested it, so use at your own risk and all that.

    std::set<MyType> highest;
    MyType obj;
    while (ReadRecord(&obj))
    {
    // Is obj one of the highest so far?
    if (highest.count() < count || obj > *highest.begin())
    {
    // Insert it into the set; this has no effect if the set
    // already contains an identical value.
    highest.insert(obj);

    // If too many elements discard the lowest one.
    if (highest.count() > count)
    {
    highest.erase(highest.begin());
    }
    }
    }
     
    , Oct 6, 2005
    #6
  7. Rich S.

    Rich S. Guest

    "mlimber" <> wrote in message
    news:...
    > Rich S. wrote:
    >> Hi
    >>
    >> In an earlier posting I was asking how to read thru millions of data
    >> records
    >> keeping the top 2000 (where the top values are based upon a field in the
    >> record) and niklasb suggested using a priority_queue like so
    >>
    >>
    >> ------------------- start old message ---------------------
    >> A more efficient way than what you describe would be
    >> to use priority_queue. Define the predicate in such
    >> a way that the top() always returns the lowest scoring
    >> object.
    >>
    >> Let q be the priority_queue and suppose you want to
    >> find the top N scores (where N is 2000 in your example),
    >> the pseudo-code would be something like this:
    >>
    >> q.reserve(N);
    >>
    >> // add the first N elements to the priority queue
    >> for (i = 0; i < N; ++i)
    >> {
    >> if (!ReadRecord(&obj))
    >> break;
    >>
    >> q.push(obj);
    >> }
    >>
    >> // invariant: q contains highest N elements
    >> // q.top() is the lowest of the highest N elements
    >> while (ReadRecord(&obj))
    >> {
    >> if (obj > q.top())
    >> {
    >> q.pop();
    >> q.push(obj);
    >> }
    >> }
    >>
    >> ------------------- end of old message ---------------------
    >>
    >> This is working well for me but I'm getting cases where I have duplicate
    >> records in the top 2000 and I'm not quite sure how to fix that problem.
    >>
    >> I was thinking of running a loop around the pop until the obj is NOT
    >> greater
    >> than the top and then pushing obj but I'm convinced thats the best way.
    >>
    >> Thanks for any help.
    >>
    >> Regards.

    >
    > Are you getting duplicates because there are duplicates in your file(s)
    > or because there is an error in the program? (If the former, you might
    > consider using a std::unique, perhaps in conjunction with a
    > std::vector, to find the first N unique records. See the docs here:
    > http://www.sgi.com/tech/stl/unique.html .)
    >
    > Cheers! --M
    >

    yes duplicates in my data which I expected. I switched to a set based
    solution and it works good so far.

    Thanks for the tip about unique that'll be handy one day.
     
    Rich S., Oct 6, 2005
    #7
  8. Rich S.

    Rich S. Guest

    <> wrote in message
    news:...
    > Kristo wrote:
    >> Rich S. wrote:
    >> > Hi
    >> >
    >> > In an earlier posting I was asking how to read thru millions of data
    >> > records
    >> > keeping the top 2000 (where the top values are based upon a field in
    >> > the
    >> > record) and niklasb suggested using a priority_queue like so

    >>
    >> [snip old message]
    >>
    >> > This is working well for me but I'm getting cases where I have
    >> > duplicate
    >> > records in the top 2000 and I'm not quite sure how to fix that problem.

    >>
    >> What if you used a set as the underlying container for your
    >> priority_queue? That would automagically prevent duplicates from being
    >> inserted.

    >
    > I believe the underlying container for priority_queue must be a
    > sequence container rather than an associative container. However,
    > you could use set _instead_ of priority_queue. The algorithm
    > doesn't change very much.
    >
    > As before, this is just off the top of my head. I haven't compiled
    > it, much lest tested it, so use at your own risk and all that.
    >
    > std::set<MyType> highest;
    > MyType obj;
    > while (ReadRecord(&obj))
    > {
    > // Is obj one of the highest so far?
    > if (highest.count() < count || obj > *highest.begin())
    > {
    > // Insert it into the set; this has no effect if the set
    > // already contains an identical value.
    > highest.insert(obj);
    >
    > // If too many elements discard the lowest one.
    > if (highest.count() > count)
    > {
    > highest.erase(highest.begin());
    > }
    > }
    > }
    >


    I switched to a set based solution and it works pretty good.

    Thanks
     
    Rich S., Oct 6, 2005
    #8
  9. Rich S.

    Rich S. Guest

    "Mark P" <> wrote in message
    news:_l_0f.9161$...
    > Rich S. wrote:
    >> Hi
    >>
    >> In an earlier posting I was asking how to read thru millions of data
    >> records keeping the top 2000 (where the top values are based upon a field
    >> in the record) and niklasb suggested using a priority_queue like so
    >>
    >>
    >> ------------------- start old message ---------------------
    >> A more efficient way than what you describe would be
    >> to use priority_queue. Define the predicate in such
    >> a way that the top() always returns the lowest scoring
    >> object.
    >>
    >> Let q be the priority_queue and suppose you want to
    >> find the top N scores (where N is 2000 in your example),
    >> the pseudo-code would be something like this:
    >>
    >> q.reserve(N);
    >>
    >> // add the first N elements to the priority queue
    >> for (i = 0; i < N; ++i)
    >> {
    >> if (!ReadRecord(&obj))
    >> break;
    >>
    >> q.push(obj);
    >> }
    >>
    >> // invariant: q contains highest N elements
    >> // q.top() is the lowest of the highest N elements
    >> while (ReadRecord(&obj))
    >> {
    >> if (obj > q.top())
    >> {
    >> q.pop();
    >> q.push(obj);
    >> }
    >> }
    >>
    >> ------------------- end of old message ---------------------
    >>
    >> This is working well for me but I'm getting cases where I have duplicate
    >> records in the top 2000 and I'm not quite sure how to fix that problem.
    >>
    >> I was thinking of running a loop around the pop until the obj is NOT
    >> greater than the top and then pushing obj but I'm convinced thats the
    >> best way.
    >>
    >> Thanks for any help.
    >>
    >> Regards.
    >>
    >>

    >
    > I agree with previous replies that a std::set is a good approach (just
    > make sure to check the return value of insert to see if an item was
    > actually inserted, if you want to ensure always 2000 records). Another
    > option is a hashtable to record which elts. are currently in the queue.


    I switched to a set but was thinking about a map solution for speed.
     
    Rich S., Oct 6, 2005
    #9
  10. Rich S.

    Rich S. Guest

    "Kai-Uwe Bux" <> wrote in message
    news:di1ce7$hcq$...
    > Rich S. wrote:
    >
    >> Hi
    >>
    >> In an earlier posting I was asking how to read thru millions of data
    >> records keeping the top 2000 (where the top values are based upon a field
    >> in the record) and niklasb suggested using a priority_queue like so
    >>
    >>
    >> ------------------- start old message ---------------------
    >> A more efficient way than what you describe would be
    >> to use priority_queue. Define the predicate in such
    >> a way that the top() always returns the lowest scoring
    >> object.
    >>
    >> Let q be the priority_queue and suppose you want to
    >> find the top N scores (where N is 2000 in your example),
    >> the pseudo-code would be something like this:
    >>
    >> q.reserve(N);
    >>
    >> // add the first N elements to the priority queue
    >> for (i = 0; i < N; ++i)
    >> {
    >> if (!ReadRecord(&obj))
    >> break;
    >>
    >> q.push(obj);
    >> }
    >>
    >> // invariant: q contains highest N elements
    >> // q.top() is the lowest of the highest N elements
    >> while (ReadRecord(&obj))
    >> {
    >> if (obj > q.top())
    >> {
    >> q.pop();
    >> q.push(obj);
    >> }
    >> }
    >>
    >> ------------------- end of old message ---------------------
    >>
    >> This is working well for me but I'm getting cases where I have duplicate
    >> records in the top 2000 and I'm not quite sure how to fix that problem.
    >>
    >> I was thinking of running a loop around the pop until the obj is NOT
    >> greater than the top and then pushing obj but I'm convinced thats the
    >> best
    >> way.

    >
    > If you care about uniqueness, then maybe std::set is the way to go.
    > Consider
    > someting like this:
    > #include <set>
    >
    > template < typename T, unsigned long N >
    > struct top_N {
    >
    > std::set<T> elements;
    >
    > void insert ( T const & t ) {
    > elements.insert( t );
    > if ( elements.size() > N ) {
    > elements.erase( elements.begin() );
    > }
    > }
    >
    > }; // struct top_N
    >
    >
    > Best
    >
    > Kai-Uwe Bux


    Thanks, I did switch to a set solution and it works well.
     
    Rich S., Oct 6, 2005
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tino

    clearing a priority_queue

    Tino, Jul 9, 2003, in forum: C++
    Replies:
    4
    Views:
    1,175
    Stuart Golodetz
    Jul 9, 2003
  2. newsock

    queue, deque, priority_queue

    newsock, Oct 24, 2003, in forum: C++
    Replies:
    15
    Views:
    2,506
    Tim Clacy
    Nov 4, 2003
  3. Dave
    Replies:
    7
    Views:
    551
    Jerry Coffin
    Apr 22, 2004
  4. ToshiBoy
    Replies:
    6
    Views:
    877
    ToshiBoy
    Aug 12, 2008
  5. Token Type
    Replies:
    9
    Views:
    385
    Chris Angelico
    Sep 9, 2012
Loading...

Share This Page