Keeping track of the N largest values

Discussion in 'Python' started by Roy Smith, Dec 25, 2010.

  1. Roy Smith

    Roy Smith Guest

    I'm processing a stream of N numbers and want to keep track of the K
    largest. There's too many numbers in the stream (i.e. N is too large)
    to keep in memory at once. K is small (100 would be typical).

    From a theoretical point of view, I should be able to do this in N log K
    time. What I'm doing now is essentially:

    top = [-1] # Assume all x are >= 0
    for x in input():
    if x <= top[0]:
    continue
    top.append(x)
    if len(top) > K:
    top.sort()
    top.pop(0)

    I can see pathological cases (say, all input values the same) where
    running time would be N K log K, but on average (N >> K and random
    distribution of values), this should be pretty close to N.

    Is there a better way to do this, either from a theoretical running time
    point of view, or just a nicer way to code this in Python?
     
    Roy Smith, Dec 25, 2010
    #1
    1. Advertising

  2. Roy Smith

    Peter Otten Guest

    Roy Smith wrote:

    > I'm processing a stream of N numbers and want to keep track of the K
    > largest. There's too many numbers in the stream (i.e. N is too large)
    > to keep in memory at once. K is small (100 would be typical).
    >
    > From a theoretical point of view, I should be able to do this in N log K
    > time. What I'm doing now is essentially:
    >
    > top = [-1] # Assume all x are >= 0
    > for x in input():
    > if x <= top[0]:
    > continue
    > top.append(x)
    > if len(top) > K:
    > top.sort()
    > top.pop(0)
    >
    > I can see pathological cases (say, all input values the same) where
    > running time would be N K log K, but on average (N >> K and random
    > distribution of values), this should be pretty close to N.
    >
    > Is there a better way to do this, either from a theoretical running time
    > point of view, or just a nicer way to code this in Python?


    http://docs.python.org/library/heapq.html#heapq.nlargest
     
    Peter Otten, Dec 25, 2010
    #2
    1. Advertising

  3. Roy Smith

    Robert Kern Guest

    On 12/25/10 10:42 AM, Roy Smith wrote:
    > I'm processing a stream of N numbers and want to keep track of the K
    > largest. There's too many numbers in the stream (i.e. N is too large)
    > to keep in memory at once. K is small (100 would be typical).
    >
    >> From a theoretical point of view, I should be able to do this in N log K

    > time. What I'm doing now is essentially:
    >
    > top = [-1] # Assume all x are>= 0
    > for x in input():
    > if x<= top[0]:
    > continue
    > top.append(x)
    > if len(top)> K:
    > top.sort()
    > top.pop(0)
    >
    > I can see pathological cases (say, all input values the same) where
    > running time would be N K log K, but on average (N>> K and random
    > distribution of values), this should be pretty close to N.
    >
    > Is there a better way to do this, either from a theoretical running time
    > point of view, or just a nicer way to code this in Python?


    heapq.nlargest()

    http://docs.python.org/library/heapq#heapq.nlargest

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
     
    Robert Kern, Dec 25, 2010
    #3
  4. Roy Smith

    Roy Smith Guest

    In article <Xns9E59A44D7CC49duncanbooth@127.0.0.1>,
    Duncan Booth <> wrote:

    > Roy Smith <> wrote:
    >
    > >
    > > I'm processing a stream of N numbers and want to keep track of the K
    > > largest. There's too many numbers in the stream (i.e. N is too large)
    > > to keep in memory at once. K is small (100 would be typical).
    > > ...
    > > Is there a better way to do this, either from a theoretical running
    > > time point of view, or just a nicer way to code this in Python?

    >
    > How about:
    >
    > from heapq import nlargest
    > top = nlargest(K, input())
    >
    > It uses a heap so avoids completely resorting each time.


    Hmmm, that looks like it would solve the problem as stated, but there's
    another twist. In addition to finding the K largest values, I *also*
    need to do some other processing on all the values (which I didn't show
    in the original example, incorrectly thinking it not germane to my
    question). The problem with nlargest() is that it doesn't give me a
    hook to do that.

    PS: I'm assuming heapq.nlargest(n, iterable) operates with memory
    proportional to n, and not to the iterator length. That's the only
    reasonable conclusion, but the docs don't actually come out and say it.
     
    Roy Smith, Dec 25, 2010
    #4
  5. Roy Smith

    Paul Rubin Guest

    Roy Smith <> writes:
    >> from heapq import nlargest
    >> top = nlargest(K, input())


    > In addition to finding the K largest values, I *also* need to do some
    > other processing on all the values .... The problem with nlargest()
    > is that it doesn't give me a hook to do that.


    def processed_input():
    for x in input():
    process(x)
    yield x

    top = nlargest(K, processed_input())

    You could also write that more consisely with genexps but it's a bit
    clumsy.
     
    Paul Rubin, Dec 25, 2010
    #5
  6. Roy Smith

    John Nagle Guest

    On 12/25/2010 8:04 AM, Peter Otten wrote:
    > Roy Smith wrote:
    >
    >> I'm processing a stream of N numbers and want to keep track of the K
    >> largest. There's too many numbers in the stream (i.e. N is too large)
    >> to keep in memory at once. K is small (100 would be typical).

    >
    > http://docs.python.org/library/heapq.html#heapq.nlargest


    Incidentally, if it happens that the data is already in
    a database, MySQL will do that.

    SELECT val FROM tab ORDER BY val DESC LIMIT N;

    will, for small N, keep only N values. For large N,
    it sorts.

    That's for an un-indexed field. If "val" is an
    index, this is a very fast operation, since the tree
    building has already been done.

    John Nagle
     
    John Nagle, Dec 25, 2010
    #6
  7. Roy Smith

    Peter Otten Guest

    Roy Smith wrote:

    > In article <Xns9E59A44D7CC49duncanbooth@127.0.0.1>,
    > Duncan Booth <> wrote:
    >
    >> Roy Smith <> wrote:
    >>
    >> >
    >> > I'm processing a stream of N numbers and want to keep track of the K
    >> > largest. There's too many numbers in the stream (i.e. N is too large)
    >> > to keep in memory at once. K is small (100 would be typical).
    >> > ...
    >> > Is there a better way to do this, either from a theoretical running
    >> > time point of view, or just a nicer way to code this in Python?

    >>
    >> How about:
    >>
    >> from heapq import nlargest
    >> top = nlargest(K, input())
    >>
    >> It uses a heap so avoids completely resorting each time.

    >
    > Hmmm, that looks like it would solve the problem as stated, but there's
    > another twist. In addition to finding the K largest values, I *also*
    > need to do some other processing on all the values (which I didn't show
    > in the original example, incorrectly thinking it not germane to my
    > question). The problem with nlargest() is that it doesn't give me a
    > hook to do that.


    If Paul's solution doesn't suffice -- the heapq module has the building
    blocks for a custom solution, e. g.:

    import heapq
    from functools import partial

    class NLargest(object):
    def __init__(self, n):
    self.n = n
    self.heap = []
    def tally(self, item):
    heap = self.heap
    if len(heap) >= self.n:
    heapq.heappushpop(heap, item)
    self.tally = partial(heapq.heappushpop, heap)
    else:
    heapq.heappush(heap, item)
    def __str__(self):
    return str(sorted(self.heap, reverse=True))

    if __name__ == "__main__":
    import random
    items = range(100)
    random.shuffle(items)
    accu = NLargest(10)
    for item in items:
    accu.tally(item)
    print item, accu


    > PS: I'm assuming heapq.nlargest(n, iterable) operates with memory
    > proportional to n, and not to the iterator length. That's the only
    > reasonable conclusion, but the docs don't actually come out and say it.


    I would hope so.
     
    Peter Otten, Dec 26, 2010
    #7
  8. Roy Smith

    n00m Guest

    from bisect import insort_left

    K = 5
    top = []
    while 1:
    x = input()
    if len(top) < K:
    insort_left(top, x)
    elif x > top[0]:
    del top[0]
    insort_left(top, x)
    print top


    will be enough
     
    n00m, Dec 26, 2010
    #8
  9. Roy Smith

    Roy Smith Guest

    In article
    <>,
    n00m <> wrote:

    > from bisect import insort_left
    >
    > K = 5
    > top = []
    > while 1:
    > x = input()
    > if len(top) < K:
    > insort_left(top, x)
    > elif x > top[0]:
    > del top[0]
    > insort_left(top, x)
    > print top
    >
    >
    > will be enough


    Hmmm, that's an interesting idea. Thanks.
     
    Roy Smith, Dec 26, 2010
    #9
  10. Am 25.12.2010 16:42, schrieb Roy Smith:
    > I'm processing a stream of N numbers and want to keep track of the K
    > largest. There's too many numbers in the stream (i.e. N is too large)
    > to keep in memory at once. K is small (100 would be typical).
    >
    > > From a theoretical point of view, I should be able to do this in N log K

    > time. What I'm doing now is essentially:
    >
    > top = [-1] # Assume all x are>= 0
    > for x in input():
    > if x<= top[0]:
    > continue
    > top.append(x)
    > if len(top)> K:
    > top.sort()
    > top.pop(0)
    >
    > I can see pathological cases (say, all input values the same) where
    > running time would be N K log K, but on average (N>> K and random
    > distribution of values), this should be pretty close to N.
    >
    > Is there a better way to do this, either from a theoretical running time
    > point of view, or just a nicer way to code this in Python?

    Here is my version:

    l = []
    K = 10

    while 1:
    a = input()
    if len(l) == K:
    l.remove(min(l))
    l=[x for x in l if x < a] + [a] + [x for x in l if x > a]
    print l
     
    Stefan Sonnenberg-Carstens, Dec 26, 2010
    #10
  11. Am 26.12.2010 19:51, schrieb Stefan Sonnenberg-Carstens:
    > l = []
    > K = 10
    >
    > while 1:
    > a = input()
    > if len(l) == K:
    > l.remove(min(l))
    > l=[x for x in l if x < a] + [a] + [x for x in l if x > a]
    > print l

    A minor fault made it into my prog:

    l = [0]
    K = 10

    while 1:
    a = input()
    l=[x for x in l if x < a] + [a] + [x for x in l if x > a]
    if len(l) == K:
    l.remove(min(l))
    print l
     
    Stefan Sonnenberg-Carstens, Dec 27, 2010
    #11
  12. On Sat, Dec 25, 2010 at 7:42 AM, Roy Smith <> wrote:
    > I'm processing a stream of N numbers and want to keep track of the K
    > largest.  There's too many numbers in the stream (i.e. N is too large)
    > to keep in memory at once.  K is small (100 would be typical).
    >
    > >From a theoretical point of view, I should be able to do this in N log K

    > time.  What I'm doing now is essentially:
    >
    > top = [-1]    # Assume all x are >= 0
    > for x in input():
    >    if x <= top[0]:
    >        continue
    >    top.append(x)
    >    if len(top) > K:
    >        top.sort()
    >        top.pop(0)
    >
    > I can see pathological cases (say, all input values the same) where
    > running time would be N K log K, but on average (N >> K and random
    > distribution of values), this should be pretty close to N.
    >
    > Is there a better way to do this, either from a theoretical running time
    > point of view, or just a nicer way to code this in Python?
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >


    heapq is probably fine, but I've empirically found a treap faster than
    a heap under some conditions:

    http://stromberg.dnsalias.org/~strombrg/treap/
    http://stromberg.dnsalias.org/~strombrg/highest/
     
    Dan Stromberg, Dec 31, 2010
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John
    Replies:
    0
    Views:
    361
  2. Utter Newbie
    Replies:
    0
    Views:
    468
    Utter Newbie
    Jul 28, 2003
  3. Manuel
    Replies:
    1
    Views:
    347
    John Saunders
    Dec 11, 2004
  4. =?Utf-8?B?dHBlcnJp?=
    Replies:
    4
    Views:
    2,007
    =?Utf-8?B?Rng=?=
    Jul 12, 2005
  5. xzzy

    keeping track of threads

    xzzy, Oct 6, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    461
Loading...

Share This Page