Keeping track of the N largest values

R

Roy Smith

I'm processing a stream of N numbers and want to keep track of the K
largest. There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once. K is small (100 would be typical).

From a theoretical point of view, I should be able to do this in N log K
time. What I'm doing now is essentially:

top = [-1] # Assume all x are >= 0
for x in input():
if x <= top[0]:
continue
top.append(x)
if len(top) > K:
top.sort()
top.pop(0)

I can see pathological cases (say, all input values the same) where
running time would be N K log K, but on average (N >> K and random
distribution of values), this should be pretty close to N.

Is there a better way to do this, either from a theoretical running time
point of view, or just a nicer way to code this in Python?
 
P

Peter Otten

Roy said:
I'm processing a stream of N numbers and want to keep track of the K
largest. There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once. K is small (100 would be typical).

From a theoretical point of view, I should be able to do this in N log K
time. What I'm doing now is essentially:

top = [-1] # Assume all x are >= 0
for x in input():
if x <= top[0]:
continue
top.append(x)
if len(top) > K:
top.sort()
top.pop(0)

I can see pathological cases (say, all input values the same) where
running time would be N K log K, but on average (N >> K and random
distribution of values), this should be pretty close to N.

Is there a better way to do this, either from a theoretical running time
point of view, or just a nicer way to code this in Python?

http://docs.python.org/library/heapq.html#heapq.nlargest
 
R

Robert Kern

I'm processing a stream of N numbers and want to keep track of the K
largest. There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once. K is small (100 would be typical).
From a theoretical point of view, I should be able to do this in N log K
time. What I'm doing now is essentially:

top = [-1] # Assume all x are>= 0
for x in input():
if x<= top[0]:
continue
top.append(x)
if len(top)> K:
top.sort()
top.pop(0)

I can see pathological cases (say, all input values the same) where
running time would be N K log K, but on average (N>> K and random
distribution of values), this should be pretty close to N.

Is there a better way to do this, either from a theoretical running time
point of view, or just a nicer way to code this in Python?

heapq.nlargest()

http://docs.python.org/library/heapq#heapq.nlargest

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
R

Roy Smith

Duncan Booth said:
How about:

from heapq import nlargest
top = nlargest(K, input())

It uses a heap so avoids completely resorting each time.

Hmmm, that looks like it would solve the problem as stated, but there's
another twist. In addition to finding the K largest values, I *also*
need to do some other processing on all the values (which I didn't show
in the original example, incorrectly thinking it not germane to my
question). The problem with nlargest() is that it doesn't give me a
hook to do that.

PS: I'm assuming heapq.nlargest(n, iterable) operates with memory
proportional to n, and not to the iterator length. That's the only
reasonable conclusion, but the docs don't actually come out and say it.
 
P

Paul Rubin

In addition to finding the K largest values, I *also* need to do some
other processing on all the values .... The problem with nlargest()
is that it doesn't give me a hook to do that.

def processed_input():
for x in input():
process(x)
yield x

top = nlargest(K, processed_input())

You could also write that more consisely with genexps but it's a bit
clumsy.
 
P

Peter Otten

Roy said:
Hmmm, that looks like it would solve the problem as stated, but there's
another twist. In addition to finding the K largest values, I *also*
need to do some other processing on all the values (which I didn't show
in the original example, incorrectly thinking it not germane to my
question). The problem with nlargest() is that it doesn't give me a
hook to do that.

If Paul's solution doesn't suffice -- the heapq module has the building
blocks for a custom solution, e. g.:

import heapq
from functools import partial

class NLargest(object):
def __init__(self, n):
self.n = n
self.heap = []
def tally(self, item):
heap = self.heap
if len(heap) >= self.n:
heapq.heappushpop(heap, item)
self.tally = partial(heapq.heappushpop, heap)
else:
heapq.heappush(heap, item)
def __str__(self):
return str(sorted(self.heap, reverse=True))

if __name__ == "__main__":
import random
items = range(100)
random.shuffle(items)
accu = NLargest(10)
for item in items:
accu.tally(item)
print item, accu

PS: I'm assuming heapq.nlargest(n, iterable) operates with memory
proportional to n, and not to the iterator length. That's the only
reasonable conclusion, but the docs don't actually come out and say it.

I would hope so.
 
N

n00m

from bisect import insort_left

K = 5
top = []
while 1:
x = input()
if len(top) < K:
insort_left(top, x)
elif x > top[0]:
del top[0]
insort_left(top, x)
print top


will be enough
 
R

Roy Smith

n00m said:
from bisect import insort_left

K = 5
top = []
while 1:
x = input()
if len(top) < K:
insort_left(top, x)
elif x > top[0]:
del top[0]
insort_left(top, x)
print top


will be enough

Hmmm, that's an interesting idea. Thanks.
 
S

Stefan Sonnenberg-Carstens

Am 25.12.2010 16:42, schrieb Roy Smith:
I'm processing a stream of N numbers and want to keep track of the K
largest. There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once. K is small (100 would be typical).
From a theoretical point of view, I should be able to do this in N log K
time. What I'm doing now is essentially:

top = [-1] # Assume all x are>= 0
for x in input():
if x<= top[0]:
continue
top.append(x)
if len(top)> K:
top.sort()
top.pop(0)

I can see pathological cases (say, all input values the same) where
running time would be N K log K, but on average (N>> K and random
distribution of values), this should be pretty close to N.

Is there a better way to do this, either from a theoretical running time
point of view, or just a nicer way to code this in Python?
Here is my version:

l = []
K = 10

while 1:
a = input()
if len(l) == K:
l.remove(min(l))
l=[x for x in l if x < a] + [a] + [x for x in l if x > a]
print l
 
S

Stefan Sonnenberg-Carstens

Am 26.12.2010 19:51, schrieb Stefan Sonnenberg-Carstens:
l = []
K = 10

while 1:
a = input()
if len(l) == K:
l.remove(min(l))
l=[x for x in l if x < a] + [a] + [x for x in l if x > a]
print l
A minor fault made it into my prog:

l = [0]
K = 10

while 1:
a = input()
l=[x for x in l if x < a] + [a] + [x for x in l if x > a]
if len(l) == K:
l.remove(min(l))
print l
 
D

Dan Stromberg

I'm processing a stream of N numbers and want to keep track of the K
largest.  There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once.  K is small (100 would be typical).
From a theoretical point of view, I should be able to do this in N log K
time.  What I'm doing now is essentially:

top = [-1]    # Assume all x are >= 0
for x in input():
   if x <= top[0]:
       continue
   top.append(x)
   if len(top) > K:
       top.sort()
       top.pop(0)

I can see pathological cases (say, all input values the same) where
running time would be N K log K, but on average (N >> K and random
distribution of values), this should be pretty close to N.

Is there a better way to do this, either from a theoretical running time
point of view, or just a nicer way to code this in Python?

heapq is probably fine, but I've empirically found a treap faster than
a heap under some conditions:

http://stromberg.dnsalias.org/~strombrg/treap/
http://stromberg.dnsalias.org/~strombrg/highest/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top