Keeping track of the N largest values

Roy Smith · Dec 25, 2010

I'm processing a stream of N numbers and want to keep track of the K
largest. There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once. K is small (100 would be typical).

From a theoretical point of view, I should be able to do this in N log K
time. What I'm doing now is essentially:

top = [-1] # Assume all x are >= 0
for x in input():
if x <= top[0]:
continue
top.append(x)
if len(top) > K:
top.sort()
top.pop(0)

I can see pathological cases (say, all input values the same) where
running time would be N K log K, but on average (N >> K and random
distribution of values), this should be pretty close to N.

Is there a better way to do this, either from a theoretical running time
point of view, or just a nicer way to code this in Python?

Peter Otten · Dec 25, 2010

Roy said:
I'm processing a stream of N numbers and want to keep track of the K
largest. There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once. K is small (100 would be typical).

From a theoretical point of view, I should be able to do this in N log K
time. What I'm doing now is essentially:

top = [-1] # Assume all x are >= 0
for x in input():
if x <= top[0]:
continue
top.append(x)
if len(top) > K:
top.sort()
top.pop(0)

I can see pathological cases (say, all input values the same) where
running time would be N K log K, but on average (N >> K and random
distribution of values), this should be pretty close to N.

Is there a better way to do this, either from a theoretical running time
point of view, or just a nicer way to code this in Python?

http://docs.python.org/library/heapq.html#heapq.nlargest

Robert Kern · Dec 25, 2010

I'm processing a stream of N numbers and want to keep track of the K
largest. There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once. K is small (100 would be typical).

From a theoretical point of view, I should be able to do this in N log K

Click to expand...

time. What I'm doing now is essentially:

top = [-1] # Assume all x are>= 0
for x in input():
if x<= top[0]:
continue
top.append(x)
if len(top)> K:
top.sort()
top.pop(0)

I can see pathological cases (say, all input values the same) where
running time would be N K log K, but on average (N>> K and random
distribution of values), this should be pretty close to N.

Is there a better way to do this, either from a theoretical running time
point of view, or just a nicer way to code this in Python?

heapq.nlargest()

http://docs.python.org/library/heapq#heapq.nlargest

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Roy Smith · Dec 25, 2010

Duncan Booth said:
How about:

from heapq import nlargest
top = nlargest(K, input())

It uses a heap so avoids completely resorting each time.

Hmmm, that looks like it would solve the problem as stated, but there's
another twist. In addition to finding the K largest values, I *also*
need to do some other processing on all the values (which I didn't show
in the original example, incorrectly thinking it not germane to my
question). The problem with nlargest() is that it doesn't give me a
hook to do that.

PS: I'm assuming heapq.nlargest(n, iterable) operates with memory
proportional to n, and not to the iterator length. That's the only
reasonable conclusion, but the docs don't actually come out and say it.

Paul Rubin · Dec 25, 2010

In addition to finding the K largest values, I *also* need to do some
other processing on all the values .... The problem with nlargest()
is that it doesn't give me a hook to do that.

def processed_input():
for x in input():
process(x)
yield x

top = nlargest(K, processed_input())

You could also write that more consisely with genexps but it's a bit
clumsy.

John Nagle · Dec 25, 2010

http://docs.python.org/library/heapq.html#heapq.nlargest

Incidentally, if it happens that the data is already in
a database, MySQL will do that.

SELECT val FROM tab ORDER BY val DESC LIMIT N;

will, for small N, keep only N values. For large N,
it sorts.

That's for an un-indexed field. If "val" is an
index, this is a very fast operation, since the tree
building has already been done.

John Nagle

Peter Otten · Dec 26, 2010

Roy said:
Hmmm, that looks like it would solve the problem as stated, but there's
another twist. In addition to finding the K largest values, I *also*
need to do some other processing on all the values (which I didn't show
in the original example, incorrectly thinking it not germane to my
question). The problem with nlargest() is that it doesn't give me a
hook to do that.

If Paul's solution doesn't suffice -- the heapq module has the building
blocks for a custom solution, e. g.:

import heapq
from functools import partial

class NLargest(object):
def __init__(self, n):
self.n = n
self.heap = []
def tally(self, item):
heap = self.heap
if len(heap) >= self.n:
heapq.heappushpop(heap, item)
self.tally = partial(heapq.heappushpop, heap)
else:
heapq.heappush(heap, item)
def __str__(self):
return str(sorted(self.heap, reverse=True))

if __name__ == "__main__":
import random
items = range(100)
random.shuffle(items)
accu = NLargest(10)
for item in items:
accu.tally(item)
print item, accu

PS: I'm assuming heapq.nlargest(n, iterable) operates with memory
proportional to n, and not to the iterator length. That's the only
reasonable conclusion, but the docs don't actually come out and say it.

I would hope so.

n00m · Dec 26, 2010

from bisect import insort_left

K = 5
top = []
while 1:
x = input()
if len(top) < K:
insort_left(top, x)
elif x > top[0]:
del top[0]
insort_left(top, x)
print top

will be enough

Roy Smith · Dec 26, 2010

n00m said:
from bisect import insort_left

K = 5
top = []
while 1:
x = input()
if len(top) < K:
insort_left(top, x)
elif x > top[0]:
del top[0]
insort_left(top, x)
print top

will be enough

Hmmm, that's an interesting idea. Thanks.

Stefan Sonnenberg-Carstens · Dec 26, 2010

Am 25.12.2010 16:42, schrieb Roy Smith:

I'm processing a stream of N numbers and want to keep track of the K
largest. There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once. K is small (100 would be typical).

From a theoretical point of view, I should be able to do this in N log K

Click to expand...

time. What I'm doing now is essentially:

top = [-1] # Assume all x are>= 0
for x in input():
if x<= top[0]:
continue
top.append(x)
if len(top)> K:
top.sort()
top.pop(0)

I can see pathological cases (say, all input values the same) where
running time would be N K log K, but on average (N>> K and random
distribution of values), this should be pretty close to N.

Is there a better way to do this, either from a theoretical running time
point of view, or just a nicer way to code this in Python?

Here is my version:

l = []
K = 10

while 1:
a = input()
if len(l) == K:
l.remove(min(l))
l=[x for x in l if x < a] + [a] + [x for x in l if x > a]
print l

Stefan Sonnenberg-Carstens · Dec 27, 2010

Am 26.12.2010 19:51, schrieb Stefan Sonnenberg-Carstens:

l = []
K = 10

while 1:
a = input()
if len(l) == K:
l.remove(min(l))
l=[x for x in l if x < a] + [a] + [x for x in l if x > a]
print l

A minor fault made it into my prog:

l = [0]
K = 10

while 1:
a = input()
l=[x for x in l if x < a] + [a] + [x for x in l if x > a]
if len(l) == K:
l.remove(min(l))
print l

Dan Stromberg · Dec 31, 2010

I'm processing a stream of N numbers and want to keep track of the K
largest. There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once. K is small (100 would be typical).

From a theoretical point of view, I should be able to do this in N log K

Click to expand...

time. What I'm doing now is essentially:

top = [-1] # Assume all x are >= 0
for x in input():
if x <= top[0]:
continue
top.append(x)
if len(top) > K:
top.sort()
top.pop(0)

I can see pathological cases (say, all input values the same) where
running time would be N K log K, but on average (N >> K and random
distribution of values), this should be pretty close to N.

Is there a better way to do this, either from a theoretical running time
point of view, or just a nicer way to code this in Python?

heapq is probably fine, but I've empirically found a treap faster than
a heap under some conditions:

http://stromberg.dnsalias.org/~strombrg/treap/
http://stromberg.dnsalias.org/~strombrg/highest/

Program to find the largest integer element of an array.	1	Mar 1, 2022
JavaScript: how to keep track of the circle in canvas on specific path?	0	Mar 20, 2023
Use of logging module to track TODOs	0	Nov 27, 2013
The cost of the cheapest routes between cities	3	Jan 6, 2023
How to try a range of hex values in C# code ?	0	Nov 19, 2022
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Range / empty list issues??	1	Dec 10, 2023

Keeping track of the N largest values

Roy Smith

Peter Otten

Robert Kern

Roy Smith

Paul Rubin

John Nagle

Peter Otten

n00m

Roy Smith

Stefan Sonnenberg-Carstens

Stefan Sonnenberg-Carstens

Dan Stromberg

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads