Selecting k smallest or largest elements from a large list in python; (benchmarking)

  • Thread starter Dmitry Chichkov
  • Start date
D

Dmitry Chichkov

Given: a large list (10,000,000) of floating point numbers;
Task: fastest python code that finds k (small, e.g. 10) smallest
items, preferably with item indexes;
Limitations: in python, using only standard libraries (numpy & scipy
is Ok);

I've tried several methods. With N = 10,000,000, K = 10 The fastest so
far (without item indexes) was pure python implementation
nsmallest_slott_bisect (using bisect/insert). And with indexes
nargsmallest_numpy_argmin (argmin() in the numpy array k times).

Anyone up to the challenge beating my code with some clever selection
algorithm?

Current Table:
1.66864395142 mins_heapq(items, n):
0.946580886841 nsmallest_slott_bisect(items, n):
1.38014793396 nargsmallest(items, n):
10.0732769966 sorted(items)[:n]:
3.17916202545 nargsmallest_numpy_argsort(items, n):
1.31794500351 nargsmallest_numpy_argmin(items, n):
2.37499308586 nargsmallest_numpy_array_argsort(items, n):
0.524670124054 nargsmallest_numpy_array_argmin(items, n):

0.0525538921356 numpy argmin(items): 1892997
0.364673852921 min(items): 10.0000026786


Code:
----------------
import heapq
from random import randint, random
import time
from bisect import insort
from itertools import islice
from operator import itemgetter

def mins_heapq(items, n):
nlesser_items = heapq.nsmallest(n, items)
return nlesser_items

def nsmallest_slott_bisect(iterable, n, insort=insort):
it = iter(iterable)
mins = sorted(islice(it, n))
for el in it:
if el <= mins[-1]: #NOTE: equal sign is to preserve duplicates
insort(mins, el)
mins.pop()
return mins

def nargsmallest(iterable, n, insort=insort):
it = enumerate(iterable)
mins = sorted(islice(it, n), key = itemgetter(1))
loser = mins[-1][1] # largest of smallest
for el in it:
if el[1] <= loser: # NOTE: equal sign is to preserve
dupl
mins.append(el)
mins.sort(key = itemgetter(1))
mins.pop()
loser = mins[-1][1]
return mins

def nargsmallest_numpy_argsort(iter, k):
distances = N.asarray(iter)
return [(i, distances) for i in distances.argsort()[0:k]]

def nargsmallest_numpy_array_argsort(array, k):
return [(i, array) for i in array.argsort()[0:k]]

def nargsmallest_numpy_argmin(iter, k):
distances = N.asarray(iter)
mins = []

def nargsmallest_numpy_array_argmin(distances, k):
mins = []
for i in xrange(k):
j = distances.argmin()
mins.append((j, distances[j]))
distances[j] = float('inf')

return mins


test_data = [randint(10, 50) + random() for i in range(10000000)]
K = 10

init = time.time()
mins = mins_heapq(test_data, K)
print time.time() - init, 'mins_heapq(items, n):', mins[:2]

init = time.time()
mins = nsmallest_slott_bisect(test_data, K)
print time.time() - init, 'nsmallest_slott_bisect(items, n):', mins[:
2]

init = time.time()
mins = nargsmallest(test_data, K)
print time.time() - init, 'nargsmallest(items, n):', mins[:2]

init = time.time()
mins = sorted(test_data)[:K]
print time.time() - init, 'sorted(items)[:n]:', time.time() - init,
mins[:2]

import numpy as N
init = time.time()
mins = nargsmallest_numpy_argsort(test_data, K)
print time.time() - init, 'nargsmallest_numpy_argsort(items, n):',
mins[:2]

init = time.time()
mins = nargsmallest_numpy_argmin(test_data, K)
print time.time() - init, 'nargsmallest_numpy_argmin(items, n):',
mins[:2]


print
init = time.time()
mins = array.argmin()
print time.time() - init, 'numpy argmin(items):', mins

init = time.time()
mins = min(test_data)
print time.time() - init, 'min(items):', mins
 
A

Arnaud Delobelle

Dmitry Chichkov said:
Given: a large list (10,000,000) of floating point numbers;
Task: fastest python code that finds k (small, e.g. 10) smallest
items, preferably with item indexes;
Limitations: in python, using only standard libraries (numpy & scipy
is Ok);

I've tried several methods. With N = 10,000,000, K = 10 The fastest so
far (without item indexes) was pure python implementation
nsmallest_slott_bisect (using bisect/insert). And with indexes
nargsmallest_numpy_argmin (argmin() in the numpy array k times).

Anyone up to the challenge beating my code with some clever selection
algorithm?

Current Table:
1.66864395142 mins_heapq(items, n):
0.946580886841 nsmallest_slott_bisect(items, n):
1.38014793396 nargsmallest(items, n):
10.0732769966 sorted(items)[:n]:
3.17916202545 nargsmallest_numpy_argsort(items, n):
1.31794500351 nargsmallest_numpy_argmin(items, n):
2.37499308586 nargsmallest_numpy_array_argsort(items, n):
0.524670124054 nargsmallest_numpy_array_argmin(items, n):

0.0525538921356 numpy argmin(items): 1892997
0.364673852921 min(items): 10.0000026786

I think without numpy, nsmallest_slott_bisect is almost optimal. There
is a slight improvement:

1.33862709999 nsmallest_slott_bisect(items, n): [10.000011643188717, 10.000017791492528]
0.883894920349 nsmallest_slott_bisect2(items, n): [10.000011643188717, 10.000017791492528]

==== code ====

from bisect import insort
from itertools import islice

def nsmallest_slott_bisect(iterable, n, insort=insort):
it = iter(iterable)
mins = sorted(islice(it, n))
for el in it:
if el <= mins[-1]: #NOTE: equal sign is to preserve duplicates
insort(mins, el)
mins.pop()
return mins

def nsmallest_slott_bisect2(iterable, n, insort=insort):
it = iter(iterable)
mins = sorted(islice(it, n))
maxmin = mins[-1]
for el in it:
if el <= maxmin: #NOTE: equal sign is to preserve duplicates
insort(mins, el)
mins.pop()
maxmin = mins[-1]
return mins

import time
from random import randint, random

test_data = [randint(10, 50) + random() for i in range(10000000)]
K = 10

init = time.time()
mins = nsmallest_slott_bisect(test_data, K)
print time.time() - init, 'nsmallest_slott_bisect(items, n):', mins[:
2]

init = time.time()
mins = nsmallest_slott_bisect2(test_data, K)
print time.time() - init, 'nsmallest_slott_bisect2(items, n):', mins[:
2]
 
P

Peter Otten

Dmitry said:
Given: a large list (10,000,000) of floating point numbers;
Task: fastest python code that finds k (small, e.g. 10) smallest
items, preferably with item indexes;
Limitations: in python, using only standard libraries (numpy & scipy
is Ok);

I've tried several methods. With N = 10,000,000, K = 10 The fastest so
far (without item indexes) was pure python implementation
nsmallest_slott_bisect (using bisect/insert). And with indexes
nargsmallest_numpy_argmin (argmin() in the numpy array k times).

Anyone up to the challenge beating my code with some clever selection
algorithm?

If you don't care about stability, i. e. whether nlargest(2, [1, 2, 2.0])
returns [1, 2] or [1, 2.0], use

_heapq.nlargest() directly

$ python nsmallest_perf2.py
nsmallest --> 0.142784833908
nsmallest_slott_bisect --> 0.19291305542
$ cat nsmallest_perf2.py
from random import randint, random
import time
from bisect import insort
from itertools import islice
import _heapq

_funcs = []
def timeit(f):
_funcs.append(f)

def time_all(*args):
funcs = _funcs
width = max(len(f.__name__) for f in funcs)
prev = None
for f in funcs:
start = time.time()
result = f(*args)
end = time.time()
print "%-*s --> %10s" % (width, f.__name__, end - start)
if prev is None:
prev = result
else:
assert prev == result

timeit(_heapq.nsmallest)

@timeit
def nsmallest_slott_bisect(n, iterable, insort=insort):
it = iter(iterable)
mins = sorted(islice(it, n))
for el in it:
if el <= mins[-1]: #NOTE: equal sign is to preserve duplicates
insort(mins, el)
mins.pop()
return mins

test_data = [randint(10, 50) + random() for i in range(10**6)]
K = 10

time_all(K, test_data)

Peter
 
P

Peter Otten

Dmitry said:

A lot of the following doesn't run or returns incorrect results.
To give but one example:
def nargsmallest_numpy_argmin(iter, k):
distances = N.asarray(iter)
mins = []

Could you please provide an up-to-date version?

Peter

PS: for an easy way to ensure consistency see timeit/time_all in my previous
post.
 
A

Arnaud Delobelle

Dmitry said:
Given: a large list (10,000,000) of floating point numbers;
Task: fastest python code that finds k (small, e.g. 10) smallest
items, preferably with item indexes;
Limitations: in python, using only standard libraries (numpy & scipy
is Ok);
I've tried several methods. With N = 10,000,000, K = 10 The fastest so
far (without item indexes) was pure python implementation
nsmallest_slott_bisect (using bisect/insert). And with indexes
nargsmallest_numpy_argmin (argmin() in the numpy array k times).
Anyone up to the challenge beating my code with some clever selection
algorithm?

If you don't care about stability, i. e. whether nlargest(2, [1, 2, 2.0])
returns [1, 2] or [1, 2.0], use

_heapq.nlargest() directly

$ python nsmallest_perf2.py
nsmallest              --> 0.142784833908
nsmallest_slott_bisect --> 0.19291305542
$ cat nsmallest_perf2.py
from random import randint, random
import time
from bisect    import insort
from itertools import islice
import _heapq

_funcs = []
def timeit(f):
    _funcs.append(f)

def time_all(*args):
    funcs = _funcs
    width = max(len(f.__name__) for f in funcs)
    prev = None
    for f in funcs:
        start = time.time()
        result = f(*args)
        end = time.time()
        print "%-*s --> %10s" % (width, f.__name__, end - start)
        if prev is None:
            prev = result
        else:
            assert prev == result

timeit(_heapq.nsmallest)

@timeit
def nsmallest_slott_bisect(n, iterable, insort=insort):
    it   = iter(iterable)
    mins = sorted(islice(it, n))
    for el in it:
        if el <= mins[-1]: #NOTE: equal sign is to preserve duplicates
            insort(mins, el)
            mins.pop()
    return mins

test_data = [randint(10, 50) + random() for i in range(10**6)]
K = 10

time_all(K, test_data)

Peter

I get:

1.46s for _heapq.nsmallest
0.85s for nsmallest_slott_bisect2 (version I posted)

I am a bit surprised that mine is so slow compared with yours. I'll
do more tests later!
 
P

Peter Otten

Arnaud said:
I get:

1.46s for _heapq.nsmallest
0.85s for nsmallest_slott_bisect2 (version I posted)

I am a bit surprised that mine is so slow compared with yours. I'll
do more tests later!

Strange. I see a significant difference only for python3 (on 64bit Linux)

$ python3 nsmallest_perf3.py
3.1.1+ (r311:74480, Nov 2 2009, 15:45:00)
[GCC 4.4.1]
pick the 10 smallest items out of 5000000
nsmallest --> 0.310983181
nsmallest_slott_bisect --> 1.02625894547
nsmallest_slott_bisect2 --> 0.612951040268

$ python nsmallest_perf3.py
2.6.4 (r264:75706, Dec 7 2009, 18:43:55)
[GCC 4.4.1]
pick the 10 smallest items out of 5000000
nsmallest --> 0.743387937546
nsmallest_slott_bisect --> 0.961116075516
nsmallest_slott_bisect2 --> 0.746085882187

Peter
 
D

Dmitry Chichkov

Uh. I'm sorry about the confusion. Last three items are just O(N)
baselines. Python min(), Numpy argmin(), Numpy asarray().
I'll update the code. Thanks!
A lot of the following doesn't run or returns incorrect results.
To give but one example:
def nargsmallest_numpy_argmin(iter, k):
    distances = N.asarray(iter)
    mins = []

Could you please provide an up-to-date version?

Peter

PS: for an easy way to ensure consistency see timeit/time_all in my previous
post.
 
D

Dmitry Chichkov

By the way, improving n-ARG-smallest (that returns indexes as well as
values) is actually more desirable than just regular n-smallest:

== Result ==
1.38639092445 nargsmallest
3.1569879055 nargsmallest_numpy_argsort
1.29344892502 nargsmallest_numpy_argmin

Note that numpy array constructor eats around 0.789.


== Code ==
from operator import itemgetter
from random import randint, random
from itertools import islice
from time import time

def nargsmallest(iterable, n):
it = enumerate(iterable)
mins = sorted(islice(it, n), key = itemgetter(1))
loser = mins[-1][1] # largest of smallest
for el in it:
if el[1] <= loser: # NOTE: preserve dups
mins.append(el)
mins.sort(key = itemgetter(1))
mins.pop()
loser = mins[-1][1]
return mins

def nargsmallest_numpy_argsort(iter, k):
distances = N.asarray(iter)
return [(i, distances) for i in distances.argsort()[0:k]]

def nargsmallest_numpy_argmin(iter, k):
mins = []
distances = N.asarray(iter)
for i in xrange(k):
j = distances.argmin()
mins.append((j, distances[j]))
distances[j] = float('inf')
return mins

test_data = [randint(10, 50) + random() for i in range(10000000)]
K = 10

init = time()
mins = nargsmallest(test_data, K)
print time() - init, 'nargsmallest:', mins[:2]

import numpy as N
init = time()
mins = nargsmallest_numpy_argsort(test_data, K)
print time() - init, 'nargsmallest_numpy_argsort:', mins[:2]

init = time()
mins = nargsmallest_numpy_argmin(test_data, K)
print time() - init, 'nargsmallest_numpy_argmin:', mins[:2]
 
T

Terry Reedy

Given: a large list (10,000,000) of floating point numbers;
Task: fastest python code that finds k (small, e.g. 10) smallest
items, preferably with item indexes;
Limitations: in python, using only standard libraries (numpy& scipy
is Ok);

I've tried several methods. With N = 10,000,000, K = 10 The fastest so
far (without item indexes) was pure python implementation
nsmallest_slott_bisect (using bisect/insert). And with indexes
nargsmallest_numpy_argmin (argmin() in the numpy array k times).

Anyone up to the challenge beating my code with some clever selection
algorithm?

Current Table:
1.66864395142 mins_heapq(items, n):
0.946580886841 nsmallest_slott_bisect(items, n):
1.38014793396 nargsmallest(items, n):
10.0732769966 sorted(items)[:n]:
3.17916202545 nargsmallest_numpy_argsort(items, n):
1.31794500351 nargsmallest_numpy_argmin(items, n):
2.37499308586 nargsmallest_numpy_array_argsort(items, n):
0.524670124054 nargsmallest_numpy_array_argmin(items, n):

0.0525538921356 numpy argmin(items): 1892997
0.364673852921 min(items): 10.0000026786

Your problem is underspecified;-).
Detailed timing comparisons are only valid for a particular Python
version running under a particular OS on particular hardware. So, to
actually run a contest, you would have to specify a version and OS and
offer to run entries on your machine, with as much else as possible
turned off, or else enlist a neutral judge to do so. And the timing
method should also be specified.
 
D

Dmitry Chichkov

Yes, you are right of course. But it is not really a contest. And if
you could improve algorithm or implementation on "your Python version
running under your OS on your hardware" it may as well improve
performance for other people under other OS's.
 

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top