sampling without replacement

braver · Apr 17, 2008

Greetings -- I am doing a sampling without replacement, taking out
random elements from an array. The picked element is then removed
from the array. When my arrays were on the order of 10,000 elements
long, everything was fast. But when I increased them to 1,000,000 it
suddenly was hours. I tracked it down to the line I first threw in to
cut out the element i:

a = a[:i] + a[i+1:]

It was indeed the weakest link. But when I replace it with e.g.

a.pop(i)

it is still slow.

I wonder what approach can be taken to speed it up? Basically, the
algorithm picks an element from the array at random and checks its
size in a different hashtable i=>size. If the size fits into an
existing accumulator, i.e. is below a threshold, we take it and delete
it from the array as above. Thus just random.sample is not enough as
we may not use an element we pick until we find the right one, element
by element, running a cumulative statistics.

Using an array is natural here as it represents "without replacement"
-- we take an element by removing it from the array. But in Python
it's very slow... What approaches are there to implement a shrinking
array with random deletions with the magnitude of millions of
elements?

Cheers,
Alexy

Paul Rubin · Apr 17, 2008

braver said:
Using an array is natural here as it represents "without replacement"
-- we take an element by removing it from the array. But in Python
it's very slow... What approaches are there to implement a shrinking
array with random deletions with the magnitude of millions of
elements?

The obvious way is use random.sample(), is there some reason you
don't do that?

Alternatively, when you select an element a[k] from the middle of the
array, instead of deleting that element and moving the rest down (del
a[k]), do something like:

k = random.randint(0,len(a)-1)
selection = a[k]
a[k] = a[-1]
a.pop()

That deletes the last element (avoiding moving them around) after
storing it in the newly freed slot. Of course it messes up the order
of the array, which won't matter for selecting random elements, but
might matter for some other reason in your program.

Mensanator · Apr 17, 2008

braver said:
braver said:

Using an array is natural here as it represents "without replacement"
-- we take an element by removing it from the array. But in Python
it's very slow... What approaches are there to implement a shrinking
array with random deletions with the magnitude of millions of
elements?

Click to expand...

The obvious way is use random.sample(), is there some reason you
don't do that?

Alternatively, when you select an element a[k] from the middle of the
array, instead of deleting that element and moving the rest down (del
a[k]), do something like:

k = random.randint(0,len(a)-1)
selection = a[k]
a[k] = a[-1]
a.pop()

That deletes the last element (avoiding moving them around) after
storing it in the newly freed slot. Of course it messes up the order
of the array, which won't matter for selecting random elements, but
might matter for some other reason in your program.

What about an initial random shuffle on the array and then
always pop the last element? Or for that matter, why even pop it,
why not just take succesively larger negative indexes as the
"last" element?

bearophileHUGS · Apr 18, 2008

Alexy>But in Python it's very slow...<

I'm the first one to say that CPython is slow, but almost any language
is slow if you use such wrong algorithms like you do.
There are many ways to solve your problem efficiently, one of such
ways, among the simpler ones is to to not modify the original list:
'e'

If you don't want to extract the same element twice across different
runs of the program, then you may create a class like this:

from random import shuffle, seed

class Sampler(object):
def __init__(self, items, init_seed=1):
self.items = list(items)
self.last = len(self.items) - 1
self.init_seed = init_seed
seed(init_seed)
shuffle(self.items)
def __repr__(self):
return repr(self.items[:self.last+1])
def next(self):
if self.last < 0:
raise StopIteration
self.last -= 1
return self.items[self.last+1]
def save(self, filename):
pass
# saves self.last and self.init_seed on disk

samp = Sampler("abcdefghijklm")
print samp
print samp.next()
print samp.next()
print samp.next()

That class code is raw, you may want to improve it in some ways.

Bye,
bearophile

A Friday Python Programming Pearl: random sampling	2	May 28, 2010
pygnomevfs get_local_path_from_uri replacement	0	Dec 22, 2012
Accessing array index addresses with custom datatype in a function	0	Jun 2, 2022
Converting an Array to a String in JavaScript	7	Sep 22, 2023
How to paste n+1 every single time without copying new line from excel	0	Jul 13, 2023
How to paste n+1 every single time without copying new line from excel	3	Jul 13, 2023
Replacement for _getch()?	9	Nov 22, 2013
How to generate random number without replacement?	23	Jun 1, 2010

sampling without replacement

braver

Paul Rubin

Mensanator

bearophileHUGS

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads