Multiple disjoint sample sets?

R

Roy Smith

I have a list of items. I need to generate n samples of k unique items
each. I not only want each sample set to have no repeats, but I also
want to make sure the sets are disjoint (i.e. no item repeated between
sets).

random.sample(items, k) will satisfy the first constraint, but not the
second. Should I just do random.sample(items, k*n), and then split the
resulting big list into n pieces? Or is there some more efficient way?

Typical values:

len(items) = 5,000,000
n = 10
k = 100,000
 
M

MRAB

I have a list of items. I need to generate n samples of k unique items
each. I not only want each sample set to have no repeats, but I also
want to make sure the sets are disjoint (i.e. no item repeated between
sets).

random.sample(items, k) will satisfy the first constraint, but not the
second. Should I just do random.sample(items, k*n), and then split the
resulting big list into n pieces? Or is there some more efficient way?

Typical values:

len(items) = 5,000,000
n = 10
k = 100,000
I don't know how efficient it would be, but couldn't you shuffle the
list and then use slicing to get the samples?
 
D

Dave Angel

I don't know how efficient it would be, but couldn't you shuffle the
list and then use slicing to get the samples?

I like that answer best, but just to offer another choice...

You start with a (presumably unique) list of items. After collecting
your first sample, you could subtract them from the list, and use the
smaller list for the next sample.

One way is to convert list to set, subtract, then convert back to list.
 
P

Peter Otten

Roy said:
I have a list of items. I need to generate n samples of k unique items
each. I not only want each sample set to have no repeats, but I also
want to make sure the sets are disjoint (i.e. no item repeated between
sets).

random.sample(items, k) will satisfy the first constraint, but not the
second. Should I just do random.sample(items, k*n), and then split the
resulting big list into n pieces? Or is there some more efficient way?

Typical values:

len(items) = 5,000,000
n = 10
k = 100,000

I would expect that your simple approach is more efficient than shuffling
the whole list.

Assuming there is a sample_iter(population) that generates unique items from
the population (which has no repetitions itself) you can create the samples
with

g = sample_iter(items)
samples = [list(itertools.islice(g, k) for _ in xrange(n)]

My ideas for such a sample_iter():

def sample_iter_mark(items):
n = len(items)
while True:
i = int(random()*n)
v = items
if v is not None:
yield v
items = None

This is destructive and will degrade badly as the number of None items
increases. For your typical values it seems to be OK though. You can make
this non-destructive by adding a bit array or a set (random.Random.sample()
has code that uses a set) to keep track of the seen items.

Another sample_iter() (which is also part of the random.Random.sample()
implementation):

def sample_iter_replace(items):
n = len(items)
for k in xrange(n):
i = int(random()*(n-k))
yield items
items = items[n-k-1]

You can micro-optimise that a bit to avoid the index calculation. Also,
instead of overwriting items you could swap them, so that no values would be
lost, only their initial order.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top