Multiple disjoint sample sets?

Discussion in 'Python' started by Roy Smith, Jan 11, 2013.

  1. Roy Smith

    Roy Smith Guest

    I have a list of items. I need to generate n samples of k unique items
    each. I not only want each sample set to have no repeats, but I also
    want to make sure the sets are disjoint (i.e. no item repeated between
    sets).

    random.sample(items, k) will satisfy the first constraint, but not the
    second. Should I just do random.sample(items, k*n), and then split the
    resulting big list into n pieces? Or is there some more efficient way?

    Typical values:

    len(items) = 5,000,000
    n = 10
    k = 100,000
    Roy Smith, Jan 11, 2013
    #1
    1. Advertising

  2. Roy Smith

    MRAB Guest

    On 2013-01-11 14:15, Roy Smith wrote:
    > I have a list of items. I need to generate n samples of k unique items
    > each. I not only want each sample set to have no repeats, but I also
    > want to make sure the sets are disjoint (i.e. no item repeated between
    > sets).
    >
    > random.sample(items, k) will satisfy the first constraint, but not the
    > second. Should I just do random.sample(items, k*n), and then split the
    > resulting big list into n pieces? Or is there some more efficient way?
    >
    > Typical values:
    >
    > len(items) = 5,000,000
    > n = 10
    > k = 100,000
    >

    I don't know how efficient it would be, but couldn't you shuffle the
    list and then use slicing to get the samples?
    MRAB, Jan 11, 2013
    #2
    1. Advertising

  3. Roy Smith

    Dave Angel Guest

    On 01/11/2013 09:36 AM, MRAB wrote:
    > On 2013-01-11 14:15, Roy Smith wrote:
    >> I have a list of items. I need to generate n samples of k unique items
    >> each. I not only want each sample set to have no repeats, but I also
    >> want to make sure the sets are disjoint (i.e. no item repeated between
    >> sets).
    >>
    >> random.sample(items, k) will satisfy the first constraint, but not the
    >> second. Should I just do random.sample(items, k*n), and then split the
    >> resulting big list into n pieces? Or is there some more efficient way?
    >>
    >> Typical values:
    >>
    >> len(items) = 5,000,000
    >> n = 10
    >> k = 100,000
    >>

    > I don't know how efficient it would be, but couldn't you shuffle the
    > list and then use slicing to get the samples?


    I like that answer best, but just to offer another choice...

    You start with a (presumably unique) list of items. After collecting
    your first sample, you could subtract them from the list, and use the
    smaller list for the next sample.

    One way is to convert list to set, subtract, then convert back to list.



    --

    DaveA
    Dave Angel, Jan 11, 2013
    #3
  4. Roy Smith

    Peter Otten Guest

    Roy Smith wrote:

    > I have a list of items. I need to generate n samples of k unique items
    > each. I not only want each sample set to have no repeats, but I also
    > want to make sure the sets are disjoint (i.e. no item repeated between
    > sets).
    >
    > random.sample(items, k) will satisfy the first constraint, but not the
    > second. Should I just do random.sample(items, k*n), and then split the
    > resulting big list into n pieces? Or is there some more efficient way?
    >
    > Typical values:
    >
    > len(items) = 5,000,000
    > n = 10
    > k = 100,000


    I would expect that your simple approach is more efficient than shuffling
    the whole list.

    Assuming there is a sample_iter(population) that generates unique items from
    the population (which has no repetitions itself) you can create the samples
    with

    g = sample_iter(items)
    samples = [list(itertools.islice(g, k) for _ in xrange(n)]

    My ideas for such a sample_iter():

    def sample_iter_mark(items):
    n = len(items)
    while True:
    i = int(random()*n)
    v = items
    if v is not None:
    yield v
    items = None

    This is destructive and will degrade badly as the number of None items
    increases. For your typical values it seems to be OK though. You can make
    this non-destructive by adding a bit array or a set (random.Random.sample()
    has code that uses a set) to keep track of the seen items.

    Another sample_iter() (which is also part of the random.Random.sample()
    implementation):

    def sample_iter_replace(items):
    n = len(items)
    for k in xrange(n):
    i = int(random()*(n-k))
    yield items
    items = items[n-k-1]

    You can micro-optimise that a bit to avoid the index calculation. Also,
    instead of overwriting items you could swap them, so that no values would be
    lost, only their initial order.
    Peter Otten, Jan 13, 2013
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tim Chase
    Replies:
    0
    Views:
    64
    Tim Chase
    Feb 16, 2014
  2. Terry Reedy
    Replies:
    0
    Views:
    70
    Terry Reedy
    Feb 16, 2014
  3. Tim Chase
    Replies:
    0
    Views:
    79
    Tim Chase
    Feb 16, 2014
  4. Ned Batchelder
    Replies:
    0
    Views:
    71
    Ned Batchelder
    Feb 16, 2014
  5. duncan smith
    Replies:
    0
    Views:
    81
    duncan smith
    Feb 16, 2014
Loading...

Share This Page