sampling without replacement

Discussion in 'Python' started by braver, Apr 17, 2008.

  1. braver

    braver Guest

    Greetings -- I am doing a sampling without replacement, taking out
    random elements from an array. The picked element is then removed
    from the array. When my arrays were on the order of 10,000 elements
    long, everything was fast. But when I increased them to 1,000,000 it
    suddenly was hours. I tracked it down to the line I first threw in to
    cut out the element i:

    a = a[:i] + a[i+1:]

    It was indeed the weakest link. But when I replace it with e.g.

    a.pop(i)

    it is still slow.

    I wonder what approach can be taken to speed it up? Basically, the
    algorithm picks an element from the array at random and checks its
    size in a different hashtable i=>size. If the size fits into an
    existing accumulator, i.e. is below a threshold, we take it and delete
    it from the array as above. Thus just random.sample is not enough as
    we may not use an element we pick until we find the right one, element
    by element, running a cumulative statistics.

    Using an array is natural here as it represents "without replacement"
    -- we take an element by removing it from the array. But in Python
    it's very slow... What approaches are there to implement a shrinking
    array with random deletions with the magnitude of millions of
    elements?

    Cheers,
    Alexy
     
    braver, Apr 17, 2008
    #1
    1. Advertising

  2. braver

    Paul Rubin Guest

    braver <> writes:
    > Using an array is natural here as it represents "without replacement"
    > -- we take an element by removing it from the array. But in Python
    > it's very slow... What approaches are there to implement a shrinking
    > array with random deletions with the magnitude of millions of
    > elements?


    The obvious way is use random.sample(), is there some reason you
    don't do that?

    Alternatively, when you select an element a[k] from the middle of the
    array, instead of deleting that element and moving the rest down (del
    a[k]), do something like:

    k = random.randint(0,len(a)-1)
    selection = a[k]
    a[k] = a[-1]
    a.pop()

    That deletes the last element (avoiding moving them around) after
    storing it in the newly freed slot. Of course it messes up the order
    of the array, which won't matter for selecting random elements, but
    might matter for some other reason in your program.
     
    Paul Rubin, Apr 17, 2008
    #2
    1. Advertising

  3. braver

    Mensanator Guest

    On Apr 17, 2:41 am, Paul Rubin <http://> wrote:
    > braver <> writes:
    > > Using an array is natural here as it represents "without replacement"
    > > -- we take an element by removing it from the array.  But in Python
    > > it's very slow...  What approaches are there to implement a shrinking
    > > array with random deletions with  the magnitude of millions of
    > > elements?

    >
    > The obvious way is use random.sample(), is there some reason you
    > don't do that?
    >
    > Alternatively, when you select an element a[k] from the middle of the
    > array, instead of deleting that element and moving the rest down (del
    > a[k]), do something like:
    >
    >    k = random.randint(0,len(a)-1)
    >    selection = a[k]
    >    a[k] = a[-1]
    >    a.pop()
    >
    > That deletes the last element (avoiding moving them around) after
    > storing it in the newly freed slot.  Of course it messes up the order
    > of the array, which won't matter for selecting random elements, but
    > might matter for some other reason in your program.


    What about an initial random shuffle on the array and then
    always pop the last element? Or for that matter, why even pop it,
    why not just take succesively larger negative indexes as the
    "last" element?
     
    Mensanator, Apr 17, 2008
    #3
  4. braver

    Guest

    Alexy>But in Python it's very slow...<

    I'm the first one to say that CPython is slow, but almost any language
    is slow if you use such wrong algorithms like you do.
    There are many ways to solve your problem efficiently, one of such
    ways, among the simpler ones is to to not modify the original list:

    >>> from random import shuffle, seed
    >>> items = list("abcdefghijklm")
    >>> seed(10)
    >>> shuffle(items)
    >>> it_items = iter(items)
    >>> it_items.next()

    'i'
    >>> it_items.next()

    'd'
    >>> it_items.next()

    'l'
    >>> it_items.next()

    'b'
    >>> it_items.next()

    'j'
    >>> it_items.next()

    'a'
    >>> it_items.next()

    'e'

    If you don't want to extract the same element twice across different
    runs of the program, then you may create a class like this:

    from random import shuffle, seed

    class Sampler(object):
    def __init__(self, items, init_seed=1):
    self.items = list(items)
    self.last = len(self.items) - 1
    self.init_seed = init_seed
    seed(init_seed)
    shuffle(self.items)
    def __repr__(self):
    return repr(self.items[:self.last+1])
    def next(self):
    if self.last < 0:
    raise StopIteration
    self.last -= 1
    return self.items[self.last+1]
    def save(self, filename):
    pass
    # saves self.last and self.init_seed on disk

    samp = Sampler("abcdefghijklm")
    print samp
    print samp.next()
    print samp.next()
    print samp.next()

    That class code is raw, you may want to improve it in some ways.

    Bye,
    bearophile
     
    , Apr 18, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ALuPin

    Over-Sampling

    ALuPin, Mar 11, 2005, in forum: VHDL
    Replies:
    5
    Views:
    1,949
  2. Albretch
    Replies:
    0
    Views:
    321
    Albretch
    Nov 29, 2004
  3. steflhermitte
    Replies:
    1
    Views:
    861
    Kanenas
    Apr 21, 2005
  4. Steven Bethard

    sampling items from a nested list

    Steven Bethard, Feb 16, 2005, in forum: Python
    Replies:
    5
    Views:
    339
    Felix Wiemann
    Feb 17, 2005
  5. Jah_Alarm

    nonuniform sampling with replacement

    Jah_Alarm, Mar 21, 2010, in forum: Python
    Replies:
    6
    Views:
    1,057
    Robert Kern
    Mar 22, 2010
Loading...

Share This Page