Sampling a population

Discussion in 'Python' started by Brian Quinlan, Jun 2, 2006.

  1. This is less a Python question and more a optimization/probability
    question. Imaging that you have a list of objects and there frequency in
    a population e.g.

    lst = [(a, 0.01), (b, 0.05), (c, 0.50), (d, 0.30), (e, 0.04), (f, 0.10)]

    and you want to drawn n items from that list (duplicates allowed), with
    that probability distribution.

    The fastest algorithm that I have been able to devise for doing so is:
    O(n * log(len(lst))). Can anyone think or a solution with a better time
    complexity? If not, is there an obviously better way to do this
    (assuming n is big and the list size is small).

    Here is the code:

    from random import random
    from bisect import bisect

    def draw(n, lst):
    ps = []
    last = 0
    for p in lst:
    ps.append(last + p)
    last += p

    # ps = [0.01, 0.06, 0.56, 0.86, 0.90, 1.00]

    chosen = [0] * len(lst) # track frequency
    for i in range(n):
    r = random()

    chosen[bisect(ps, r)] += 1 # binary search and increment

    result = [] # rescale probability based on frequency
    for c in chosen:
    result.append(float(c) / n)
    return result

    lst = [0.01, 0.05, 0.50, 0.30, 0.04, 0.10]
    print draw(10000, lst)
    Brian Quinlan, Jun 2, 2006
    #1
    1. Advertising

  2. Brian Quinlan

    Duncan Smith Guest

    Brian Quinlan wrote:
    > This is less a Python question and more a optimization/probability
    > question. Imaging that you have a list of objects and there frequency in
    > a population e.g.
    >
    > lst = [(a, 0.01), (b, 0.05), (c, 0.50), (d, 0.30), (e, 0.04), (f, 0.10)]
    >
    > and you want to drawn n items from that list (duplicates allowed), with
    > that probability distribution.
    >
    > The fastest algorithm that I have been able to devise for doing so is:
    > O(n * log(len(lst))). Can anyone think or a solution with a better time
    > complexity? If not, is there an obviously better way to do this
    > (assuming n is big and the list size is small).
    >
    > Here is the code:
    >
    > from random import random
    > from bisect import bisect
    >
    > def draw(n, lst):
    > ps = []
    > last = 0
    > for p in lst:
    > ps.append(last + p)
    > last += p
    >
    > # ps = [0.01, 0.06, 0.56, 0.86, 0.90, 1.00]
    >
    > chosen = [0] * len(lst) # track frequency
    > for i in range(n):
    > r = random()
    >
    > chosen[bisect(ps, r)] += 1 # binary search and increment
    >
    > result = [] # rescale probability based on frequency
    > for c in chosen:
    > result.append(float(c) / n)
    > return result
    >
    > lst = [0.01, 0.05, 0.50, 0.30, 0.04, 0.10]
    > print draw(10000, lst)
    >


    I would do something like the following (maybe with an additional check
    that the probabilities do not sum to less than 1),

    >>> from random import random
    >>> import operator
    >>> def draw(n, lst):

    lst.sort(key=operator.itemgetter(1), reverse=True)
    cumprobs = []
    this_cp = 0
    for p in lst:
    this_cp += p[1]
    cumprobs.append(this_cp)
    for _ in xrange(n):
    rnd = random()
    for i, cumprob in enumerate(cumprobs):
    if rnd < cumprob:
    yield lst[0]
    break


    >>> lst = [('a', 0.01), ('b', 0.05), ('c', 0.50), ('d', 0.30), ('e',

    0.04), ('f', 0.10)]
    >>> list(draw(8, lst))

    ['d', 'd', 'c', 'e', 'c', 'd', 'c', 'f']
    >>>


    The sorting of the list means that iterating over the cumulative
    probabilities is minimised (for your density function the inner loop
    will be broken out of after the first iteration 50% of the time). (I've
    assumed that it doesn't matter to you that the list is sorted.) I'm not
    sure that this is in any way optimal.

    Duncan
    Duncan Smith, Jun 2, 2006
    #2
    1. Advertising

  3. Brian Quinlan wrote:
    > The fastest algorithm that I have been able to devise for doing so is:
    > O(n * log(len(lst))). Can anyone think or a solution with a better time
    > complexity? If not, is there an obviously better way to do this
    > (assuming n is big and the list size is small).



    If list is indeed short, I'd say common sense speaks against complicating and optimizing
    further - you can only get log(len(lst))-fold speedup, which is in your case more or less
    a small constant. _IF_ this part of code later turns out to be a bottleneck, you might
    profit more by porting it to C than searching for an O(n) solution, if it even exists.
    Mitja Trampus, Jun 2, 2006
    #3
  4. Brian Quinlan

    Paddy Guest

    Brian Quinlan wrote:
    > This is less a Python question and more a optimization/probability
    > question. Imaging that you have a list of objects and there frequency in
    > a population e.g.
    >
    > lst = [(a, 0.01), (b, 0.05), (c, 0.50), (d, 0.30), (e, 0.04), (f, 0.10)]
    >
    > and you want to drawn n items from that list (duplicates allowed), with
    > that probability distribution.
    >
    > The fastest algorithm that I have been able to devise for doing so is:
    > O(n * log(len(lst))). Can anyone think or a solution with a better time
    > complexity? If not, is there an obviously better way to do this
    > (assuming n is big and the list size is small).
    >


    Any way I tried to slice and dice it, I could not get any faster.
    draw2 and draw 3 generate code on the fly. draw4 sneakily tries to
    trade memory and accuracy for speed but is even slower!

    First the times, then the code:

    $ ./timeit.py 'from probDistribution import draw as draw; draw(10000,
    [0.01, 0.05, 0.50, 0.30, 0.04, 0.10])'
    100 loops, best of 3: 13.4 msec per loop

    $ ./timeit.py 'from probDistribution import draw2 as draw; draw(10000,
    [0.01, 0.05, 0.50, 0.30, 0.04, 0.10])'
    100 loops, best of 3: 15.2 msec per loop

    $ ./timeit.py 'from probDistribution import draw3 as draw; draw(10000,
    [0.01, 0.05, 0.50, 0.30, 0.04, 0.10])'
    100 loops, best of 3: 16.2 msec per loop

    $ ./timeit.py 'from probDistribution import draw4 as draw; draw(10000,
    [0.01, 0.05, 0.50, 0.30, 0.04, 0.10])'
    10 loops, best of 3: 30.5 msec per loop


    === CODE probDistribution.py ===
    from random import random, randrange
    from bisect import bisect

    def draw(n, lst):
    ps = []
    last = 0
    for p in lst:
    ps.append(last + p)
    last += p

    # ps = [0.01, 0.06, 0.56, 0.86, 0.90, 1.00]

    chosen = [0] * len(lst) # track frequency
    for i in range(n):
    r = random()

    chosen[bisect(ps, r)] += 1 # binary search and increment

    result = [] # rescale probability based on frequency
    for c in chosen:
    result.append(float(c) / n)
    return result


    def draw2(n, lst):
    """ uses dynamicc code generation of this form:
    chosen = [0] * 6
    for i in xrange(10000):
    r = random()
    if r < 0.01: chosen[0]+=1
    elif r < 0.06: chosen[1] +=1
    ...
    elif r < 0.90: chosen[4] +=1
    else chosen[5]+=1
    """
    assert len(lst)>1, "Corner case NOT covered"

    codestr = 'chosen = [0] * %i\n' % (len(lst),)
    codestr += 'for i in xrange(%i):\n r = random()\n' % (n,)

    last = 0.0
    lstmax = len(lst)-1
    for i,p in enumerate(lst):
    last += p
    if i==0: codestr += ' if r < %g: chosen[%i] +=1\n' %
    (last, i)
    elif i==lstmax: codestr += ' else: chosen[%i] +=1\n' % (i,)
    else: codestr += ' elif r < %g: chosen[%i] +=1\n' %
    (last, i)

    exec codestr

    result = [] # rescale probability based on frequency
    for c in chosen:
    result.append(float(c) / n)
    return result


    def draw3(n, lst):
    """ uses dynamicc code generation of this form:
    chosen = [0] * 6
    for i in xrange(10000):
    r = random()
    chosen[-1+ (
    ((r<0.01) and 1)
    or ((r<0.06) and 2)
    ...
    or ((r<0.90) and 5)
    or 6
    )] +=1

    """
    assert len(lst)>1, "Corner case NOT covered"

    codestr = 'chosen = [0] * %i\n' % (len(lst),)
    codestr += 'for i in xrange(%i):\n r = random()\n' % (n,)
    codestr += ' chosen[-1+ (\n'

    last = 0.0
    lstmax = len(lst)-1
    for i,p in enumerate(lst):
    last += p
    if i==0: codestr += ' ((r<%g) and 1)\n' % (last)
    elif i==lstmax: codestr += ' or %i\n )] +=1\n' % (i+1,)
    else: codestr += ' or ((r<%g) and %i)\n' % (last,
    i+1)
    #return codestr
    exec codestr

    result = [] # rescale probability based on frequency
    for c in chosen:
    result.append(float(c) / n)
    return result


    def draw4(n, lst, precision = 0.01, maxbins=10000):
    """ Memory/speed tradeoff by coarse quantizing of frequency values.

    """
    assert len(lst)>1, "Corner case NOT covered"
    assert 0.0 < precision < 1.0 and (1.0/precision) < maxbins

    binmax = int(1.0/precision)
    chosenbin = [0] * binmax
    for i in xrange(n):
    chosenbin[randrange(binmax)] +=1

    left, right = 0, 0 # extract bin range for summation
    chosen = [0] * len(lst)
    last = 0.0
    for i,p in enumerate(lst):
    last += p
    right = int(last/precision)
    chosen = sum( chosenbin[left:right] )
    left = right

    result = [] # rescale probability based on frequency
    for c in chosen:
    result.append(float(c) / n)
    return result

    === END CODE ===
    Paddy, Jun 2, 2006
    #4
  5. Brian Quinlan

    Terry Reedy Guest

    "Paddy" <> wrote in message
    news:...
    > Brian Quinlan wrote:
    >> This is less a Python question and more a optimization/probability
    >> question. Imaging that you have a list of objects and there frequency in
    >> a population e.g.
    >>
    >> lst = [(a, 0.01), (b, 0.05), (c, 0.50), (d, 0.30), (e, 0.04), (f, 0.10)]
    >>
    >> and you want to drawn n items from that list (duplicates allowed), with
    >> that probability distribution.


    For a set of probabilities that uneven, you *might* do better to sort high
    to low and do a linear search. Or check the two common cases and then do a
    binary search on the rest.

    tjr
    Terry Reedy, Jun 2, 2006
    #5
  6. Brian Quinlan

    Roger Miller Guest

    For your example, since the probabilities are all multiples of 0.01 you
    could make a list of 100 elements. Set one of them to a, 5 of them to
    b, 50 of them to c, etc. Then just randomly sample from the table
    (which is O(1)). Of course if the probabilities can be arbitrary
    floating point values then this won't work. But if you can live with
    rounding to 3 or 4 digits this is probably the fastest and easiest
    approach.
    Roger Miller, Jun 2, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ALuPin

    Over-Sampling

    ALuPin, Mar 11, 2005, in forum: VHDL
    Replies:
    5
    Views:
    1,916
  2. Albretch
    Replies:
    0
    Views:
    308
    Albretch
    Nov 29, 2004
  3. steflhermitte
    Replies:
    1
    Views:
    826
    Kanenas
    Apr 21, 2005
  4. Steven Bethard

    sampling items from a nested list

    Steven Bethard, Feb 16, 2005, in forum: Python
    Replies:
    5
    Views:
    326
    Felix Wiemann
    Feb 17, 2005
  5. Replies:
    0
    Views:
    448
Loading...

Share This Page