implement random selection in Python

Discussion in 'Python' started by Bruza, Nov 16, 2007.

  1. Bruza

    Bruza Guest

    I need to implement a "random selection" algorithm which takes a list
    of [(obj, prob),...] as input. Each of the (obj, prob) represents how
    likely an object, "obj", should be selected based on its probability
    of
    "prob".To simplify the problem, assuming "prob" are integers, and the
    sum of all "prob" equals 100. For example,

    items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

    The algorithm will take a number "N", and a [(obj, prob),...] list as
    inputs, and randomly pick "N" objects based on the probabilities of
    the
    objects in the list.


    For N=1 this is pretty simply; the following code is sufficient to do
    the job.

    def foo(items):
    index = random.randint(0, 99)
    currentP = 0
    for (obj, p) in items:
    currentP += w
    if currentP > index:
    return obj

    But how about the general case, for N > 1 and N < len(items)? Is there
    some clever algorithm using Python standard "random" package to do
    the trick?

    Thanks,

    Ben
     
    Bruza, Nov 16, 2007
    #1
    1. Advertising

  2. Bruza

    Guest

    On Nov 15, 10:40�pm, Bruza <> wrote:
    > I need to implement a "random selection" algorithm which takes a list
    > of [(obj, prob),...] as input. Each of the (obj, prob) represents how
    > likely an object, "obj", should be selected based on its probability
    > of
    > "prob".To simplify the problem, assuming "prob" are integers, and the
    > sum of all "prob" equals 100. For example,
    >
    > � �items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
    >
    > The algorithm will take a number "N", and a [(obj, prob),...] list as
    > inputs, and randomly pick "N" objects based on the probabilities of
    > the
    > objects in the list.
    >
    > For N=1 this is pretty simply; the following code is sufficient to do
    > the job.
    >
    > def foo(items):
    > � � index = random.randint(0, 99)
    > � � currentP = 0
    > � � for (obj, p) in items:
    > � � � �currentP += w
    > � � � �if currentP > index:
    > � � � � � return obj
    >
    > But how about the general case, for N > 1 and N < len(items)? Is there
    > some clever algorithm using Python standard "random" package to do
    > the trick?
    >
    > Thanks,
    >
    > Ben


    What do you think of this?

    import random
    N = 100
    items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
    the_items = []
    for i,j in items:
    the_items.extend(*j)
    histogram = {}
    for i in xrange(N):
    chosen = random.choice(the_items)
    print chosen,
    if chosen in histogram:
    histogram[chosen] += 1
    else:
    histogram[chosen] = 1
    print
    print
    for i in histogram:
    print '%4s: %d' % (i,histogram)

    ## John Mary Jane Tom Tom Mary Mary Tom John John Tom John Tom
    ## John Mary Mary Mary John Tom Tom John Mary Mary Tom Mary
    ## Mary John Tom Jane Jane Jane John Tom Jane Tom Tom John Tom
    ## Tom Mary Tom Tom Mary Tom Mary Tom Tom Tom Tom Mary Mary Tom
    ## Mary Tom Mary Tom Tom Jane Tom Mary Tom Jane Tom Tom Tom Tom
    ## Tom Mary Tom Jane Tom Mary Mary Jane Mary John Mary Mary Tom
    ## Mary Mary Tom Mary John Tom Tom Tom Tom Mary Jane Mary Tom
    ## Mary Tom Tom Jane Tom Mary Mary Tom
    ##
    ## Jane: 11
    ## John: 12
    ## Mary: 32
    ## Tom: 45
     
    , Nov 16, 2007
    #2
    1. Advertising

  3. Bruza

    James Stroud Guest

    Bruza wrote:
    > I need to implement a "random selection" algorithm which takes a list
    > of [(obj, prob),...] as input. Each of the (obj, prob) represents how
    > likely an object, "obj", should be selected based on its probability
    > of
    > "prob".To simplify the problem, assuming "prob" are integers, and the
    > sum of all "prob" equals 100. For example,
    >
    > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
    >
    > The algorithm will take a number "N", and a [(obj, prob),...] list as
    > inputs, and randomly pick "N" objects based on the probabilities of
    > the
    > objects in the list.
    >
    >
    > For N=1 this is pretty simply; the following code is sufficient to do
    > the job.
    >
    > def foo(items):
    > index = random.randint(0, 99)
    > currentP = 0
    > for (obj, p) in items:
    > currentP += w
    > if currentP > index:
    > return obj
    >
    > But how about the general case, for N > 1 and N < len(items)? Is there
    > some clever algorithm using Python standard "random" package to do
    > the trick?
    >
    > Thanks,
    >
    > Ben


    This will not get you an A on your homework:

    x = []
    [x.extend(*n) for (i,n) in items]
    random.choice(x)

    James


    --
    James Stroud
    UCLA-DOE Institute for Genomics and Proteomics
    Box 951570
    Los Angeles, CA 90095

    http://www.jamesstroud.com
     
    James Stroud, Nov 16, 2007
    #3
  4. Bruza

    Bruza Guest

    On Nov 15, 9:32 pm, "" <> wrote:
    > On Nov 15, 10:40�pm, Bruza <> wrote:
    >
    >
    >
    > > I need to implement a "random selection" algorithm which takes a list
    > > of [(obj, prob),...] as input. Each of the (obj, prob) represents how
    > > likely an object, "obj", should be selected based on its probability
    > > of
    > > "prob".To simplify the problem, assuming "prob" are integers, and the
    > > sum of all "prob" equals 100. For example,

    >
    > > � �items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

    >
    > > The algorithm will take a number "N", and a [(obj, prob),...] list as
    > > inputs, and randomly pick "N" objects based on the probabilities of
    > > the
    > > objects in the list.

    >
    > > For N=1 this is pretty simply; the following code is sufficient to do
    > > the job.

    >
    > > def foo(items):
    > > � � index = random.randint(0, 99)
    > > � � currentP = 0
    > > � � for (obj, p) in items:
    > > � � � �currentP += w
    > > � � � �if currentP > index:
    > > � � � � � return obj

    >
    > > But how about the general case, for N > 1 and N < len(items)? Is there
    > > some clever algorithm using Python standard "random" package to do
    > > the trick?

    >
    > > Thanks,

    >
    > > Ben

    >
    > What do you think of this?
    >
    > import random
    > N = 100
    > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
    > the_items = []
    > for i,j in items:
    > the_items.extend(*j)
    > histogram = {}
    > for i in xrange(N):
    > chosen = random.choice(the_items)
    > print chosen,
    > if chosen in histogram:
    > histogram[chosen] += 1
    > else:
    > histogram[chosen] = 1
    > print
    > print
    > for i in histogram:
    > print '%4s: %d' % (i,histogram)
    >
    > ## John Mary Jane Tom Tom Mary Mary Tom John John Tom John Tom
    > ## John Mary Mary Mary John Tom Tom John Mary Mary Tom Mary
    > ## Mary John Tom Jane Jane Jane John Tom Jane Tom Tom John Tom
    > ## Tom Mary Tom Tom Mary Tom Mary Tom Tom Tom Tom Mary Mary Tom
    > ## Mary Tom Mary Tom Tom Jane Tom Mary Tom Jane Tom Tom Tom Tom
    > ## Tom Mary Tom Jane Tom Mary Mary Jane Mary John Mary Mary Tom
    > ## Mary Mary Tom Mary John Tom Tom Tom Tom Mary Jane Mary Tom
    > ## Mary Tom Tom Jane Tom Mary Mary Tom
    > ##
    > ## Jane: 11
    > ## John: 12
    > ## Mary: 32
    > ## Tom: 45


    No. That does not solve the problem. What I want is a function

    def randomPick(n, the_items):

    which will return n DISTINCT items from "the_items" such that
    the n items returned are according to their probabilities specified
    in the (item, pro) elements inside "the_items".

    If I understand correctly, both of the previous replies will output
    one item at a time independently, instead of returning n DISTINCT
    items at a time.
     
    Bruza, Nov 16, 2007
    #4
  5. Bruza

    Boris Borcic Guest

    Bruza wrote:
    > No. That does not solve the problem. What I want is a function
    >
    > def randomPick(n, the_items):
    >
    > which will return n DISTINCT items from "the_items" such that
    > the n items returned are according to their probabilities specified
    > in the (item, pro) elements inside "the_items".
    >
    > If I understand correctly, both of the previous replies will output
    > one item at a time independently, instead of returning n DISTINCT
    > items at a time.
    >


    from random import sample
    randomPick = lambda n,its : sample(eval('+'.join('[%r]*%r'%p for p in its)),n)

    hth :)
     
    Boris Borcic, Nov 16, 2007
    #5
  6. Bruza

    Guest

    , Nov 16, 2007
    #6
  7. Bruza

    Boris Borcic Guest

    Bruza wrote:
    > No. That does not solve the problem. What I want is a function
    >
    > def randomPick(n, the_items):
    >
    > which will return n DISTINCT items from "the_items" such that
    > the n items returned are according to their probabilities specified
    > in the (item, pro) elements inside "the_items".


    and in the initial post you wrote :

    > But how about the general case, for N > 1 and N < len(items)?


    The problem is you need to make "the n items returned are according
    to their probabilities" more precise. "According to their probabilities" implies
    n INDEPENDENT picks, but this is contradictory with the n picks having to
    provide DISTINCT results (what clearly constrains picks relative to each other).

    Of course there are obvious ways to combine the results of random choices of
    single items to obtain a set like you want, but it is not obvious that they are
    equivalent.

    This is the most simple-minded :

    def randomPick(n, the_items) :
    assert n<len(the_items)
    result = set()
    while len(result)<n :
    result.add(singlePick(the_items))
    return sorted(result)

    This is another (but it won't work with your version of singlePick as it is,
    although it would with that provided by the other posters) :

    def randomPick(n, the_items) :
    result = []
    items = dict(the_items)
    for k in range(n) :
    pick = singlePick(items.items())
    result.append(pick)
    del items[pick]
    return result

    I would be surprised if they had exactly the same statistical properties, IOW,
    if they did fit the same exact interpretation of "according to their probabilities".
     
    Boris Borcic, Nov 16, 2007
    #7
  8. Bruza

    Paul Rubin Guest

    Bruza <> writes:
    > But how about the general case, for N > 1 and N < len(items)? Is there
    > some clever algorithm using Python standard "random" package


    Yeah, I'm not sure what the name for it is, but there'ss a well known
    algorithm that's sort of an online verison of random.choice. The
    famous N=1 example where all probabilities are equal goes:

    # choose a random element of seq
    for k,s in enumerate(seq):
    if random() < 1.0/(k+1):
    choice = s

    you should be able to generalize this to N items with different
    probabilities.
     
    Paul Rubin, Nov 16, 2007
    #8
  9. Bruza

    Boris Borcic Guest

    Boris Borcic wrote:
    > Bruza wrote:
    >> No. That does not solve the problem. What I want is a function
    >>
    >> def randomPick(n, the_items):
    >>
    >> which will return n DISTINCT items from "the_items" such that
    >> the n items returned are according to their probabilities specified
    >> in the (item, pro) elements inside "the_items".

    >
    > and in the initial post you wrote :
    >
    > > But how about the general case, for N > 1 and N < len(items)?

    >
    > The problem is you need to make "the n items returned are according
    > to their probabilities" more precise. "According to their probabilities" implies
    > n INDEPENDENT picks, but this is contradictory with the n picks having to
    > provide DISTINCT results (what clearly constrains picks relative to each other).
    >
    > Of course there are obvious ways to combine the results of random choices of
    > single items to obtain a set like you want, but it is not obvious that they are
    > equivalent.
    >
    > This is the most simple-minded :
    >
    > def randomPick(n, the_items) :
    > assert n<len(the_items)
    > result = set()
    > while len(result)<n :
    > result.add(singlePick(the_items))
    > return sorted(result)
    >
    > This is another (but it won't work with your version of singlePick as it is,
    > although it would with that provided by the other posters) :
    >
    > def randomPick(n, the_items) :
    > result = []
    > items = dict(the_items)
    > for k in range(n) :
    > pick = singlePick(items.items())
    > result.append(pick)
    > del items[pick]
    > return result
    >
    > I would be surprised if they had exactly the same statistical properties, IOW,
    > if they did fit the same exact interpretation of "according to their probabilities".
    >
    >


    yet another one, constructing a list of n-sets first, and then picking one;
    like the other solutions, it doesn't optimize for repeated use.

    def pickn(items,n) :
    "yields all n-sublists of (destructed) items"
    if n<=len(items) :
    if n :
    item = items.pop()
    for res in pickn(items[:],n) :
    yield res
    for res in pickn(items,n-1) :
    res.append(item)
    yield res
    else :
    yield []


    def randomPick(n,the_items) :
    "randomly pick n distinct members of the_items"
    the_sets = list(pickn(the_items[:],n))
    divisor = len(the_sets)*float(n)/len(the_items)
    for k,s in enumerate(the_sets) :
    the_sets[k] = (sorted(who for who,_ in s),
    int(1+sum(p for _,p in s)/divisor)) # mhh...
    return singlePick(the_sets)
     
    Boris Borcic, Nov 16, 2007
    #9
  10. Bruza

    duncan smith Guest

    Bruza wrote:
    > I need to implement a "random selection" algorithm which takes a list
    > of [(obj, prob),...] as input. Each of the (obj, prob) represents how
    > likely an object, "obj", should be selected based on its probability
    > of
    > "prob".To simplify the problem, assuming "prob" are integers, and the
    > sum of all "prob" equals 100. For example,
    >
    > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
    >
    > The algorithm will take a number "N", and a [(obj, prob),...] list as
    > inputs, and randomly pick "N" objects based on the probabilities of
    > the
    > objects in the list.
    >
    >
    > For N=1 this is pretty simply; the following code is sufficient to do
    > the job.
    >
    > def foo(items):
    > index = random.randint(0, 99)
    > currentP = 0
    > for (obj, p) in items:
    > currentP += w
    > if currentP > index:
    > return obj
    >
    > But how about the general case, for N > 1 and N < len(items)? Is there
    > some clever algorithm using Python standard "random" package to do
    > the trick?
    >


    I think you need to clarify what you want to do. The "probs" are
    clearly not probabilities. Are they counts of items? Are you then
    sampling without replacement? When you say N < len(items) do you mean N
    <= sum of the "probs"?

    Duncabn
     
    duncan smith, Nov 16, 2007
    #10
  11. Bruza

    Bruza Guest

    On Nov 16, 6:58 am, duncan smith <>
    wrote:
    > Bruza wrote:
    > > I need to implement a "random selection" algorithm which takes a list
    > > of [(obj, prob),...] as input. Each of the (obj, prob) represents how
    > > likely an object, "obj", should be selected based on its probability
    > > of
    > > "prob".To simplify the problem, assuming "prob" are integers, and the
    > > sum of all "prob" equals 100. For example,

    >
    > > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

    >
    > > The algorithm will take a number "N", and a [(obj, prob),...] list as
    > > inputs, and randomly pick "N" objects based on the probabilities of
    > > the
    > > objects in the list.

    >
    > > For N=1 this is pretty simply; the following code is sufficient to do
    > > the job.

    >
    > > def foo(items):
    > > index = random.randint(0, 99)
    > > currentP = 0
    > > for (obj, p) in items:
    > > currentP += w
    > > if currentP > index:
    > > return obj

    >
    > > But how about the general case, for N > 1 and N < len(items)? Is there
    > > some clever algorithm using Python standard "random" package to do
    > > the trick?

    >
    > I think you need to clarify what you want to do. The "probs" are
    > clearly not probabilities. Are they counts of items? Are you then
    > sampling without replacement? When you say N < len(items) do you mean N
    > <= sum of the "probs"?
    >
    > Duncabn


    I think I need to explain on the probability part: the "prob" is a
    relative likelihood that the object will be included in the output
    list. So, in my example input of

    items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

    So, for any size of N, 'Tom' (with prob of 45) will be more likely to
    be included in the output list of N distinct member than 'Mary' (prob
    of 30) and much more likely than that of 'John' (with prob of 10).

    I know "prob" is not exactly the "probability" in the context of
    returning a multiple member list. But what I want is a way to "favor"
    some member in a selection process.

    So far, only Boris's solution is closest (but not quite) to what I
    need, which returns a list of N distinct object from the input
    "items". However, I tried with input of

    items = [('Mary',1), ('John', 1), ('Tom', 1), ('Jane', 97)]

    and have a repeated calling of

    Ben
     
    Bruza, Nov 17, 2007
    #11
  12. Bruza

    Bruza Guest

    On Nov 16, 4:47 pm, Bruza <> wrote:
    > On Nov 16, 6:58 am, duncan smith <>
    > wrote:
    >
    >
    >
    > > Bruza wrote:
    > > > I need to implement a "random selection" algorithm which takes a list
    > > > of [(obj, prob),...] as input. Each of the (obj, prob) represents how
    > > > likely an object, "obj", should be selected based on its probability
    > > > of
    > > > "prob".To simplify the problem, assuming "prob" are integers, and the
    > > > sum of all "prob" equals 100. For example,

    >
    > > > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

    >
    > > > The algorithm will take a number "N", and a [(obj, prob),...] list as
    > > > inputs, and randomly pick "N" objects based on the probabilities of
    > > > the
    > > > objects in the list.

    >
    > > > For N=1 this is pretty simply; the following code is sufficient to do
    > > > the job.

    >
    > > > def foo(items):
    > > > index = random.randint(0, 99)
    > > > currentP = 0
    > > > for (obj, p) in items:
    > > > currentP += w
    > > > if currentP > index:
    > > > return obj

    >
    > > > But how about the general case, for N > 1 and N < len(items)? Is there
    > > > some clever algorithm using Python standard "random" package to do
    > > > the trick?

    >
    > > I think you need to clarify what you want to do. The "probs" are
    > > clearly not probabilities. Are they counts of items? Are you then
    > > sampling without replacement? When you say N < len(items) do you mean N
    > > <= sum of the "probs"?

    >
    > > Duncabn

    >
    > I think I need to explain on the probability part: the "prob" is a
    > relative likelihood that the object will be included in the output
    > list. So, in my example input of
    >
    > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
    >
    > So, for any size of N, 'Tom' (with prob of 45) will be more likely to
    > be included in the output list of N distinct member than 'Mary' (prob
    > of 30) and much more likely than that of 'John' (with prob of 10).
    >
    > I know "prob" is not exactly the "probability" in the context of
    > returning a multiple member list. But what I want is a way to "favor"
    > some member in a selection process.
    >
    > So far, only Boris's solution is closest (but not quite) to what I
    > need, which returns a list of N distinct object from the input
    > "items". However, I tried with input of
    >
    > items = [('Mary',1), ('John', 1), ('Tom', 1), ('Jane', 97)]
    >
    > and have a repeated calling of
    >
    > Ben


    OOPS. I pressed the Send too fast.

    The problem w/ Boris's solution is that after repeated calling of
    randomPick(3,items), 'Jane' is not the most "frequent appearing"
    member in all the out list of 3 member lists...
     
    Bruza, Nov 17, 2007
    #12
  13. Bruza

    Jordan Guest

    Maybe it would help to make your problem statement a litte rigorous so
    we can get a clearer idea of whats required.

    One possible formulation:

    Given a list L of pairs of values, weightings: [ (v_0, w_0), (v_1,
    w_1), ....], and some N between 1 and length(L)

    you would like to randomly select a set of N (distinct) values, V,
    such that for any ints i and j,

    Prob (v_i is in V) / Prob (v_j is in V) = w_i / w_j

    This matches your expectations for N = 1. Intuitively though, without
    having put much thought into it, I suspect this might not be possible
    in the general case.

    You might then want to (substantially) relax thec ondition to

    Prob (v_i is in V) >= Prob (v_j is in V) iff w_i >= w_j

    but in that case its more an ordering of likelihoods rather than a
    weighting, and doesn't guarantee the right behaviour for N = 1, so i
    don't think thats really what you want.

    I can't think of any other obvious way of generalising the behaviour
    of the N = 1 case.

    - Jordan

    On Nov 17, 10:50 am, Bruza <> wrote:
    > On Nov 16, 4:47 pm, Bruza <> wrote:
    >
    >
    >
    >
    >
    > > On Nov 16, 6:58 am, duncan smith <>
    > > wrote:

    >
    > > > Bruza wrote:
    > > > > I need to implement a "random selection" algorithm which takes a list
    > > > > of [(obj, prob),...] as input. Each of the (obj, prob) represents how
    > > > > likely an object, "obj", should be selected based on its probability
    > > > > of
    > > > > "prob".To simplify the problem, assuming "prob" are integers, and the
    > > > > sum of all "prob" equals 100. For example,

    >
    > > > > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

    >
    > > > > The algorithm will take a number "N", and a [(obj, prob),...] list as
    > > > > inputs, and randomly pick "N" objects based on the probabilities of
    > > > > the
    > > > > objects in the list.

    >
    > > > > For N=1 this is pretty simply; the following code is sufficient to do
    > > > > the job.

    >
    > > > > def foo(items):
    > > > > index = random.randint(0, 99)
    > > > > currentP = 0
    > > > > for (obj, p) in items:
    > > > > currentP += w
    > > > > if currentP > index:
    > > > > return obj

    >
    > > > > But how about the general case, for N > 1 and N < len(items)? Is there
    > > > > some clever algorithm using Python standard "random" package to do
    > > > > the trick?

    >
    > > > I think you need to clarify what you want to do. The "probs" are
    > > > clearly not probabilities. Are they counts of items? Are you then
    > > > sampling without replacement? When you say N < len(items) do you mean N
    > > > <= sum of the "probs"?

    >
    > > > Duncabn

    >
    > > I think I need to explain on the probability part: the "prob" is a
    > > relative likelihood that the object will be included in the output
    > > list. So, in my example input of

    >
    > > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

    >
    > > So, for any size of N, 'Tom' (with prob of 45) will be more likely to
    > > be included in the output list of N distinct member than 'Mary' (prob
    > > of 30) and much more likely than that of 'John' (with prob of 10).

    >
    > > I know "prob" is not exactly the "probability" in the context of
    > > returning a multiple member list. But what I want is a way to "favor"
    > > some member in a selection process.

    >
    > > So far, only Boris's solution is closest (but not quite) to what I
    > > need, which returns a list of N distinct object from the input
    > > "items". However, I tried with input of

    >
    > > items = [('Mary',1), ('John', 1), ('Tom', 1), ('Jane', 97)]

    >
    > > and have a repeated calling of

    >
    > > Ben

    >
    > OOPS. I pressed the Send too fast.
    >
    > The problem w/ Boris's solution is that after repeated calling of
    > randomPick(3,items), 'Jane' is not the most "frequent appearing"
    > member in all the out list of 3 member lists...- Hide quoted text -
    >
    > - Show quoted text -
     
    Jordan, Nov 17, 2007
    #13
  14. Bruza

    duncan smith Guest

    Bruza wrote:
    > On Nov 16, 4:47 pm, Bruza <> wrote:
    >> On Nov 16, 6:58 am, duncan smith <>
    >> wrote:
    >>
    >>
    >>
    >>> Bruza wrote:
    >>>> I need to implement a "random selection" algorithm which takes a list
    >>>> of [(obj, prob),...] as input. Each of the (obj, prob) represents how
    >>>> likely an object, "obj", should be selected based on its probability
    >>>> of
    >>>> "prob".To simplify the problem, assuming "prob" are integers, and the
    >>>> sum of all "prob" equals 100. For example,
    >>>> items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
    >>>> The algorithm will take a number "N", and a [(obj, prob),...] list as
    >>>> inputs, and randomly pick "N" objects based on the probabilities of
    >>>> the
    >>>> objects in the list.
    >>>> For N=1 this is pretty simply; the following code is sufficient to do
    >>>> the job.
    >>>> def foo(items):
    >>>> index = random.randint(0, 99)
    >>>> currentP = 0
    >>>> for (obj, p) in items:
    >>>> currentP += w
    >>>> if currentP > index:
    >>>> return obj
    >>>> But how about the general case, for N > 1 and N < len(items)? Is there
    >>>> some clever algorithm using Python standard "random" package to do
    >>>> the trick?
    >>> I think you need to clarify what you want to do. The "probs" are
    >>> clearly not probabilities. Are they counts of items? Are you then
    >>> sampling without replacement? When you say N < len(items) do you mean N
    >>> <= sum of the "probs"?
    >>> Duncabn

    >> I think I need to explain on the probability part: the "prob" is a
    >> relative likelihood that the object will be included in the output
    >> list. So, in my example input of
    >>
    >> items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
    >>
    >> So, for any size of N, 'Tom' (with prob of 45) will be more likely to
    >> be included in the output list of N distinct member than 'Mary' (prob
    >> of 30) and much more likely than that of 'John' (with prob of 10).
    >>
    >> I know "prob" is not exactly the "probability" in the context of
    >> returning a multiple member list. But what I want is a way to "favor"
    >> some member in a selection process.
    >>
    >> So far, only Boris's solution is closest (but not quite) to what I
    >> need, which returns a list of N distinct object from the input
    >> "items". However, I tried with input of
    >>
    >> items = [('Mary',1), ('John', 1), ('Tom', 1), ('Jane', 97)]
    >>
    >> and have a repeated calling of
    >>
    >> Ben

    >
    > OOPS. I pressed the Send too fast.
    >
    > The problem w/ Boris's solution is that after repeated calling of
    > randomPick(3,items), 'Jane' is not the most "frequent appearing"
    > member in all the out list of 3 member lists...
    >


    So it looks like you are sampling without replacement. For large N you
    don't want to be generating N individual observations. Do you want
    something like the following?

    >>> import randvars
    >>> randvars.multinomial_variate([0.3, 0.1, 0.45, 0.15], 100)

    [34, 8, 40, 18]
    >>> randvars.multinomial_variate([0.3, 0.1, 0.45, 0.15], 10000)

    [2984, 1003, 4511, 1502]
    >>> randvars.multinomial_variate([0.3, 0.1, 0.45, 0.15], 1000000)

    [300068, 99682, 450573, 149677]
    >>>


    The algorithm I use is to generate independent Poisson variates for each
    category, with Poisson parameters proportional to the probabilities.
    This won't usually give you a total of N, but the sample can be adjusted
    by adding / subtracting individual observations. In fact, I choose a
    constant of proportionality that aims to get the total close to, but not
    greater than N. I repeatedly generate Poisson counts until the sample
    size is not greater than N, then adjust by adding individual
    observations. (This algorithm might be in Knuth, and might be due to
    Ahrens and Dieter. I can't remember the exact source.)

    An efficient way to generate the individual observations is to construct
    a cumulative distribution function and use binary search. I know the
    code for this has been posted before. An alias table should generally
    be faster, but my Python coded effort is only similarly fast.

    Unfortunately generating Poisson variates isn't straightforward, and the
    approach I use requires the generation of gamma and beta variates (both
    available in the random module though). This approach is due to Ahrens
    and Dieter and is in Knuth, Vol.2. So yes, there is a clever algorithm
    that uses functions in the random module (Knuth, Ahrens and Dieter are
    quite clever). But it's not simple.

    The numpy random module has a function for Poisson variates, which would
    make things easier for you (I don't off-hand know which algorithm is
    used). Then it's just a case of choosing an appropriate constant of
    proportionality for the Poisson parameters. Try N - k, where k is the
    number of categories, for k <= N**0.5; otherwise, N - N**0.5 - (k -
    N**0.5)**0.5.

    Duncan
     
    duncan smith, Nov 17, 2007
    #14
  15. On Fri, 16 Nov 2007 16:47:16 -0800, Bruza wrote:

    > I think I need to explain on the probability part: the "prob" is a
    > relative likelihood that the object will be included in the output list.
    > So, in my example input of
    >
    > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
    >
    > So, for any size of N, 'Tom' (with prob of 45) will be more likely to be
    > included in the output list of N distinct member than 'Mary' (prob of
    > 30) and much more likely than that of 'John' (with prob of 10).



    Since there are only four items, you can't select more than four distinct
    items, because the fifth has to be a duplicate of one of the others. And
    the probability/likelihood doesn't work either: if you insist on getting
    distinct items, then your results are are much limited:

    N = 0, result: []

    N = 1, four possible results:
    ['Mary'] ['John'] ['Tom'] or ['Jane']

    N = 2, six possible results (assuming order doesn't matter):
    ['Mary', 'John'] ['Mary', 'Tom'] ['Mary', 'Jane'] ['John', 'Tom']
    ['John', 'Jane'] or ['Tom', 'Jane']

    N = 3, four possible results:
    ['Mary', 'John', 'Tom'] ['Mary', 'John', 'Jane']
    ['Mary', 'Tom', 'Jane'] or ['John', 'Tom', 'Jane']

    N = 4, one possible result:
    ['Mary', 'John', 'Tom', 'Jane']



    > I know "prob" is not exactly the "probability" in the context of
    > returning a multiple member list. But what I want is a way to "favor"
    > some member in a selection process.


    I don't think this is really well defined, but I'll take a stab in the
    dark at it.

    Let's take the example above for N = 3:

    items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
    # use the lowest common factor for simplicity
    items = [('Mary',6), ('John', 2), ('Tom', 9), ('Jane', 3)]

    # N = 3 has four possible results:
    R = ['Mary', 'John', 'Tom']
    S = ['Mary', 'John', 'Jane']
    T = ['Mary', 'Tom', 'Jane']
    U = ['John', 'Tom', 'Jane']

    You want to bias the four possible results above so that having Mary is
    three times more likely than having John. It sounds simple, but it isn't.
    To do so, you need to solve a whole series of simultaneous equations:

    Pr(Mary) = Pr(R) + Pr(S) + Pr(T)
    Pr(John) = Pr(R) + Pr(S) + Pr(U)
    Pr(Mary) = 3*Pr(John)

    And so on for all the other items.

    This is a HARD problem to solve, and for most sets of likelihoods you
    might give, there won't be a solution. For example, you can't have a set
    of results where Mary is three times more likely than John, John is twice
    as likely as Tom, and Tom is four times more likely than Mary. It simply
    can't happen.

    So we can't interpret the numbers as probabilities. We can't even
    interpret them more loosely as "likelihoods". Proof of that is to
    consider the case of N=4. There is only one possible result with four
    distinct items. So all of Mary, John, Tom and Jane are equally likely, in
    fact all of them are certain. The weightings you gave (30, 10, 45, 15)
    are meaningless.

    So, given that what you are asking for is impossible, can you explain
    what you are actually trying to accomplish? Maybe there's a more feasible
    alternative.


    --
    Steven.
     
    Steven D'Aprano, Nov 17, 2007
    #15
  16. Bruza

    Jordan Guest

    How about this variation on your intial attempt?

    # Untested!
    def randomPick(n, items):
    def pickOne():
    index = random.randint(0, 99)
    currentP = 0
    for (obj, p) in items:
    currentP += p
    if currentP > index:
    return obj
    selection = set()
    while len(selection) < n:
    selection.add(pickOne())
    return selection

    Of course the performance is likely to be far from stellar for, say,
    randomPick(2, [("Bob", 98), ("Jane", 1), ("Harry", 1)]), but at least
    its boundedly bad (so long as you retain the condition that the sum of
    the weightings is 100.)

    Not sure if it satisfies the conditions in my last post.... either do
    some empirical testing, or some mathematics, or maybe a bit of
    both.

    On Nov 17, 12:02 pm, Jordan <> wrote:
    > Maybe it would help to make your problem statement a litte rigorous so
    > we can get a clearer idea of whats required.
    >
    > One possible formulation:
    >
    > Given a list L of pairs of values, weightings: [ (v_0, w_0), (v_1,
    > w_1), ....], and some N between 1 and length(L)
    >
    > you would like to randomly select a set of N (distinct) values, V,
    > such that for any ints i and j,
    >
    > Prob (v_i is in V) / Prob (v_j is in V) = w_i / w_j
    >
    > This matches your expectations for N = 1. Intuitively though, without
    > having put much thought into it, I suspect this might not be possible
    > in the general case.
    >
    > You might then want to (substantially) relax thec ondition to
    >
    > Prob (v_i is in V) >= Prob (v_j is in V) iff w_i >= w_j
    >
    > but in that case its more an ordering of likelihoods rather than a
    > weighting, and doesn't guarantee the right behaviour for N = 1, so i
    > don't think thats really what you want.
    >
    > I can't think of any other obvious way of generalising the behaviour
    > of the N = 1 case.
    >
    > - Jordan
    >
    > On Nov 17, 10:50 am, Bruza <> wrote:
    >
    >
    >
    > > On Nov 16, 4:47 pm, Bruza <> wrote:

    >
    > > > On Nov 16, 6:58 am, duncan smith <>
    > > > wrote:

    >
    > > > > Bruza wrote:
    > > > > > I need to implement a "random selection" algorithm which takes a list
    > > > > > of [(obj, prob),...] as input. Each of the (obj, prob) represents how
    > > > > > likely an object, "obj", should be selected based on its probability
    > > > > > of
    > > > > > "prob".To simplify the problem, assuming "prob" are integers, and the
    > > > > > sum of all "prob" equals 100. For example,

    >
    > > > > > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

    >
    > > > > > The algorithm will take a number "N", and a [(obj, prob),...] list as
    > > > > > inputs, and randomly pick "N" objects based on the probabilities of
    > > > > > the
    > > > > > objects in the list.

    >
    > > > > > For N=1 this is pretty simply; the following code is sufficient to do
    > > > > > the job.

    >
    > > > > > def foo(items):
    > > > > > index = random.randint(0, 99)
    > > > > > currentP = 0
    > > > > > for (obj, p) in items:
    > > > > > currentP += w
    > > > > > if currentP > index:
    > > > > > return obj

    >
    > > > > > But how about the general case, for N > 1 and N < len(items)? Is there
    > > > > > some clever algorithm using Python standard "random" package to do
    > > > > > the trick?

    >
    > > > > I think you need to clarify what you want to do. The "probs" are
    > > > > clearly not probabilities. Are they counts of items? Are you then
    > > > > sampling without replacement? When you say N < len(items) do you mean N
    > > > > <= sum of the "probs"?

    >
    > > > > Duncabn

    >
    > > > I think I need to explain on the probability part: the "prob" is a
    > > > relative likelihood that the object will be included in the output
    > > > list. So, in my example input of

    >
    > > > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

    >
    > > > So, for any size of N, 'Tom' (with prob of 45) will be more likely to
    > > > be included in the output list of N distinct member than 'Mary' (prob
    > > > of 30) and much more likely than that of 'John' (with prob of 10).

    >
    > > > I know "prob" is not exactly the "probability" in the context of
    > > > returning a multiple member list. But what I want is a way to "favor"
    > > > some member in a selection process.

    >
    > > > So far, only Boris's solution is closest (but not quite) to what I
    > > > need, which returns a list of N distinct object from the input
    > > > "items". However, I tried with input of

    >
    > > > items = [('Mary',1), ('John', 1), ('Tom', 1), ('Jane', 97)]

    >
    > > > and have a repeated calling of

    >
    > > > Ben

    >
    > > OOPS. I pressed the Send too fast.

    >
    > > The problem w/ Boris's solution is that after repeated calling of
    > > randomPick(3,items), 'Jane' is not the most "frequent appearing"
    > > member in all the out list of 3 member lists...- Hide quoted text -

    >
    > > - Show quoted text -- Hide quoted text -

    >
    > - Show quoted text -
     
    Jordan, Nov 17, 2007
    #16
  17. Bruza

    Guest

    Knuth says to pick N distinct records from a collection where the
    probability is equal you should:

    first fill up N records by chosing the first seen.

    if less than N were in the collection, quit.

    otherwise, t = (the number of items looked at) or N to start.

    while your not at the end of the collection:
    increment t
    generate a random number between 0 and t => K
    if K < N:
    add it to your collection at position K (replacing the previous
    item).

    Now, to incorporate probability is another question . . .

    Find the largest probability. Assume it has 100% probability.
    Calculate the probability of each item accordingly. Take the randomly
    generated number and multiply it by the probability. Take the random
    number minus the (random number times to probability). If it falls in
    range, then you replace like usual. You should find that the max
    probability will always appear in the zeroth position if it is not
    replaced by another value. Now, this does not guarantee that the
    highest probability value will show up first in the list, since that
    is the same as sorting by the probability. It is just a way of
    increasing the probability of making the value fall in the range as
    the probability varies.

    I am not guaranteeing this even works. I am seeing that there is some
    collision among the numbers, but it will work for the most part.
     
    , Nov 17, 2007
    #17
  18. On Sat, 17 Nov 2007 13:51:16 -0800, wrote:

    > I am not guaranteeing this even works. I am seeing that there is some
    > collision among the numbers, but it will work for the most part.


    "Work for the most part" -- is that another way of saying "Apart from the
    bugs, this is bug-free"?


    --
    Steven.
     
    Steven D'Aprano, Nov 17, 2007
    #18
  19. Bruza

    Neil Cerutti Guest

    On 2007-11-17, Bruza <> wrote:
    > OOPS. I pressed the Send too fast.
    >
    > The problem w/ Boris's solution is that after repeated calling
    > of randomPick(3,items), 'Jane' is not the most "frequent
    > appearing" member in all the out list of 3 member lists...


    How does this solution fair against your spec?

    def sample_bp(seq, probabilities, k):
    """ Return k distinct random items from seq, with probabilities specified
    as weighted integers in a list.

    >>> random.seed(0)


    >>> sample_bp(['a', 'b'], [1, 5], 2)

    ['b', 'a']

    >>> sample_bp(['a', 'b', 'c'], [1, 5, 2], 3)

    ['b', 'a', 'c']

    """

    if k > len(seq):
    raise ValueError('sample size greater than population')
    probs = build_probs(probabilities)
    rv = []
    while k > 0:
    j = random_prob(probs)
    rv.append(probs[j][2])
    remove_prob(probs, j)
    k -= 1
    return [seq for i in rv]

    def build_probs(probabilities):
    """ Receives a list of integers, and returns list of ranges and original
    indices.

    >>> build_probs([8, 10, 7])

    [(0, 8, 0), (8, 18, 1), (18, 25, 2)]
    >>> build_probs([1, 5, 8, 2, 3, 7])

    [(0, 1, 0), (1, 6, 1), (6, 14, 2), (14, 16, 3), (16, 19, 4), (19, 26, 5)]


    """
    k = 0
    probs = []
    for i, p in enumerate(probabilities):
    if p < 0:
    raise ValueError('negative probability')
    probs.append((k, k+p, i))
    k = k+p
    return probs

    def random_prob(probs):
    """ Return the index of a weighted random element of prob.

    >>> random.seed(0)


    >>> for x in xrange(20):

    ... print random_prob(build_probs([1, 5, 8, 2, 3, 7])),
    5 5 2 2 2 2 5 2 2 3 5 2 2 5 4 2 5 5 5 5

    >>> random_prob([(0, 15, 0)])

    0

    """
    i = random.randrange(probs[-1][1])
    # Binary search for the element whose range contains i
    hi = len(probs)
    lo = 0
    while lo < hi:
    mid = (lo+hi)//2
    begin, end, _ = probs[mid]
    if i >= begin and i < end: return mid
    elif i >= end: lo = mid+1
    else: hi = mid

    def remove_prob(probs, i):
    """ Remove element j from the probability list, adjusting ranges as needed.

    >>> prob = [(0, 12, 0), (12, 15, 1), (15, 25, 2)]
    >>> remove_prob(prob, 1)
    >>> prob

    [(0, 12, 0), (12, 22, 2)]

    """
    begin, end, _ = probs
    diff = end - begin
    j = i+1
    while j < len(probs):
    begin, end, index = probs[j]
    probs[j] = (begin-diff, end-diff, index)
    j += 1
    del probs

    This was the most simple-minded approach I could think of, so it
    might serve as a reference against a more efficient approach.
    Although thorough testing of such a bizarre function boggles my
    mind.

    I toyed with sorting the probability list from greatest to
    lowest, which would make a linear search fast, but unfortunately
    it would slow down removing probabilities.

    --
    Neil Cerutti
    I make love to pressure. --Stephen Jackson
     
    Neil Cerutti, Nov 19, 2007
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Simon Niederberger
    Replies:
    2
    Views:
    16,553
    Christian Kaufhold
    Jan 7, 2005
  2. Andrew Crowe
    Replies:
    1
    Views:
    4,482
    Andrew Crowe
    Sep 13, 2004
  3. J. Clifford Dyer

    Re: implement random selection in Python

    J. Clifford Dyer, Nov 17, 2007, in forum: Python
    Replies:
    1
    Views:
    495
    Scott David Daniels
    Nov 20, 2007
  4. globalrev
    Replies:
    4
    Views:
    778
    Gabriel Genellina
    Apr 20, 2008
  5. VK
    Replies:
    15
    Views:
    1,189
    Dr J R Stockton
    May 2, 2010
Loading...

Share This Page