implement random selection in Python

Discussion in 'Python' started by Bruza, Nov 16, 2007.

1. BruzaGuest

I need to implement a "random selection" algorithm which takes a list
of [(obj, prob),...] as input. Each of the (obj, prob) represents how
likely an object, "obj", should be selected based on its probability
of
"prob".To simplify the problem, assuming "prob" are integers, and the
sum of all "prob" equals 100. For example,

items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

The algorithm will take a number "N", and a [(obj, prob),...] list as
inputs, and randomly pick "N" objects based on the probabilities of
the
objects in the list.

For N=1 this is pretty simply; the following code is sufficient to do
the job.

def foo(items):
index = random.randint(0, 99)
currentP = 0
for (obj, p) in items:
currentP += w
if currentP > index:
return obj

But how about the general case, for N > 1 and N < len(items)? Is there
some clever algorithm using Python standard "random" package to do
the trick?

Thanks,

Ben

Bruza, Nov 16, 2007

2. Guest

On Nov 15, 10:40ï¿½pm, Bruza <> wrote:
> I need to implement a "random selection" algorithm which takes a list
> of [(obj, prob),...] as input. Each of the (obj, prob) represents how
> likely an object, "obj", should be selected based on its probability
> of
> "prob".To simplify the problem, assuming "prob" are integers, and the
> sum of all "prob" equals 100. For example,
>
> ï¿½ ï¿½items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
>
> The algorithm will take a number "N", and a [(obj, prob),...] list as
> inputs, and randomly pick "N" objects based on the probabilities of
> the
> objects in the list.
>
> For N=1 this is pretty simply; the following code is sufficient to do
> the job.
>
> def foo(items):
> ï¿½ ï¿½ index = random.randint(0, 99)
> ï¿½ ï¿½ currentP = 0
> ï¿½ ï¿½ for (obj, p) in items:
> ï¿½ ï¿½ ï¿½ ï¿½currentP += w
> ï¿½ ï¿½ ï¿½ ï¿½if currentP > index:
> ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ return obj
>
> But how about the general case, for N > 1 and N < len(items)? Is there
> some clever algorithm using Python standard "random" package to do
> the trick?
>
> Thanks,
>
> Ben

What do you think of this?

import random
N = 100
items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
the_items = []
for i,j in items:
the_items.extend(*j)
histogram = {}
for i in xrange(N):
chosen = random.choice(the_items)
print chosen,
if chosen in histogram:
histogram[chosen] += 1
else:
histogram[chosen] = 1
print
print
for i in histogram:
print '%4s: %d' % (i,histogram)

## John Mary Jane Tom Tom Mary Mary Tom John John Tom John Tom
## John Mary Mary Mary John Tom Tom John Mary Mary Tom Mary
## Mary John Tom Jane Jane Jane John Tom Jane Tom Tom John Tom
## Tom Mary Tom Tom Mary Tom Mary Tom Tom Tom Tom Mary Mary Tom
## Mary Tom Mary Tom Tom Jane Tom Mary Tom Jane Tom Tom Tom Tom
## Tom Mary Tom Jane Tom Mary Mary Jane Mary John Mary Mary Tom
## Mary Mary Tom Mary John Tom Tom Tom Tom Mary Jane Mary Tom
## Mary Tom Tom Jane Tom Mary Mary Tom
##
## Jane: 11
## John: 12
## Mary: 32
## Tom: 45

, Nov 16, 2007

3. James StroudGuest

Bruza wrote:
> I need to implement a "random selection" algorithm which takes a list
> of [(obj, prob),...] as input. Each of the (obj, prob) represents how
> likely an object, "obj", should be selected based on its probability
> of
> "prob".To simplify the problem, assuming "prob" are integers, and the
> sum of all "prob" equals 100. For example,
>
> items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
>
> The algorithm will take a number "N", and a [(obj, prob),...] list as
> inputs, and randomly pick "N" objects based on the probabilities of
> the
> objects in the list.
>
>
> For N=1 this is pretty simply; the following code is sufficient to do
> the job.
>
> def foo(items):
> index = random.randint(0, 99)
> currentP = 0
> for (obj, p) in items:
> currentP += w
> if currentP > index:
> return obj
>
> But how about the general case, for N > 1 and N < len(items)? Is there
> some clever algorithm using Python standard "random" package to do
> the trick?
>
> Thanks,
>
> Ben

This will not get you an A on your homework:

x = []
[x.extend(*n) for (i,n) in items]
random.choice(x)

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com

James Stroud, Nov 16, 2007
4. BruzaGuest

On Nov 15, 9:32 pm, "" <> wrote:
> On Nov 15, 10:40ï¿½pm, Bruza <> wrote:
>
>
>
> > I need to implement a "random selection" algorithm which takes a list
> > of [(obj, prob),...] as input. Each of the (obj, prob) represents how
> > likely an object, "obj", should be selected based on its probability
> > of
> > "prob".To simplify the problem, assuming "prob" are integers, and the
> > sum of all "prob" equals 100. For example,

>
> > ï¿½ ï¿½items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

>
> > The algorithm will take a number "N", and a [(obj, prob),...] list as
> > inputs, and randomly pick "N" objects based on the probabilities of
> > the
> > objects in the list.

>
> > For N=1 this is pretty simply; the following code is sufficient to do
> > the job.

>
> > def foo(items):
> > ï¿½ ï¿½ index = random.randint(0, 99)
> > ï¿½ ï¿½ currentP = 0
> > ï¿½ ï¿½ for (obj, p) in items:
> > ï¿½ ï¿½ ï¿½ ï¿½currentP += w
> > ï¿½ ï¿½ ï¿½ ï¿½if currentP > index:
> > ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ return obj

>
> > But how about the general case, for N > 1 and N < len(items)? Is there
> > some clever algorithm using Python standard "random" package to do
> > the trick?

>
> > Thanks,

>
> > Ben

>
> What do you think of this?
>
> import random
> N = 100
> items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
> the_items = []
> for i,j in items:
> the_items.extend(*j)
> histogram = {}
> for i in xrange(N):
> chosen = random.choice(the_items)
> print chosen,
> if chosen in histogram:
> histogram[chosen] += 1
> else:
> histogram[chosen] = 1
> print
> print
> for i in histogram:
> print '%4s: %d' % (i,histogram)
>
> ## John Mary Jane Tom Tom Mary Mary Tom John John Tom John Tom
> ## John Mary Mary Mary John Tom Tom John Mary Mary Tom Mary
> ## Mary John Tom Jane Jane Jane John Tom Jane Tom Tom John Tom
> ## Tom Mary Tom Tom Mary Tom Mary Tom Tom Tom Tom Mary Mary Tom
> ## Mary Tom Mary Tom Tom Jane Tom Mary Tom Jane Tom Tom Tom Tom
> ## Tom Mary Tom Jane Tom Mary Mary Jane Mary John Mary Mary Tom
> ## Mary Mary Tom Mary John Tom Tom Tom Tom Mary Jane Mary Tom
> ## Mary Tom Tom Jane Tom Mary Mary Tom
> ##
> ## Jane: 11
> ## John: 12
> ## Mary: 32
> ## Tom: 45

No. That does not solve the problem. What I want is a function

def randomPick(n, the_items):

which will return n DISTINCT items from "the_items" such that
the n items returned are according to their probabilities specified
in the (item, pro) elements inside "the_items".

If I understand correctly, both of the previous replies will output
one item at a time independently, instead of returning n DISTINCT
items at a time.

Bruza, Nov 16, 2007
5. Boris BorcicGuest

Bruza wrote:
> No. That does not solve the problem. What I want is a function
>
> def randomPick(n, the_items):
>
> which will return n DISTINCT items from "the_items" such that
> the n items returned are according to their probabilities specified
> in the (item, pro) elements inside "the_items".
>
> If I understand correctly, both of the previous replies will output
> one item at a time independently, instead of returning n DISTINCT
> items at a time.
>

from random import sample
randomPick = lambda n,its : sample(eval('+'.join('[%r]*%r'%p for p in its)),n)

hth

Boris Borcic, Nov 16, 2007
6. Guest

, Nov 16, 2007
7. Boris BorcicGuest

Bruza wrote:
> No. That does not solve the problem. What I want is a function
>
> def randomPick(n, the_items):
>
> which will return n DISTINCT items from "the_items" such that
> the n items returned are according to their probabilities specified
> in the (item, pro) elements inside "the_items".

and in the initial post you wrote :

> But how about the general case, for N > 1 and N < len(items)?

The problem is you need to make "the n items returned are according
to their probabilities" more precise. "According to their probabilities" implies
n INDEPENDENT picks, but this is contradictory with the n picks having to
provide DISTINCT results (what clearly constrains picks relative to each other).

Of course there are obvious ways to combine the results of random choices of
single items to obtain a set like you want, but it is not obvious that they are
equivalent.

This is the most simple-minded :

def randomPick(n, the_items) :
assert n<len(the_items)
result = set()
while len(result)<n :
return sorted(result)

This is another (but it won't work with your version of singlePick as it is,
although it would with that provided by the other posters) :

def randomPick(n, the_items) :
result = []
items = dict(the_items)
for k in range(n) :
pick = singlePick(items.items())
result.append(pick)
del items[pick]
return result

I would be surprised if they had exactly the same statistical properties, IOW,
if they did fit the same exact interpretation of "according to their probabilities".

Boris Borcic, Nov 16, 2007
8. Paul RubinGuest

Bruza <> writes:
> But how about the general case, for N > 1 and N < len(items)? Is there
> some clever algorithm using Python standard "random" package

Yeah, I'm not sure what the name for it is, but there'ss a well known
algorithm that's sort of an online verison of random.choice. The
famous N=1 example where all probabilities are equal goes:

# choose a random element of seq
for k,s in enumerate(seq):
if random() < 1.0/(k+1):
choice = s

you should be able to generalize this to N items with different
probabilities.

Paul Rubin, Nov 16, 2007
9. Boris BorcicGuest

Boris Borcic wrote:
> Bruza wrote:
>> No. That does not solve the problem. What I want is a function
>>
>> def randomPick(n, the_items):
>>
>> which will return n DISTINCT items from "the_items" such that
>> the n items returned are according to their probabilities specified
>> in the (item, pro) elements inside "the_items".

>
> and in the initial post you wrote :
>
> > But how about the general case, for N > 1 and N < len(items)?

>
> The problem is you need to make "the n items returned are according
> to their probabilities" more precise. "According to their probabilities" implies
> n INDEPENDENT picks, but this is contradictory with the n picks having to
> provide DISTINCT results (what clearly constrains picks relative to each other).
>
> Of course there are obvious ways to combine the results of random choices of
> single items to obtain a set like you want, but it is not obvious that they are
> equivalent.
>
> This is the most simple-minded :
>
> def randomPick(n, the_items) :
> assert n<len(the_items)
> result = set()
> while len(result)<n :
> return sorted(result)
>
> This is another (but it won't work with your version of singlePick as it is,
> although it would with that provided by the other posters) :
>
> def randomPick(n, the_items) :
> result = []
> items = dict(the_items)
> for k in range(n) :
> pick = singlePick(items.items())
> result.append(pick)
> del items[pick]
> return result
>
> I would be surprised if they had exactly the same statistical properties, IOW,
> if they did fit the same exact interpretation of "according to their probabilities".
>
>

yet another one, constructing a list of n-sets first, and then picking one;
like the other solutions, it doesn't optimize for repeated use.

def pickn(items,n) :
"yields all n-sublists of (destructed) items"
if n<=len(items) :
if n :
item = items.pop()
for res in pickn(items[:],n) :
yield res
for res in pickn(items,n-1) :
res.append(item)
yield res
else :
yield []

def randomPick(n,the_items) :
"randomly pick n distinct members of the_items"
the_sets = list(pickn(the_items[:],n))
divisor = len(the_sets)*float(n)/len(the_items)
for k,s in enumerate(the_sets) :
the_sets[k] = (sorted(who for who,_ in s),
int(1+sum(p for _,p in s)/divisor)) # mhh...
return singlePick(the_sets)

Boris Borcic, Nov 16, 2007
10. duncan smithGuest

Bruza wrote:
> I need to implement a "random selection" algorithm which takes a list
> of [(obj, prob),...] as input. Each of the (obj, prob) represents how
> likely an object, "obj", should be selected based on its probability
> of
> "prob".To simplify the problem, assuming "prob" are integers, and the
> sum of all "prob" equals 100. For example,
>
> items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
>
> The algorithm will take a number "N", and a [(obj, prob),...] list as
> inputs, and randomly pick "N" objects based on the probabilities of
> the
> objects in the list.
>
>
> For N=1 this is pretty simply; the following code is sufficient to do
> the job.
>
> def foo(items):
> index = random.randint(0, 99)
> currentP = 0
> for (obj, p) in items:
> currentP += w
> if currentP > index:
> return obj
>
> But how about the general case, for N > 1 and N < len(items)? Is there
> some clever algorithm using Python standard "random" package to do
> the trick?
>

I think you need to clarify what you want to do. The "probs" are
clearly not probabilities. Are they counts of items? Are you then
sampling without replacement? When you say N < len(items) do you mean N
<= sum of the "probs"?

Duncabn

duncan smith, Nov 16, 2007
11. BruzaGuest

On Nov 16, 6:58 am, duncan smith <>
wrote:
> Bruza wrote:
> > I need to implement a "random selection" algorithm which takes a list
> > of [(obj, prob),...] as input. Each of the (obj, prob) represents how
> > likely an object, "obj", should be selected based on its probability
> > of
> > "prob".To simplify the problem, assuming "prob" are integers, and the
> > sum of all "prob" equals 100. For example,

>
> > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

>
> > The algorithm will take a number "N", and a [(obj, prob),...] list as
> > inputs, and randomly pick "N" objects based on the probabilities of
> > the
> > objects in the list.

>
> > For N=1 this is pretty simply; the following code is sufficient to do
> > the job.

>
> > def foo(items):
> > index = random.randint(0, 99)
> > currentP = 0
> > for (obj, p) in items:
> > currentP += w
> > if currentP > index:
> > return obj

>
> > But how about the general case, for N > 1 and N < len(items)? Is there
> > some clever algorithm using Python standard "random" package to do
> > the trick?

>
> I think you need to clarify what you want to do. The "probs" are
> clearly not probabilities. Are they counts of items? Are you then
> sampling without replacement? When you say N < len(items) do you mean N
> <= sum of the "probs"?
>
> Duncabn

I think I need to explain on the probability part: the "prob" is a
relative likelihood that the object will be included in the output
list. So, in my example input of

items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

So, for any size of N, 'Tom' (with prob of 45) will be more likely to
be included in the output list of N distinct member than 'Mary' (prob
of 30) and much more likely than that of 'John' (with prob of 10).

I know "prob" is not exactly the "probability" in the context of
returning a multiple member list. But what I want is a way to "favor"
some member in a selection process.

So far, only Boris's solution is closest (but not quite) to what I
need, which returns a list of N distinct object from the input
"items". However, I tried with input of

items = [('Mary',1), ('John', 1), ('Tom', 1), ('Jane', 97)]

and have a repeated calling of

Ben

Bruza, Nov 17, 2007
12. BruzaGuest

On Nov 16, 4:47 pm, Bruza <> wrote:
> On Nov 16, 6:58 am, duncan smith <>
> wrote:
>
>
>
> > Bruza wrote:
> > > I need to implement a "random selection" algorithm which takes a list
> > > of [(obj, prob),...] as input. Each of the (obj, prob) represents how
> > > likely an object, "obj", should be selected based on its probability
> > > of
> > > "prob".To simplify the problem, assuming "prob" are integers, and the
> > > sum of all "prob" equals 100. For example,

>
> > > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

>
> > > The algorithm will take a number "N", and a [(obj, prob),...] list as
> > > inputs, and randomly pick "N" objects based on the probabilities of
> > > the
> > > objects in the list.

>
> > > For N=1 this is pretty simply; the following code is sufficient to do
> > > the job.

>
> > > def foo(items):
> > > index = random.randint(0, 99)
> > > currentP = 0
> > > for (obj, p) in items:
> > > currentP += w
> > > if currentP > index:
> > > return obj

>
> > > But how about the general case, for N > 1 and N < len(items)? Is there
> > > some clever algorithm using Python standard "random" package to do
> > > the trick?

>
> > I think you need to clarify what you want to do. The "probs" are
> > clearly not probabilities. Are they counts of items? Are you then
> > sampling without replacement? When you say N < len(items) do you mean N
> > <= sum of the "probs"?

>
> > Duncabn

>
> I think I need to explain on the probability part: the "prob" is a
> relative likelihood that the object will be included in the output
> list. So, in my example input of
>
> items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
>
> So, for any size of N, 'Tom' (with prob of 45) will be more likely to
> be included in the output list of N distinct member than 'Mary' (prob
> of 30) and much more likely than that of 'John' (with prob of 10).
>
> I know "prob" is not exactly the "probability" in the context of
> returning a multiple member list. But what I want is a way to "favor"
> some member in a selection process.
>
> So far, only Boris's solution is closest (but not quite) to what I
> need, which returns a list of N distinct object from the input
> "items". However, I tried with input of
>
> items = [('Mary',1), ('John', 1), ('Tom', 1), ('Jane', 97)]
>
> and have a repeated calling of
>
> Ben

OOPS. I pressed the Send too fast.

The problem w/ Boris's solution is that after repeated calling of
randomPick(3,items), 'Jane' is not the most "frequent appearing"
member in all the out list of 3 member lists...

Bruza, Nov 17, 2007
13. JordanGuest

Maybe it would help to make your problem statement a litte rigorous so
we can get a clearer idea of whats required.

One possible formulation:

Given a list L of pairs of values, weightings: [ (v_0, w_0), (v_1,
w_1), ....], and some N between 1 and length(L)

you would like to randomly select a set of N (distinct) values, V,
such that for any ints i and j,

Prob (v_i is in V) / Prob (v_j is in V) = w_i / w_j

This matches your expectations for N = 1. Intuitively though, without
having put much thought into it, I suspect this might not be possible
in the general case.

You might then want to (substantially) relax thec ondition to

Prob (v_i is in V) >= Prob (v_j is in V) iff w_i >= w_j

but in that case its more an ordering of likelihoods rather than a
weighting, and doesn't guarantee the right behaviour for N = 1, so i
don't think thats really what you want.

I can't think of any other obvious way of generalising the behaviour
of the N = 1 case.

- Jordan

On Nov 17, 10:50 am, Bruza <> wrote:
> On Nov 16, 4:47 pm, Bruza <> wrote:
>
>
>
>
>
> > On Nov 16, 6:58 am, duncan smith <>
> > wrote:

>
> > > Bruza wrote:
> > > > I need to implement a "random selection" algorithm which takes a list
> > > > of [(obj, prob),...] as input. Each of the (obj, prob) represents how
> > > > likely an object, "obj", should be selected based on its probability
> > > > of
> > > > "prob".To simplify the problem, assuming "prob" are integers, and the
> > > > sum of all "prob" equals 100. For example,

>
> > > > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

>
> > > > The algorithm will take a number "N", and a [(obj, prob),...] list as
> > > > inputs, and randomly pick "N" objects based on the probabilities of
> > > > the
> > > > objects in the list.

>
> > > > For N=1 this is pretty simply; the following code is sufficient to do
> > > > the job.

>
> > > > def foo(items):
> > > > index = random.randint(0, 99)
> > > > currentP = 0
> > > > for (obj, p) in items:
> > > > currentP += w
> > > > if currentP > index:
> > > > return obj

>
> > > > But how about the general case, for N > 1 and N < len(items)? Is there
> > > > some clever algorithm using Python standard "random" package to do
> > > > the trick?

>
> > > I think you need to clarify what you want to do. The "probs" are
> > > clearly not probabilities. Are they counts of items? Are you then
> > > sampling without replacement? When you say N < len(items) do you mean N
> > > <= sum of the "probs"?

>
> > > Duncabn

>
> > I think I need to explain on the probability part: the "prob" is a
> > relative likelihood that the object will be included in the output
> > list. So, in my example input of

>
> > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

>
> > So, for any size of N, 'Tom' (with prob of 45) will be more likely to
> > be included in the output list of N distinct member than 'Mary' (prob
> > of 30) and much more likely than that of 'John' (with prob of 10).

>
> > I know "prob" is not exactly the "probability" in the context of
> > returning a multiple member list. But what I want is a way to "favor"
> > some member in a selection process.

>
> > So far, only Boris's solution is closest (but not quite) to what I
> > need, which returns a list of N distinct object from the input
> > "items". However, I tried with input of

>
> > items = [('Mary',1), ('John', 1), ('Tom', 1), ('Jane', 97)]

>
> > and have a repeated calling of

>
> > Ben

>
> OOPS. I pressed the Send too fast.
>
> The problem w/ Boris's solution is that after repeated calling of
> randomPick(3,items), 'Jane' is not the most "frequent appearing"
> member in all the out list of 3 member lists...- Hide quoted text -
>
> - Show quoted text -

Jordan, Nov 17, 2007
14. duncan smithGuest

Bruza wrote:
> On Nov 16, 4:47 pm, Bruza <> wrote:
>> On Nov 16, 6:58 am, duncan smith <>
>> wrote:
>>
>>
>>
>>> Bruza wrote:
>>>> I need to implement a "random selection" algorithm which takes a list
>>>> of [(obj, prob),...] as input. Each of the (obj, prob) represents how
>>>> likely an object, "obj", should be selected based on its probability
>>>> of
>>>> "prob".To simplify the problem, assuming "prob" are integers, and the
>>>> sum of all "prob" equals 100. For example,
>>>> items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
>>>> The algorithm will take a number "N", and a [(obj, prob),...] list as
>>>> inputs, and randomly pick "N" objects based on the probabilities of
>>>> the
>>>> objects in the list.
>>>> For N=1 this is pretty simply; the following code is sufficient to do
>>>> the job.
>>>> def foo(items):
>>>> index = random.randint(0, 99)
>>>> currentP = 0
>>>> for (obj, p) in items:
>>>> currentP += w
>>>> if currentP > index:
>>>> return obj
>>>> But how about the general case, for N > 1 and N < len(items)? Is there
>>>> some clever algorithm using Python standard "random" package to do
>>>> the trick?
>>> I think you need to clarify what you want to do. The "probs" are
>>> clearly not probabilities. Are they counts of items? Are you then
>>> sampling without replacement? When you say N < len(items) do you mean N
>>> <= sum of the "probs"?
>>> Duncabn

>> I think I need to explain on the probability part: the "prob" is a
>> relative likelihood that the object will be included in the output
>> list. So, in my example input of
>>
>> items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
>>
>> So, for any size of N, 'Tom' (with prob of 45) will be more likely to
>> be included in the output list of N distinct member than 'Mary' (prob
>> of 30) and much more likely than that of 'John' (with prob of 10).
>>
>> I know "prob" is not exactly the "probability" in the context of
>> returning a multiple member list. But what I want is a way to "favor"
>> some member in a selection process.
>>
>> So far, only Boris's solution is closest (but not quite) to what I
>> need, which returns a list of N distinct object from the input
>> "items". However, I tried with input of
>>
>> items = [('Mary',1), ('John', 1), ('Tom', 1), ('Jane', 97)]
>>
>> and have a repeated calling of
>>
>> Ben

>
> OOPS. I pressed the Send too fast.
>
> The problem w/ Boris's solution is that after repeated calling of
> randomPick(3,items), 'Jane' is not the most "frequent appearing"
> member in all the out list of 3 member lists...
>

So it looks like you are sampling without replacement. For large N you
don't want to be generating N individual observations. Do you want
something like the following?

>>> import randvars
>>> randvars.multinomial_variate([0.3, 0.1, 0.45, 0.15], 100)

[34, 8, 40, 18]
>>> randvars.multinomial_variate([0.3, 0.1, 0.45, 0.15], 10000)

[2984, 1003, 4511, 1502]
>>> randvars.multinomial_variate([0.3, 0.1, 0.45, 0.15], 1000000)

[300068, 99682, 450573, 149677]
>>>

The algorithm I use is to generate independent Poisson variates for each
category, with Poisson parameters proportional to the probabilities.
This won't usually give you a total of N, but the sample can be adjusted
by adding / subtracting individual observations. In fact, I choose a
constant of proportionality that aims to get the total close to, but not
greater than N. I repeatedly generate Poisson counts until the sample
observations. (This algorithm might be in Knuth, and might be due to
Ahrens and Dieter. I can't remember the exact source.)

An efficient way to generate the individual observations is to construct
a cumulative distribution function and use binary search. I know the
code for this has been posted before. An alias table should generally
be faster, but my Python coded effort is only similarly fast.

Unfortunately generating Poisson variates isn't straightforward, and the
approach I use requires the generation of gamma and beta variates (both
available in the random module though). This approach is due to Ahrens
and Dieter and is in Knuth, Vol.2. So yes, there is a clever algorithm
that uses functions in the random module (Knuth, Ahrens and Dieter are
quite clever). But it's not simple.

The numpy random module has a function for Poisson variates, which would
make things easier for you (I don't off-hand know which algorithm is
used). Then it's just a case of choosing an appropriate constant of
proportionality for the Poisson parameters. Try N - k, where k is the
number of categories, for k <= N**0.5; otherwise, N - N**0.5 - (k -
N**0.5)**0.5.

Duncan

duncan smith, Nov 17, 2007
15. Steven D'ApranoGuest

On Fri, 16 Nov 2007 16:47:16 -0800, Bruza wrote:

> I think I need to explain on the probability part: the "prob" is a
> relative likelihood that the object will be included in the output list.
> So, in my example input of
>
> items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
>
> So, for any size of N, 'Tom' (with prob of 45) will be more likely to be
> included in the output list of N distinct member than 'Mary' (prob of
> 30) and much more likely than that of 'John' (with prob of 10).

Since there are only four items, you can't select more than four distinct
items, because the fifth has to be a duplicate of one of the others. And
the probability/likelihood doesn't work either: if you insist on getting
distinct items, then your results are are much limited:

N = 0, result: []

N = 1, four possible results:
['Mary'] ['John'] ['Tom'] or ['Jane']

N = 2, six possible results (assuming order doesn't matter):
['Mary', 'John'] ['Mary', 'Tom'] ['Mary', 'Jane'] ['John', 'Tom']
['John', 'Jane'] or ['Tom', 'Jane']

N = 3, four possible results:
['Mary', 'John', 'Tom'] ['Mary', 'John', 'Jane']
['Mary', 'Tom', 'Jane'] or ['John', 'Tom', 'Jane']

N = 4, one possible result:
['Mary', 'John', 'Tom', 'Jane']

> I know "prob" is not exactly the "probability" in the context of
> returning a multiple member list. But what I want is a way to "favor"
> some member in a selection process.

I don't think this is really well defined, but I'll take a stab in the
dark at it.

Let's take the example above for N = 3:

items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
# use the lowest common factor for simplicity
items = [('Mary',6), ('John', 2), ('Tom', 9), ('Jane', 3)]

# N = 3 has four possible results:
R = ['Mary', 'John', 'Tom']
S = ['Mary', 'John', 'Jane']
T = ['Mary', 'Tom', 'Jane']
U = ['John', 'Tom', 'Jane']

You want to bias the four possible results above so that having Mary is
three times more likely than having John. It sounds simple, but it isn't.
To do so, you need to solve a whole series of simultaneous equations:

Pr(Mary) = Pr(R) + Pr(S) + Pr(T)
Pr(John) = Pr(R) + Pr(S) + Pr(U)
Pr(Mary) = 3*Pr(John)

And so on for all the other items.

This is a HARD problem to solve, and for most sets of likelihoods you
might give, there won't be a solution. For example, you can't have a set
of results where Mary is three times more likely than John, John is twice
as likely as Tom, and Tom is four times more likely than Mary. It simply
can't happen.

So we can't interpret the numbers as probabilities. We can't even
interpret them more loosely as "likelihoods". Proof of that is to
consider the case of N=4. There is only one possible result with four
distinct items. So all of Mary, John, Tom and Jane are equally likely, in
fact all of them are certain. The weightings you gave (30, 10, 45, 15)
are meaningless.

So, given that what you are asking for is impossible, can you explain
what you are actually trying to accomplish? Maybe there's a more feasible
alternative.

--
Steven.

Steven D'Aprano, Nov 17, 2007
16. JordanGuest

# Untested!
def randomPick(n, items):
def pickOne():
index = random.randint(0, 99)
currentP = 0
for (obj, p) in items:
currentP += p
if currentP > index:
return obj
selection = set()
while len(selection) < n:
return selection

Of course the performance is likely to be far from stellar for, say,
randomPick(2, [("Bob", 98), ("Jane", 1), ("Harry", 1)]), but at least
its boundedly bad (so long as you retain the condition that the sum of
the weightings is 100.)

Not sure if it satisfies the conditions in my last post.... either do
some empirical testing, or some mathematics, or maybe a bit of
both.

On Nov 17, 12:02 pm, Jordan <> wrote:
> Maybe it would help to make your problem statement a litte rigorous so
> we can get a clearer idea of whats required.
>
> One possible formulation:
>
> Given a list L of pairs of values, weightings: [ (v_0, w_0), (v_1,
> w_1), ....], and some N between 1 and length(L)
>
> you would like to randomly select a set of N (distinct) values, V,
> such that for any ints i and j,
>
> Prob (v_i is in V) / Prob (v_j is in V) = w_i / w_j
>
> This matches your expectations for N = 1. Intuitively though, without
> having put much thought into it, I suspect this might not be possible
> in the general case.
>
> You might then want to (substantially) relax thec ondition to
>
> Prob (v_i is in V) >= Prob (v_j is in V) iff w_i >= w_j
>
> but in that case its more an ordering of likelihoods rather than a
> weighting, and doesn't guarantee the right behaviour for N = 1, so i
> don't think thats really what you want.
>
> I can't think of any other obvious way of generalising the behaviour
> of the N = 1 case.
>
> - Jordan
>
> On Nov 17, 10:50 am, Bruza <> wrote:
>
>
>
> > On Nov 16, 4:47 pm, Bruza <> wrote:

>
> > > On Nov 16, 6:58 am, duncan smith <>
> > > wrote:

>
> > > > Bruza wrote:
> > > > > I need to implement a "random selection" algorithm which takes a list
> > > > > of [(obj, prob),...] as input. Each of the (obj, prob) represents how
> > > > > likely an object, "obj", should be selected based on its probability
> > > > > of
> > > > > "prob".To simplify the problem, assuming "prob" are integers, and the
> > > > > sum of all "prob" equals 100. For example,

>
> > > > > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

>
> > > > > The algorithm will take a number "N", and a [(obj, prob),...] list as
> > > > > inputs, and randomly pick "N" objects based on the probabilities of
> > > > > the
> > > > > objects in the list.

>
> > > > > For N=1 this is pretty simply; the following code is sufficient to do
> > > > > the job.

>
> > > > > def foo(items):
> > > > > index = random.randint(0, 99)
> > > > > currentP = 0
> > > > > for (obj, p) in items:
> > > > > currentP += w
> > > > > if currentP > index:
> > > > > return obj

>
> > > > > But how about the general case, for N > 1 and N < len(items)? Is there
> > > > > some clever algorithm using Python standard "random" package to do
> > > > > the trick?

>
> > > > I think you need to clarify what you want to do. The "probs" are
> > > > clearly not probabilities. Are they counts of items? Are you then
> > > > sampling without replacement? When you say N < len(items) do you mean N
> > > > <= sum of the "probs"?

>
> > > > Duncabn

>
> > > I think I need to explain on the probability part: the "prob" is a
> > > relative likelihood that the object will be included in the output
> > > list. So, in my example input of

>
> > > items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]

>
> > > So, for any size of N, 'Tom' (with prob of 45) will be more likely to
> > > be included in the output list of N distinct member than 'Mary' (prob
> > > of 30) and much more likely than that of 'John' (with prob of 10).

>
> > > I know "prob" is not exactly the "probability" in the context of
> > > returning a multiple member list. But what I want is a way to "favor"
> > > some member in a selection process.

>
> > > So far, only Boris's solution is closest (but not quite) to what I
> > > need, which returns a list of N distinct object from the input
> > > "items". However, I tried with input of

>
> > > items = [('Mary',1), ('John', 1), ('Tom', 1), ('Jane', 97)]

>
> > > and have a repeated calling of

>
> > > Ben

>
> > OOPS. I pressed the Send too fast.

>
> > The problem w/ Boris's solution is that after repeated calling of
> > randomPick(3,items), 'Jane' is not the most "frequent appearing"
> > member in all the out list of 3 member lists...- Hide quoted text -

>
> > - Show quoted text -- Hide quoted text -

>
> - Show quoted text -

Jordan, Nov 17, 2007
17. Guest

Knuth says to pick N distinct records from a collection where the
probability is equal you should:

first fill up N records by chosing the first seen.

if less than N were in the collection, quit.

otherwise, t = (the number of items looked at) or N to start.

while your not at the end of the collection:
increment t
generate a random number between 0 and t => K
if K < N:
item).

Now, to incorporate probability is another question . . .

Find the largest probability. Assume it has 100% probability.
Calculate the probability of each item accordingly. Take the randomly
generated number and multiply it by the probability. Take the random
number minus the (random number times to probability). If it falls in
range, then you replace like usual. You should find that the max
probability will always appear in the zeroth position if it is not
replaced by another value. Now, this does not guarantee that the
highest probability value will show up first in the list, since that
is the same as sorting by the probability. It is just a way of
increasing the probability of making the value fall in the range as
the probability varies.

I am not guaranteeing this even works. I am seeing that there is some
collision among the numbers, but it will work for the most part.

, Nov 17, 2007
18. Steven D'ApranoGuest

On Sat, 17 Nov 2007 13:51:16 -0800, wrote:

> I am not guaranteeing this even works. I am seeing that there is some
> collision among the numbers, but it will work for the most part.

"Work for the most part" -- is that another way of saying "Apart from the
bugs, this is bug-free"?

--
Steven.

Steven D'Aprano, Nov 17, 2007
19. Neil CeruttiGuest

On 2007-11-17, Bruza <> wrote:
> OOPS. I pressed the Send too fast.
>
> The problem w/ Boris's solution is that after repeated calling
> of randomPick(3,items), 'Jane' is not the most "frequent
> appearing" member in all the out list of 3 member lists...

How does this solution fair against your spec?

def sample_bp(seq, probabilities, k):
""" Return k distinct random items from seq, with probabilities specified
as weighted integers in a list.

>>> random.seed(0)

>>> sample_bp(['a', 'b'], [1, 5], 2)

['b', 'a']

>>> sample_bp(['a', 'b', 'c'], [1, 5, 2], 3)

['b', 'a', 'c']

"""

if k > len(seq):
raise ValueError('sample size greater than population')
probs = build_probs(probabilities)
rv = []
while k > 0:
j = random_prob(probs)
rv.append(probs[j][2])
remove_prob(probs, j)
k -= 1
return [seq for i in rv]

def build_probs(probabilities):
""" Receives a list of integers, and returns list of ranges and original
indices.

>>> build_probs([8, 10, 7])

[(0, 8, 0), (8, 18, 1), (18, 25, 2)]
>>> build_probs([1, 5, 8, 2, 3, 7])

[(0, 1, 0), (1, 6, 1), (6, 14, 2), (14, 16, 3), (16, 19, 4), (19, 26, 5)]

"""
k = 0
probs = []
for i, p in enumerate(probabilities):
if p < 0:
raise ValueError('negative probability')
probs.append((k, k+p, i))
k = k+p
return probs

def random_prob(probs):
""" Return the index of a weighted random element of prob.

>>> random.seed(0)

>>> for x in xrange(20):

... print random_prob(build_probs([1, 5, 8, 2, 3, 7])),
5 5 2 2 2 2 5 2 2 3 5 2 2 5 4 2 5 5 5 5

>>> random_prob([(0, 15, 0)])

0

"""
i = random.randrange(probs[-1][1])
# Binary search for the element whose range contains i
hi = len(probs)
lo = 0
while lo < hi:
mid = (lo+hi)//2
begin, end, _ = probs[mid]
if i >= begin and i < end: return mid
elif i >= end: lo = mid+1
else: hi = mid

def remove_prob(probs, i):
""" Remove element j from the probability list, adjusting ranges as needed.

>>> prob = [(0, 12, 0), (12, 15, 1), (15, 25, 2)]
>>> remove_prob(prob, 1)
>>> prob

[(0, 12, 0), (12, 22, 2)]

"""
begin, end, _ = probs
diff = end - begin
j = i+1
while j < len(probs):
begin, end, index = probs[j]
probs[j] = (begin-diff, end-diff, index)
j += 1
del probs

This was the most simple-minded approach I could think of, so it
might serve as a reference against a more efficient approach.
Although thorough testing of such a bizarre function boggles my
mind.

I toyed with sorting the probability list from greatest to
lowest, which would make a linear search fast, but unfortunately
it would slow down removing probabilities.

--
Neil Cerutti
I make love to pressure. --Stephen Jackson

Neil Cerutti, Nov 19, 2007