Smarter way of doing this?

Max M · Feb 2, 2004

I have written this function, "choose_by_probability()"

It takes a list of probabilities in the range 0.00-1.00, and it must
then randomly select one of the probabilities and return it's index.

The idea is to have two lists, one with a value, and another with a
probability. The value then gets randomly selected from its probability.

values = ['item 1','item 2','item 3',]
probabilities = [0.66, 0.33, 0.16]

print values[choose_by_probability(probabilities)]
etc...

I just wondered if anybody has a better way of doing it, as this seems
nastily squared to me?

regards Max M

###################

from random import random

def choose_by_probability(probabilities):
"Randomly selects an index from a list of probabilities"
rnd = random() * sum(probabilities)
range_start = 0.0
for i in range(len(probabilities)):
probability = probabilities
range_end = range_start + probability
if range_start < rnd < range_end:
return i
range_start = range_end

probabilities = []
for f in range(0,16):
probabilities.append(1.0 / (f+1.0))

print probabilities

# writing it out sorted to do a visual test...
result = []
for i in range(100):
result.append(choose_by_probability(probabilities))
result.sort()
print result

Mel Wilson · Feb 2, 2004

I have written this function, "choose_by_probability()"

It takes a list of probabilities in the range 0.00-1.00, and it must
then randomly select one of the probabilities and return it's index.

The idea is to have two lists, one with a value, and another with a
probability. The value then gets randomly selected from its probability.

values = ['item 1','item 2','item 3',]
probabilities = [0.66, 0.33, 0.16]

print values[choose_by_probability(probabilities)]
etc...

I just wondered if anybody has a better way of doing it, as this seems
nastily squared to me?

My usual way is:

import random

def choose_by_probability (values, probabilities):
cumul = 0.0
for prob, item in zip (probabilities, values):
cumul += prob
if prob > random.random()*cumul:
selected = item
return selected

def run_tests (n):
D = {}
for i in xrange (n):
j = choose_by_probability (['item 1', 'item 2', 'item 3']
, [0.66, 0.33, 0.17])
D[j] = D.get (j, 0) + 1
s = D.items()
s.sort()
for k, v in s:
print k, float(v)/n
print "Total:", sum([v for k, v in D.items()])

run_tests (10000)

Regards. Mel.

Jim Jewett · Feb 2, 2004

Max M said:
It takes a list of probabilities in the range 0.00-1.00, and it must
then randomly select one of the probabilities and return it's index.

You may wish to look at the GRADES example in the bisect documentation.

-jJ Take only memories. Leave not even footprints.

Raymond Hettinger · Feb 3, 2004

Max M said:
I have written this function, "choose_by_probability()"

It takes a list of probabilities in the range 0.00-1.00, and it must
then randomly select one of the probabilities and return it's index.

The idea is to have two lists, one with a value, and another with a
probability. The value then gets randomly selected from its probability.

values = ['item 1','item 2','item 3',]
probabilities = [0.66, 0.33, 0.16]

Here is a simplified version of the approach you took:

def select(tab, random=random.random):
cp = 0
r = random()
for i, p in enumerate(tab):
cp += p
if cp > r:
return i
raise Exception('probabilities do not add upto 1.0')

If you need to call this many times and for large table, there are
ways to speed this up.

* re-arrange the tables to make sure the largest probabilities come
first. Intentionally or not, your sample data is already arranged this
way (though the probabilities add up to more than 1.0 and should be
scaled back to 1.0).

* pre-summarize the summations with a cumulative probability table:
[.50, .25, .15, .10] --> [.50, .75, .90, 1.00]

def cselect(ctab, random=random.random):
r = random()
for i, cp in enumerate(ctab):
if cp > r:
return i
raise Exception

* if the table is cumulative and large, bisection offers a faster
search:

def cselect(ctab, random=random.random, bisect=bisect.bisect):
return bisect(ctab, random())

* if the probabilities come in integer ratios, you could reduce the
problem to an O(1) lookup:

[.5, .3, .1, .1] --> [0,0,0,0,0,1,1,1,2,3]

def lselect(ltab, choice=random.choice):
return choice(ltab)

IOW, if speed is what is important, then be sure to exploit any known
structure in the data (ordering, cumulativeness, integer rations, etc)
and don't compute anything twice (pre-compute the summation).

Raymond Hettinger

Max M · Feb 3, 2004

Jim said:
You may wish to look at the GRADES example in the bisect documentation.

First of, thanks for the replies!

I finally chose a combination of the suggested approaches, and ended up
with something I find is both more functional and nicer to look at than
my first try.

regards Max M

################################

from random import random
from bisect import bisect

class Selector:

"Returns random elements by probability"

def __init__(self, probabilities, elements):
self.sum = sum(probabilities)
self.decorated = zip(probabilities, elements)
self.decorated.sort()

def get(self):
"Randomly returns an element by probability"
rnd = random() * self.sum
index = bisect(self.decorated, (rnd, 0))-1
return self.decorated[index][-1]

def get_range(self, n):
"Randomly returns a range of elements by probability"
return [self.get() for itm in range(n)]

if __name__ == '__main__':

probabilities = []
for f in range(0,10):
probabilities.append(1.0 / (f+1.0))

elements = ['a','b','c','d','e','f','g','h','i','j',]

s = Selector(probabilities, elements)
result = s.get_range(1000)
result.sort()

print result

Max M · Feb 3, 2004

Raymond said:
Here is a simplified version of the approach you took:

def select(tab, random=random.random):
cp = 0
r = random()
for i, p in enumerate(tab):
cp += p
if cp > r:
return i
raise Exception('probabilities do not add upto 1.0')

* re-arrange the tables to make sure the largest probabilities come
first. Intentionally or not, your sample data is already arranged this
way (though the probabilities add up to more than 1.0 and should be
scaled back to 1.0).

It doesn't really have to be scaled to 1.0 as long as they get selected
with a frequency that corresponds to their probability.

IOW, if speed is what is important, then be sure to exploit any known
structure in the data (ordering, cumulativeness, integer rations, etc)
and don't compute anything twice (pre-compute the summation).

Lot's of good stuff. Thanks.

Max M

Roberto A. F. De Almeida · Feb 3, 2004

Max M said:
I have written this function, "choose_by_probability()"

It takes a list of probabilities in the range 0.00-1.00, and it must
then randomly select one of the probabilities and return it's index.

The idea is to have two lists, one with a value, and another with a
probability. The value then gets randomly selected from its probability.

values = ['item 1','item 2','item 3',]
probabilities = [0.66, 0.33, 0.16]

print values[choose_by_probability(probabilities)]
etc...

I just wondered if anybody has a better way of doing it, as this seems
nastily squared to me?

In one line:

v = ['item 1','item 2','item 3',]
p = [0.66, 0.33, 0.16]

[v for i,j in enumerate(p) if sum(p[:i]) > (random.random()*sum(p))][0]

R.

Max M · Feb 3, 2004

Roberto said:
In one line:

v = ['item 1','item 2','item 3',]
p = [0.66, 0.33, 0.16]

[v for i,j in enumerate(p) if sum(p[:i]) > (random.random()*sum(p))][0]

uh, that's hard to read

But doesn't random.random()*sum(p) get executed for each probability in p?

Then each probability in p will be compared to a different random value.
That isn't fair...

regards Max m

Roberto Antonio Ferreira De Almeida · Feb 3, 2004

Max said:
Roberto said:

In one line:

v = ['item 1','item 2','item 3',]
p = [0.66, 0.33, 0.16]

[v for i,j in enumerate(p) if sum(p[:i]) >
(random.random()*sum(p))][0]

Click to expand...

uh, that's hard to read

Yes, I know. I just wanted to see if I could do it in one line.

But it just calculates the cumulative probability, and gets the first
one which passes a random value between 0 and the sum of the probabilities.

But doesn't random.random()*sum(p) get executed for each probability in p?

Click to expand...

I actually don't know. You could call random.random() out of the list
comprehension...

R.

Max M · Feb 4, 2004

Max said:
Jim Jewett wrote:

I finally chose a combination of the suggested approaches, and ended up
with something I find is both more functional and nicer to look at than
my first try.

Well, that version had a problem. Whenever a series of values are the
same bisect() allways selects the first one. So I never got a
statistical valid sample of elements.

I solved it by using acumulated probabilities instead. So the final
version is here, if anybody cares.

I used it in a Markov chain class, that is now pretty fast.

regards Max M

#########################

from random import random, randint
from bisect import bisect

class Selector:

"Returns random elements by probability"

def __init__(self, probabilities, elements):
self.sum = sum(probabilities)
acumulated_probabilities = []
acumulated_probability = 0
for probability in probabilities:
acumulated_probability += probability
acumulated_probabilities.append(acumulated_probability)
self.decorated = zip(acumulated_probabilities, elements)
self.decorated.sort()

def get(self):
"Randomly returns an element by probability"
rnd = random() * self.sum
index = bisect(self.decorated, (rnd, 0))
self.decorated[index][-1]
return self.decorated[index][-1]

def get_range(self, n):
"Randomly returns a range of elements by probability"
return [self.get() for itm in range(n)]

if __name__ == '__main__':

probabilities = [1.0, 0.5, 0.25, 0.125, 0.0625]
# probabilities = [0.1, 0.1, 1.0, 0.1, 0.1, ]
elements = ['a', 'b', 'c', 'd', 'e']
s = Selector(probabilities, elements)
r = s.get_range(10000)
r.sort()

for element in elements:
print element, r.count(element)

##########

Josiah Carlson · Feb 4, 2004

def get_range(self, n):

"Randomly returns a range of elements by probability"
return [self.get() for itm in range(n)]

Use xrange here ^^^^^
xrange doesn't instantiate a list, so is faster and uses less memory for
large n.

- Josiah

Anton Vredegoor · Feb 5, 2004

Max M said:
I solved it by using acumulated probabilities instead. So the final
version is here, if anybody cares.

I used it in a Markov chain class, that is now pretty fast.

Yes it's fast, but I don't think it's smart

Unfortunately I
haven't got a working solution as an alternative, but only some code
that "explains" what I mean:

def choice_generator(population,probabilities):
PC = zip(probabilities,population)
while 1:
p,c = max([(p*random(),c) for p,c in PC])
yield c

This is short and fast. However, the distribution of the outcomes is
wrong, because the list of probabilities should be "adjusted" so that
in the end the *outcomes* are distributed according to the
"probabilities". Or should that be proportions?

My statistics skills are a bit rusty, but I think it's an interesting
problem. I'll get back when I have some more ideas. In the meantime
more mathematically educated readers are given a chance to beat me to
it

, or prove it can't be done.

Anton

Max M · Feb 5, 2004

Anton said:
I solved it by using acumulated probabilities instead. So the final
version is here, if anybody cares.

Click to expand...

Yes it's fast, but I don't think it's smart Unfortunately I
haven't got a working solution as an alternative, but only some code
that "explains" what I mean:

def choice_generator(population,probabilities):
PC = zip(probabilities,population)
while 1:
p,c = max([(p*random(),c) for p,c in PC])
yield c

This is short and fast. However, the distribution of the outcomes is
wrong, because the list of probabilities should be "adjusted" so that
in the end the *outcomes* are distributed according to the
"probabilities". Or should that be proportions?

I don't understand what you mean. If I calculate the deviations from
what is expected, and use a large result set, I get very small deviations.

I am interrested in getting a result as a propertion of the probablilty,
or more correctly in this case, the frequency. If I want it as
probabilities I just have to normalise the sim to 1.0.

This has the advantage that the frequencies can be expressed as integers
too. This is nice in my Markov chain class that count words in text, etc.

In my example below each letter should be occur 50% of the times of the
previous letter.

Perhaps you mean that it should behave differently?

regards Max M

###################

probabilities = [16, 8, 4, 2, 1]
elements = ['a', 'b', 'c', 'd', 'e']
sample_size = 1000000

s = Selector(probabilities, elements)
r = s.get_range(sample_size)
r.sort()

previous = float(sample_size)
for element in elements:
count = r.count(element)
deviation = (previous/2.0-count) / count * 100
previous = count
print element, count, deviation

Anton Vredegoor · Feb 5, 2004

Max M said:
I don't understand what you mean. If I calculate the deviations from
what is expected, and use a large result set, I get very small deviations.

That's what I was afraid of, and I can't blame you for it

In my example below each letter should be occur 50% of the times of the
previous letter.

Perhaps you mean that it should behave differently?

No, your code does what it should do and my code was just some
obviously wrong example to show what *kind* of solution I was after.

For example if my code is run with a list of probabilities like this:

[.1,.1,.1,.1,.1] it produces exactly the same distribution as your
code. However, with input like this: [2,1] (*different* probabilities)
it starts to diverge from your code. What I would like to do is to
replace this list [2,1] with the list of probabilities that would
cause my code to produce the same output as yours ...

So [2,1] ==> "some list of probabilities" and this *new* list is used
as multiplication factors in the line with :

"max([(p*random(),c) for p,c in PC])"

I hope I succeeded in just getting the idea across instead of giving
the impression that my code works better, which obviously it doesn't.
It *needs* some kind of transformation of the probabilities into an
other list ...

Anton

Most pythonic way of weighted random selection	3	Aug 30, 2008
Sampling a population	5	Jun 2, 2006
anybody help me	1	Feb 10, 2006
comp.lang.vhdl FAQ part 1 of 4: general	0	Jul 8, 2003
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Changes to Answers to Frequently Asked Questions (FAQ)	1	Jul 4, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Smarter way of doing this?

Max M

Mel Wilson

Jim Jewett

Raymond Hettinger

Max M

Max M

Roberto A. F. De Almeida

Max M

Roberto Antonio Ferreira De Almeida

Max M

Josiah Carlson

Anton Vredegoor

Max M

Anton Vredegoor

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads