Smarter way of doing this?

M

Max M

I have written this function, "choose_by_probability()"

It takes a list of probabilities in the range 0.00-1.00, and it must
then randomly select one of the probabilities and return it's index.

The idea is to have two lists, one with a value, and another with a
probability. The value then gets randomly selected from its probability.

values = ['item 1','item 2','item 3',]
probabilities = [0.66, 0.33, 0.16]

print values[choose_by_probability(probabilities)]
etc...

I just wondered if anybody has a better way of doing it, as this seems
nastily squared to me?


regards Max M


###################

from random import random

def choose_by_probability(probabilities):
"Randomly selects an index from a list of probabilities"
rnd = random() * sum(probabilities)
range_start = 0.0
for i in range(len(probabilities)):
probability = probabilities
range_end = range_start + probability
if range_start < rnd < range_end:
return i
range_start = range_end


probabilities = []
for f in range(0,16):
probabilities.append(1.0 / (f+1.0))

print probabilities

# writing it out sorted to do a visual test...
result = []
for i in range(100):
result.append(choose_by_probability(probabilities))
result.sort()
print result
 
M

Mel Wilson

I have written this function, "choose_by_probability()"

It takes a list of probabilities in the range 0.00-1.00, and it must
then randomly select one of the probabilities and return it's index.

The idea is to have two lists, one with a value, and another with a
probability. The value then gets randomly selected from its probability.

values = ['item 1','item 2','item 3',]
probabilities = [0.66, 0.33, 0.16]

print values[choose_by_probability(probabilities)]
etc...

I just wondered if anybody has a better way of doing it, as this seems
nastily squared to me?

My usual way is:


import random

def choose_by_probability (values, probabilities):
cumul = 0.0
for prob, item in zip (probabilities, values):
cumul += prob
if prob > random.random()*cumul:
selected = item
return selected


def run_tests (n):
D = {}
for i in xrange (n):
j = choose_by_probability (['item 1', 'item 2', 'item 3']
, [0.66, 0.33, 0.17])
D[j] = D.get (j, 0) + 1
s = D.items()
s.sort()
for k, v in s:
print k, float(v)/n
print "Total:", sum([v for k, v in D.items()])

run_tests (10000)



Regards. Mel.
 
J

Jim Jewett

Max M said:
It takes a list of probabilities in the range 0.00-1.00, and it must
then randomly select one of the probabilities and return it's index.

You may wish to look at the GRADES example in the bisect documentation.

-jJ Take only memories. Leave not even footprints.
 
R

Raymond Hettinger

Max M said:
I have written this function, "choose_by_probability()"

It takes a list of probabilities in the range 0.00-1.00, and it must
then randomly select one of the probabilities and return it's index.

The idea is to have two lists, one with a value, and another with a
probability. The value then gets randomly selected from its probability.

values = ['item 1','item 2','item 3',]
probabilities = [0.66, 0.33, 0.16]

Here is a simplified version of the approach you took:

def select(tab, random=random.random):
cp = 0
r = random()
for i, p in enumerate(tab):
cp += p
if cp > r:
return i
raise Exception('probabilities do not add upto 1.0')

If you need to call this many times and for large table, there are
ways to speed this up.

* re-arrange the tables to make sure the largest probabilities come
first. Intentionally or not, your sample data is already arranged this
way (though the probabilities add up to more than 1.0 and should be
scaled back to 1.0).


* pre-summarize the summations with a cumulative probability table:
[.50, .25, .15, .10] --> [.50, .75, .90, 1.00]

def cselect(ctab, random=random.random):
r = random()
for i, cp in enumerate(ctab):
if cp > r:
return i
raise Exception


* if the table is cumulative and large, bisection offers a faster
search:

def cselect(ctab, random=random.random, bisect=bisect.bisect):
return bisect(ctab, random())


* if the probabilities come in integer ratios, you could reduce the
problem to an O(1) lookup:

[.5, .3, .1, .1] --> [0,0,0,0,0,1,1,1,2,3]

def lselect(ltab, choice=random.choice):
return choice(ltab)


IOW, if speed is what is important, then be sure to exploit any known
structure in the data (ordering, cumulativeness, integer rations, etc)
and don't compute anything twice (pre-compute the summation).


Raymond Hettinger
 
M

Max M

Jim said:
You may wish to look at the GRADES example in the bisect documentation.

First of, thanks for the replies!

I finally chose a combination of the suggested approaches, and ended up
with something I find is both more functional and nicer to look at than
my first try.

regards Max M


################################

from random import random
from bisect import bisect

class Selector:

"Returns random elements by probability"

def __init__(self, probabilities, elements):
self.sum = sum(probabilities)
self.decorated = zip(probabilities, elements)
self.decorated.sort()

def get(self):
"Randomly returns an element by probability"
rnd = random() * self.sum
index = bisect(self.decorated, (rnd, 0))-1
return self.decorated[index][-1]

def get_range(self, n):
"Randomly returns a range of elements by probability"
return [self.get() for itm in range(n)]


if __name__ == '__main__':

probabilities = []
for f in range(0,10):
probabilities.append(1.0 / (f+1.0))

elements = ['a','b','c','d','e','f','g','h','i','j',]

s = Selector(probabilities, elements)
result = s.get_range(1000)
result.sort()

print result
 
M

Max M

Raymond said:
Here is a simplified version of the approach you took:

def select(tab, random=random.random):
cp = 0
r = random()
for i, p in enumerate(tab):
cp += p
if cp > r:
return i
raise Exception('probabilities do not add upto 1.0')
* re-arrange the tables to make sure the largest probabilities come
first. Intentionally or not, your sample data is already arranged this
way (though the probabilities add up to more than 1.0 and should be
scaled back to 1.0).


It doesn't really have to be scaled to 1.0 as long as they get selected
with a frequency that corresponds to their probability.

IOW, if speed is what is important, then be sure to exploit any known
structure in the data (ordering, cumulativeness, integer rations, etc)
and don't compute anything twice (pre-compute the summation).


Lot's of good stuff. Thanks.

Max M
 
R

Roberto A. F. De Almeida

Max M said:
I have written this function, "choose_by_probability()"

It takes a list of probabilities in the range 0.00-1.00, and it must
then randomly select one of the probabilities and return it's index.

The idea is to have two lists, one with a value, and another with a
probability. The value then gets randomly selected from its probability.

values = ['item 1','item 2','item 3',]
probabilities = [0.66, 0.33, 0.16]

print values[choose_by_probability(probabilities)]
etc...

I just wondered if anybody has a better way of doing it, as this seems
nastily squared to me?

In one line:

v = ['item 1','item 2','item 3',]
p = [0.66, 0.33, 0.16]

[v for i,j in enumerate(p) if sum(p[:i]) > (random.random()*sum(p))][0]

R.
 
M

Max M

Roberto said:
In one line:

v = ['item 1','item 2','item 3',]
p = [0.66, 0.33, 0.16]

[v for i,j in enumerate(p) if sum(p[:i]) > (random.random()*sum(p))][0]



uh, that's hard to read :)

But doesn't random.random()*sum(p) get executed for each probability in p?

Then each probability in p will be compared to a different random value.
That isn't fair...


regards Max m
 
R

Roberto Antonio Ferreira De Almeida

Max said:
Roberto said:
In one line:

v = ['item 1','item 2','item 3',]
p = [0.66, 0.33, 0.16]

[v for i,j in enumerate(p) if sum(p[:i]) >
(random.random()*sum(p))][0]


uh, that's hard to read :)


Yes, I know. I just wanted to see if I could do it in one line. :)

But it just calculates the cumulative probability, and gets the first
one which passes a random value between 0 and the sum of the probabilities.
But doesn't random.random()*sum(p) get executed for each probability in p?

I actually don't know. You could call random.random() out of the list
comprehension... :)

R.
 
M

Max M

Max said:
Jim Jewett wrote:
I finally chose a combination of the suggested approaches, and ended up
with something I find is both more functional and nicer to look at than
my first try.


Well, that version had a problem. Whenever a series of values are the
same bisect() allways selects the first one. So I never got a
statistical valid sample of elements.

I solved it by using acumulated probabilities instead. So the final
version is here, if anybody cares.

I used it in a Markov chain class, that is now pretty fast.

regards Max M



#########################

from random import random, randint
from bisect import bisect


class Selector:

"Returns random elements by probability"

def __init__(self, probabilities, elements):
self.sum = sum(probabilities)
acumulated_probabilities = []
acumulated_probability = 0
for probability in probabilities:
acumulated_probability += probability
acumulated_probabilities.append(acumulated_probability)
self.decorated = zip(acumulated_probabilities, elements)
self.decorated.sort()

def get(self):
"Randomly returns an element by probability"
rnd = random() * self.sum
index = bisect(self.decorated, (rnd, 0))
self.decorated[index][-1]
return self.decorated[index][-1]

def get_range(self, n):
"Randomly returns a range of elements by probability"
return [self.get() for itm in range(n)]





if __name__ == '__main__':


probabilities = [1.0, 0.5, 0.25, 0.125, 0.0625]
# probabilities = [0.1, 0.1, 1.0, 0.1, 0.1, ]
elements = ['a', 'b', 'c', 'd', 'e']
s = Selector(probabilities, elements)
r = s.get_range(10000)
r.sort()

for element in elements:
print element, r.count(element)


##########
 
J

Josiah Carlson

def get_range(self, n):
"Randomly returns a range of elements by probability"
return [self.get() for itm in range(n)]

Use xrange here ^^^^^
xrange doesn't instantiate a list, so is faster and uses less memory for
large n.

- Josiah
 
A

Anton Vredegoor

Max M said:
I solved it by using acumulated probabilities instead. So the final
version is here, if anybody cares.

I used it in a Markov chain class, that is now pretty fast.

Yes it's fast, but I don't think it's smart :) Unfortunately I
haven't got a working solution as an alternative, but only some code
that "explains" what I mean:


def choice_generator(population,probabilities):
PC = zip(probabilities,population)
while 1:
p,c = max([(p*random(),c) for p,c in PC])
yield c

This is short and fast. However, the distribution of the outcomes is
wrong, because the list of probabilities should be "adjusted" so that
in the end the *outcomes* are distributed according to the
"probabilities". Or should that be proportions?

My statistics skills are a bit rusty, but I think it's an interesting
problem. I'll get back when I have some more ideas. In the meantime
more mathematically educated readers are given a chance to beat me to
it :), or prove it can't be done.

Anton
 
M

Max M

Anton said:
I solved it by using acumulated probabilities instead. So the final
version is here, if anybody cares.

Yes it's fast, but I don't think it's smart :) Unfortunately I
haven't got a working solution as an alternative, but only some code
that "explains" what I mean:


def choice_generator(population,probabilities):
PC = zip(probabilities,population)
while 1:
p,c = max([(p*random(),c) for p,c in PC])
yield c

This is short and fast. However, the distribution of the outcomes is
wrong, because the list of probabilities should be "adjusted" so that
in the end the *outcomes* are distributed according to the
"probabilities". Or should that be proportions?


I don't understand what you mean. If I calculate the deviations from
what is expected, and use a large result set, I get very small deviations.

I am interrested in getting a result as a propertion of the probablilty,
or more correctly in this case, the frequency. If I want it as
probabilities I just have to normalise the sim to 1.0.

This has the advantage that the frequencies can be expressed as integers
too. This is nice in my Markov chain class that count words in text, etc.

In my example below each letter should be occur 50% of the times of the
previous letter.

Perhaps you mean that it should behave differently?

regards Max M



###################

probabilities = [16, 8, 4, 2, 1]
elements = ['a', 'b', 'c', 'd', 'e']
sample_size = 1000000

s = Selector(probabilities, elements)
r = s.get_range(sample_size)
r.sort()

previous = float(sample_size)
for element in elements:
count = r.count(element)
deviation = (previous/2.0-count) / count * 100
previous = count
print element, count, deviation

 
A

Anton Vredegoor

Max M said:
I don't understand what you mean. If I calculate the deviations from
what is expected, and use a large result set, I get very small deviations.

That's what I was afraid of, and I can't blame you for it :)
In my example below each letter should be occur 50% of the times of the
previous letter.

Perhaps you mean that it should behave differently?

No, your code does what it should do and my code was just some
obviously wrong example to show what *kind* of solution I was after.

For example if my code is run with a list of probabilities like this:

[.1,.1,.1,.1,.1] it produces exactly the same distribution as your
code. However, with input like this: [2,1] (*different* probabilities)
it starts to diverge from your code. What I would like to do is to
replace this list [2,1] with the list of probabilities that would
cause my code to produce the same output as yours ...

So [2,1] ==> "some list of probabilities" and this *new* list is used
as multiplication factors in the line with :

"max([(p*random(),c) for p,c in PC])"

I hope I succeeded in just getting the idea across instead of giving
the impression that my code works better, which obviously it doesn't.
It *needs* some kind of transformation of the probabilities into an
other list ...

Anton
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top