Most pythonic way of weighted random selection

M

Manuel Ebert

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear list,

who's got aesthetic advice for the following problem? I've got some
joint probabilities of two distinct events Pr(X=x, Y=y), stored in a
list of lists of floats, where every row represents a possible
outcome of X and every float in a row a possible outcome of Y (which
I will now call my matrix, altough I can't use numpy or numeric for
this), so e.g. m = [[0.2, 0.4, 0.05], [0.1, 0.05, 0.2]]. All lists in
the list are equally long and the values of the flattened list add up
to 1.0, i.e. sum([sum(row) for row in m]) == 1. In practice, this
'matrix' is about 20x40, i.e. a list with 20 lists á 40 floats each.

Now the task is to select one outcome for X and Y based on the joint
probabilites, and afterwards check that the outcomes fullfill certain
criteria. If it doesn't fulfill some criteria a new pair of outcomes
has to be selected, for other criteria it will still be selected with
a certain probability. My approach was to choose a random number, and
then walk through the list adding the values together until the
accumulated sum is greater than my random threshold:

import random
r = random.random()
s = 0.0
p = 0.2 # probability of selecting outcome if it doesn't fulfill
criterion 2
break_loop = False
while not break_loop:
for row_index in range(len(m)):
for col_index in range(len(row)):
s += m[row_index][col_index]
if s >= r:
if not fulfills_criterion_a(row_index, col_index):
break_loop = True
elif not fulfills_criterion_b(row_index, col_index):
if random.random() <= p:
return row_index, col_index
else:
break_loop = True
else:
return row_index, col_index
if break_loop: break
if break_loop: break
break_loop = False


Now that looks plain ugly, and I wonder whether you might find a
slightly more elegant way of doing it without using numpy and the like.

Bests,
Manuel
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iD8DBQFIuWoncZ70OCIgLecRArV4AJ9ynhC/McegMIYTWOOOW4p44t3rWgCbBjvm
1JRHy5kp1qIGLDaCTXXFcSs=
=X6Sv
-----END PGP SIGNATURE-----
 
D

duncan smith

Manuel said:
Dear list,

who's got aesthetic advice for the following problem? I've got some
joint probabilities of two distinct events Pr(X=x, Y=y), stored in a
list of lists of floats, where every row represents a possible outcome
of X and every float in a row a possible outcome of Y (which I will now
call my matrix, altough I can't use numpy or numeric for this), so e.g.
m = [[0.2, 0.4, 0.05], [0.1, 0.05, 0.2]]. All lists in the list are
equally long and the values of the flattened list add up to 1.0, i.e.
sum([sum(row) for row in m]) == 1. In practice, this 'matrix' is about
20x40, i.e. a list with 20 lists á 40 floats each.

Now the task is to select one outcome for X and Y based on the joint
probabilites, and afterwards check that the outcomes fullfill certain
criteria. If it doesn't fulfill some criteria a new pair of outcomes has
to be selected, for other criteria it will still be selected with a
certain probability. My approach was to choose a random number, and then
walk through the list adding the values together until the accumulated
sum is greater than my random threshold:

[snip]

For a list of cumulative probabilities you could use the bisect module
to find the insertion point for a 0,1 uniform variate (which you could
then map back to a cell index).

Alternatively you could use an alias table,
http://amath.colorado.edu/courses/7400/2004fall/002/Web/SS-10.ppt.

You could produce a probability table that takes into account your
criteria, so that you don't need to check them. i.e. Each cell that
does not satisfy criterion a has probability 0 of being selected. Each
other cell has a probability of being selected proportional to its
original value if it satisfies criterion b, or its original value * p
otherwise. Normalise the resulting values to sum to 1 and you're done
with the need to check against criteria.

So, preprocess your table, create a corresponding list of cumulative
probabilities and use the bisect model (is one possibility).

Duncan
 
S

Steven D'Aprano

Dear list,

who's got aesthetic advice for the following problem?

....

[ugly code removed]
Now that looks plain ugly, and I wonder whether you might find a
slightly more elegant way of doing it without using numpy and the like.

Never be afraid to factor out pieces of code into small functions.
Writing a single huge while loop that does everything is not only hard to
read, hard to write and hard to debug, but it can also run slower. (It
depends on a number of factors.)

Anyway, here's my attempt to solve the problem, as best as I can
understand it:


import random

def eq(x, y, tol=1e-10):
# floating point equality within some tolerance
return abs(x-y) <= tol

M = [[0.2, 0.4, 0.05], [0.1, 0.05, 0.2]]
# the sums of each row must sum to 1.0
assert eq(1.0, sum([sum(row) for row in M]))

# build a cumulative probability matrix
CM = []
for row in M:
for p in row:
CM.append(p) # initialize with the raw probabilities

for i in range(1, len(CM)):
CM += CM[i-1] # and turn into cumulative probabilities

assert CM[0] >= 0.0
assert eq(CM[-1], 1.0)

def index(data, p):
"""Return the index of the item in data
which is no smaller than float p.
"""
# Note: this uses a linear search. If it is too slow,
# you can re-write it using the bisect module.
for i, x in enumerate(data):
if x >= p:
return i
return len(data-1)

def index_to_rowcolumn(i, num_columns):
"""Convert a linear index number i into a (row, column) tuple."""
# When converting [ [a, b, c, ...], [...] ] into a single
# array [a, b, c, ... z] we have the identity:
# index number = row number * number of columns + column number
return divmod(i, num_columns)

# Now with these two helper functions, we can find the row and column
# number of the first entry in M where the cumulative probability
# exceeds some given value.

# You will need to define your own fulfills_criterion_a and
# fulfills_criterion_b, but here's a couple of mock functions
# for testing with:

def fulfills_criterion_a(row, column):
return random.random() < 0.5

fulfills_criterion_b = fulfills_criterion_a

def find_match(p=0.2):
while True:
r = random.random()
i = index(CM, r)
row, column = index_to_rowcolumn(i, len(M[0]))
if fulfills_criterion_a(row, column) or \
fulfills_criterion_b(row, column):
return row, column
else:
if random.random() <= p:
return row, column


And here's my test:
(1, 2)


Hope this helps.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top