choose value from custom distribution

elsa · Oct 18, 2010

Hello,

I'm trying to find a way to collect a set of values from real data,
and then sample values randomly from this data - so, the data I'm
collecting becomes a kind of probability distribution. For instance, I
might have age data for some children. It's very easy to collect this
data using a list, where the index gives the value of the data, and
the number in the list gives the number of times that values occurs:

[0,0,10,20,5]

could mean that there are no individuals that are no people aged 0, no
people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4
in my data collection.

I then want to make a random sample that would be representative of
these proportions - is there any easy and fast way to select an entry
weighted by its value? Or are there any python packages that allow you
to easily create your own distribution based on collected data? Two
other things to bear in mind are that in reality I'm collating data
from up to around 5 million individuals, so just making one long list
with a new entry for each individual won't work. Also, it would be
good if I didn't have to decide before hand what the possible range of
values is (which unfortunately I have to do with the approach I'm
currently working on).

Thanks in advance for your help,

elsa.

Chris Rebert · Oct 19, 2010

Hello,

I'm trying to find a way to collect a set of values from real data,
and then sample values randomly from this data - so, the data I'm
collecting becomes a kind of probability distribution. For instance, I
might have age data for some children. It's very easy to collect this
data using a list, where the index gives the value of the data, and
the number in the list gives the number of times that values occurs:

[0,0,10,20,5]

could mean that there are no individuals that are no people aged 0, no
people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4
in my data collection.

I then want to make a random sample that would be representative of
these proportions - is there any easy and fast way to select an entry
weighted by its value? Or are there any python packages that allow you
to easily create your own distribution based on collected data? Two
other things to bear in mind are that in reality I'm collating data
from up to around 5 million individuals, so just making one long list
with a new entry for each individual won't work. Also, it would be
good if I didn't have to decide before hand what the possible range of
values is (which unfortunately I have to do with the approach I'm
currently working on).

http://stackoverflow.com/questions/526255/probability-distribution-in-python

There's quite possibly something for this in NumPy/SciPy (or at least
a more efficient recipe utilizing one of them). Hopefully someone will
chime in.

Cheers,
Chris

Arnaud Delobelle · Oct 19, 2010

elsa said:
Hello,

I'm trying to find a way to collect a set of values from real data,
and then sample values randomly from this data - so, the data I'm
collecting becomes a kind of probability distribution. For instance, I
might have age data for some children. It's very easy to collect this
data using a list, where the index gives the value of the data, and
the number in the list gives the number of times that values occurs:

[0,0,10,20,5]

could mean that there are no individuals that are no people aged 0, no
people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4
in my data collection.

I then want to make a random sample that would be representative of
these proportions - is there any easy and fast way to select an entry
weighted by its value? Or are there any python packages that allow you
to easily create your own distribution based on collected data? Two
other things to bear in mind are that in reality I'm collating data
from up to around 5 million individuals, so just making one long list
with a new entry for each individual won't work. Also, it would be
good if I didn't have to decide before hand what the possible range of
values is (which unfortunately I have to do with the approach I'm
currently working on).

Thanks in advance for your help,

elsa.

If you want to keep it simple, you can do:

t = [0,0,10,20,5]
expanded = sum([[x]*f for x, f in enumerate(t)], [])
random.sample(expanded, 10) [3, 2, 2, 3, 2, 3, 2, 2, 3, 3]
random.sample(expanded, 10) [3, 3, 4, 3, 2, 3, 3, 3, 2, 2]
random.sample(expanded, 10)

Click to expand...

Click to expand...

[3, 3, 3, 3, 3, 2, 3, 2, 2, 3]

Is that what you need?

Chris Rebert · Oct 19, 2010

elsa said:
elsa said:

Hello,

I'm trying to find a way to collect a set of values from real data,
and then sample values randomly from this data - so, the data I'm
collecting becomes a kind of probability distribution. For instance, I
might have age data for some children. It's very easy to collect this
data using a list, where the index gives the value of the data, and
the number in the list gives the number of times that values occurs:

[0,0,10,20,5]

could mean that there are no individuals that are no people aged 0, no
people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4
in my data collection.

I then want to make a random sample that would be representative of
these proportions - is there any easy and fast way to select an entry
weighted by its value? Or are there any python packages that allow you
to easily create your own distribution based on collected data?

Click to expand...

If you want to keep it simple, you can do:

t = [0,0,10,20,5]
expanded = sum([[x]*f for x, f in enumerate(t)], [])
random.sample(expanded, 10) [3, 2, 2, 3, 2, 3, 2, 2, 3, 3]
random.sample(expanded, 10) [3, 3, 4, 3, 2, 3, 3, 3, 2, 2]
random.sample(expanded, 10)

Click to expand...

Click to expand...

[3, 3, 3, 3, 3, 2, 3, 2, 2, 3]

Is that what you need?

The OP explicitly ruled that out:

Cheers,
Chris

Ian · Oct 19, 2010

Hello,

I'm trying to find a way to collect a set of values from real data,
and then sample values randomly from this data - so, the data I'm
collecting becomes a kind of probability distribution. For instance, I
might have age data for some children. It's very easy to collect this
data using a list, where the index gives the value of the data, and
the number in the list gives the number of times that values occurs:

[0,0,10,20,5]

could mean that there are no individuals that are no people aged 0, no
people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4
in my data collection.

I then want to make a random sample that would be representative of
these proportions - is there any easy and fast way to select an entry
weighted by its value? Or are there any python packages that allow you
to easily create your own distribution based on collected data? Two
other things to bear in mind are that in reality I'm collating data
from up to around 5 million individuals, so just making one long list
with a new entry for each individual won't work. Also, it would be
good if I didn't have to decide before hand what the possible range of
values is (which unfortunately I have to do with the approach I'm
currently working on).

My suggestion is to sample into a cumulative sum list and find the
index by binary search:

import bisect
import random

data = [0, 0, 10, 20, 5]
cumsum = []
for x in data:
cumsum.append(cumsum[-1] + x if cumsum else x)
virtual_index = random.randrange(cumsum[-1])
actual_index = bisect.bisect_right(cumsum, virtual_index)

HTH,
Ian

Peter Otten · Oct 19, 2010

Chris said:
elsa said:

Hello,

I'm trying to find a way to collect a set of values from real data,
and then sample values randomly from this data - so, the data I'm
collecting becomes a kind of probability distribution. For instance, I
might have age data for some children. It's very easy to collect this
data using a list, where the index gives the value of the data, and
the number in the list gives the number of times that values occurs:

[0,0,10,20,5]

could mean that there are no individuals that are no people aged 0, no
people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4
in my data collection.

I then want to make a random sample that would be representative of
these proportions - is there any easy and fast way to select an entry
weighted by its value? Or are there any python packages that allow you
to easily create your own distribution based on collected data?

Click to expand...

Click to expand...

If you want to keep it simple, you can do:

t = [0,0,10,20,5]
expanded = sum([[x]*f for x, f in enumerate(t)], [])
random.sample(expanded, 10) [3, 2, 2, 3, 2, 3, 2, 2, 3, 3]
random.sample(expanded, 10) [3, 3, 4, 3, 2, 3, 3, 3, 2, 2]
random.sample(expanded, 10)

Click to expand...

[3, 3, 3, 3, 3, 2, 3, 2, 2, 3]

Is that what you need?

Click to expand...

The OP explicitly ruled that out:

Python can cope with a list of 5 million integer entries just fine on
average hardware. Eventually you may have to switch to Ian's cumulative sums
approach -- but not necessarily at 10**6.

This second objection seems invalid to me, too, and I think what Arnaud
provides is a useful counterexample.

However, if you (elsa) are operating near the limits of the available memory
on your machine using sum() on lists is not a good idea. It does the
equivalent of

expanded = []
for x, f in enumerate(t):
expanded = expanded + [x]*f

which creates a lot of "large" temporary lists where you want the more
memory-friendly

expanded = []
for x, f in enumerate(t):
expanded.extend([x]*f)
# expanded += [x]*f

The internet is wrecking people's attention spans and reading
comprehension.

Maybe, but I can't google the control group that is always offline and I
have a hunch that facebook wouldn't work either

Peter

Add the value taken from the user to my range for ActiveX ComboBox	0	Aug 27, 2022
ANN: eGenix mx Base Distribution 3.2.6 (mxDateTime, mxTextTools, etc.)	0	Apr 17, 2013
ANN: eGenix pyOpenSSL Distribution 0.13.1.1.0.1.5	0	Mar 13, 2013
ANN: eGenix pyOpenSSL Distribution 0.13.3.1.0.1.6	1	Jan 28, 2014
Getting value of instances of variable.	1	Mar 24, 2023
Changing string value that is an element of a list	3	Feb 10, 2025
ANN: eGenix mx Base Distribution 3.2.2 (mxDateTime, mxTextTools, etc.)	0	Jan 11, 2012
How can I calculate the last payment of the year to be the sum of all previous payments for that year and subtracting it from Research Costs value?	7	Aug 21, 2023

choose value from custom distribution

elsa

Chris Rebert

Arnaud Delobelle

Chris Rebert

Ian

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads