choose value from custom distribution

E

elsa

Hello,

I'm trying to find a way to collect a set of values from real data,
and then sample values randomly from this data - so, the data I'm
collecting becomes a kind of probability distribution. For instance, I
might have age data for some children. It's very easy to collect this
data using a list, where the index gives the value of the data, and
the number in the list gives the number of times that values occurs:

[0,0,10,20,5]

could mean that there are no individuals that are no people aged 0, no
people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4
in my data collection.

I then want to make a random sample that would be representative of
these proportions - is there any easy and fast way to select an entry
weighted by its value? Or are there any python packages that allow you
to easily create your own distribution based on collected data? Two
other things to bear in mind are that in reality I'm collating data
from up to around 5 million individuals, so just making one long list
with a new entry for each individual won't work. Also, it would be
good if I didn't have to decide before hand what the possible range of
values is (which unfortunately I have to do with the approach I'm
currently working on).

Thanks in advance for your help,

elsa.
 
C

Chris Rebert

Hello,

I'm trying to find a way to collect a set of values from real data,
and then sample values randomly from this data - so, the data I'm
collecting becomes a kind of probability distribution. For instance, I
might have age data for some children. It's very easy to collect this
data using a list, where the index gives the value of the data, and
the number in the list gives the number of times that values occurs:

[0,0,10,20,5]

could mean that there are no individuals that are no people aged 0, no
people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4
in my data collection.

I then want to make a random sample that would be representative of
these proportions - is there any easy and fast way to select an entry
weighted by its value? Or are there any python packages that allow you
to easily create your own distribution based on collected data? Two
other things to bear in mind are that in reality I'm collating data
from up to around 5 million individuals, so just making one long list
with a new entry for each individual won't work. Also, it would be
good if I didn't have to decide before hand what the possible range of
values is (which unfortunately I have to do with the approach I'm
currently working on).

http://stackoverflow.com/questions/526255/probability-distribution-in-python

There's quite possibly something for this in NumPy/SciPy (or at least
a more efficient recipe utilizing one of them). Hopefully someone will
chime in.

Cheers,
Chris
 
A

Arnaud Delobelle

elsa said:
Hello,

I'm trying to find a way to collect a set of values from real data,
and then sample values randomly from this data - so, the data I'm
collecting becomes a kind of probability distribution. For instance, I
might have age data for some children. It's very easy to collect this
data using a list, where the index gives the value of the data, and
the number in the list gives the number of times that values occurs:

[0,0,10,20,5]

could mean that there are no individuals that are no people aged 0, no
people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4
in my data collection.

I then want to make a random sample that would be representative of
these proportions - is there any easy and fast way to select an entry
weighted by its value? Or are there any python packages that allow you
to easily create your own distribution based on collected data? Two
other things to bear in mind are that in reality I'm collating data
from up to around 5 million individuals, so just making one long list
with a new entry for each individual won't work. Also, it would be
good if I didn't have to decide before hand what the possible range of
values is (which unfortunately I have to do with the approach I'm
currently working on).

Thanks in advance for your help,

elsa.

If you want to keep it simple, you can do:
t = [0,0,10,20,5]
expanded = sum([[x]*f for x, f in enumerate(t)], [])
random.sample(expanded, 10) [3, 2, 2, 3, 2, 3, 2, 2, 3, 3]
random.sample(expanded, 10) [3, 3, 4, 3, 2, 3, 3, 3, 2, 2]
random.sample(expanded, 10)
[3, 3, 3, 3, 3, 2, 3, 2, 2, 3]

Is that what you need?
 
C

Chris Rebert

elsa said:
Hello,

I'm trying to find a way to collect a set of values from real data,
and then sample values randomly from this data - so, the data I'm
collecting becomes a kind of probability distribution. For instance, I
might have age data for some children. It's very easy to collect this
data using a list, where the index gives the value of the data, and
the number in the list gives the number of times that values occurs:

[0,0,10,20,5]

could mean that there are no individuals that are no people aged 0, no
people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4
in my data collection.

I then want to make a random sample that would be representative of
these proportions - is there any easy and fast way to select an entry
weighted by its value? Or are there any python packages that allow you
to easily create your own distribution based on collected data?
If you want to keep it simple, you can do:
t = [0,0,10,20,5]
expanded = sum([[x]*f for x, f in enumerate(t)], [])
random.sample(expanded, 10) [3, 2, 2, 3, 2, 3, 2, 2, 3, 3]
random.sample(expanded, 10) [3, 3, 4, 3, 2, 3, 3, 3, 2, 2]
random.sample(expanded, 10)
[3, 3, 3, 3, 3, 2, 3, 2, 2, 3]

Is that what you need?

The OP explicitly ruled that out:

Cheers,
Chris
 
I

Ian

Hello,

I'm trying to find a way to collect a set of values from real data,
and then sample values randomly from this data - so, the data I'm
collecting becomes a kind of probability distribution. For instance, I
might have age data for some children. It's very easy to collect this
data using a list, where the index gives the value of the data, and
the number in the list gives the number of times that values occurs:

[0,0,10,20,5]

could mean that there are no individuals that are no people aged 0, no
people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4
in my data collection.

I then want to make a random sample that would be representative of
these proportions - is there any easy and fast way to select an entry
weighted by its value? Or are there any python packages that allow you
to easily create your own distribution based on collected data? Two
other things to bear in mind are that in reality I'm collating data
from up to around 5 million individuals, so just making one long list
with a new entry for each individual won't work. Also, it would be
good if I didn't have to decide before hand what the possible range of
values is (which unfortunately I have to do with the approach I'm
currently working on).

My suggestion is to sample into a cumulative sum list and find the
index by binary search:

import bisect
import random

data = [0, 0, 10, 20, 5]
cumsum = []
for x in data:
cumsum.append(cumsum[-1] + x if cumsum else x)
virtual_index = random.randrange(cumsum[-1])
actual_index = bisect.bisect_right(cumsum, virtual_index)

HTH,
Ian
 
P

Peter Otten

Chris said:
elsa said:
Hello,

I'm trying to find a way to collect a set of values from real data,
and then sample values randomly from this data - so, the data I'm
collecting becomes a kind of probability distribution. For instance, I
might have age data for some children. It's very easy to collect this
data using a list, where the index gives the value of the data, and
the number in the list gives the number of times that values occurs:

[0,0,10,20,5]

could mean that there are no individuals that are no people aged 0, no
people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4
in my data collection.

I then want to make a random sample that would be representative of
these proportions - is there any easy and fast way to select an entry
weighted by its value? Or are there any python packages that allow you
to easily create your own distribution based on collected data?
If you want to keep it simple, you can do:
t = [0,0,10,20,5]
expanded = sum([[x]*f for x, f in enumerate(t)], [])
random.sample(expanded, 10) [3, 2, 2, 3, 2, 3, 2, 2, 3, 3]
random.sample(expanded, 10) [3, 3, 4, 3, 2, 3, 3, 3, 2, 2]
random.sample(expanded, 10)
[3, 3, 3, 3, 3, 2, 3, 2, 2, 3]

Is that what you need?

The OP explicitly ruled that out:

Python can cope with a list of 5 million integer entries just fine on
average hardware. Eventually you may have to switch to Ian's cumulative sums
approach -- but not necessarily at 10**6.

This second objection seems invalid to me, too, and I think what Arnaud
provides is a useful counterexample.

However, if you (elsa) are operating near the limits of the available memory
on your machine using sum() on lists is not a good idea. It does the
equivalent of

expanded = []
for x, f in enumerate(t):
expanded = expanded + [x]*f

which creates a lot of "large" temporary lists where you want the more
memory-friendly

expanded = []
for x, f in enumerate(t):
expanded.extend([x]*f)
# expanded += [x]*f
The internet is wrecking people's attention spans and reading
comprehension.

Maybe, but I can't google the control group that is always offline and I
have a hunch that facebook wouldn't work either ;)

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,266
Messages
2,571,082
Members
48,772
Latest member
Backspace Studios

Latest Threads

Top