Hypergeometric distribution

R

Raven

Hi to all, I need to calculate the hpergeometric distribution:


choose(r, x) * choose(b, n-x)
p(x; r,b,n) = -----------------------------
choose(r+b, n)

choose(r,x) is the binomial coefficient
I use the factorial to calculate the above formula but since I am using
large numbers, the result of choose(a,b) (ie: the binomial coefficient)
is too big even for large int. I've tried the scipy library, but this
library calculates
the hypergeometric using the factorials too, so the problem subsist. Is
there any other libray or an algorithm to calculate
the hypergeometric distribution? The statistical package R can handle
such calculations but I don't want to use python R binding since I want
a standalone app.
Thanks a lot
Ale
 
R

Robert Kern

Raven said:
Hi to all, I need to calculate the hpergeometric distribution:


choose(r, x) * choose(b, n-x)
p(x; r,b,n) = -----------------------------
choose(r+b, n)

choose(r,x) is the binomial coefficient
I use the factorial to calculate the above formula but since I am using
large numbers, the result of choose(a,b) (ie: the binomial coefficient)
is too big even for large int. I've tried the scipy library, but this
library calculates
the hypergeometric using the factorials too, so the problem subsist. Is
there any other libray or an algorithm to calculate
the hypergeometric distribution?

Use logarithms.

Specifically,

from scipy import special

def logchoose(n, k):
lgn1 = special.gammaln(n+1)
lgk1 = special.gammaln(k+1)
lgnk1 = special.gammaln(n-k+1)
return lgn1 - (lgnk1 + lgk1)

def gauss_hypergeom(x, r, b, n):
return exp(logchoose(r, x) +
logchoose(b, n-x) -
logchoose(r+b, n))

Or you could use gmpy if you need exact rational arithmetic rather than floating
point.

--
Robert Kern
(e-mail address removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
 
G

Gerard Flanagan

Raven said:
Hi to all, I need to calculate the hpergeometric distribution:


choose(r, x) * choose(b, n-x)
p(x; r,b,n) = -----------------------------
choose(r+b, n)

choose(r,x) is the binomial coefficient
I use the factorial to calculate the above formula but since I am using
large numbers, the result of choose(a,b) (ie: the binomial coefficient)
is too big even for large int. I've tried the scipy library, but this
library calculates
the hypergeometric using the factorials too, so the problem subsist. Is
there any other libray or an algorithm to calculate
the hypergeometric distribution? The statistical package R can handle
such calculations but I don't want to use python R binding since I want
a standalone app.
Thanks a lot
Ale

Ale

I had this code lying about if it helps. I don't know if it's even
correct but it's non-recursive!

def Binomial( n, k ):
ret = 0
if k == 0:
ret = 1
elif k > 0:
a = range( n+1 )
a[0] = 1
for i in a[1:]:
a = 1
for j in range(i-1,0,-1):
a[j] = a[j] + a[j-1]
ret = a[k]
return ret

Gerard
 
S

Steven D'Aprano

Hi to all, I need to calculate the hpergeometric distribution:


choose(r, x) * choose(b, n-x)
p(x; r,b,n) = -----------------------------
choose(r+b, n)

choose(r,x) is the binomial coefficient
I use the factorial to calculate the above formula but since I am using
large numbers, the result of choose(a,b) (ie: the binomial coefficient)
is too big even for large int.

Are you sure about that? Python long ints can be as big as you have enough
memory for. My Python can print 10L**10000 to the console with a barely
detectable pause, and 10L**100000 with about a ten second delay. (Most of
that delay is printing it, not calculating it.)

25206 is the first integer whose factorial exceeds 10L**100000, so even if
you are calculating the binomial coefficient using the most naive
algorithm, calculating the factorials and dividing, you should easily be
able to generate it for a,b up to 20,000 unless you have a severe
shortage of memory.
I've tried the scipy library, but this
library calculates
the hypergeometric using the factorials too, so the problem subsist.

What exactly is your problem? What values of hypergeometric(x; r,b,n) fail
for you?
 
A

Alex Martelli

Raven said:
Hi to all, I need to calculate the hpergeometric distribution:


choose(r, x) * choose(b, n-x)
p(x; r,b,n) = -----------------------------
choose(r+b, n)

choose(r,x) is the binomial coefficient
I use the factorial to calculate the above formula but since I am using
large numbers, the result of choose(a,b) (ie: the binomial coefficient)
is too big even for large int. I've tried the scipy library, but this
library calculates
the hypergeometric using the factorials too, so the problem subsist. Is
there any other libray or an algorithm to calculate
the hypergeometric distribution? The statistical package R can handle
such calculations but I don't want to use python R binding since I want
a standalone app.

Have you tried with gmpy?


Alex
 
R

Raven

Thanks to all of you guys, I could resolve my problem using the
logarithms as proposed by Robert. I needed to calculate the factorial
for genomic data, more specifically for the number of genes in the
human genome i.e. about 30.000 and that is a big number :)
I didn't know gmpy
Thanks a lot, really

Ale
 
S

Steven D'Aprano

Thanks to all of you guys, I could resolve my problem using the
logarithms as proposed by Robert. I needed to calculate the factorial
for genomic data, more specifically for the number of genes in the
human genome i.e. about 30.000 and that is a big number :)
I didn't know gmpy
Thanks a lot, really

Are you *sure* the existing functions didn't work? Did you try them?
.... return math.log(x)/math.log(2)
........ n += log2(i)
....0.26649093627929688

That's about one quarter of a second to calculate 300,000 factorial
(approximately), and it shows that the calculations are well within
Python's capabilities.

Of course, converting this 1.5 million-plus digit number to a string takes
a bit longer:
1512846
6939.3762848377228

A quarter of a second to calculate, and almost two hours to convert to a
string. Lesson one: calculations on longints are fast. Converting them to
strings is not.

As far as your original question goes, try something like this:

(formula from memory, may be wrong)
.... x = 1
.... for i in range(r+1, n+1):
.... x *= i
.... for i in range(1, n-r+1):
.... x /= i
.... return x
....28.317800045013428

Less than thirty seconds to calculate a rather large binomial coefficient
exactly. How many digits?
7076

If you are calculating hundreds of hypergeometric probabilities, 30
seconds each could be quite painful, but it certainly shows that Python is
capable of doing it without resorting to logarithms which may lose some
significant digits. Although, in fairness, the log function doesn't seem
to lose much accuracy for arguments in the range you are dealing with.


How long does it take to calculate factorials?
.... # calculate n! and return the time taken in seconds
.... t = time.time()
.... L = 1
.... for i in range(1, n+1):
.... L *= i
.... return time.time() - t
....4255.2370519638062

Keep in mind, if you are calculating the hypergeometric probabilities
using raw factorials, you are doing way too much work.
 
R

Raven

Thanks Steven for your very interesting post.

This was a critical instance from my problem:
inf

The scipy.stats.distributions.hypergeom function uses the scipy.comb
function, so it returned nan since it tries to divide an infinite. I
did not tried to write a self-made function using standard python as I
supposed that the scipy functions reached python's limits but I was
wrong, what a fool :)
If you are calculating hundreds of hypergeometric probabilities, 30
seconds each could be quite painful, but it certainly shows that Python is
capable of doing it without resorting to logarithms which may lose some
significant digits. Although, in fairness, the log function doesn't seem
to lose much accuracy for arguments in the range you are dealing with.

Yes I am calculating hundreds of hypergeometric probabilities so I need
fast calculations

Ale
 
P

Paul Rubin

Raven said:
Yes I am calculating hundreds of hypergeometric probabilities so I need
fast calculations

Can you use Stirling's approximation to get the logs of the factorials?
 
S

Steven D'Aprano

Thanks Steven for your very interesting post.

This was a critical instance from my problem:

inf

Curious. It wouldn't surprise me if scipy was using floats, because 'inf'
is usually a floating point value, not an integer.

Using my test code from yesterday, I got:
11172777193562324917353367958024437473336018053487854593870
07090637489405604489192488346144684402362344409632515556732
33563523161308145825208276395238764441857829454464446478336
90173777095041891067637551783324071233625370619908633625448
31076677382448616246125346667737896891548166898009878730510
57476139515840542769956414204130692733629723305869285300247
645972456505830620188961902165086857407612722931651840L

Took about three seconds on my system.


Yes I am calculating hundreds of hypergeometric probabilities so I
need fast calculations

Another possibility, if you want exact integer maths rather than floating
point with logarithms, is to memoise the binomial coefficients. Something
like this:

# untested
def bincoeff(n,r, \
cache={}):
try:
return cache((n,r))
except KeyError:
x = 1
for i in range(r+1, n+1):
x *= i
for i in range(1, n-r+1):
x /= i
cache((n,r)) = x
return x
 
S

Scott David Daniels

Steven said:
Curious. It wouldn't surprise me if scipy was using floats, because 'inf'
is usually a floating point value, not an integer.

Using my test code from yesterday, I got:


Another possibility, if you want exact integer maths rather than floating
point with logarithms, is to memoise the binomial coefficients. Something
like this:

# untested
def bincoeff(n,r, \
cache={}):
try:
return cache((n,r))
except KeyError:
x = 1
for i in range(r+1, n+1):
x *= i
for i in range(1, n-r+1):
x /= i
cache((n,r)) = x
return x

Well, there is a much better optimization to use first:

def bincoeff1(n, r):
if r < n - r:
r = n - r
x = 1
for i in range(n, r, -1):
x *= i
for i in range(n - r, 1, -1):
x /= i
return x

Then, if you still need to speed it up:

def bincoeff2(n, r, cache={}):
if r < n - r:
r = n - r
try:
return cache[n, r]
except KeyError:
pass
x = 1
for i in range(n, r, -1):
x *= i
for i in range(n - r, 1, -1):
x /= i
cache[n, r] = x
return x


--Scott David Daniels
(e-mail address removed)
 
R

Raven

Well, what to say? I am very happy for all the solutions you guys have
posted :)
For Paul:
I would prefer not to use Stirling's approximation


The problem with long integers is that to calculate the hypergeometric
I need to do float division and multiplication because integer division
returns 0. A solution could be to calculate log(Long_Factorial_Integer)
and finally calculate the hypergeometric with the logarithmic values.
I've done a test: iterated 1000 times two different functions for the
hypergeometric, the first one based on scipy.special.gammaln:

from scipy.special import gammaln

def lnchoose(n, m):
nf = gammaln(n + 1)
mf = gammaln(m + 1)
nmmnf = gammaln(n - m + 1)
return nf - (mf + nmmnf)

def hypergeometric_gamma(k, n1, n2, t):
if t > n1 + n2:
t = n1 + n2
if k > n1 or k > t:
return 0
elif t > n2 and ((k + n2) < t):
return 0
else:
c1 = lnchoose(n1,k)
c2 = lnchoose(n2, t - k)
c3 = lnchoose(n1 + n2 ,t)

return exp(c1 + c2 - c3)

and the second one based on the code by Steven and Scott:


import time
from math import log, exp

def bincoeff1(n, r):
if r < n - r:
r = n - r
x = 1
for i in range(n, r, -1):
x *= i
for i in range(n - r, 1, -1):
x /= i
return x

def hypergeometric(k, n1, n2, t):
if t > n1 + n2:
t = n1 + n2
if k > n1 or k > t:
return 0
elif t > n2 and ((k + n2) < t):
return 0
else:
c1 = log(raw_bincoeff1(n1,k))
c2 = log(raw_bincoeff1(n2, t - k))
c3 = log(raw_bincoeff1(n1 + n2 ,t))

return exp(c1 + c2 - c3)

def main():
t = time.time()
for i in range(1000):
r = hypergeometric(6,6,30,6)
print time.time() - t

t = time.time()
for i in range(1000):
r = hypergeometric_gamma(6,6,30,6)
print time.time() - t


if __name__ == "__main__":
main()


and the result is:

0.0386447906494
0.192448139191

The first approach is faster so I think I will adopt it.
 
S

Scott David Daniels

Raven said:
...
def main():
t = time.time()
for i in range(1000):
r = hypergeometric(6,6,30,6)
print time.time() - t

t = time.time()
for i in range(1000):
r = hypergeometric_gamma(6,6,30,6)
print time.time() - t

and the result is:

0.0386447906494
0.192448139191

The first approach is faster so I think I will adopt it.

You should really look into the timeit module -- you'll get nice
solid timings slightly easier to tweak.
Imagine something like:

import timeit
...
t0 = timeit.Timer(stmt='f(6, 6, 30, 6)',
setup='from __main__ import hypergeometric as f')
t1 = timeit.Timer(stmt='f(6, 6, 30, 6)',
setup='from __main__ import hypergeometric_gamma as f')

repetitions = 1 # Gross under-estimate of needed repetitions
while t0.timeit(repetitions) < .25: # .25 = minimum Secs per round
repetitions *= 10
print 'Going for %s repetitions' % repetitions
print 'hypergeometric:', t0.repeat(3, repetitions)
print 'hypergeometric_gamma:', t1.repeat(3, repetitions)

--Scott David Daniels
(e-mail address removed)
 
B

Bengt Richter

On 2 Jan 2006 03:35:33 -0800 said:
The problem with long integers is that to calculate the hypergeometric
I need to do float division and multiplication because integer division
returns 0. A solution could be to calculate log(Long_Factorial_Integer)

ISTM you wouldn't get zero if you scaled by 10**significant_digits (however many
you require) before dividing. E.g., expected hits per trillion (or septillion or whatever)
expresses probability too. Perhaps that could work in your calculation?

Regards,
Bengt Richter
 
R

Raven

Scott David Daniels ha scritto:
You should really look into the timeit module -- you'll get nice
solid timings slightly easier to tweak.

This seems a very interesting module, I will give it a try as soon as
possible. Thanks Scott.
Ale
 
R

Raven

Bengt said:
ISTM you wouldn't get zero if you scaled by 10**significant_digits (however many
you require) before dividing. E.g., expected hits per trillion (or septillion or whatever)
expresses probability too. Perhaps that could work in your calculation?

Regards,
Bengt Richter

Sorry Bengt but I can't figure out how to do it, can you give me an
example please? Thanks in advance.
Ale
 
R

Raven

Bengt Richter wrote:

ISTM you wouldn't get zero if you scaled by 10**significant_digits (however many
you require) before dividing. E.g., expected hits per trillion (or septillion or whatever)
expresses probability too. Perhaps that could work in your calculation?

Regards,
Bengt Richter

Sorry but I can't figure out how to do it, can you give me an example
please?
Thnx

Ale
 
T

Travis E. Oliphant

Raven said:
Thanks Steven for your very interesting post.

This was a critical instance from my problem:



inf

The scipy.stats.distributions.hypergeom function uses the scipy.comb
function, so it returned nan since it tries to divide an infinite. I
did not tried to write a self-made function using standard python as I
supposed that the scipy functions reached python's limits but I was
wrong, what a fool :)

Notice the keyword for the comb function (in scipy) lets you use it to
compute exact values. SciPy does not just automatically use the long
integer because this will always slow you down.

comb(N, k, exact=0)

Combinations of N things taken k at a time.

If exact==0, then floating point precision is used, otherwise
exact long integer is computed.

Notes:
- Array arguments accepted only for exact=0 case.
- If k > N, N < 0, or k < 0, then a 0 is returned.

-Travis Oliphant
 
R

Raven

Travis said:
Notice the keyword for the comb function (in scipy) lets you use it to
compute exact values. SciPy does not just automatically use the long
integer because this will always slow you down.

comb(N, k, exact=0)

Combinations of N things taken k at a time.

If exact==0, then floating point precision is used, otherwise
exact long integer is computed.

Notes:
- Array arguments accepted only for exact=0 case.
- If k > N, N < 0, or k < 0, then a 0 is returned.

-Travis Oliphant

Great, thanks Travis.
Ale
 
C

Cameron Laird

Well, what to say? I am very happy for all the solutions you guys have
posted :)
For Paul:
I would prefer not to use Stirling's approximation


The problem with long integers is that to calculate the hypergeometric
I need to do float division and multiplication because integer division
returns 0. A solution could be to calculate log(Long_Factorial_Integer)
.
.
.
This thread confuses me.

I've lost track of the real goal. If it's an exact calculation of
binomial coefficients--or even one of several other potential
targets mentioned--I echo Steven D'Aprano, and ask, are you *sure*
the suggestions already offered aren't adequate? Also, I think you
might not realize how accurate Stirling's approximation (perhaps to
second order) is in the range of interest.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,481
Members
44,900
Latest member
Nell636132

Latest Threads

Top