# [QUIZ] Sampling (#39)

Discussion in 'Ruby' started by Graeme Defty, Jul 17, 2005.

1. ### Graeme DeftyGuest

Sorry, all. New to the list. I accidentally posted my
previous under a nonsense user id

> > Unless I'm mistaken, in the 5e6-from-1e9 sampling,

> the probability
> > that a sampling contains exactly one number from a

> given 200-numbers
> > interval is 200.0*(1/200)*(199.0/200)**199 =3D

> 0.3688. The probability
> > that this happens for 20 such 200-numbers

> intervals is
> > 0.3688**20 =3D 2.1e-09.

>=20
> I think you are wrong. A rough estimates of this
> probability gives me
> 1e-2171438.
> Actually, it should be a bit smaller, but at least
> the order of
> magnitude of the order of magnitude is right
> Quite impressive...
>=20
> Paolo
>=20
>=20

I think he is wrong too (sorry. Missed the previous
post, so I don't know who it was) but I think I think
it's wrong for a different reason.

I got the probability of getting exactly one number in
a specific interval to be as follows:
Probability of one *specific* number in the interval
1/20
Probability of all the other 19 OUT of that interval
(19/20)**19
But we have 20 specific numbers to consider, so:
20 * 1/20 * (19/20)**19
or
(19.0/20)**19
which gives a remarkably similar 0.3773536 (don't know
if this is a co-incidence. My brain hurts

Where I think this goes wrong is in assuming that
expanding this to 20 intervals just needs a **20.=20

After we get exactly one number in the first interval,
we then are looking for exactly one number in the
second interval, which is only one interval out of 19.
In other words, the probability should
be(18.0/19)**18. and so forth. The overall probability
therefore is
(19.0/20)**19 * (18.0/19)**18 * ... * (2.0/1)**1
which my RubyShell informs me is 0.377353602535307e-8,

Graeme

Send instant messages to your online friends http://uk.messenger.yahoo.co=
m=20

Graeme Defty, Jul 17, 2005

2. ### Olaf KlischatGuest

Graeme Defty <> writes:

> Sorry, all. New to the list. I accidentally posted my
> previous under a nonsense user id
>
>> > Unless I'm mistaken, in the 5e6-from-1e9 sampling,

>> the probability
>> > that a sampling contains exactly one number from a

>> given 200-numbers
>> > interval is 200.0*(1/200)*(199.0/200)**199 =

>> 0.3688. The probability
>> > that this happens for 20 such 200-numbers

>> intervals is
>> > 0.3688**20 = 2.1e-09.

>>
>> I think you are wrong. A rough estimates of this
>> probability gives me
>> 1e-2171438.
>> Actually, it should be a bit smaller, but at least
>> the order of
>> magnitude of the order of magnitude is right
>> Quite impressive...
>>
>> Paolo
>>
>>

> I think he is wrong too (sorry. Missed the previous
> post, so I don't know who it was)

Me.

> but I think I think
> it's wrong for a different reason.
>
> I got the probability of getting exactly one number in
> a specific interval to be as follows:
> Probability of one *specific* number in the interval
> 1/20
> Probability of all the other 19 OUT of that interval
> (19/20)**19
> But we have 20 specific numbers to consider, so:
> 20 * 1/20 * (19/20)**19
> or
> (19.0/20)**19

Right, that's what I did too, but I was talking about a 200-numbers
interval, not a 20-numbers one

> which gives a remarkably similar 0.3773536 (don't know
> if this is a co-incidence.

It isn't. ((x-1)/x)**(x-1) converges for x->Infinity.

= lim_{x->\inf} ((x-1)/x)**(x-1)
= lim_{x->\inf} (1-1/x)**(x-1)
= lim_{x->\inf} (1-1/x)**x * lim_{x->\inf} (1-1/x)**-1
= lim_{x->\inf} (1-1/x)**x * 1
= lim_{x->\inf} (1-1/x)**x
= lim_{x->\inf} (1/(1+1/x))**x
= lim_{x->\inf} 1/((1+1/x)**x)
= 1 / lim_{x->\inf} ((1+1/x)**x)
= 1/E
= 0.3678794...

>
> Where I think this goes wrong is in assuming that
> expanding this to 20 intervals just needs a **20.
>
> After we get exactly one number in the first interval,
> we then are looking for exactly one number in the
> second interval, which is only one interval out of 19.
> In other words, the probability should
> be(18.0/19)**18. and so forth. The overall probability
> therefore is
> (19.0/20)**19 * (18.0/19)**18 * ... * (2.0/1)**1
> which my RubyShell informs me is 0.377353602535307e-8,

I didn't get that. You have 20 given (previously selected) intervals,
and you want the probability that a sampling contains exactly one
number from each of them. That gives (with your interval size)
p=(19/20)**19 for the first given interval, (19/20)**19 for the second
one and so on. ((19/20)**19)**20 in total.

Olaf

Olaf Klischat, Jul 17, 2005

3. ### Paolo CapriottiGuest

On 7/17/05, Olaf Klischat <-berlin.de> wrote:
> Graeme Defty <> writes:
>=20
> > Sorry, all. New to the list. I accidentally posted my
> > previous under a nonsense user id
> >
> >> > Unless I'm mistaken, in the 5e6-from-1e9 sampling,
> >> the probability
> >> > that a sampling contains exactly one number from a
> >> given 200-numbers
> >> > interval is 200.0*(1/200)*(199.0/200)**199 =3D
> >> 0.3688. The probability
> >> > that this happens for 20 such 200-numbers
> >> intervals is
> >> > 0.3688**20 =3D 2.1e-09.
> >>
> >> I think you are wrong. A rough estimates of this
> >> probability gives me
> >> 1e-2171438.
> >> Actually, it should be a bit smaller, but at least
> >> the order of
> >> magnitude of the order of magnitude is right
> >> Quite impressive...
> >>
> >> Paolo
> >>
> >>

> > I think he is wrong too (sorry. Missed the previous
> > post, so I don't know who it was)

>=20
> Me.
>=20
> > but I think I think
> > it's wrong for a different reason.
> >
> > I got the probability of getting exactly one number in
> > a specific interval to be as follows:
> > Probability of one *specific* number in the interval
> > 1/20
> > Probability of all the other 19 OUT of that interval
> > (19/20)**19
> > But we have 20 specific numbers to consider, so:
> > 20 * 1/20 * (19/20)**19
> > or
> > (19.0/20)**19

>=20
> Right, that's what I did too, but I was talking about a 200-numbers
> interval, not a 20-numbers one

It's wrong anyways. You both are assuming that all these events are
independent, and multiply their probability, but they are not. One
neat way to see this is remarking that you didn't use the fact that
the intervals are 5000000. If there were only one such interval (that
is, if you were doing an 1 out of 200 sampling), the probability would
have been obviously 1. If your formula were correct, I could apply it
to this case, obtaining a contradiction.

Paolo

Paolo Capriotti, Jul 17, 2005
4. ### Olaf KlischatGuest

Paolo Capriotti <> writes:

> It's wrong anyways. You both are assuming that all these events are
> independent, and multiply their probability, but they are not. One
> neat way to see this is remarking that you didn't use the fact that
> the intervals are 5000000. If there were only one such interval (that
> is, if you were doing an 1 out of 200 sampling), the probability would
> have been obviously 1. If your formula were correct, I could apply it
> to this case, obtaining a contradiction.

Yeah. You have a point there .

I still think that 0.3688 and 0.3688**20 are not far off because the
interval size (200) and number (20) is much smaller than MEMBERS and
LIMIT, so the occurance of an event like "the sampling contains one
number from a given (preselected) interval" does not influence the
probability of another such event (for another, disjunct interval)
very much. Right?

Olaf Klischat, Jul 17, 2005
5. ### Olaf KlischatGuest

Olaf Klischat <-berlin.de> writes:

> Paolo Capriotti <> writes:
>
>> It's wrong anyways. You both are assuming that all these events are
>> independent, and multiply their probability, but they are not. One
>> neat way to see this is remarking that you didn't use the fact that
>> the intervals are 5000000. If there were only one such interval (that
>> is, if you were doing an 1 out of 200 sampling), the probability would
>> have been obviously 1. If your formula were correct, I could apply it
>> to this case, obtaining a contradiction.

>
> Yeah. You have a point there .
>
> I still think that 0.3688 and 0.3688**20 are not far off because the
> interval size (200) and number (20) is much smaller than MEMBERS and
> LIMIT, so the occurance of an event like "the sampling contains one
> number from a given (preselected) interval" does not influence the
> probability of another such event (for another, disjunct interval)
> very much. Right?

I've pondered some more and I think, considering "event dependency"
(i.e. changed probabilities once you've selected some numbers), the
actual probability that exactly one number occurs in a given
200-numbers interval isn't (199/200)**199=0.36880183, but
(1-4_999_999/999_999_999)*
(1-4_999_999/999_999_998)*
(1-4_999_999/999_999_997)*
..
..
.. *
(1-4_999_999/999_999_801)
= 0.368801868

(so the approximation has an error of only 1e-7), or in the general
case:

# probability that exactly one number occurs in a given interval of
# size intervalsize in a sampling with parameters members and limit
def p(members,limit,intervalsize)
(1...intervalsize).inject(intervalsize * members/limit){|x,i|
x * (1-(members-1)/(limit-i))
}
end

I've now written a "check" script that gets a sample on stdin and
determines this value (all consecutive intervals of size limit/members
are checked), and compares it to the theoretically expected one:

--------------------------------
#!/usr/bin/ruby

# probability that exactly one number occurs in a given interval of
# size intervalsize in a sampling with parameters members and limit
def p(members,limit,intervalsize)
(1...intervalsize).inject(intervalsize * members/limit){|x,i|
x * (1-(members-1)/(limit-i))
}
end

members,limit = ARGV.map{|x|x.to_i}
warn "#{limit} mod #{members} != 0 -- inaccurate results" if limit%members!=0

interval = limit/members

puts "expected: #{p(members.to_f,limit.to_f,interval)}"

found=0
begin
last=0; count=0;
loop do
new=num/interval
if new==last then
count+=1
else
last=new
found+=1 if count==1
count=1
end
end
rescue EOFError
end
found+=1 if count==1

puts "actual: #{found.to_f/(limit/interval)}"
--------------------------------

For the big_sample.txt here, this gives:

olaf@tack:~/doc/mydocs/ruby/quiz\$ ./check 5_000_000 1_000_000_000 <big_sample.txt
expected: 0.368801867760757
actual: 0.368865
olaf@tack:~/doc/mydocs/ruby/quiz\$

Other runs:

olaf@tack:~/doc/mydocs/ruby/quiz\$ ./sample 5000 70000 | ./check 5000 70000
expected: 0.381630036177205
actual: 0.386
olaf@tack:~/doc/mydocs/ruby/quiz\$

olaf@tack:~/doc/mydocs/ruby/quiz\$ ./sample 1 100 | ./check 1 100
expected: 1.0
actual: 1.0
olaf@tack:~/doc/mydocs/ruby/quiz\$

The "cheated" samples that have one value in each interval will always
yield actual: 1.0.

All things considered, this seems to be quite a good indicator to tell
the really random solutions from the not-so-random ones

Olaf

Olaf Klischat, Jul 17, 2005