A new classification method for RNGs: Significance Level

J

joe

My experiments show that the random number generator
in Microsoft's VC++6 compiler is a statistical RNG with a
significance level > 1.0%.
Statistical testing at SL >1.0% (for example 1.001%) passes the test,
but 1.0% does not pass...

Can anybody confirm this finding?

The RNG function of the various SW products can be
analyzed and classified better using its significance level
as shown above.
I think this IMO important finding deserves a deeper research... :)

For the testing method see:
http://en.wikipedia.org/wiki/Binomial_test
http://en.wikipedia.org/wiki/Binomial_distribution
 
J

joe

David Kerber said:
(e-mail address removed) says:

What are you using for your sample size and null hypothesis?

Here the details:
Sample size is 500 (ie. calling rand() 500 times),
rnd range is 37 (ie. 0 to 36; yes, a roulette simulation).
Doing the above mentioned StatTest after each rand() call.
The above said is called more than 30 times in a loop,
each time initializing the freq stats anew.
srand(time(0)) done only once at pgmstart.
 
J

joe

H0 = RNG passes the randomness test only at a significance level >1%
ie. try out 1% and 1.001% for example and you will
 
J

joe

David Kerber said:
(e-mail address removed) says:

I think your sample size is too small for such a conclusion to be
statistically valid.

The minimum neccessary is given in these relations
(cf. the above wiki pages and also stats books):

n*p >= 5 AND n*(1-p) >= 5

ie. for the above example:
n >= 5 / (1/37) >= 185
Above we have 500, ie. from 185 to 500 the test can be applied.
 
U

user923005

My experiments show that the random number generator
in Microsoft's VC++6 compiler is a statistical RNG with a
significance level > 1.0%.
Statistical testing at SL >1.0% (for example 1.001%) passes the test,
but 1.0% does not pass...

Can anybody confirm this finding?

The RNG function of the various SW products can be
analyzed and classified better using its significance level
as shown above.
I think this IMO important finding deserves a deeper research... :)

For the testing method see:http://en.wikipedia.org/wiki/Binomial_testhttp://en.wikipedia.org/wiki/Binomial_distribution

I guess that what you really want is the statistics group.

Follow-up added.
 
J

joe

David Kerber said:
(e-mail address removed) says:

But you said you're initializing the frequency stats before each loop
iteration? To me, that would imply that your sample size is really only
36 for each of 30 or so runs. Or am I misunderstanding how you are
accumulating the results?

The two loops I used are not that important.
I just simulate 30 days with each day 500 rand's (ie. spins, draws etc.).
At the beginning of each day the frequency stats are cleared.
Ie. the tests are done for intraday stats only.
I hope this makes it clear.

BTW, there is also a parallel discussion on this in sci.math
under the subject "Detecting biased random number generators".
 
J

Jerry Coffin

[ ... ]
The two loops I used are not that important.
I just simulate 30 days with each day 500 rand's (ie. spins, draws etc.).
At the beginning of each day the frequency stats are cleared.
Ie. the tests are done for intraday stats only.
I hope this makes it clear.

BTW, there is also a parallel discussion on this in sci.math
under the subject "Detecting biased random number generators".

Out of curiosity, what exact code are you using to get from the original
range of the generator to the 0-35 that you're using?
 
J

joe

Jerry Coffin said:
(e-mail address removed) says...

[ ... ]
The two loops I used are not that important.
I just simulate 30 days with each day 500 rand's (ie. spins, draws etc.).
At the beginning of each day the frequency stats are cleared.
Ie. the tests are done for intraday stats only.
I hope this makes it clear.

BTW, there is also a parallel discussion on this in sci.math
under the subject "Detecting biased random number generators".

Out of curiosity, what exact code are you using to get from the original
range of the generator to the 0-35 that you're using?

Jerry, it is 0 to 36.
As recommended in many books (for example Phillip Good
"Permutation, Parametric and Bootstrap Tests of Hypotheses")
I use the following:

int genrand(int lo, int hi)
{
int z = rand() % (hi - lo + 1) + lo;
return z;
}
....
int r = genrand(0, 36);
 
J

joe

Richard Heathfield said:
joe said:



Better:

#include <stdlib.h>

int genrand(int low, int high)
{
return (high - low + 1) * (rand() / (RAND_MAX + 1.0)) + low;
}

This avoids over-dependence on the low-order bits, which are a little too
predictable in some implementations.

Yes, this makes much sense, although it uses floating point.
But the observed phenomenon described in the IP of this thread,
ie. that the MS-rand is:
a) a statistical RNG, and that
b) it operates at 1% SL (or equivalently 99% CL)
still holds also with the above.
 
J

joe

Pete Becker said:
You're both wrong. <g>

Yes, there may be a dependence on low bits, but even if there isn't,
both of these approaches introduce a bias.

Suppose, for simplicity, that you're trying to generate 37 values with
a generator for which RAND_MAX is 38. If you reduce the generated value
modulo 37 you'll get values from 0-37. In fact, the values 0-37
produced by rand() will map to 0-37, and 38 will map to 0. So 0 will
come up twice as often as any other value.

Changing the approach to use floating point doesn't change the fact
that you're trying to produce 38 values from a source that gives you
39. No matter how you do it, you're not going to get an even
distribution.

Granted, the bias isn't large when RAND_MAX is much greater than the
number of values you're interested in, but it's there, nonetheless. And
when RAND_MAX is around 32767, which is quite common, it wouldn't
particularly surprise me if that alone accounted for the 1% deviation
from expectations.

The way to avoid this is to throw out some of the values from rand() so
that the number of values that you put into the transformation is a
multiple of the number of values you want:

int max = RAND_MAX + 1 - (RAND_MAX + 1) % range;
int value = rand();
while (max <= value)
value = rand();
return value % range;

There's probably an off-by-one error in that code, but it should give
you the general idea.

Pete, can you confirm that then the max possible value
one can get is RAND_MAX - 1, and not RAND_MAX ?
For me it doesn't matter much whether the max possible
value is RAND_MAX or 1 less of it, but I want be sure.

But I think the above code for max is not correct.
It should be IMHO something like that:
int max = (RAND_MAX / range) * range;
where range can be maximally RAND_MAX.
Then the maximally possible output lies between 0 and RAND_MAX - 1,
excluding RAND_MAX, right?

Ie. standard rand() gives values between 0 and RAND_MAX
but using the above method gives values between 0 and RAND_MAX - 1 only.
But this seems to be correct to avoid the problem you described above
regarding the 0.
 
J

joe

Richard Heathfield said:
The Standard says that "RAND_MAX [...] expands to an integral constant
expression, the value of which is the maximum value returned by the rand
function".

My final version:

int genrand(unsigned range)
{ // generates random number between 0 and range-1
// range can be maximally RAND_MAX + 1
// THIS IS THE RECOMMENDED CORRECT METHOD
// (recommended by me and others :)

if (range < 2)
return 0;
if (range > (RAND_MAX + 1))
range = RAND_MAX + 1;
const unsigned maxr = RAND_MAX + 1 - ((RAND_MAX + 1) % range);
unsigned value = maxr;
while (value >= maxr)
value = rand();
return value % range;
}
 
J

Jerry Coffin

[email protected] says... said:
Jerry, it is 0 to 36.

Okay -- not that it makes any real difference in terms of the algorithm.
As recommended in many books (for example Phillip Good
"Permutation, Parametric and Bootstrap Tests of Hypotheses")
I use the following:

int genrand(int lo, int hi)
{
int z = rand() % (hi - lo + 1) + lo;
return z;
}
...
int r = genrand(0, 36);

I had a hunch you might be doing it in a manner that was biased, and I
was right. Here's how I do it:

#include <stdlib.h>

int rand_lim(int limit) {
/* return a random number between 0 and limit inclusive.
*/

int divisor = RAND_MAX/(limit+1);
int retval;

do {
retval = rand() / divisor;
} while (retval > limit);

return retval;
}
 
V

vippstar

The Standard says that "RAND_MAX [...] expands to an integral constant
expression, the value of which is the maximum value returned by the rand
function".
My final version:
int genrand(unsigned range)
{ // generates random number between 0 and range-1
// range can be maximally RAND_MAX + 1
// THIS IS THE RECOMMENDED CORRECT METHOD
// (recommended by me and others :)
if (range < 2)
return 0;
if (range > (RAND_MAX + 1))
range = RAND_MAX + 1;

This last step isn't right.
<snip>
It's redundant, but not wrong.
 
V

vippstar

On 2008-07-12 22:59:09 -0400, "joe" <[email protected]> said:
The Standard says that "RAND_MAX [...] expands to an integral constant
expression, the value of which is the maximum value returned by the rand
function".
My final version:
int genrand(unsigned range)
{ // generates random number between 0 and range-1
// range can be maximally RAND_MAX + 1
// THIS IS THE RECOMMENDED CORRECT METHOD
// (recommended by me and others :)
if (range < 2)
return 0;
if (range > (RAND_MAX + 1))
range = RAND_MAX + 1;
This last step isn't right.
<snip>
It's redundant, but not wrong.

How is it redundant? If the caller asks for a range that's larger than
this distribution can handle and this check isn't present, the
algorithm will not return values that cover the requested range.
Whether that second if is present or not, the behavior of the
algorithm is the same.
 
V

vippstar

On 2008-07-13 09:17:51 -0400, (e-mail address removed) said:
On 2008-07-12 22:59:09 -0400, "joe" <[email protected]> said:
The Standard says that "RAND_MAX [...] expands to an integral constant
expression, the value of which is the maximum value returned by the rand
function".
My final version:
int genrand(unsigned range)
{ // generates random number between 0 and range-1
// range can be maximally RAND_MAX + 1
// THIS IS THE RECOMMENDED CORRECT METHOD
// (recommended by me and others :)
if (range < 2)
return 0;
if (range > (RAND_MAX + 1))
range = RAND_MAX + 1;
This last step isn't right.
<snip>
It's redundant, but not wrong.
How is it redundant? If the caller asks for a range that's larger than
this distribution can handle and this check isn't present, the
algorithm will not return values that cover the requested range.
Whether that second if is present or not, the behavior of the
algorithm is the same.

It should produce an error. The fact that it doesn't means it's wrong,
not that it's redundant.
You're wrong.
 
W

Walter Roberson

You're wrong.

"right" or "wrong" in situations like this depend upon the
documented requirements upon the function. I've looked back through the
thread, but I can't seem to find any statement of what *exactly* the
function is supposed to do under which circumstances, so we cannot
say for sure what is "right" or "wrong" code to implement the function.
*Any* code that was standard conforming could be given as an
implementation of the function, and we have no firm grounds upon which to
say that the code is "wrong".

When there are no documented requirements for a function,
perceptions of "right" or "wrong" depend strongly upon
"reasonable expectations" of what the function would do. And it
seems to me that for -most- people, the "reasonable expectation"
of behaviour if the function cannot deliver the request range of
values is some kind of indication that there was a parameter problem.
It is my belief that relatively few people would have the
"reasonable expectation" that the function would instead quietly
deliver only a subset of the requested values. Yes, it is true that
*by chance* a random number generator run for several runs might not
-happen- to generate values greater than a certain value, but as
the number of runs increases, the probability of that happening
becomes statistically small enough that we can reasonably make
judgements about whether the function really is producing a
uniform random distribution.
 
J

James Kanze

On 2008-07-13 10:11:09 -0400, (e-mail address removed) said:

[...]
It should produce an error. The fact that it doesn't means
it's wrong, not that it's redundant.

From a quality of implementation point of view. Formally, the
contract is to return a number in the given range. The single
statement "return 0;" meets that requirement, and you don't need
all of this extra testing:).

Of course, the specification did state that the "range can be
maximally RAND_MAX + 1", so normally, I would expect an
assertion failure if my argument wasn't in this range. Or
totally undefined behavior, but not the function arbitrarily
changing my argument.
 
K

Keith Thompson

James Kanze said:
Of course, the specification did state that the "range can be
maximally RAND_MAX + 1", so normally, I would expect an
assertion failure if my argument wasn't in this range. Or
totally undefined behavior, but not the function arbitrarily
changing my argument.

Surely arbitrarily changing the argument *is* an example of "totally
undefined behavior".

semi-:cool:}
 
J

James Kanze

[...]
Of course, the specification did state that the "range can be
maximally RAND_MAX + 1", so normally, I would expect an
assertion failure if my argument wasn't in this range. Or
totally undefined behavior, but not the function arbitrarily
changing my argument.
Surely arbitrarily changing the argument *is* an example of "totally
undefined behavior".

When undefined behavior occurs, anything the program does is
correct. But from a quality of implementation point of view, we
don't normally expect it to go out of the way to make things
worse; *IF* the code detects the error, it should treat it as an
error.
 
K

Keith Thompson

James Kanze said:
[...]
Of course, the specification did state that the "range can be
maximally RAND_MAX + 1", so normally, I would expect an
assertion failure if my argument wasn't in this range. Or
totally undefined behavior, but not the function arbitrarily
changing my argument.
Surely arbitrarily changing the argument *is* an example of "totally
undefined behavior".

When undefined behavior occurs, anything the program does is
correct. But from a quality of implementation point of view, we
don't normally expect it to go out of the way to make things
worse; *IF* the code detects the error, it should treat it as an
error.

Certainly.

By snipping the smiley, you've made it appear that I was making a
serious point. I wasn't.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top