A new classification method for RNGs: Significance Level

joe · Jul 11, 2008

My experiments show that the random number generator
in Microsoft's VC++6 compiler is a statistical RNG with a
significance level > 1.0%.
Statistical testing at SL >1.0% (for example 1.001%) passes the test,
but 1.0% does not pass...

Can anybody confirm this finding?

The RNG function of the various SW products can be
analyzed and classified better using its significance level
as shown above.
I think this IMO important finding deserves a deeper research...

For the testing method see:
http://en.wikipedia.org/wiki/Binomial_test
http://en.wikipedia.org/wiki/Binomial_distribution

joe · Jul 11, 2008

David Kerber said:
(e-mail address removed) says:

What are you using for your sample size and null hypothesis?

Here the details:
Sample size is 500 (ie. calling rand() 500 times),
rnd range is 37 (ie. 0 to 36; yes, a roulette simulation).
Doing the above mentioned StatTest after each rand() call.
The above said is called more than 30 times in a loop,
each time initializing the freq stats anew.
srand(time(0)) done only once at pgmstart.

joe · Jul 11, 2008

H0 = RNG passes the randomness test only at a significance level >1%
ie. try out 1% and 1.001% for example and you will

joe · Jul 11, 2008

David Kerber said:
(e-mail address removed) says:

I think your sample size is too small for such a conclusion to be
statistically valid.

The minimum neccessary is given in these relations
(cf. the above wiki pages and also stats books):

n*p >= 5 AND n*(1-p) >= 5

ie. for the above example:
n >= 5 / (1/37) >= 185
Above we have 500, ie. from 185 to 500 the test can be applied.

user923005 · Jul 11, 2008

My experiments show that the random number generator
in Microsoft's VC++6 compiler is a statistical RNG with a
significance level > 1.0%.
Statistical testing at SL >1.0% (for example 1.001%) passes the test,
but 1.0% does not pass...

Can anybody confirm this finding?

The RNG function of the various SW products can be
analyzed and classified better using its significance level
as shown above.
I think this IMO important finding deserves a deeper research...

For the testing method see:http://en.wikipedia.org/wiki/Binomial_testhttp://en.wikipedia.org/wiki/Binomial_distribution

I guess that what you really want is the statistics group.

Follow-up added.

joe · Jul 11, 2008

David Kerber said:
(e-mail address removed) says:

But you said you're initializing the frequency stats before each loop
iteration? To me, that would imply that your sample size is really only
36 for each of 30 or so runs. Or am I misunderstanding how you are
accumulating the results?

The two loops I used are not that important.
I just simulate 30 days with each day 500 rand's (ie. spins, draws etc.).
At the beginning of each day the frequency stats are cleared.
Ie. the tests are done for intraday stats only.
I hope this makes it clear.

BTW, there is also a parallel discussion on this in sci.math
under the subject "Detecting biased random number generators".

Jerry Coffin · Jul 12, 2008

[ ... ]

The two loops I used are not that important.
I just simulate 30 days with each day 500 rand's (ie. spins, draws etc.).
At the beginning of each day the frequency stats are cleared.
Ie. the tests are done for intraday stats only.
I hope this makes it clear.

BTW, there is also a parallel discussion on this in sci.math
under the subject "Detecting biased random number generators".

Out of curiosity, what exact code are you using to get from the original
range of the generator to the 0-35 that you're using?

joe · Jul 12, 2008

Jerry Coffin said:
(e-mail address removed) says...

[ ... ]

The two loops I used are not that important.
I just simulate 30 days with each day 500 rand's (ie. spins, draws etc.).
At the beginning of each day the frequency stats are cleared.
Ie. the tests are done for intraday stats only.
I hope this makes it clear.

BTW, there is also a parallel discussion on this in sci.math
under the subject "Detecting biased random number generators".

Click to expand...

Out of curiosity, what exact code are you using to get from the original
range of the generator to the 0-35 that you're using?

Jerry, it is 0 to 36.
As recommended in many books (for example Phillip Good
"Permutation, Parametric and Bootstrap Tests of Hypotheses")
I use the following:

int genrand(int lo, int hi)
{
int z = rand() % (hi - lo + 1) + lo;
return z;
}
....
int r = genrand(0, 36);

joe · Jul 13, 2008

Richard Heathfield said:
joe said:

Better:

#include <stdlib.h>

int genrand(int low, int high)
{
return (high - low + 1) * (rand() / (RAND_MAX + 1.0)) + low;
}

This avoids over-dependence on the low-order bits, which are a little too
predictable in some implementations.

Yes, this makes much sense, although it uses floating point.
But the observed phenomenon described in the IP of this thread,
ie. that the MS-rand is:
a) a statistical RNG, and that
b) it operates at 1% SL (or equivalently 99% CL)
still holds also with the above.

joe · Jul 13, 2008

Pete Becker said:
You're both wrong. <g>

Yes, there may be a dependence on low bits, but even if there isn't,
both of these approaches introduce a bias.

Suppose, for simplicity, that you're trying to generate 37 values with
a generator for which RAND_MAX is 38. If you reduce the generated value
modulo 37 you'll get values from 0-37. In fact, the values 0-37
produced by rand() will map to 0-37, and 38 will map to 0. So 0 will
come up twice as often as any other value.

Changing the approach to use floating point doesn't change the fact
that you're trying to produce 38 values from a source that gives you
39. No matter how you do it, you're not going to get an even
distribution.

Granted, the bias isn't large when RAND_MAX is much greater than the
number of values you're interested in, but it's there, nonetheless. And
when RAND_MAX is around 32767, which is quite common, it wouldn't
particularly surprise me if that alone accounted for the 1% deviation
from expectations.

The way to avoid this is to throw out some of the values from rand() so
that the number of values that you put into the transformation is a
multiple of the number of values you want:

int max = RAND_MAX + 1 - (RAND_MAX + 1) % range;
int value = rand();
while (max <= value)
value = rand();
return value % range;

There's probably an off-by-one error in that code, but it should give
you the general idea.

Pete, can you confirm that then the max possible value
one can get is RAND_MAX - 1, and not RAND_MAX ?
For me it doesn't matter much whether the max possible
value is RAND_MAX or 1 less of it, but I want be sure.

But I think the above code for max is not correct.
It should be IMHO something like that:
int max = (RAND_MAX / range) * range;
where range can be maximally RAND_MAX.
Then the maximally possible output lies between 0 and RAND_MAX - 1,
excluding RAND_MAX, right?

Ie. standard rand() gives values between 0 and RAND_MAX
but using the above method gives values between 0 and RAND_MAX - 1 only.
But this seems to be correct to avoid the problem you described above
regarding the 0.

joe · Jul 13, 2008

Richard Heathfield said:
The Standard says that "RAND_MAX [...] expands to an integral constant
expression, the value of which is the maximum value returned by the rand
function".

My final version:

int genrand(unsigned range)
{ // generates random number between 0 and range-1
// range can be maximally RAND_MAX + 1
// THIS IS THE RECOMMENDED CORRECT METHOD
// (recommended by me and others

if (range < 2)
return 0;
if (range > (RAND_MAX + 1))
range = RAND_MAX + 1;
const unsigned maxr = RAND_MAX + 1 - ((RAND_MAX + 1) % range);
unsigned value = maxr;
while (value >= maxr)
value = rand();
return value % range;
}

Jerry Coffin · Jul 13, 2008

[email protected] says... said:
Jerry, it is 0 to 36.

Okay -- not that it makes any real difference in terms of the algorithm.

As recommended in many books (for example Phillip Good
"Permutation, Parametric and Bootstrap Tests of Hypotheses")
I use the following:

int genrand(int lo, int hi)
{
int z = rand() % (hi - lo + 1) + lo;
return z;
}
...
int r = genrand(0, 36);

I had a hunch you might be doing it in a manner that was biased, and I
was right. Here's how I do it:

#include <stdlib.h>

int rand_lim(int limit) {
/* return a random number between 0 and limit inclusive.
*/

int divisor = RAND_MAX/(limit+1);
int retval;

do {
retval = rand() / divisor;
} while (retval > limit);

return retval;
}

vippstar · Jul 13, 2008

The Standard says that "RAND_MAX [...] expands to an integral constant
expression, the value of which is the maximum value returned by the rand
function".

Click to expand...

Click to expand...

My final version:

Click to expand...

int genrand(unsigned range)
{ // generates random number between 0 and range-1
// range can be maximally RAND_MAX + 1
// THIS IS THE RECOMMENDED CORRECT METHOD
// (recommended by me and others

Click to expand...

if (range < 2)
return 0;
if (range > (RAND_MAX + 1))
range = RAND_MAX + 1;

Click to expand...

This last step isn't right.

<snip>
It's redundant, but not wrong.

vippstar · Jul 13, 2008

On 2008-07-12 22:59:09 -0400, "joe" <[email protected]> said:
The Standard says that "RAND_MAX [...] expands to an integral constant
expression, the value of which is the maximum value returned by the rand
function".
My final version:
int genrand(unsigned range)
{ // generates random number between 0 and range-1
// range can be maximally RAND_MAX + 1
// THIS IS THE RECOMMENDED CORRECT METHOD
// (recommended by me and others
if (range < 2)
return 0;
if (range > (RAND_MAX + 1))
range = RAND_MAX + 1;
This last step isn't right.

Click to expand...

<snip>
It's redundant, but not wrong.

Click to expand...

How is it redundant? If the caller asks for a range that's larger than
this distribution can handle and this check isn't present, the
algorithm will not return values that cover the requested range.

Whether that second if is present or not, the behavior of the
algorithm is the same.

vippstar · Jul 13, 2008

On 2008-07-13 09:17:51 -0400, (e-mail address removed) said:
On 2008-07-12 22:59:09 -0400, "joe" <[email protected]> said:
The Standard says that "RAND_MAX [...] expands to an integral constant
expression, the value of which is the maximum value returned by the rand
function".
My final version:
int genrand(unsigned range)
{ // generates random number between 0 and range-1
// range can be maximally RAND_MAX + 1
// THIS IS THE RECOMMENDED CORRECT METHOD
// (recommended by me and others
if (range < 2)
return 0;
if (range > (RAND_MAX + 1))
range = RAND_MAX + 1;
This last step isn't right.
<snip>
It's redundant, but not wrong.
How is it redundant? If the caller asks for a range that's larger than
this distribution can handle and this check isn't present, the
algorithm will not return values that cover the requested range.

Click to expand...

Whether that second if is present or not, the behavior of the
algorithm is the same.

Click to expand...

It should produce an error. The fact that it doesn't means it's wrong,
not that it's redundant.

You're wrong.

Walter Roberson · Jul 13, 2008

You're wrong.

"right" or "wrong" in situations like this depend upon the
documented requirements upon the function. I've looked back through the
thread, but I can't seem to find any statement of what *exactly* the
function is supposed to do under which circumstances, so we cannot
say for sure what is "right" or "wrong" code to implement the function.
*Any* code that was standard conforming could be given as an
implementation of the function, and we have no firm grounds upon which to
say that the code is "wrong".

When there are no documented requirements for a function,
perceptions of "right" or "wrong" depend strongly upon
"reasonable expectations" of what the function would do. And it
seems to me that for -most- people, the "reasonable expectation"
of behaviour if the function cannot deliver the request range of
values is some kind of indication that there was a parameter problem.
It is my belief that relatively few people would have the
"reasonable expectation" that the function would instead quietly
deliver only a subset of the requested values. Yes, it is true that
*by chance* a random number generator run for several runs might not
-happen- to generate values greater than a certain value, but as
the number of runs increases, the probability of that happening
becomes statistically small enough that we can reasonably make
judgements about whether the function really is producing a
uniform random distribution.

James Kanze · Jul 13, 2008

On 2008-07-13 10:11:09 -0400, (e-mail address removed) said:

[...]

It should produce an error. The fact that it doesn't means
it's wrong, not that it's redundant.

From a quality of implementation point of view. Formally, the
contract is to return a number in the given range. The single
statement "return 0;" meets that requirement, and you don't need
all of this extra testing

.

Of course, the specification did state that the "range can be
maximally RAND_MAX + 1", so normally, I would expect an
assertion failure if my argument wasn't in this range. Or
totally undefined behavior, but not the function arbitrarily
changing my argument.

Keith Thompson · Jul 14, 2008

James Kanze said:
Of course, the specification did state that the "range can be
maximally RAND_MAX + 1", so normally, I would expect an
assertion failure if my argument wasn't in this range. Or
totally undefined behavior, but not the function arbitrarily
changing my argument.

Surely arbitrarily changing the argument *is* an example of "totally
undefined behavior".

semi-

}

James Kanze · Jul 15, 2008

[...]
Of course, the specification did state that the "range can be
maximally RAND_MAX + 1", so normally, I would expect an
assertion failure if my argument wasn't in this range. Or
totally undefined behavior, but not the function arbitrarily
changing my argument.

Click to expand...

Surely arbitrarily changing the argument *is* an example of "totally
undefined behavior".

When undefined behavior occurs, anything the program does is
correct. But from a quality of implementation point of view, we
don't normally expect it to go out of the way to make things
worse; *IF* the code detects the error, it should treat it as an
error.

Keith Thompson · Jul 15, 2008

James Kanze said:
[...]
Of course, the specification did state that the "range can be
maximally RAND_MAX + 1", so normally, I would expect an
assertion failure if my argument wasn't in this range. Or
totally undefined behavior, but not the function arbitrarily
changing my argument.

Click to expand...

Click to expand...

Surely arbitrarily changing the argument *is* an example of "totally
undefined behavior".

Click to expand...

When undefined behavior occurs, anything the program does is
correct. But from a quality of implementation point of view, we
don't normally expect it to go out of the way to make things
worse; *IF* the code detects the error, it should treat it as an
error.

Certainly.

By snipping the smiley, you've made it appear that I was making a
serious point. I wasn't.

comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

A new classification method for RNGs: Significance Level

joe

joe

joe

joe

user923005

joe

Jerry Coffin

joe

joe

joe

joe

Jerry Coffin

vippstar

vippstar

vippstar

Walter Roberson

James Kanze

Keith Thompson

James Kanze

Keith Thompson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads