Does shuffle() produce uniform result ?

T

tooru honda

Hi,

I have read the source code of the built-in random module, random.py.
After also reading Wiki article on Knuth Shuffle algorithm, I wonder if
the shuffle method implemented in random.py produces results with modulo
bias.

The reasoning is as follows: Because the method random() only produces
finitely many possible results, we get modulo bias when the number of
possible results is not divisible by the size of the shuffled list.

1. Does shuffle() produce uniform result ?

2. If not, is there a fast and uniform shuffle() available somewhere ?

Thanks !

-tooru honda
 
S

Steve Holden

tooru said:
Hi,

I have read the source code of the built-in random module, random.py.
After also reading Wiki article on Knuth Shuffle algorithm, I wonder if
the shuffle method implemented in random.py produces results with modulo
bias.

The reasoning is as follows: Because the method random() only produces
finitely many possible results, we get modulo bias when the number of
possible results is not divisible by the size of the shuffled list.

1. Does shuffle() produce uniform result ?
Given the cycle length of the Mersenne twister algorithm that generates
the underlying random numbers, it would have to be a pretty long list to
see a significant bias, don't you think? Have you done any calculations
on the length of list you propose to use?
2. If not, is there a fast and uniform shuffle() available somewhere ?
Frankly I don't think you need to worry.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------
 
H

Hrvoje Niksic

tooru honda said:
I have read the source code of the built-in random module,
random.py. After also reading Wiki article on Knuth Shuffle
algorithm, I wonder if the shuffle method implemented in random.py
produces results with modulo bias.

It doesn't have modulo bias because it doesn't use modulo to produce a
random index; it multiplies the floating point value with the desired
range. I'm not sure if that method produces any measurable bias.
 
M

Mark Dickinson

It doesn't have modulo bias because it doesn't use modulo to produce a
random index; it multiplies the floating point value with the desired
range. I'm not sure if that method produces any measurable bias.

It produces exactly the same level of bias as the 'modulo bias'
obtained by reducing a random integer in [0, 2**53). For example,
suppose we're trying to produce an integer x in the range 0 through 6
inclusive. If n is a random variable whose values are uniformly
distributed across range(0, 2**53) then:

x = n % 7

will produce 0, 1, 2 and 3 with probability (2**53//7+1)/2**53, and 4,
5 and 6 with probability (2**53//7)/2**53, while

x = floor((n/2**53)*7)

will produce 0, 1, 3 and 5 with probability (2**53//7+1)/2**53, and 2,
4 and 6 with probability (2**53//7)/2*53.

Either way, you'd have a very hard time detecting such a tiny bias.
At the other end of the scale, if you're trying to produce a value in
[0, 2**53-2] (for example) then it looks worse: with either method,
one of the values occurs exactly twice as often as all of the others.
But since there are now so many values, you'd again have problems
detecting any bias.

Steven said:
Frankly I don't think you need to worry.

What he said.

Mark
 
P

Paul Rubin

tooru honda said:
The reasoning is as follows: Because the method random() only produces
finitely many possible results, we get modulo bias when the number of
possible results is not divisible by the size of the shuffled list.

1. Does shuffle() produce uniform result ?

The nonuniformity is too small to matter. But what is the
application? If you are doing something like implementing online
poker for real money, you shouldn't use the built-in RNG. It is not
designed for what we call adversarial indistinguishability from true
randomness. Instead, use the random byte stream available from
os.urandom() and derive your random numbers from that.
 
M

Mark Dickinson

x = floor((n/2**53)*7)

will produce 0, 1, 3 and 5 with probability (2**53//7+1)/2**53, and 2,
4 and 6 with probability (2**53//7)/2*53.

Oops---I lied; I forgot to take into account the rounding implicit in
the (n/2**53)*7 multiplication. A bit of experimentation shows that
it's 0, 2, 4 and 6 that occur more often, with 1, 3 and 5 less likely
by a miniscule amount (at least on an IEEE-754 system).

Mark
 
T

tooru honda

Hi, First of all, my thanks to all of you who replied.

I am writing a gamble simulation to convince my friend that his "winning
strategy" doesn't work. I use shuffle method from a random.SystemRandom
instance to shuffle 8 decks of cards.

As the number of cards is quite small (number of cards is 416), the
nonuniformity doesn't matter as most of you have already said. Just to
avoid argument from my friend, I am considering writing my own randint
and shuffle methods based on os.urandom() though.

-tooru honda
 
T

tooru honda

At the end, I think it is worthwhile to implement my own shuffle and
random methods based on os.urandom. Not only does the resulting code
gets rid of the minuscule bias, but the program also runs much faster.

When using random.SystemRandom.shuffle, posix.open and posix.close from
calling os.urandom account for almost half of the total execution time
for my program. By implementing my own random and getting a much larger
chunk of random bytes from os.urandom each time, I am able to reduce the
total execution time by half.

-tooru honda

P.S. I use python 2.5.1 on MacOSX 10.4.10 (PowerPC).
 
D

Dan Bishop

Hi,

I have read the source code of the built-in random module, random.py.
After also reading Wiki article on Knuth Shuffle algorithm, I wonder if
the shuffle method implemented in random.py produces results with modulo
bias.

The reasoning is as follows: Because the method random() only produces
finitely many possible results, we get modulo bias when the number of
possible results is not divisible by the size of the shuffled list.

1. Does shuffle() produce uniform result ?

Experimental evidence with a 5-card decks and 1,200,000 simulation
gives me

[9782, 9784, 9784, 9786, 9817, 9822, 9825, 9841, 9859, 9864, 9866,
9872, 9886, 9888, 9893, 9893, 9893, 9893, 9896, 9896, 9901, 9905,
9909, 9909, 9912, 9917, 9919, 9920, 9920, 9925, 9932, 9933, 9934,
9938, 9938, 9943, 9944, 9946, 9947, 9951, 9952, 9955, 9956, 9959,
9960, 9961, 9962, 9963, 9964, 9966, 9968, 9968, 9968, 9969, 9969,
9975, 9977, 9978, 9978, 9979, 9985, 9987, 9990, 9996, 9998, 9998,
10002, 10009, 10012, 10015, 10017, 10018, 10020, 10026, 10027, 10028,
10032, 10035, 10042, 10047, 10048, 10049, 10049, 10055, 10056, 10060,
10062, 10064, 10066, 10069, 10070, 10076, 10087, 10089, 10092, 10095,
10097, 10098, 10100, 10101, 10116, 10117, 10128, 10129, 10131, 10137,
10138, 10140, 10146, 10155, 10158, 10162, 10162, 10163, 10177, 10190,
10190, 10232, 10260, 10292]

frequencies of each of the 120 possible orderings. The chi-squared
test statistic (for H0: all 120 orderings are equally likely) is
132.9472 with 119 degrees of freedom. The p-value is 0.1804, which is
greater than typically-used significance levels, so there is
insufficient evidence to reject H0.
 
A

Alex Martelli

tooru honda said:
At the end, I think it is worthwhile to implement my own shuffle and
random methods based on os.urandom. Not only does the resulting code
gets rid of the minuscule bias, but the program also runs much faster.

When using random.SystemRandom.shuffle, posix.open and posix.close from
calling os.urandom account for almost half of the total execution time
for my program. By implementing my own random and getting a much larger
chunk of random bytes from os.urandom each time, I am able to reduce the
total execution time by half.

If I were in your shoes, I would optimize by subclassing
random.SystemRandom and overriding the random method to use os.urandom
with some large block size and then parcel it out, instead of the
_urandom(7) that it now uses. E.g., something like:

class SystemBlockRandom(random.SystemRandom):

def __init__(self):
random.SystemRandom.__init__(self)
def rand7():
while True:
randata = os.urandom(7*1024)
for i in xrange(0, 7*1024, 7):
yield long(binascii.hexlify(randata[i:i+7]),16)
self.rand7 = rand7().next

def random(self):
"""Get the next random number in the range [0.0, 1.0)."""
return (self.rand7() >> 3) * random.RECIP_BPF

(untested code). No need to reimplement anything else, it seems to me.


Alex
 
T

tooru honda

By incorporating Alex's code, I got another performance boost of 20%.
It is mostly due to Alex's more efficient implementation of block random
than my own version.

-tooru honda

Below is the code I have now:


from binascii import hexlify
from os import urandom

class rcRandomC(random.SystemRandom):

def __init__(self):
random.SystemRandom.__init__(self)

def rand2():

while True:
randata = urandom(2*1024)
for i in xrange(0, 2*1024, 2):
yield int(hexlify(randata[i:i+2]),16) # integer
in [0,65535]

self.rand2_M = rand2().next


# modified from random._randbelow
def randrange(self,startN,stopN):

"""Choose a random integer from range(startN, stopN).
widthN<=65536

"""

widthN=stopN-startN


left_over_N=65536%widthN
upper_bound_N= 65535-left_over_N

random_number=self.rand2_M()

while random_number>upper_bound_N:

random_number=self.rand2_M()

r = random_number%widthN

return startN+r

def shuffle(self, x):
"""x, random=random.random -> shuffle list x in place; return
None.

"""

randrange=self.randrange

for i in reversed(xrange(1, len(x))):
# pick an element in x[:i+1] with which to exchange x
j = randrange(0,i+1)
x, x[j] = x[j], x
 
A

Alex Martelli

tooru honda said:
def rand2():
while True:
randata = urandom(2*1024)
for i in xrange(0, 2*1024, 2):
yield int(hexlify(randata[i:i+2]),16) # integer
in [0,65535]

another equivalent possibility, which might probably be faster:

import array
...
def rand2():
while True:
x = array.array("H")
x.fromstring(urandom(2*4000))
for i in x: yield i


Alex
 
T

tooru honda

Thanks to everyone who replied, (and special thanks to Alex Martelli,) I
was able to accomplish what I originally asked for: a shuffle() which is
both fast and without bias. It is the first time I try to optimize
python code, and it is definitely a very rewarding experience.


To bring closure to this thread, I will provide below some statistics
obtained from cProfile:


1. using random.SystemRandom

ncalls tottime percall cumtime percall filename:lineno(function)

4182879 35.154 0.000 211.006 0.000 os.py:724(urandom)
4182879 31.981 0.000 248.291 0.000 random.py:767(random)
40578 17.977 0.000 266.300 0.007 random.py:250(shuffle)
4182879 29.487 0.000 29.487 0.000 {posix.close}
4182879 99.460 0.000 99.460 0.000 {posix.open}
4182879 41.794 0.000 41.794 0.000 {posix.read}

2. my original optimization

8268 0.133 0.000 2.577 0.000 os.py:724(urandom)
4134322 15.094 0.000 21.606 0.000 baccarat.py:70(get2bytes)
4131441 17.221 0.000 41.722 0.000 baccarat.py:85(randrange)
40590 7.059 0.000 48.795 0.001 baccarat.py:112(shuffle)

3. using Alex's generator with string

4117 0.058 0.000 2.048 0.000 os.py:724(urandom)
4214795 10.186 0.000 14.880 0.000 baccarat.py:93(rand2)
4211790 8.883 0.000 23.763 0.000 baccarat.py:106(randrange)
40739 6.101 0.000 29.878 0.001 baccarat.py:131(shuffle)


4. using Alex's generator with array

2081 0.029 0.000 1.736 0.001 os.py:724(urandom)
4161569 1.724 0.000 3.473 0.000 baccarat.py:100(rand2new)
4158474 8.315 0.000 11.788 0.000 baccarat.py:113(randrange)
40514 5.726 0.000 17.528 0.000 baccarat.py:138(shuffle)
 
A

Antoon Pardon

By incorporating Alex's code, I got another performance boost of 20%.
It is mostly due to Alex's more efficient implementation of block random
than my own version.

If I understand correctly that you are using urandom as a random
generator I wouldn't trust too much on this performance. Urandom
uses the systemwide entropy-pool. If other programs need this pool
too, your performance can drop spectaculary. If you are using a linux
machine just try to execute "od -x /dev/random" at the same time
as your program.
 
P

Paul Rubin

Antoon Pardon said:
If I understand correctly that you are using urandom as a random
generator I wouldn't trust too much on this performance. Urandom
uses the systemwide entropy-pool. If other programs need this pool
too, your performance can drop spectaculary.

No the idea is that once there's enough entropy in the pool to make
one encryption key (say 128 bits), the output of /dev/urandom is
computationally indistinguishable from random output no matter how
much data you read from it.
 
A

Antoon Pardon

No the idea is that once there's enough entropy in the pool to make
one encryption key (say 128 bits), the output of /dev/urandom is
computationally indistinguishable from random output no matter how
much data you read from it.

If you were talking about /dev/random I would agree. But this is what
the man page on my system says about /dev/urandom

A read from the /dev/urandom device will not block waiting for
more entropy. As a result, if there is not sufficient
entropy in the entropy pool, the returned values are
theoretically vulnerable to a cryptographic attack on the algorithms
used by the driver. Knowledge of how to do this is not available
in the current non-classified literature, but it is the-
oretically possible that such an attack may exist. If this is a
concern in your application, use /dev/random instead.

And reading from /dev/random can block if there is not enough entropy.
 
P

Paul Rubin

Antoon Pardon said:
If you were talking about /dev/random I would agree. But this is what
the man page on my system says about /dev/urandom. ...
the returned values are theoretically vulnerable to a
cryptographic attack on the algorithms used by the driver.

Right. The idea is that those attacks don't exist and therefore the
output is computationally indistinguishable from random. Of course
whether the idea is correct, an unproven conjecture, but it looks
pretty good; certainly finding any problem with the specific
algorithms in urandom would be a significant research discovery and
not likely to affect the application being discussed. Finding a
general attack that couldn't be fixed with some simple tweak would be
a major crypto breakthrough that would probably reshape a lot of
complexity theory. This is unlike the situation with Mersenne
Twister, which was not designed for indistinguishability against an
opponent who knew what to look for.

In short, using /dev/random is fairly silly once you know there's
enough entropy in the randomness pool to make a good key. If
/dev/urandom's algorithms are broken then whatever you're doing with
the /dev/random output is probably also broken.
 
S

Steven D'Aprano

Right. The idea is that those attacks don't exist and therefore the
output is computationally indistinguishable from random.

It is a huge leap from what the man page says, that they don't exist in
the unclassified literature at the time the docs were written, to what
you're saying, that they don't exist.

The man page is clear: there is a possible vulnerability in /dev/urandom.
Any cryptographer worth his salt (pun intended) would be looking to close
that vulnerability BEFORE an attack is made public, and not just wait for
the attack to trickle down from the NSA to the script kiddies. The time
to close the stable door is _before_ the horse gets away.

Of course
whether the idea is correct, an unproven conjecture, but it looks pretty
good; certainly finding any problem with the specific algorithms in
urandom would be a significant research discovery and not likely to
affect the application being discussed.

I agree that this flaw doesn't sound like it will effect the application
being discussed, but a problem has already been found and a solution is
already known: block until there's enough entropy. That's what /dev/
random does.


[snip]
In short, using /dev/random is fairly silly once you know there's enough
entropy in the randomness pool to make a good key. If /dev/urandom's
algorithms are broken then whatever you're doing with the /dev/random
output is probably also broken.

That doesn't follow. Antoon is specifically warning that /dev/urandom is
non-blocking. If you knew there was enough entropy available, you
wouldn't need /dev/random -- but how do you know there's enough entropy?

(I suppose you could look in /proc/sys/kernel/random/entropy_avail.)

For this specific application, it probably doesn't matter -- using /dev/
urandom is surely overkill, and on a single-user Linux desktop you're
unlikely to have vast numbers of applications reading /dev/urandom
without your knowledge. But why not use /dev/random? What's the downside?
 
R

Raymond Hettinger

1. Does shuffle() produce uniform result ?

If you're worried about this microscopic bias (on the order of
2**-53), then shuffle more than once. That will distribute the bias
more evenly:

def super_shuffle(sequence):
for i in range(10):
shuffle(sequence)


Raymond Hettinger
 
P

Paul Rubin

Steven D'Aprano said:
It is a huge leap from what the man page says, that they don't exist in
the unclassified literature at the time the docs were written, to what
you're saying, that they don't exist.

OK. /dev/random vs /dev/urandom is a perennial topic in sci.crypt and
there are endless long threads about it there, so I tried to give you
the short version, but will give a somewhat longer version here.

There can't be an actual security proof without also proving P!=NP;
however, the cryptographic part of /dev/urandom is designed pretty
conservatively and seems to be in good shape even after all these years.
The man page is clear: there is a possible vulnerability in /dev/urandom.

There is the theoretical possibility of such a vulnerability, however
it also applies to /dev/random. Note that the man page is neither a
kernel internals document nor a cryptography textbook, so it doesn't
describe the algorithm or the theory behind it. For that you have to
look at the source code and the cryptographic literature, respectively.

For the random.c source code, see

http://www.cs.berkeley.edu/~daw/rnd/linux-rand

It's an old version but hasn't changed all that much since then AFAIK.
There is a long comment at the top explaining how it works. Short
version: /dev/random reads data from a bunch of stochastic sources
around the system (e.g. things like mouse motion) and makes a
seat-of-the-pants estimate of the amount of entropy coming from the
sources. The sources are not considered to be uniform or
uncorrelated; they're just supposed to contain entropy. The entropy
is distilled through a cryptographic hash function before being sent
out the API. /dev/random throttles its output so that the number of
bits emitted is always lower than the estimated input entropy.
/dev/urandom is the same but doesn't do this throttling.

For the theory of what we hope (but have not proved) that these hash
functions do, see for example the HMAC papers:

http://www-cse.ucsd.edu/~mihir/papers/hmac.html
Any cryptographer worth his salt (pun intended) would be looking to close
that vulnerability BEFORE an attack is made public, and not just wait for
the attack to trickle down from the NSA to the script kiddies. The time
to close the stable door is _before_ the horse gets away.

No really, these functions are very well studied, in fact there are
known ways to construct collisions in the md5 compression function but
that doesn't appear to be of any use against how random.c uses it.
I agree that this flaw doesn't sound like it will effect the application
being discussed, but a problem has already been found and a solution is
already known: block until there's enough entropy. That's what /dev/
random does.

No, having enough entropy does NOT guarantee indistinguishability from
randomness. Think of a 60-40 biased coin. Each flip gives you 0.97
bits of entropy, so if you want 64 bits of entropy, 66 flips gives it
to you. But it's very easy to spot the 60-40 bias and see that the
flips are not drawn uniformly from {heads,tails}. Running the flips
through MD5 experimentally appears to get rid of any correlation
(except for some very tricky adversarial constructions based on
controlling the input) so that is what /dev/random does. But that is
the exact same thing that urandom does. It's possible that some
as-yet-unknown attack can find correlations in the md5 output and also
in SHA256 or whatever.

More recently there's been some theoretical advances, see these:

http://en.wikipedia.org/wiki/Extractor
http://citeseer.ist.psu.edu/barak03true.html

so maybe that stuff can be used in implementations. I haven't looked
into it enough to know if it makes sense. Chapter 10 of Schneier and
Ferguson's "Practical Cryptography" also discusses how to write these
system-level PRNG's.
That doesn't follow. Antoon is specifically warning that /dev/urandom is
non-blocking. If you knew there was enough entropy available, you
wouldn't need /dev/random -- but how do you know there's enough entropy?

/dev/urandom can in fact screw you if you try to use it before the
system has gathered enough entropy to give good output, such as while
the system is still booting up. That is not a cryptographic algorithm
vulnerability. It could also screw you if your system is somehow
leaking information about its entropy pool (say through
electromagnetic emissions from the memory bus). That is also not a
cryptographic algorithm vulnerability. /dev/random does protect you
somewhat from both of these screws. /dev/random does NOT protect you
against sufficiently strong hypothetical attacks against the crypto
algorithm.

Overall, the situation is:

1) /dev/urandom is slightly bogus because it can produce output when
there's almost no entropy in the system, i.e. during boot or
immediately after boot. But once the system has been running
for a while, the urandom output (in a real PC) is pretty good.

2) /dev/random is somewhat more bogus because it blocks when it's
capable of producing good (computationally secure) output. Note
that /dev/random's entropy estimates could in fact be wrong--for
example, in a virtual PC (say a simulated one), there might be ZERO
actual entropy (restarting the simulation might get you back to
exactly the same state).

3) /dev/random and /dev/urandom are in principle BOTH vulnerable to
cryptographic attacks. Some people incorrectly assume that this
doesn't apply to /dev/random. So from an information-theoretic
point of view, /dev/random and /dev/urandom are both bogus, but
from a computational security point of view, the algorithms look
ok (as best as we can tell), and if anything is irreparably wrong
with them then all of cryptography is in trouble, so really neither
is bogus in this regard.

4) The man page is fairly seriously bogus because it doesn't explain
the real situation with either /dev/urandom or /dev/random.

5) Really, these modern algorithms are much harder to attack than
fans of pre-computer (e.g. WW2-era) cryptography seem to imagine.
These algorithms are very fast on modern computers but wouldn't
have been practical without computers. Computers have extended
the capabilities of both attackers and defenders, but it looks
like they've given a lot more to the defenders (see Kahn "The
Codebreakers" 2nd ed. note at the end: it says something like
"in the long-running battle between cryptographers and cryptanalysts,
it looks like the cryptographers have won).

6) For a sort of philosophical note on how cryptography fits in with
the P vs. NP problem, this article is pretty cool:

http://www.cs.ucsd.edu/users/russell/average.ps

A brief summary is at:

http://weblog.fortnow.com/2004/06/impagliazzos-five-worlds.html
For this specific application, it probably doesn't matter -- using /dev/
urandom is surely overkill, and on a single-user Linux desktop you're
unlikely to have vast numbers of applications reading /dev/urandom
without your knowledge. But why not use /dev/random? What's the downside?

/dev/random is in fact quite slow, like a few bits per second, not
practical for dealing zillions of poker hands or whatever this
application was. Once you exhaust the stored pool (a few hundred
bytes or whatever) it takes quite a long time to get any more output.
Also, it doesn't give the theoretical guarantees that some people
imagine that it does. Aside from that, it's indistinguishable from
/dev/urandom once /dev/urandom has enough entropy to get started.

The right way for /dev/*random to work is block after system boot
until there is some reasonable fixed amount of entropy (say 256 bits)
gathered, then produce unlimited output without ever blocking again.
Some ongoing folding of new entropy into the entropy pool (see the
Fortuna algorithm in Schneier and Ferguson) is a good practical
precaution, but really high security applications should consider the
whole PC memory and CPU buses to be an insecure broadcast medium
anyway, so such applications generally do their most sensitive
operations in dedicated secure hardware.

Sorry to go on for so long but it connects to a never-ending source of
bogosity on sci.crypt (the other newsgroup I mostly hang out in),
where several times a year someone shows up selling some bogus
one-time-pad product based on supposed information-theoretic security,
they're misguided 100% of the time and get shot down, but it always
takes a while. My guess is that one of them is responsible for the
ongoing sci.crypt sporge attack.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top