numpy performance and random numbers

C

Carl Johan Rehn

Dear friends,

I plan to port a Monte Carlo engine from Matlab to Python. However,
when I timed randn(N1, N2) in Python and compared it with Matlab's
randn, Matlab came out as a clear winner with a speedup of 3-4 times.
This was truly disappointing. I ran tthis test on a Win32 machine and
without the Atlas library.

Why is there such a large difference in speed and how can I improve
the speed of randn in Python! Any help with this matter is truly
appreciated since I really would like to get away from Matlab and move
over to Python instead.

Yours Carl
 
S

Steven D'Aprano

Dear friends,

I plan to port a Monte Carlo engine from Matlab to Python. However, when
I timed randn(N1, N2) in Python and compared it with Matlab's randn,

What's randn? I don't know that function. I know the randint, random, and
randrange functions, but not randn. Where does it come from?

Matlab came out as a clear winner with a speedup of 3-4 times. This was
truly disappointing. I ran tthis test on a Win32 machine and without the
Atlas library.

Why is there such a large difference in speed and how can I improve the
speed of randn in Python! Any help with this matter is truly appreciated
since I really would like to get away from Matlab and move over to
Python instead.

Could be many reasons. Python could be generally slower than Matlab. Your
timing code might have been faulty and you weren't comparing equal
amounts of work (if your Python code was doing four times as much work as
the Matlab code, then naturally it will be four times slower). Perhaps
the Matlab random number generator is a low-quality generator which is
fast but not very random. Python uses a very high quality RNG which is
not cheap.

But does it really matter if the RNG is slower? Your Monte Carlo engine
is a lot more than just a RNG. What matters is whether the engine as a
whole is faster or slower, not whether one small component is slower.
 
C

Carl Johan Rehn

What's randn? I don't know that function. I know the randint, random, and
randrange functions, but not randn. Where does it come from?



Could be many reasons. Python could be generally slower than Matlab. Your
timing code might have been faulty and you weren't comparing equal
amounts of work (if your Python code was doing four times as much work as
the Matlab code, then naturally it will be four times slower). Perhaps
the Matlab random number generator is a low-quality generator which is
fast but not very random. Python uses a very high quality RNG which is
not cheap.

But does it really matter if the RNG is slower? Your Monte Carlo engine
is a lot more than just a RNG. What matters is whether the engine as a
whole is faster or slower, not whether one small component is slower.

randn is given by
array([[-2.66435181, -0.32486419, 0.12742156],
[-0.2387061 , -0.55894044, 1.20750493]])

Generally, at least in my MC application, I need a large number of
random numbers. Usually I execute, for example, r = randn(100, 10000)
sequentially a relatively large number of times until sufficient
accuracy has been reached. Thus, randn is in my case a mission
critical component for obtaining an acceptable overall run time.
Matlab and numpy have (by chance?) the exact names for the same
functionality, so I was very happy with numpy's implementation until I
timed it. So the basioc question is, how can I speed up random number
generation?

Carl
 
S

sturlamolden

I plan to port a Monte Carlo engine from Matlab to Python. However,
when I timed randn(N1, N2) in Python and compared it with Matlab's
randn, Matlab came out as a clear winner with a speedup of 3-4 times.
This was truly disappointing. I ran tthis test on a Win32 machine and
without the Atlas library.

This is due to the algorithm. Matlab is using Marsaglia's ziggurat
method. Is is the fastest there is for normal and gamma random
variates. NumPy uses the Mersenne Twister to produce uniform random
deviates, and then applies trancendental functions to transform to the
normal distribution. Marsaglia's C code for ziggurat is freely
available, so you can compile it yourself and call from ctypes, Cython
or f2py.

The PRNG does not use BLAS/ATLAS.
 
S

sturlamolden

Perhaps
the Matlab random number generator is a low-quality generator which is
fast but not very random. Python uses a very high quality RNG which is
not cheap.

Marsaglia and Matlab's implementation of ziggurat uses a slightly
lower quality RNG for uniform deviates than NumPy's Mersenne Twister.
But the real speed advantage comes from avoiding trancendental
functions. I have for some time thought of contributing a ziggurat
generator to NumPy, while retaining the Mersenne Twister internally,
but I have not got around to do it.
 
S

sturlamolden

Matlab and numpy have (by chance?) the exact names for the same
functionality,

Common ancenstry, NumPy and Matlab borrowed the name from IDL.


LabView, Octave and SciLab uses the name randn as well.

So the basioc question is, how can I speed up random number
generation?

The obvious thing would be to compile ziggurat yourself, and turn on
optimization flags for your hardware.
http://www.jstatsoft.org/v05/i08/


P.S. Be careful if you consider using more than one processor.
Multithreading is a very difficult issue with PRNGs, becuase it is
difficult to guarrantee they are truely independent. But you can use a
producer-consumer pattern, though: one thread constantly producing
random numbers (writing into a buffer or pipe) and another thread(s)
consuming them.
 
C

Carl Johan Rehn

This is due to the algorithm. Matlab is using Marsaglia's ziggurat
method. Is is the fastest there is for normal and gamma random
variates. NumPy uses the Mersenne Twister to produce uniform random
deviates, and then applies trancendental functions to transform to the
normal distribution. Marsaglia's C code for ziggurat is freely
available, so you can compile it yourself and call from ctypes, Cython
or f2py.

The PRNG does not use BLAS/ATLAS.

Thank you, this was very informative. I know about the Mersenne
Twister, but had no idea about Marsaglia's ziggurat method. I will
definitely try f2py or cython.

Well, I guess I knew that random numbers were not handled by BLAS/
ATLAS, but wasn't 100% sure.

Carl
 
C

Carl Johan Rehn

Common ancenstry, NumPy and Matlab borrowed the name from IDL.

LabView, Octave and SciLab uses the name randn as well.


The obvious thing would be to compile ziggurat yourself, and turn on
optimization flags for your hardware.http://www.jstatsoft.org/v05/i08/

P.S. Be careful if you consider using more than one processor.
Multithreading is a very difficult issue with PRNGs, becuase it is
difficult to guarrantee they are truely independent. But you can use a
producer-consumer pattern, though: one thread constantly producing
random numbers (writing into a buffer or pipe) and another thread(s)
consuming them.

How about mulit-core or (perhaps more exciting) GPU and CUDA? I must
admit that I am extremely interested in trying the CUDA-alternative.

Obviously, cuBLAS is not an option here, so what is the safest route
for a novice parallel-programmer?

Carl


Carl
 
S

sturlamolden

How about mulit-core or (perhaps more exciting) GPU and CUDA? I must
admit that I am extremely interested in trying the CUDA-alternative.

Obviously, cuBLAS is not an option here, so what is the safest route
for a novice parallel-programmer?

The problem with PRNG is that they are iterative in nature, and
maintain global states. They are therefore very hard to vectorize. A
GPU will not help. The GPU has hundreds of computational cores that
can run kernels, but you only get to utilize one.

Parallel PRNGs are an unsolved problem in computer science.
 
C

Carl Johan Rehn

The problem with PRNG is that they are iterative in nature, and
maintain global states. They are therefore very hard to vectorize. A
GPU will not help. The GPU has hundreds of computational cores that
can run kernels, but you only get to utilize one.

Parallel PRNGs are an unsolved problem in computer science.

Well, in Matlab I used "tic; for i = 1:1000, randn(100, 10000), end;
toc" and in IPython i used a similar construct but with "time" instead
of tic/(toc.


Thanks again for sharing your knowledge. I had no idea. This means
that if I want to speed up my application I have to go for the fastest
random generator and focus on other parts of my code that can be
vectorized.

Carl
 
L

Lie Ryan

Well, in Matlab I used "tic; for i = 1:1000, randn(100, 10000), end;
toc" and in IPython i used a similar construct but with "time" instead
of tic/(toc.
Code?


Thanks again for sharing your knowledge. I had no idea. This means
that if I want to speed up my application I have to go for the fastest
random generator and focus on other parts of my code that can be
vectorized.

If you don't care about "repeatability" (which is already extremely
difficult in parallel processing even without random number generators),
you can just start two PRNG at two distinct states (and probably from
two different algorithms) and they will each spews out two independent
streams of random numbers. What was "unsolved" was the "pseudo-" part of
the random number generation, which guarantee perfect replayability in
all conditions.
 
S

sturlamolden

you can just start two PRNG at two distinct states

No. You have to know for certain that the outputs don't overlap.

If you pick two random states (using any PRNG), you need error-
checking that states are always unique, i.e. that each PRNG never
reaches the starting state of the other(s). If I had to do this, I
would e.g. hash the state to a 32 bit integer. You can deal with
errors by jumping to a different state or simply aborting the
simulation. The problem is that the book-keeping will be costly
compared to the work involved in generating a single pseudorandom
deviate. So while we can construct a parallel PRNG this way, it will
likely run slower than one unique PRNG.

It has been suggested to use short-period MTs with different
characteristic polynomials. This is also available for the nVidia GPU
using CUDA. This is probably the generator I would pick if I wanted to
produce random numbers in parallel. However, here we should be aware
that the math has not been thoroughly proven.

http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/DC/dgene.pdf
http://developer.download.nvidia.co...jects/MersenneTwister/doc/MersenneTwister.pdf

There is also a version of MT that use SIMD instructions internally,
which gives a factor-of-two speedup.

I would never use different algorithms. It will introduce a systematic
difference in the simulation.
 
S

sturlamolden

If you pick two random states (using any PRNG), you need error-
checking that states are always unique, i.e. that each PRNG never
reaches the starting state of the other(s).

Another note on this:

Ideally, we would e.g. know how to find (analytically) MT states that
are very far apart. But to my knowledge no such equation has been
derived.

But often in Monte Carlo simulations, the PRNG is not the dominant
computational bottleneck. So we can simply start N PRNGs from N
consequtive states, and for each PRNG only use every N-th pseudorandom
deviate.
 
C

Carl Johan Rehn

Another note on this:

Ideally, we would e.g. know how to find (analytically) MT states that
are very far apart. But to my knowledge no such equation has been
derived.

But often in Monte Carlo simulations, the PRNG is not the dominant
computational bottleneck. So we can simply start N PRNGs from N
consequtive states, and for each PRNG only use every N-th pseudorandom
deviate.

Thank you for pointing me to the short-period MT reference and
especially the reference on the CUDA-version of parallel MT (even
though I would have wished the author had included a benchmark
comparison in the report). This is a very interesting topic. I agree
that it may work to start PRNGs at distinct and different states, but
that bookkeeping may slow down the algorithm so that it is not worth
the effort. However, the CUDA-version sounds interesting and should be
easy enough to use in a practical application.

Carl
 
G

Gregory Ewing

Lie said:
If you don't care about "repeatability" (which is already extremely
difficult in parallel processing even without random number generators),
you can just start two PRNG at two distinct states (and probably from
two different algorithms)

There's no need for different algorithms, and no problem
with repeatability. There exist algorithms with a very
long period (on the order of 2*100 or more) for which it's
easy to calculate seeds for a set of guaranteed non-
overlapping subsequences.

http://or.journal.informs.org/cgi/content/abstract/44/5/816
http://or.journal.informs.org/cgi/content/abstract/47/1/159

In these kinds of generator, moving from one state to
the next involves multiplying a state vector by a matrix
using modulo arithmetic. So you can jump ahead N steps
by raising the matrix to the power of N.
 
S

Steven D'Aprano

Well, in Matlab I used "tic; for i = 1:1000, randn(100, 10000), end;
toc" and in IPython i used a similar construct but with "time" instead
of tic/(toc.


I don't know if this will make any significant difference, but for the
record that is not the optimal way to time a function in Python due to
the overhead of creating the loop sequence.

In Python, the typical for-loop construct is something like this:

for i in range(1, 1000)

but range isn't a funny way of writing "loop over these values", it
actually creates a list of integers, which has a measurable cost. In
Python 2.x you can reduce that cost by using xrange instead of range, but
it's even better to pre-compute the list outside of the timing code. Even
better still is to use a loop sequence like [None]*1000 (precomputed
outside of the timing code naturally!) to minimize the cost of memory
accesses.

The best way to perform timings of small code snippets in Python is using
the timeit module, which does all these things for you. Something like
this:

from timeit import Timer
t = Timer('randn(100, 10000)', 'from numpy import randn')
print min(t.repeat())

This will return the time taken by one million calls to randn. Because of
the nature of modern operating systems, any one individual timing can be
seriously impacted by other processes, so it does three independent
timings and returns the lowest. And it will automatically pick the best
timer to use according to your operating system (either clock, which is
best on Windows, or time, which is better on Posix systems).
 
L

Lie Ryan

No. You have to know for certain that the outputs don't overlap.

Not necessarily, you only need to be certain that the two streams don't
overlap in any reasonable amount of time. For that purpose, you can use
a PRNG that have extremely high period like Mersenne Twister and puts
the generators to very distant states.
If you pick two random states (using any PRNG), you need error-
checking that states are always unique, i.e. that each PRNG never
reaches the starting state of the other(s). If I had to do this, I
would e.g. hash the state to a 32 bit integer. You can deal with
errors by jumping to a different state or simply aborting the
simulation. The problem is that the book-keeping will be costly
compared to the work involved in generating a single pseudorandom
deviate. So while we can construct a parallel PRNG this way, it will
likely run slower than one unique PRNG.

You will need some logic to ensure that your generator's states are
distant enough, but that won't incur any generation overhead. Of course
this relies on the very large period of the PRNG algorithm.
 
S

Steven D'Aprano

No. You have to know for certain that the outputs don't overlap.

"For certain"? Why?

Presumably you never do a Monte Carlo simulation once, you always repeat
the simulation. Does it matter if (say) one trial in fifty is lower-
quality than the rest because the outputs happened to overlap? What if
it's one trial in fifty thousand? So long as the probability of overlap
is sufficiently small, you might not even care about doing repeated
simulations.

If you pick two random states (using any PRNG), you need error- checking
that states are always unique, i.e. that each PRNG never reaches the
starting state of the other(s). If I had to do this, I would e.g. hash
the state to a 32 bit integer. You can deal with errors by jumping to a
different state or simply aborting the simulation. The problem is that
the book-keeping will be costly compared to the work involved in
generating a single pseudorandom deviate. So while we can construct a
parallel PRNG this way, it will likely run slower than one unique PRNG.

Since truly random sequences sometimes produce repeating sequences, you
should be able to tolerate short periods of overlap. I'd consider the
following strategy: for each of your parallel PRNGs other than the first,
periodically jump to another state unconditionally. The periods should be
relatively prime to each other, e.g. the second PRNG jumps-ahead after
(say) 51 calls, the third after 37 calls, etc. (the exact periods
shouldn't be important). Use a second, cheap, PRNG to specify how far to
jump ahead. The overhead should be quite small: a simple counter and test
for each PRNG, plus a small number of calls to a cheap PRNG, and
statistically you should expect no significant overlap.

Is this workable?
 
S

sturlamolden

Not necessarily, you only need to be certain that the two streams don't
overlap in any reasonable amount of time. For that purpose, you can use
a PRNG that have extremely high period like Mersenne Twister and puts
the generators to very distant states.

Except there is no way to find two very distant states and prove they
are distant enough.
 

Members online

No members online now.

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,113
Latest member
Vinay KumarNevatia
Top