performance of script to write very long lines of random chars

G

gry

Dear pythonistas,
I am writing a tiny utility to produce a file consisting of a
specified number of lines of a given length of random ascii
characters. I am hoping to find a more time and memory efficient way,
that is still fairly simple clear, and _pythonic_.

I would like to have something that I can use at both extremes of
data:

32M chars per line * 100 lines
or
5 chars per line * 1e8 lines.

E.g., the output of bigrand.py for 10 characters, 2 lines might be:

gw2+M/5t&.
S[[db/l?Vx

I'm using python 2.7.0 on linux. I need to use only out-of-the box
modules, since this has to work on a bunch of different computers.
At this point I'm especially concerned with the case of a few very
long lines, since that seems to use a lot of memory, and take a long
time.
Characters are a slight subset of the printable ascii's, specified in
the examples below. My first naive try was:

from sys import stdout
import random
nchars = 32000000
rows = 10
avail_chrs =
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
\'()*+,-./:;<=>?@[\\]^_`{}'

def make_varchar(nchars):
return (''.join([random.choice(avail_chrs) for i in
range(nchars)]))

for l in range(rows):
stdout.write(make_varchar(nchars))
stdout.write('\n')

This version used around 1.2GB resident/1.2GB virtual of memory for
3min 38sec.


My second try uses much less RAM, but more CPU time, and seems rather,
umm, un-pythonic (the array module always seems a little un
pythonic...)

from sys import stdout
from array import array
import random
nchars = 32000000
rows = 10
avail_chrs =
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
\'()*+,-./:;<=>?@[\\]^_`{}'
a = array('c', 'X' * nchars)

for l in range(rows):
for i in xrange(nchars):
a = random.choice(avail_chrs)
a.tofile(stdout)
stdout.write('\n')

This version using array took 4 min, 29 sec, using 34MB resident/110
virtual. So, much smaller than the first attempt, but a bit slower.
Can someone suggest a better code? And help me understand the
performance issues here?

-- George
 
C

Chris Angelico

avail_chrs =
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
\'()*+,-./:;<=>?@[\\]^_`{}'

Is this exact set of characters a requirement? For instance, would it
be acceptable to instead use this set of characters?

avail_chrs = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'

Your alphabet has 92 characters, this one only 64... the advantage is
that it's really easy to work with a 64-character set; in fact, for
this specific set, it's the standard called Base 64, and Python
already has a module for working with it. All you need is a random
stream of eight-bit characters, which can be provided by os.urandom().

So here's a much simpler version of your program, following the
cut-down character set I offer:

import os
import base64
nchars = 32000000
rows = 10
# Note: If nchars is one higher than a multiple of 4 (eg 5, 9, 101),
# the lines will be one character short (4, 8, 100).
nchars = nchars * 3 // 4
for l in range(rows):
print(base64.b64encode(os.urandom(nchars)).strip(b'='))


If you can guarantee that your nchars will always be a multiple of 4,
you can drop the .strip() call.

This is going to be *immensely* faster than calling random.choice()
for every character, but it depends on a working os.urandom (it'll
raise NotImplementedError if there's no suitable source). I know it's
available on OS/2, Windows, and Linux, but don't have others handy to
test. If by "a bunch of different computers" you mean exclusively
Linux computers, this should be fine.

ChrisA
 
M

Michael Torrie

from sys import stdout
from array import array
import random
nchars = 32000000
rows = 10
avail_chrs =
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
\'()*+,-./:;<=>?@[\\]^_`{}'
a = array('c', 'X' * nchars)

for l in range(rows):
for i in xrange(nchars):
a = random.choice(avail_chrs)
a.tofile(stdout)
stdout.write('\n')

This version using array took 4 min, 29 sec, using 34MB resident/110
virtual. So, much smaller than the first attempt, but a bit slower.
Can someone suggest a better code? And help me understand the
performance issues here?


Why are you using an array? Why not just rely on the OS to buffer the
output. Just write your characters straight to stdout instead of
placing them in an array.

At that point I believe this program will be as fast as is possible in
Python.
 
G

gry

from sys import stdout
from array import array
import random
nchars = 32000000
rows = 10
avail_chrs =
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
\'()*+,-./:;<=>?@[\\]^_`{}'
a = array('c', 'X' * nchars)
for l in range(rows):
    for i in xrange(nchars):
        a = random.choice(avail_chrs)
    a.tofile(stdout)
    stdout.write('\n')

This version using array took 4 min, 29 sec, using 34MB resident/110
virtual. So, much smaller than the first attempt, but a bit slower.
Can someone suggest a better code?  And help me understand the
performance issues here?

Why are you using an array?  Why not just rely on the OS to buffer the
output.  Just write your characters straight to stdout instead of
placing them in an array.

At that point I believe this program will be as fast as is possible in
Python.


Appealing idea, but it's slower than the array solution: 5min 13
secs. vs 4min 30sec for the array:

for l in range(rows):
for i in xrange(nchars):
stdout.write(random.choice(avail_chrs))
stdout.write('\n')


os.urandom does look promising -- I have to have full control over the
charset, but urandom is very fast at generating big random strings...
stay tuned...
 
M

MRAB

Dear pythonistas,
I am writing a tiny utility to produce a file consisting of a
specified number of lines of a given length of random ascii
characters. I am hoping to find a more time and memory efficient way,
that is still fairly simple clear, and _pythonic_.

I would like to have something that I can use at both extremes of
data:

32M chars per line * 100 lines
or
5 chars per line * 1e8 lines.

E.g., the output of bigrand.py for 10 characters, 2 lines might be:

gw2+M/5t&.
S[[db/l?Vx

I'm using python 2.7.0 on linux. I need to use only out-of-the box
modules, since this has to work on a bunch of different computers.
At this point I'm especially concerned with the case of a few very
long lines, since that seems to use a lot of memory, and take a long
time.
Characters are a slight subset of the printable ascii's, specified in
the examples below. My first naive try was:

from sys import stdout
import random
nchars = 32000000
rows = 10
avail_chrs =
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
\'()*+,-./:;<=>?@[\\]^_`{}'

def make_varchar(nchars):
return (''.join([random.choice(avail_chrs) for i in
range(nchars)]))

for l in range(rows):
stdout.write(make_varchar(nchars))
stdout.write('\n')

This version used around 1.2GB resident/1.2GB virtual of memory for
3min 38sec.


My second try uses much less RAM, but more CPU time, and seems rather,
umm, un-pythonic (the array module always seems a little un
pythonic...)

from sys import stdout
from array import array
import random
nchars = 32000000
rows = 10
avail_chrs =
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
\'()*+,-./:;<=>?@[\\]^_`{}'
a = array('c', 'X' * nchars)

for l in range(rows):
for i in xrange(nchars):
a = random.choice(avail_chrs)
a.tofile(stdout)
stdout.write('\n')

This version using array took 4 min, 29 sec, using 34MB resident/110
virtual. So, much smaller than the first attempt, but a bit slower.
Can someone suggest a better code? And help me understand the
performance issues here?

Names in the global scope are stored in a dict, but local to a function
are stored in slots and can be accessed more quickly.

'avail_chrs' and 'random.choice' are referred to many times, so making
'avail_chrs' local and making a local reference to 'random.choice' will
help.


from sys import stdout
from array import array
import random

def generate():
avail_chrs =
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{}'
rnd = random.choice

for l in range(rows):
stdout.write(''.join([rnd(avail_chrs) for i in xrange(nchars)]))
stdout.write('\n')

nchars = 32000000
rows = 10
generate()
 
C

Chris Angelico

Appealing idea, but it's slower than the array solution: 5min 13
secs. vs 4min 30sec for the array:

for l in range(rows):
for i in xrange(nchars):
stdout.write(random.choice(avail_chrs))
stdout.write('\n')


os.urandom does look promising -- I have to have full control over the
charset, but urandom is very fast at generating big random strings...
stay tuned...

Without actually profiling it, my first guess would be that calling
random.choice() for every character is an optimization target. (NOTE:
Do profile it, if the urandom method isn't sufficient straight-off.)
You may want to consider, for instance, generating larger random
numbers and doing some kind of translation on them - which is
fundamentally what the urandom/b64encode method is doing.

ChrisA
 
S

Steven D'Aprano

avail_chrs =
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
\'()*+,-./:;<=>?@[\\]^_`{}'

Is this exact set of characters a requirement? For instance, would it be
acceptable to instead use this set of characters?

avail_chrs =
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'

Your alphabet has 92 characters, this one only 64... the advantage is
that it's really easy to work with a 64-character set; in fact, for this
specific set, it's the standard called Base 64, and Python already has a
module for working with it. All you need is a random stream of eight-bit
characters, which can be provided by os.urandom().

I was originally going to write that using the base64 module would
introduce bias into the random strings, but after a little investigation,
I don't think it does.

Or at least, if it does, it's a fairly subtle bias, and not detectable by
the simple technique I used: inspect the mean, and the mean deviation
from the mean.


from os import urandom
from base64 import b64encode

data = urandom(1000000)
m = sum(data)/len(data)
md = sum(abs(v - m) for v in data)/len(data)
print("Mean and mean deviation of urandom:", m, md)

encoded = b64encode(data).strip(b'=')
chars = (b'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef'
b'ghijklmnopqrstuvwxyz0123456789+/')
values = [chars.index(v) for v in encoded]
m = sum(values)/len(values)
md = sum(abs(v - m) for v in values)/len(values)
print("Mean and mean deviation of encoded data:", m, md)



When I run this, it prints:


Mean and mean deviation of urandom: 127.451652 63.95331188965717
Mean and mean deviation of encoded data: 31.477027511486245
15.991177272527072

I would expect 127 64 and 32 16, so we're pretty close. That's not to say
that there aren't any other biases or correlations in the encoded data,
but after a simplistic test, it looks okay to me.
 
C

Chris Angelico

I was originally going to write that using the base64 module would
introduce bias into the random strings, but after a little investigation,
I don't think it does.

Assuming that os.urandom() returns bytes with perfectly fair
distribution (exactly equal chance of any value 00-FF - it probably
does, or close to it), and assuming that you work with exact multiples
of 3 bytes and 4 output characters, base64 will give you perfectly
fair distribution of result characters. You take three bytes (24 bits)
and turn them into four characters (6 bits per character, = 24 bits).
You might see some bias if you use less than a full set of four output
characters, though; I haven't dug into the details to check that.

ChrisA
 
S

Steven D'Aprano

Dear pythonistas,
I am writing a tiny utility to produce a file consisting of a
specified number of lines of a given length of random ascii characters.
I am hoping to find a more time and memory efficient way, that is still
fairly simple clear, and _pythonic_.

Here's another option: use string.translate to map random bytes to the
printable ASCII bytes.


import string

all_bytes = ''.join(map(chr, range(256)))
printable = all_bytes[33:127]
n = 127 + len(printable)
extras = all_bytes[127:n]
_deletions = all_bytes[:33] + all_bytes[n:]
_table = string.maketrans(extras, printable)


def random_chars(length, size=1000):
# Make local bindings for speed.
import string, os
table, deletions = _table, _deletions
# Generate random strings.
buffer = []
while True:
while len(buffer) < length:
bytes = string.translate(os.urandom(size), table, deletions)
buffer.extend(bytes)
yield ''.join(buffer[:length])
buffer = buffer[length:]



Then use it like this:

# I want seven lines of twelve char strings.
make = random_chars(12)
for i in range(7):
print next(make)



One thing to be aware of: urandom may run out of entropy, and then it
will slow down a lot. If you don't care about cryptographic randomness,
you could use this instead:

import random
def myrandom(n):
return [random.randint(0, 255) for i in xrange(n)]


although that will actually be slower unless urandom has run out of
entropy.
 
O

Oscar Benjamin

Dear pythonistas,
I am writing a tiny utility to produce a file consisting of a
specified number of lines of a given length of random ascii
characters. I am hoping to find a more time and memory efficient way,
that is still fairly simple clear, and _pythonic_.

I would like to have something that I can use at both extremes of
data:

32M chars per line * 100 lines
or
5 chars per line * 1e8 lines.

I would definitely use numpy for this. The script below seems to be
io-bound on my machine:

#!/usr/bin/env python

from sys import stdout
import numpy as np
from numpy import random

CHARS = (
'0123456789'
'abcdefghijklmnopqrstuvwxyz'
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
'!#$%& \'()*+,-./:;<=>?@[\\]^_`{}'
)

ARRAY_CHARS = np.frombuffer(CHARS, np.uint8)
NCHARS = len(CHARS)

CHUNK_SIZE = 4096
NCOLS = 32000000
NROWS = 10

def chunk_sizes(total, chunk_size):
numchunks, remainder = divmod(total, chunk_size)
for n in range(numchunks):
yield chunk_size
if remainder:
yield remainder

def chunks():
bytes_per_line = NCOLS + 1
total_bytes = bytes_per_line * NROWS
newline_index = NCOLS
newline = ord('\n')
for size in chunk_sizes(total_bytes, CHUNK_SIZE):
chars = ARRAY_CHARS[random.randint(0, NCHARS, size)]
chars[newline_index::bytes_per_line] = newline
newline_index = (newline_index - CHUNK_SIZE) % bytes_per_line
yield chars

for chunk in chunks():
chunk.tofile(stdout)
 
O

Oscar Benjamin

One thing to be aware of: urandom may run out of entropy, and then it
will slow down a lot. If you don't care about cryptographic randomness,
you could use this instead:

Reading this I'm realising that I don't really know what os.urandom
is. How exactly is it generating random numbers and what do you mean
by it running out of entropy?


Oscar
 
S

Steven D'Aprano

Reading this I'm realising that I don't really know what os.urandom is.
How exactly is it generating random numbers and what do you mean by it
running out of entropy?

I am not an expert, but here goes...

Some (most?) modern operating systems provide a cryptographically strong
source of non-deterministic randomness. The non-deterministic part comes
from external "stuff", which is called "entropy". Typical sources of
entropy include network events, user key-presses, moving the mouse, and
(presumably in machines with special hardware), even thermal noise in
electrical components.

If the OS hasn't collected enough entropy, urandom can block until it
has. This can be a problem, e.g. I've experienced issues where scripts
relying indirectly on urandom that run at system reboot can block for
minutes at a time, waiting for entropy. If those scripts run before the
networking software, and before any users log in and start running apps,
the script can block forever waiting for entropy that never arrives.

(Where "forever" == "70 minute boot times".)

Entropy is used and discarded, so urandom needs the OS to continually
replenish the amount of entropy. Under normal circumstances, this it
does, but if you grab lots of urandom output on a system which is
otherwise quiet and not doing anything, it could run out.

If I've got any of this wrong, corrections will be gratefully acceptable.
 
R

Robert Kern

I am not an expert, but here goes...

Some (most?) modern operating systems provide a cryptographically strong
source of non-deterministic randomness. The non-deterministic part comes
from external "stuff", which is called "entropy". Typical sources of
entropy include network events, user key-presses, moving the mouse, and
(presumably in machines with special hardware), even thermal noise in
electrical components.

If the OS hasn't collected enough entropy, urandom can block until it
has. This can be a problem, e.g. I've experienced issues where scripts
relying indirectly on urandom that run at system reboot can block for
minutes at a time, waiting for entropy. If those scripts run before the
networking software, and before any users log in and start running apps,
the script can block forever waiting for entropy that never arrives.

(Where "forever" == "70 minute boot times".)

Entropy is used and discarded, so urandom needs the OS to continually
replenish the amount of entropy. Under normal circumstances, this it
does, but if you grab lots of urandom output on a system which is
otherwise quiet and not doing anything, it could run out.

If I've got any of this wrong, corrections will be gratefully acceptable.

Just one important thing: os.urandom() does not block to wait for more entropy.
Only os.random() does.

http://en.wikipedia.org/wiki//dev/random

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
O

Oscar Benjamin

Some (most?) modern operating systems provide a cryptographically strong
source of non-deterministic randomness. The non-deterministic part comes
from external "stuff", which is called "entropy". Typical sources of
entropy include network events, user key-presses, moving the mouse, and
(presumably in machines with special hardware), even thermal noise in
electrical components.
Entropy is used and discarded, so urandom needs the OS to continually
replenish the amount of entropy. Under normal circumstances, this it
does, but if you grab lots of urandom output on a system which is
otherwise quiet and not doing anything, it could run out.

Okay, so I understand what entropy is in the thermodynamic sense and
also in the mathematical (Shannon) sense but I'm still confused about
what it means that the OS is somehow storing entropy. Do you mean that
it is always maintaining a buffer of what it considers to be random
bytes that it slowly builds up from noise that is made accessible to
the OS from the hardware?


Oscar
 
R

Robert Kern

Okay, so I understand what entropy is in the thermodynamic sense and
also in the mathematical (Shannon) sense but I'm still confused about
what it means that the OS is somehow storing entropy. Do you mean that
it is always maintaining a buffer of what it considers to be random
bytes that it slowly builds up from noise that is made accessible to
the OS from the hardware?

Yes.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
C

Chris Angelico

Okay, so I understand what entropy is in the thermodynamic sense and
also in the mathematical (Shannon) sense but I'm still confused about
what it means that the OS is somehow storing entropy. Do you mean that
it is always maintaining a buffer of what it considers to be random
bytes that it slowly builds up from noise that is made accessible to
the OS from the hardware?

Correct. And Steven's right about most of what he says (modulo the
urandom vs random distinction, as Robert Kern pointed out - urandom
won't block, but it's not guaranteed to be cryptographically secure);
I'll just add that one of the best sources of entropy is a solid
cylinder, rotated at high velocity in a sealed container filled with a
fluid, and entropy is found in the eddies. Many computers have a
device of this nature - the solid cylinder is thin and flat and
referred to as a "disk", the fluid it's in is air, and the sealed
container is your hard disk drive.

The details will vary, but broadly speaking, the /dev/random driver
(or its equivalent) maintains an ever-increasing buffer of entropic
bits, accumulated as they arrive from the various sources, and often
saved to disk on shutdown to permit faster boot (which helps to avoid
the problem Steven described of 70-minute boot times - on an all-SSD
computer with no human being attached, entropy really can be very hard
to obtain); whenever a program asks for bytes from it, it delivers
them and removes that much "recorded entropy" from its buffer. For
many purposes, it's sufficient to take 4 or 8 bytes of /dev/random
entropy and use that to seed a PRNG, but if you're using /dev/urandom
and it's not a critical server, I wouldn't worry too much about
drawing too much off it. (On a web server that's constantly serving
HTTPS requests, for instance, I'd be cautious about reading too much
from /dev/urandom as it might cause the web server to block waiting
for /dev/random. Might kill your TPS for a while.)

ChrisA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,818
Messages
2,569,732
Members
45,691
Latest member
Dick331194

Latest Threads

Top