shuffle the lines of a large file

Simon Brunning · Mar 10, 2005

On Thu said:
There. Factor 10. That's what I call optimization...

The simplest approach is even faster:

C:\>python -m timeit -s "from itertools import repeat" "[None for i in
range(10000)]"
100 loops, best of 3: 2.53 msec per loop

C:\>python -m timeit -s "from itertools import repeat" "[None for i in
xrange(10000)]"
100 loops, best of 3: 2.45 msec per loop

C:\>python -m timeit -s "from itertools import repeat"
"list(repeat(None, 10000))"
1000 loops, best of 3: 245 usec per loop

C:\>python -m timeit -s "from itertools import repeat" "[None] * 10000"
10000 loops, best of 3: 160 usec per loop

;-)

BTW, I've posted a more general version here:
http://www.brunningonline.net/simon/blog/archives/001784.html

Heiko Wundram · Mar 11, 2005

Ah, but that's the clever bit; it *doesn't* store the whole list -
only the selected lines.

But that means that it'll only read several lines from the file, never do a
shuffle of the whole file content... When you'd want to shuffle the file
content, you'd have to set lines=1 and throw away repeating lines in
subsequent runs, or you'd have to set lines higher, and deal with the
resulting lines too in some way (throw away repeating ones...

. Doesn't
matter how, you'd have to store which lines you've already read
(selected_lines). And in any case you'd need a line cache of 10^9 entries for
this amount of data...

That's just what I wanted to say...

--
--- Heiko.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQBCMTPJf0bpgh6uVAMRAreTAJ4hawh8B05b7wUQs4z4sMoKyX5nwwCaAyn7
d1u5AVzZF9Tcg8pWzpjEfeA=
=owru
-----END PGP SIGNATURE-----

Simon Brunning · Mar 11, 2005

But that means that it'll only read several lines from the file, never do a
shuffle of the whole file content...

Err, thing is, it *does* pick a random selection from the whole file,
without holding the whole file in memory. (It does hold all the
selected items in memory - I don't see any way to avoid that.) Why not
try it and see?

When you'd want to shuffle the file
content, you'd have to set lines=1 and throw away repeating lines in
subsequent runs, or you'd have to set lines higher, and deal with the
resulting lines too in some way (throw away repeating ones... .

Eliminating duplicates is left as an exercise for the reader. ;-)

Doesn't
matter how, you'd have to store which lines you've already read
(selected_lines). And in any case you'd need a line cache of 10^9 entries for
this amount of data...

Nope, you don't.

Peter Otten · Mar 11, 2005

Simon said:
I couldn't resist. ;-)

Me neither...

import random

def randomLines(filename, lines=1):
selected_lines = list(None for line_no in xrange(lines))

for line_index, line in enumerate(open(filename)):
for selected_line_index in xrange(lines):
if random.uniform(0, line_index) < 1:
selected_lines[selected_line_index] = line

return selected_lines

This has the advantage that every line had the same chance of being
picked regardless of its length. There is the chance that it'll pick
the same line more than once, though.

The idea is simple: have a chain of "pickers" that pass all items that are
not picked on to the next picker. The current implementation, however, may
hit the recursion limit for large samples.

import random

class PickOne(object):
"""Pick a random item from an iterable and store it in the 'picked'
attribute.
As an iterator, yield all items that are not picked.

Helper for the pick() function.
"""
def __init__(self, iterable):
self.iterable = iterable
def __iter__(self):
it = enumerate(self.iterable)
index, self.picked = it.next()
for index, item in it:
if random.uniform(0, index) < 1:
yield self.picked
self.picked = item
else:
yield item

def pick(iterable, N, accept_less=False):
"""Pick N random items from an iterable.
"""
# Set up the PickOne(PickOne( ... PickOne(iterable) ... )) chain
keepers = []
for i in xrange(N):
iterable = PickOne(iterable)
keepers.append(iterable)

# We are interested in the side effects, i. e. the PickOne instances
# picking random items from the initial iterable
for item in iterable:
pass

# The following could be
# return [k.picked for k in keepers]
# but we want to handle the case len(iterable) < N, too
result = []
for keeper in keepers:
try:
picked = keeper.picked
except AttributeError:
if accept_less:
return result
raise ValueError("sequence contains less items than shall be
picked")
result.append(picked)
return result

def demo():
"""Pick random lines from a set of files.
"""
import optparse
from itertools import chain, izip, repeat, count

parser = optparse.OptionParser()
parser.usage = ("%prog [nl] file1 file2 ... fileN\n\n"
"Choose random lines from files given on the command line.")
parser.add_option("-n", "--count", type="int", default=10,
help="number of lines that shall be picked (default=10)")
parser.add_option("-l", "--less", action="store_true",
help="accept less picked lines than specified, if there are less
total lines in the file(s)")
options, args = parser.parse_args()

index_name_line = chain(*[izip(count(), repeat(fn), file(fn)) for fn in
args])
try:
picked = pick(index_name_line, options.count, options.less)
except ValueError:
raise SystemExit("Cannot pick more lines than found in the file(s)")

print "".join([("%3d <%s> %s" % ifl) for ifl in picked])

if __name__ == "__main__":
demo()

paul koelle · Mar 12, 2005

Joerg said:
Thanks to all. This thread shows again that Python's best feature is
comp.lang.python.

from comp.lang import python

Paul

Php combine identical lines in text file	4	Oct 11, 2023
Python point location of intersect between two lines	0	Feb 28, 2018
Download the "head" of a large file?	5	Jul 27, 2009
split lines from stdin into a list of unicode strings	0	Aug 28, 2013
Python, LDA : How to get the id of keywords instead of the keywords themselves with Gensim?	0	Jan 20, 2017
The cost of the cheapest routes between cities	3	Jan 7, 2023
Reading a large csv file	1	Jun 23, 2009
portably shuffle a deck	30	Aug 1, 2007

shuffle the lines of a large file

Simon Brunning

Heiko Wundram

Simon Brunning

Peter Otten

paul koelle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads