RE: shuffle the lines of a large file

Discussion in 'Python' started by Alex Stapleton, Mar 7, 2005.

  1. Woops typo.

    else:
    buffer.shuffle()
    for line in buffer:
    print line

    should be

    else:
    random.shuffle(buffer)
    for line in buffer:
    print line

    of course

    -----Original Message-----
    From: python-list-bounces+alexs=
    [mailto:python-list-bounces+alexs=]On Behalf Of Alex
    Stapleton
    Sent: 07 March 2005 14:17
    To: Joerg Schuster;
    Subject: RE: shuffle the lines of a large file


    Not tested this, run it (or some derivation thereof) over the output to get
    increasing randomness.
    You will want to keep max_buffered_lines as high as possible really I
    imagine. If shuffle() is too intensize
    you could itterate over the buffer several times randomly removing and
    printing lines until the buffer is empty/suitibly small removing some more
    processing overhead.

    ### START ###
    import random

    f = open('corpus.uniq')

    buffer = []
    max_buffered_lines = 1000

    for line in f:
    if len(buffer) < max_buffered_lines:
    buffer.append(line)
    else:
    buffer.shuffle()
    for line in buffer:
    print line

    random.shuffle(buffer)
    for line in buffer:
    print line


    f.close()

    ### END ###

    -----Original Message-----
    From: python-list-bounces+alexs=
    [mailto:python-list-bounces+alexs=]On Behalf Of
    Joerg Schuster
    Sent: 07 March 2005 13:37
    To:
    Subject: shuffle the lines of a large file


    Hello,

    I am looking for a method to "shuffle" the lines of a large file.

    I have a corpus of sorted and "uniqed" English sentences that has been
    produced with (1):

    (1) sort corpus | uniq > corpus.uniq

    corpus.uniq is 80G large. The fact that every sentence appears only
    once in corpus.uniq plays an important role for the processes
    I use to involve my corpus in. Yet, the alphabetical order is an
    unwanted side effect of (1): Very often, I do not want (or rather, I
    do not have the computational capacities) to apply a program to all of
    corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a
    very lopsided set of English sentences.

    So, it would be very useful to do one of the following things:

    - produce corpus.uniq in a such a way that it is not sorted in any way
    - shuffle corpus.uniq > corpus.uniq.shuffled

    Unfortunately, none of the machines that I may use has 80G RAM.
    So, using a dictionary will not help.

    Any ideas?

    Joerg Schuster

    --
    http://mail.python.org/mailman/listinfo/python-list

    --
    http://mail.python.org/mailman/listinfo/python-list
     
    Alex Stapleton, Mar 7, 2005
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    12
    Views:
    14,559
    Lee Ryman
    Apr 26, 2005
  2. j0ecanad1an80

    Shuffle method error

    j0ecanad1an80, Jun 19, 2006, in forum: Java
    Replies:
    4
    Views:
    686
  3. Joe Wright
    Replies:
    0
    Views:
    539
    Joe Wright
    Jul 27, 2003
  4. Joerg Schuster

    shuffle the lines of a large file

    Joerg Schuster, Mar 7, 2005, in forum: Python
    Replies:
    24
    Views:
    1,569
    paul koelle
    Mar 12, 2005
  5. Jeff Moore

    Array.shuffle/Array.shuffle!

    Jeff Moore, Aug 24, 2008, in forum: Ruby
    Replies:
    6
    Views:
    162
    Alexei Broner
    Oct 9, 2008
Loading...

Share This Page