J
Joerg Schuster
Hello,
I am looking for a method to "shuffle" the lines of a large file.
I have a corpus of sorted and "uniqed" English sentences that has been
produced with (1):
(1) sort corpus | uniq > corpus.uniq
corpus.uniq is 80G large. The fact that every sentence appears only
once in corpus.uniq plays an important role for the processes
I use to involve my corpus in. Yet, the alphabetical order is an
unwanted side effect of (1): Very often, I do not want (or rather, I
do not have the computational capacities) to apply a program to all of
corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a
very lopsided set of English sentences.
So, it would be very useful to do one of the following things:
- produce corpus.uniq in a such a way that it is not sorted in any way
- shuffle corpus.uniq > corpus.uniq.shuffled
Unfortunately, none of the machines that I may use has 80G RAM.
So, using a dictionary will not help.
Any ideas?
Joerg Schuster
I am looking for a method to "shuffle" the lines of a large file.
I have a corpus of sorted and "uniqed" English sentences that has been
produced with (1):
(1) sort corpus | uniq > corpus.uniq
corpus.uniq is 80G large. The fact that every sentence appears only
once in corpus.uniq plays an important role for the processes
I use to involve my corpus in. Yet, the alphabetical order is an
unwanted side effect of (1): Very often, I do not want (or rather, I
do not have the computational capacities) to apply a program to all of
corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a
very lopsided set of English sentences.
So, it would be very useful to do one of the following things:
- produce corpus.uniq in a such a way that it is not sorted in any way
- shuffle corpus.uniq > corpus.uniq.shuffled
Unfortunately, none of the machines that I may use has 80G RAM.
So, using a dictionary will not help.
Any ideas?
Joerg Schuster