Sorting Large File (Code/Performance)

I

Ira.Kovac

Hello all,

I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
to sort based on first two characters.

I'd greatly appreciate if someone can post sample code that can help
me do this.

Also, any ideas on approximately how long is the sort process going to
take (XP, Dual Core 2.0GHz w/2GB RAM).

Cheers,

Ira
 
P

Paul Rubin

I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
to sort based on first two characters.

I'd greatly appreciate if someone can post sample code that can help
me do this.

Use the unix sort command:

sort inputfile -o outputfile

I think there is a cygwin port.
Also, any ideas on approximately how long is the sort process going to
take (XP, Dual Core 2.0GHz w/2GB RAM).

Eh, unix sort would probably take a while, somewhere between 15
minutes and an hour. If you only have to do it once it's not worth
writing special purpose code. If you have to do it a lot, get some
more ram for that box, suck the file into memory and do a radix sort.
 
J

John Nagle

Hello all,

I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
to sort based on first two characters.

Given those numbers, the average number of characters per line is
less than 2. Please check.

John Nagle
 
J

John Machin

Hello all,

I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
to sort based on first two characters.

If you mean 1.6 American billion i.e. 1.6 * 1000 ** 3 lines, and 2 *
1024 ** 3 bytes of data, that's 1.34 bytes per line. If you mean other
definitions of "billion" and/or "GB", the result is even fewer bytes
per line.

What is a "Unicode text file"? How is it encoded: utf8, utf16,
utf16le, utf16be, ??? If you don't know, do this:

print repr(open('the_file', 'rb').read(100))

and show us the results.

What does "based on [the] first two characters" mean? Do you mean raw
order based on the ordinal of each character i.e. no fancy language-
specific collating sequence? Do the first two characters always belong
to the ASCII subset?

You'd like to sort a large file? Why? Sorting a file is just a means
to an end, and often another means is more appropriate. What are you
going to do with it after it's sorted?
I'd greatly appreciate if someone can post sample code that can help
me do this.

I'm sure you would. However it would benefit you even more if instead
of sitting on the beach next to the big arrow pointing to the drop
zone, you were to read the manual and work out how to do it yourself.
Here's a start: http://docs.python.org/lib/typesseq-mutable.html
Also, any ideas on approximately how long is the sort process going to
take (XP, Dual Core 2.0GHz w/2GB RAM).

If you really have a 2GB file and only 2GB of RAM, I suggest that you
don't hold your breath.

Instead of writing Python code, you are probably better off doing an
external sort. You might consider looking for a Windows port of a
Unicode-capable Unix sort utility. Google "GnuWin32" and see if their
sort does what you want.
 
I

Ira.Kovac

Thanks to all who replied. It's very appreciated.

Yes, I had to doublecheck line counts and the number of lines is ~16
million (insetead of stated 1.6B).

Also:
What is a "Unicode text file"? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this:
The file is UTF-8
Do the first two characters always belong to the ASCII subset?
Yes, first two always belong to ASCII subset
What are you going to do with it after it's sorted?
I need to isolate all lines that start with two characters (zz to be
particular)
Here's a start: http://docs.python.org/lib/typesseq-mutable.html
Google "GnuWin32" and see if their sort does what you want.
Will do, thanks for the tip.
If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath.
I am limited with resources. Unfortunately.

Cheers,

Ira
 
S

Stefan Behnel

I need to isolate all lines that start with two characters (zz to be
particular)

"Isolate" as in "extract"? Remove the rest?

Then why don't you extract the lines first, without sorting the file? (or sort
it afterwards if you still need to). That would heavily cut down your memory
footprint.

Stefan
 
S

Stefan Behnel

Stefan said:
"Isolate" as in "extract"? Remove the rest?

Then why don't you extract the lines first, without sorting the file? (or sort
it afterwards if you still need to). That would heavily cut down your memory
footprint.

Just for fun, this is what I meant:

for utf8_line in open(filename, 'rb'):
if utf8_line.startswith('zz'):
print utf8_line

Stefan
 
J

John Machin

I need to isolate all lines that start with two characters (zz to be
particular)

What does "isolate" mean to you? What does this have to do with
sorting? What do you actually want to do with (a) the lines starting
with "zz" (b) the other lines? What percentage of the lines start with
"zz"?

When looking at the GnuWin32 collection:
(a) "Unicode" is not relevant to your sort problem.
(b) grab yourself a wc and a grep while you're there -- they will help
with "how many lines" and "what percentage of lines" questions.
 
M

Martin Marcher

Given those numbers, the average number of characters per line is
less than 2. Please check.

which would be true if 1.599.999.999 had 2 chars and the rest of the lines
just one :)

(but yes that would be an interesting question how to sort a 1 character
line based on the first 2 of that line)

martin





--
http://noneisyours.marcher.name
http://feeds.feedburner.com/NoneIsYours

You are not free to read this message,
by doing so, you have violated my licence
and are required to urinate publicly. Thank you.
 
J

John Nagle

Thanks to all who replied. It's very appreciated.

Yes, I had to double check line counts and the number of lines is ~16
million (instead of stated 1.6B).

OK, that's not bad at all.

You have a few options:

- Get enough memory to do the sort with an in-memory sort, like UNIX "sort"
or Python's "sort" function.
- Thrash; in-memory sorts do very badly with virtual memory, but eventually
they finish. Might take many hours.
- Get a serious disk-to-disk sort program. (See "http://www.ordinal.com/".
There's a free 30-day trial. It can probably sort your data
in about a minute.)
- Load the data into a database like MySQL and let it do the work.
This is slow if done wrong, but OK if done right.
- Write a distribution sort yourself. Fan out the incoming file into
one file for each first letter, sort each subfile, merge the
results.

With DRAM at $64 for 4GB, I'd suggest just getting more memory and using
a standard in-memory sort.

John Nagle
 
P

Paul Rubin

John Nagle said:
- Get enough memory to do the sort with an in-memory sort, like
UNIX "sort" or Python's "sort" function.

Unix sort does external sorting when needed.
 
J

John Nagle

Paul said:
Unix sort does external sorting when needed.

Ah, someone finally put that in. Good. I hadn't looked at "sort"'s manual
page in many years.

John Nagle
 
P

Paul Rubin

John Nagle said:
Ah, someone finally put that in. Good. I hadn't looked at
"sort"'s manual page in many years.

Huh? It has been like that from the beginning. It HAD to be. Unix
was originally written on a PDP-11. The GNU version has also always
been external.
 
A

Asim

Thanks to all who replied. It's very appreciated.

Yes, I had to doublecheck line counts and the number of lines is ~16
million (insetead of stated 1.6B).

Also:


The file is UTF-8


Yes, first two always belong to ASCII subset


I need to isolate all lines that start with two characters (zz to be
particular)


Will do, thanks for the tip.


I am limited with resources. Unfortunately.

Since the OP has stated that they are running Windows XP, and more
than one poster has suggested installing more RAM in the box, I
thought people should know that WinXP has certain limitations on the
amount of memory that may be used:

http://msdn2.microsoft.com/en-us/library/aa366778.aspx

Firstly, the maximum amount of physical memory that may be installed
is 4GB. Secondly, with the "4 gigabyte tuning" and
"IMAGE_FILE_LARGE_ADDRESS_AWARE" patches, the maximum amount of
virtual memory (phyical memory + swapfile size) that may be assigned
to user processes is 2GB.

Hence, even if you made a 100GB swap file with 4GB RAM installed, by
default only a maximum of 2GB would ever be assigned to a user-
process. With the two flags enabled, the maximum becomes 3GB.

If the OP finds performance to be limited and thinks more RAM would
help trying a later version of Windows would be a start, but better
would be to try Linux or Mac OSX out.

Cheers,
Asim
 
A

Asim

Since the OP has stated that they are running Windows XP, and more
than one poster has suggested installing more RAM in the box, I
thought people should know that WinXP has certain limitations on the
amount of memory that may be used:

http://msdn2.microsoft.com/en-us/library/aa366778.aspx

Firstly, the maximum amount of physical memory that may be installed
is 4GB. Secondly, with the "4 gigabyte tuning" and
"IMAGE_FILE_LARGE_ADDRESS_AWARE" patches, the maximum amount of
virtual memory (phyical memory + swapfile size) that may be assigned
to user processes is 2GB.

Hence, even if you made a 100GB swap file with 4GB RAM installed, by
default only a maximum of 2GB would ever be assigned to a user-
process. With the two flags enabled, the maximum becomes 3GB.

If the OP finds performance to be limited and thinks more RAM would
help trying a later version of Windows would be a start, but better
would be to try Linux or Mac OSX out.

Cheers,
Asim

Sorry, just to clarify my response. Any 32-bit OS will only be able
to assign 4GB of virtual memory to a single processes, the argument
being that since processes can only issue 32-bit instructions the
process can only address a maximum of 2^32 bytes of addresses
(assuming the architecture is using byte-addressed memory).

Another link that's easier to grok:

http://www.codinghorror.com/blog/archives/000811.html

However, a 32-bit OS may support more than 4GB of virtual memory
(using "Physical Address Extension", or PAE) and split it more
intelligently between processes than Windows XP or Vista does:

http://www.ibm.com/developerworks/linux/library/l-memmod/

So allocating more than 4GB of virtual memory to your sort application
could be achieved through splitting your task into more than one
process on an appropriate OS. AFAIK, such memory limitations are
dependent on the particular Linux distro you're using, and I'm not
sure about Mac OSX limitations. This applies doubly for 64-bit
architectures and OS's.

Please correct me, with references, if my conclusions are wrong.

Cheers,
Asim
 
N

Nicko

I am limited with resources. Unfortunately.

As long as you have at least as much disc space spare as you need to
hold a copy of the file then this is not too hard. Split the file
into chunks that are small enough to fit in memory, sort each chunk
and write it to a file and then interleave the chunks. Below is a
cheap and cheesy outline of code to do this, from which you can start.

For files which are hugely larger than your available memory you can
do this recursively but for files which are 10 to 100 times too big
the single-pass code below will probably work just fine. The
complexity is technically O(n.(log(c)+(n/c))) where n is the size of
input and c is the chunk size; once n/c (the number of chunks) exceeds
log(c) the cost of merging the chunks will start to dominate, though a
recursive version would be slowed by needing a lot more disc access.

#!/usr/bin/env python
from itertools import islice
from tempfile import TemporaryFile
import sys

# Tweak this number to fill your memory
lines_per_chunk = 100000

chunkfiles = []
mergechunks = []

while True:
chunk = list(islice(sys.stdin, lines_per_chunk))
if not chunk:
break
chunk.sort()
f = TemporaryFile()
f.writelines(chunk)
f.seek(0)
mergechunks.append((chunk[0], len(chunkfiles)))
chunkfiles.append(f)

while mergechunks:
# The next line is order O(n) in the number of chunks
(line, fileindex) = min(mergechunks)
mergechunks.remove((line, fileindex))
sys.stdout.write(line)
nextline = chunkfiles[fileindex].readline()
if nextline == "":
chunkfiles[fileindex].close()
else:
mergechunks.append((nextline, fileindex))
 
P

Paul Rubin

Nicko said:
# The next line is order O(n) in the number of chunks
(line, fileindex) = min(mergechunks)

You should use the heapq module to make this operation O(log n) instead.
 
G

Gabriel Genellina

En Fri, 25 Jan 2008 17:50:17 -0200, Paul Rubin
You should use the heapq module to make this operation O(log n) instead.

Or forget about Python and use the Windows sort command. It has been there
since MS-DOS ages, there is no need to download and install other
packages, and the documentation at
http://technet.microsoft.com/en-us/library/bb491004.aspx says:

Limits on file size:
The sort command has no limit on file size.

Better, since the OP only intents to extract lines starting with "zz", use
the findstr command:
findstr /l /b "zz" filename.exe
would do the job.

Why doing things more complicated than that?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top