Sorting Large File (Code/Performance)

G

Grant Edwards

Sure, since no-one can ever try it with more than 640k, it's
easy to state that there is no limit. :)

Huh? I used DOS sort to sort files much bigger than 640K. And
I used the Unix sort command on a machine with 64K bytes of
data space to sort files even larger than the ones under DOS.
 
A

Albert van der Horst

Thanks to all who replied. It's very appreciated.

Yes, I had to doublecheck line counts and the number of lines is ~16
million (insetead of stated 1.6B).

Also:

The file is UTF-8

Yes, first two always belong to ASCII subset

I need to isolate all lines that start with two characters (zz to be
particular)

Like in?
grep '^zz' longfile > aapje

You will have a hard time to beat that with python, in every respect.

Cheers,

Ira

Groetjes Albert
 
A

Albert van der Horst

OK, that's not bad at all.

You have a few options:

- Get enough memory to do the sort with an in-memory sort, like UNIX "sort"
or Python's "sort" function.

There is no Unix sort besides the executable binary /usr/bin/sort
There is qsort() in the standard c-library
but you would have to write a c-program around it.
- Thrash; in-memory sorts do very badly with virtual memory, but eventually
they finish. Might take many hours.

A long time I thought that too, but of course it doesn't apply to mergesorts.
Then I heard reputable sources say that quicksort has good "locality".
Thinking about it, I find it more than plausible.
Now I wonder whether even trash to disk (page fault) could tank a qsort
using application. Why would it be a worse than a batch sort?
- Get a serious disk-to-disk sort program. (See "http://www.ordinal.com/".
There's a free 30-day trial. It can probably sort your data
in about a minute.)

This is the way, but you don't need/want to go with "trial" software.
- Load the data into a database like MySQL and let it do the work.
This is slow if done wrong, but OK if done right.
- Write a distribution sort yourself. Fan out the incoming file into
one file for each first letter, sort each subfile, merge the
results.


With DRAM at $64 for 4GB, I'd suggest just getting more memory and using
a standard in-memory sort.

You give the mistaken impression that normal
sort programs can only sort what can be hold in memory. This is
only true for MSDOS' sort.exe that was an abomination even in 1983.

All sort programs on Unixes, way back in the 1980 could handle
files that were restricted by disk size.

With respect to Microsoft OS'es:
Despite its age and using only 1 Mbyte of internal memory,
my ssort.exe can handle the OP's problem with ease.

It sorted a 12 Mbyte file in minutes even in 1993.

http://home.hccnet.nl/a.w.m.van.der.horst/ssort.html

This is a plain executable, so no hassles and no trial.
It is GPL, so you can even hack the source, if you fancy C++.

There are a lot of Unix compatible toolkit present for
Windows nowadays. Install one, you can spare a dozen Mbytes,
can't you?
John Nagle

Groetjes Albert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top