Sorting in huge files

P

Paul

Hi all

I have a sorting problem, but my experience with Python is rather
limited (3 days), so I am running this by the list first.

I have a large database of 15GB, consisting of 10^8 entries of
approximately 100 bytes each. I devised a relatively simple key map on
my database, and I would like to order the database with respect to the
key.

I expect a few repeats for most of the keys, and that s actually part
of what I want to figure out in the end. (Said loosely, I want to group
all the data entries having "similar" keys. For this I need to sort the
keys first (data entries having _same_ key), and then figure out which
keys are "similar").

A few thoughts on this:
- Space is not going to be an issue. I have a Tb available.
- The Python sort() on list should be good enough, if I can load the
whole database into a list/dict
- each data entry is relatively small, so I shouldn't use pointers
- Keys could be strings, integers with the usual order, whatever is
handy, it doesn't matter to me. The choice will probably have to do
with what sort() prefers.
- Also I will be happy with any key space size. So I guess 100*size of
the database will do.

Any comments?
How long should I hope this sort will take? It will sound weird, but I
actually have 12 different key maps and I want to sort this with
respect to each map, so I will have to sort 12 times.

Paul
 
S

Steven Bethard

Paul said:
I expect a few repeats for most of the keys, and that s actually part
of what I want to figure out in the end. (Said loosely, I want to group
all the data entries having "similar" keys. For this I need to sort the
keys first (data entries having _same_ key), and then figure out which
keys are "similar").

If this is really your final goal, you may not want to sort. Consider
code like the following:
.... ('x', '7'),
.... ('a', '2'),
.... ('b', '7'),
.... ('x', '4')].... key = entry[0]
.... counts.setdefault(key, []).append(entry)
........ print key, counts[key]
....
a [('a', '4'), ('a', '2')]
x [('x', '7'), ('x', '4')]
b [('b', '7')]

I've grouped all entries with the same key together using a dict object
and without the need for any sorting. If you had a good definition of
'similar', you could perhaps map all 'similar' keys to the same value in
the dict.

If you really do need to sort, Python 2.4 provides a very nice way to
sort by a particular key:
>>> import operator
>>> entries = [('a', '4'),
.... ('x', '7'),
.... ('a', '2'),
.... ('b', '7'),
.... ('x', '4')][('a', '2'), ('a', '4'), ('x', '4'), ('x', '7'), ('b', '7')]

Here, I've sorted the entries by the second item in each tuple. If you
go this route, you should also look at itertools.groupby:
>>> import itertools
>>> entries = [('a', '4'),
.... ('x', '7'),
.... ('a', '2'),
.... ('b', '7'),
.... ('x', '4')].... print key, list(values)
....
2 [('a', '2')]
4 [('a', '4'), ('x', '4')]
7 [('x', '7'), ('b', '7')]

The groupby basically does the sort of grouping of a sorted list that I
think you had in mind...

Steve
 
P

Paul

I really do need to sort. It is complicated and I haven't said why, but
it will help in finding similar keys later on. Sorry I can't be more
precise, this has to do with my research.

Your two other suggestions with itertools and operator are more useful,
but I was mostly wondering about performance issue.

Is this reasonnable to do on 10^8 elements with repeats in the keys? I
guess I should just try and see for myself.
 
P

Paul

I really do need to sort. It is complicated and I haven't said why, but
it will help in finding similar keys later on. Sorry I can't be more
precise, this has to do with my research.

Your two other suggestions with itertools and operator are more useful,
but I was mostly wondering about performance issue.

Is this reasonnable to do on 10^8 elements with repeats in the keys? I
guess I should just try and see for myself.
 
S

Steven Bethard

Paul said:
Is this reasonnable to do on 10^8 elements with repeats in the keys? I
guess I should just try and see for myself.

Yeah, that's usually the right solution. I didn't comment on
space/speed issues because they're so data dependent in a situation like
this, and without actually looking at your data, I doubt anyone here can
even really ballpark an answer for you. And if we had your data, we'd
probably just try to load it and see what happens anyway. ;)

Steve
 
S

Scott David Daniels

Paul said:
I have a large database of 15GB, consisting of 10^8 entries of
approximately 100 bytes each. or 10 gigabytes of data.
A few thoughts on this:
- Space is not going to be an issue. I have a Tb available.
I presume this is disk space, not memory. If you do have a Tb of RAM
and you are using CrayPython, just do it. From the above, I'd figure
(very hand-wavishly), that you want about 20G RAM (maybe as little as
15). That's a lot to me.
Any comments?
I'd draw a random sample of data to work with. The python cookbook has
some good recipes fro drawing an unbiased sample. If you work with
about 10,000 elements, you can probably still see patterns, but can
try all kinds of weird stuff experimentally. Once you have a feel for
key distributions, cut the distribution into key ranges that "guarantee"
(not really, but in a statistical sense) the data will fit comfortably
in memory (say a third of your available RAM), and process each range
separately. The sort should then be duck soup:

ordered_dest = open('ordered_by_key1', 'w')
for interval in ranges_in_order():
data = extract_data(interval)
data.sort(key=key1)
for entry in data:
ordered_dest.write(data)
data = [] # avoid having older interval data during extract
How long should I hope this sort will take? It will sound weird, but I
actually have 12 different key maps and I want to sort this with
respect to each map, so I will have to sort 12 times.

Well, good sorting is O(N * log(N)), so you should be able to calculate
from a sample timing (once you get out of cache effects). If you use a
variant of the above, simply time a few of the intervals you've chosen
and do a nice linear extrapolation.

--Scott David Daniels
(e-mail address removed)
 
L

Larry Bates

Paul,

I can pretty much promise you that it you really have 10^8
records they should be put into a database and let the database
do the sorting by creating indexes on the fields that you want.
Something like MySQL should do nicely and is free.

http://www.mysql.org

Python has good interface to mysql database if you want other
processing on the records.

The alternative is a good high-speed sort like Syncsort, etc.

Good Luck,
Larry Bates
 
J

Jeremy Sanders

I have a large database of 15GB, consisting of 10^8 entries of
approximately 100 bytes each. I devised a relatively simple key map on
my database, and I would like to order the database with respect to the
key.

You won't be able to load this into memory on a 32-bit machine, even with
loads of swap. Maybe you could do this on x86-64 with lots of swap (or
loadsa memory), or other 64-bit hardware. It will be _really_ slow,
however.

Otherwise you could do an on-disk sort (not too hard with fixed-length
records), but this will require some coding. You'll probably need to do
some reading to work out which sorting algorithm accesses the data less
randomly. I think the key phrase is an "external sort" rather than an
"interal sort".

It's probably easiest to load it into the thing into a database (like
PostgreSQL), to do the work for you.

Jeremy
 
A

Adam DePrince

I really do need to sort. It is complicated and I haven't said why, but
it will help in finding similar keys later on. Sorry I can't be more
precise, this has to do with my research.

Precision is precisely what we require to give you an answer more
meaningful than "write a script to load it into your favorite database
and type 'select * from table order by column;' "

Now unless you have an NDA with an employer or are working on something
classified, (in which case you have already given us too much
information and should start looking for another job and lawyer) I would
venture a guess that you have more to gain than lose from giving us more
information. Decisions are hard sometimes ... is the help worth the
risk that somebody in this forum will look at your question, say "hey
that is a neat idea," duplicate all of your research and publish before
you shaming you to a life of asking "do you want fries with that" and
pumping gas.
Your two other suggestions with itertools and operator are more useful,
but I was mostly wondering about performance issue.

What performance issue? Nowadays any decent laptop should be able to
handle this dataset (from disk) without too much trouble.

c = make_a_cursor_for_my_favoriate_database()
f = open( "mydata" )
for line in f.xreadlines():
c.execute( "insert into table( fields) values (%s,%s ... )",
line.split() )
c.commit()
print "I'm done loading, feel free to hit control+C if you get tired"
c.execute( "select * from table order by field" )
while 1:
print c.fetchone()

Then, from your shell:

myloadscript.py | gzip -9 > results.txt

Start it up Friday night and take the weekend off. Just make sure you
plug your laptop into the wall before you go home.
Is this reasonnable to do on 10^8 elements with repeats in the keys? I
guess I should just try and see for myself.

Repeats in the keys don't matter.


Adam DePrince
 
P

Paul

The reason I am not telling you much about the data is not because I am
afraid anyone would steal my ideas, or because I have a non-disclosure
agreement or that I don't want to end up pumping gas.
It is just that it is pretty freaking damn hard to even explain what is
going on. Probably a bit harder than your usual database.

If you really want to know, my entries are elliptic curves and my
hashing function is an attempt at mapping them to their Serre resdual
representation modulo a given prime p.

Now, for you to tell me something relevant about the data that I don't
already know from some big-shot conjecture/theorem would probably
require that you read a few grad school books on number theory. Happy?
 
P

Paul Rubin

Paul said:
If you really want to know, my entries are elliptic curves and my
hashing function is an attempt at mapping them to their Serre resdual
representation modulo a given prime p.

Now, for you to tell me something relevant about the data that I don't
already know from some big-shot conjecture/theorem would probably
require that you read a few grad school books on number theory. Happy?

If you know that much math, you shouldn't have any trouble reading
Knuth "The Art Of Computer Programming" vol 3 "Sorting and Searching",
for info on how to do external sorts. External sorting means sorting
files that are too large to fit in memory. Basically you read in
chunks of the file that will fit in memory, sort the chunks in memory
and write the sorted chunks to disk. Then you merge the sorted
chunks. Since in the old days a huge amount of computer resources
went into doing these operations, whole books (like Knuth's) have been
written about how to handle all the details efficiently.

If you only want to do this sort once, your simplest approach (if
you're using Unix/Linux) is to use the Unix/Linux "sort" command,
which does an external sort. You'll have to first crunch your data
into a format that the "sort" command can handle, then run "sort",
then uncrunch if appropriate.
 
P

Paul

Thanks! I definitely didn't want to go into any elaborate programming
for this, and the Unix sort is perfect for this.
It sorted a tenth of my data in about 8 min, which is entirely
satisfactory to me (assuming it will take ~ 20 times more to do the
whole thing).
Your answer greatly helped!
Paul
 
?

=?iso-8859-1?Q?Fran=E7ois?= Pinard

[Paul]
Thanks! I definitely didn't want to go into any elaborate programming
for this, and the Unix sort is perfect for this. It sorted a tenth of
my data in about 8 min, which is entirely satisfactory to me (assuming
it will take ~ 20 times more to do the whole thing). Your answer
greatly helped! Paul

I was to reply a bit more elaborately, but if you are happy with `sort'
that's quite nice, you do have a solution. :)

One of my old cases was a bit more difficult in that the comparison
algorithm was not _simple_ enough to easily translate into computing a
key once, that could be then saved into the record. The comparison had
to stay live with the sort and the whole somehow sort co-routined with
the application. I wrote a dual-tournament sort aiming a polyphased
merge, and transported the algorithm through machines and languages.
Not so long ago, I finally saved the whole algorithm in Python for the
record, but found it to be too slow to be practical for huge tasks.

Nowadays, given the same kind of problem, the size and speed of machines
is such that I would dare trying Timsort (the standard Python sort) over
a mmap-ed file, right from within Python. My guess is that it could
very reasonably work, despite requiring almost no development time.
 
S

sjmachin

FWIW, the algorithms in early editions [haven't looked at recent ones]
are designed for magnetic tapes, not disk. They do still work on disk
(treat each tape drive as a file on disc). I had to implement a robust
production-quality sort on MS-DOS about 20 years ago, and did it
straight out of Knuth's book. It worked robustly, and was fast enough.
However there probably were then, and are sure to be now, algorithms
specifically designed for disk that work faster. Whether they are
documented and analysed as thoroughly as in Knuth's book is another
question.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top