Sorting in huge files

Discussion in 'Python' started by Paul, Dec 7, 2004.

  1. Paul

    Paul Guest

    Hi all

    I have a sorting problem, but my experience with Python is rather
    limited (3 days), so I am running this by the list first.

    I have a large database of 15GB, consisting of 10^8 entries of
    approximately 100 bytes each. I devised a relatively simple key map on
    my database, and I would like to order the database with respect to the
    key.

    I expect a few repeats for most of the keys, and that s actually part
    of what I want to figure out in the end. (Said loosely, I want to group
    all the data entries having "similar" keys. For this I need to sort the
    keys first (data entries having _same_ key), and then figure out which
    keys are "similar").

    A few thoughts on this:
    - Space is not going to be an issue. I have a Tb available.
    - The Python sort() on list should be good enough, if I can load the
    whole database into a list/dict
    - each data entry is relatively small, so I shouldn't use pointers
    - Keys could be strings, integers with the usual order, whatever is
    handy, it doesn't matter to me. The choice will probably have to do
    with what sort() prefers.
    - Also I will be happy with any key space size. So I guess 100*size of
    the database will do.

    Any comments?
    How long should I hope this sort will take? It will sound weird, but I
    actually have 12 different key maps and I want to sort this with
    respect to each map, so I will have to sort 12 times.

    Paul
    Paul, Dec 7, 2004
    #1
    1. Advertising

  2. Paul wrote:
    > I expect a few repeats for most of the keys, and that s actually part
    > of what I want to figure out in the end. (Said loosely, I want to group
    > all the data entries having "similar" keys. For this I need to sort the
    > keys first (data entries having _same_ key), and then figure out which
    > keys are "similar").


    If this is really your final goal, you may not want to sort. Consider
    code like the following:

    >>> entries = [('a', '4'),

    .... ('x', '7'),
    .... ('a', '2'),
    .... ('b', '7'),
    .... ('x', '4')]
    >>> counts = {}
    >>> for entry in entries:

    .... key = entry[0]
    .... counts.setdefault(key, []).append(entry)
    ....
    >>> for key in counts:

    .... print key, counts[key]
    ....
    a [('a', '4'), ('a', '2')]
    x [('x', '7'), ('x', '4')]
    b [('b', '7')]

    I've grouped all entries with the same key together using a dict object
    and without the need for any sorting. If you had a good definition of
    'similar', you could perhaps map all 'similar' keys to the same value in
    the dict.

    If you really do need to sort, Python 2.4 provides a very nice way to
    sort by a particular key:

    >>> import operator
    >>> entries = [('a', '4'),

    .... ('x', '7'),
    .... ('a', '2'),
    .... ('b', '7'),
    .... ('x', '4')]
    >>> entries.sort(key=operator.itemgetter(1))
    >>> entries

    [('a', '2'), ('a', '4'), ('x', '4'), ('x', '7'), ('b', '7')]

    Here, I've sorted the entries by the second item in each tuple. If you
    go this route, you should also look at itertools.groupby:

    >>> import itertools
    >>> entries = [('a', '4'),

    .... ('x', '7'),
    .... ('a', '2'),
    .... ('b', '7'),
    .... ('x', '4')]
    >>> entries.sort(key=operator.itemgetter(1))
    >>> for key, values in itertools.groupby(entries, operator.itemgetter(1)):

    .... print key, list(values)
    ....
    2 [('a', '2')]
    4 [('a', '4'), ('x', '4')]
    7 [('x', '7'), ('b', '7')]

    The groupby basically does the sort of grouping of a sorted list that I
    think you had in mind...

    Steve
    Steven Bethard, Dec 7, 2004
    #2
    1. Advertising

  3. Paul

    Paul Guest

    I really do need to sort. It is complicated and I haven't said why, but
    it will help in finding similar keys later on. Sorry I can't be more
    precise, this has to do with my research.

    Your two other suggestions with itertools and operator are more useful,
    but I was mostly wondering about performance issue.

    Is this reasonnable to do on 10^8 elements with repeats in the keys? I
    guess I should just try and see for myself.
    Paul, Dec 7, 2004
    #3
  4. Paul

    Paul Guest

    I really do need to sort. It is complicated and I haven't said why, but
    it will help in finding similar keys later on. Sorry I can't be more
    precise, this has to do with my research.

    Your two other suggestions with itertools and operator are more useful,
    but I was mostly wondering about performance issue.

    Is this reasonnable to do on 10^8 elements with repeats in the keys? I
    guess I should just try and see for myself.
    Paul, Dec 7, 2004
    #4
  5. Paul wrote:
    > Is this reasonnable to do on 10^8 elements with repeats in the keys? I
    > guess I should just try and see for myself.


    Yeah, that's usually the right solution. I didn't comment on
    space/speed issues because they're so data dependent in a situation like
    this, and without actually looking at your data, I doubt anyone here can
    even really ballpark an answer for you. And if we had your data, we'd
    probably just try to load it and see what happens anyway. ;)

    Steve
    Steven Bethard, Dec 7, 2004
    #5
  6. Paul wrote:
    > I have a large database of 15GB, consisting of 10^8 entries of
    > approximately 100 bytes each.

    or 10 gigabytes of data.
    > A few thoughts on this:
    > - Space is not going to be an issue. I have a Tb available.

    I presume this is disk space, not memory. If you do have a Tb of RAM
    and you are using CrayPython, just do it. From the above, I'd figure
    (very hand-wavishly), that you want about 20G RAM (maybe as little as
    15). That's a lot to me.

    > Any comments?

    I'd draw a random sample of data to work with. The python cookbook has
    some good recipes fro drawing an unbiased sample. If you work with
    about 10,000 elements, you can probably still see patterns, but can
    try all kinds of weird stuff experimentally. Once you have a feel for
    key distributions, cut the distribution into key ranges that "guarantee"
    (not really, but in a statistical sense) the data will fit comfortably
    in memory (say a third of your available RAM), and process each range
    separately. The sort should then be duck soup:

    ordered_dest = open('ordered_by_key1', 'w')
    for interval in ranges_in_order():
    data = extract_data(interval)
    data.sort(key=key1)
    for entry in data:
    ordered_dest.write(data)
    data = [] # avoid having older interval data during extract

    > How long should I hope this sort will take? It will sound weird, but I
    > actually have 12 different key maps and I want to sort this with
    > respect to each map, so I will have to sort 12 times.


    Well, good sorting is O(N * log(N)), so you should be able to calculate
    from a sample timing (once you get out of cache effects). If you use a
    variant of the above, simply time a few of the intervals you've chosen
    and do a nice linear extrapolation.

    --Scott David Daniels
    Scott David Daniels, Dec 7, 2004
    #6
  7. Paul

    Larry Bates Guest

    Paul,

    I can pretty much promise you that it you really have 10^8
    records they should be put into a database and let the database
    do the sorting by creating indexes on the fields that you want.
    Something like MySQL should do nicely and is free.

    http://www.mysql.org

    Python has good interface to mysql database if you want other
    processing on the records.

    The alternative is a good high-speed sort like Syncsort, etc.

    Good Luck,
    Larry Bates



    Paul wrote:
    > Hi all
    >
    > I have a sorting problem, but my experience with Python is rather
    > limited (3 days), so I am running this by the list first.
    >
    > I have a large database of 15GB, consisting of 10^8 entries of
    > approximately 100 bytes each. I devised a relatively simple key map on
    > my database, and I would like to order the database with respect to the
    > key.
    >
    > I expect a few repeats for most of the keys, and that s actually part
    > of what I want to figure out in the end. (Said loosely, I want to group
    > all the data entries having "similar" keys. For this I need to sort the
    > keys first (data entries having _same_ key), and then figure out which
    > keys are "similar").
    >
    > A few thoughts on this:
    > - Space is not going to be an issue. I have a Tb available.
    > - The Python sort() on list should be good enough, if I can load the
    > whole database into a list/dict
    > - each data entry is relatively small, so I shouldn't use pointers
    > - Keys could be strings, integers with the usual order, whatever is
    > handy, it doesn't matter to me. The choice will probably have to do
    > with what sort() prefers.
    > - Also I will be happy with any key space size. So I guess 100*size of
    > the database will do.
    >
    > Any comments?
    > How long should I hope this sort will take? It will sound weird, but I
    > actually have 12 different key maps and I want to sort this with
    > respect to each map, so I will have to sort 12 times.
    >
    > Paul
    >
    Larry Bates, Dec 7, 2004
    #7
  8. On Tue, 07 Dec 2004 12:27:33 -0800, Paul wrote:

    > I have a large database of 15GB, consisting of 10^8 entries of
    > approximately 100 bytes each. I devised a relatively simple key map on
    > my database, and I would like to order the database with respect to the
    > key.


    You won't be able to load this into memory on a 32-bit machine, even with
    loads of swap. Maybe you could do this on x86-64 with lots of swap (or
    loadsa memory), or other 64-bit hardware. It will be _really_ slow,
    however.

    Otherwise you could do an on-disk sort (not too hard with fixed-length
    records), but this will require some coding. You'll probably need to do
    some reading to work out which sorting algorithm accesses the data less
    randomly. I think the key phrase is an "external sort" rather than an
    "interal sort".

    It's probably easiest to load it into the thing into a database (like
    PostgreSQL), to do the work for you.

    Jeremy
    Jeremy Sanders, Dec 8, 2004
    #8
  9. On Tue, 2004-12-07 at 16:47, Paul wrote:
    > I really do need to sort. It is complicated and I haven't said why, but
    > it will help in finding similar keys later on. Sorry I can't be more
    > precise, this has to do with my research.


    Precision is precisely what we require to give you an answer more
    meaningful than "write a script to load it into your favorite database
    and type 'select * from table order by column;' "

    Now unless you have an NDA with an employer or are working on something
    classified, (in which case you have already given us too much
    information and should start looking for another job and lawyer) I would
    venture a guess that you have more to gain than lose from giving us more
    information. Decisions are hard sometimes ... is the help worth the
    risk that somebody in this forum will look at your question, say "hey
    that is a neat idea," duplicate all of your research and publish before
    you shaming you to a life of asking "do you want fries with that" and
    pumping gas.

    >
    > Your two other suggestions with itertools and operator are more useful,
    > but I was mostly wondering about performance issue.


    What performance issue? Nowadays any decent laptop should be able to
    handle this dataset (from disk) without too much trouble.

    c = make_a_cursor_for_my_favoriate_database()
    f = open( "mydata" )
    for line in f.xreadlines():
    c.execute( "insert into table( fields) values (%s,%s ... )",
    line.split() )
    c.commit()
    print "I'm done loading, feel free to hit control+C if you get tired"
    c.execute( "select * from table order by field" )
    while 1:
    print c.fetchone()

    Then, from your shell:

    myloadscript.py | gzip -9 > results.txt

    Start it up Friday night and take the weekend off. Just make sure you
    plug your laptop into the wall before you go home.

    >
    > Is this reasonnable to do on 10^8 elements with repeats in the keys? I
    > guess I should just try and see for myself.


    Repeats in the keys don't matter.


    Adam DePrince
    Adam DePrince, Dec 8, 2004
    #9
  10. Paul

    Paul Guest

    The reason I am not telling you much about the data is not because I am
    afraid anyone would steal my ideas, or because I have a non-disclosure
    agreement or that I don't want to end up pumping gas.
    It is just that it is pretty freaking damn hard to even explain what is
    going on. Probably a bit harder than your usual database.

    If you really want to know, my entries are elliptic curves and my
    hashing function is an attempt at mapping them to their Serre resdual
    representation modulo a given prime p.

    Now, for you to tell me something relevant about the data that I don't
    already know from some big-shot conjecture/theorem would probably
    require that you read a few grad school books on number theory. Happy?
    Paul, Dec 9, 2004
    #10
  11. Paul

    Paul Rubin Guest

    "Paul" <> writes:
    > If you really want to know, my entries are elliptic curves and my
    > hashing function is an attempt at mapping them to their Serre resdual
    > representation modulo a given prime p.
    >
    > Now, for you to tell me something relevant about the data that I don't
    > already know from some big-shot conjecture/theorem would probably
    > require that you read a few grad school books on number theory. Happy?


    If you know that much math, you shouldn't have any trouble reading
    Knuth "The Art Of Computer Programming" vol 3 "Sorting and Searching",
    for info on how to do external sorts. External sorting means sorting
    files that are too large to fit in memory. Basically you read in
    chunks of the file that will fit in memory, sort the chunks in memory
    and write the sorted chunks to disk. Then you merge the sorted
    chunks. Since in the old days a huge amount of computer resources
    went into doing these operations, whole books (like Knuth's) have been
    written about how to handle all the details efficiently.

    If you only want to do this sort once, your simplest approach (if
    you're using Unix/Linux) is to use the Unix/Linux "sort" command,
    which does an external sort. You'll have to first crunch your data
    into a format that the "sort" command can handle, then run "sort",
    then uncrunch if appropriate.
    Paul Rubin, Dec 9, 2004
    #11
  12. Paul

    Paul Guest

    Thanks! I definitely didn't want to go into any elaborate programming
    for this, and the Unix sort is perfect for this.
    It sorted a tenth of my data in about 8 min, which is entirely
    satisfactory to me (assuming it will take ~ 20 times more to do the
    whole thing).
    Your answer greatly helped!
    Paul
    Paul, Dec 9, 2004
    #12
  13. [Paul]

    > Thanks! I definitely didn't want to go into any elaborate programming
    > for this, and the Unix sort is perfect for this. It sorted a tenth of
    > my data in about 8 min, which is entirely satisfactory to me (assuming
    > it will take ~ 20 times more to do the whole thing). Your answer
    > greatly helped! Paul


    I was to reply a bit more elaborately, but if you are happy with `sort'
    that's quite nice, you do have a solution. :)

    One of my old cases was a bit more difficult in that the comparison
    algorithm was not _simple_ enough to easily translate into computing a
    key once, that could be then saved into the record. The comparison had
    to stay live with the sort and the whole somehow sort co-routined with
    the application. I wrote a dual-tournament sort aiming a polyphased
    merge, and transported the algorithm through machines and languages.
    Not so long ago, I finally saved the whole algorithm in Python for the
    record, but found it to be too slow to be practical for huge tasks.

    Nowadays, given the same kind of problem, the size and speed of machines
    is such that I would dare trying Timsort (the standard Python sort) over
    a mmap-ed file, right from within Python. My guess is that it could
    very reasonably work, despite requiring almost no development time.

    --
    Fran├žois Pinard http://pinard.progiciels-bpi.ca
    =?iso-8859-1?Q?Fran=E7ois?= Pinard, Dec 9, 2004
    #13
  14. Paul

    Guest

    FWIW, the algorithms in early editions [haven't looked at recent ones]
    are designed for magnetic tapes, not disk. They do still work on disk
    (treat each tape drive as a file on disc). I had to implement a robust
    production-quality sort on MS-DOS about 20 years ago, and did it
    straight out of Knuth's book. It worked robustly, and was fast enough.
    However there probably were then, and are sure to be now, algorithms
    specifically designed for disk that work faster. Whether they are
    documented and analysed as thoroughly as in Knuth's book is another
    question.
    , Dec 10, 2004
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mario Rodriguez

    uploading huge files

    Mario Rodriguez, Apr 20, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    313
    =?Utf-8?B?Q0FSZWVk?=
    Apr 20, 2004
  2. JMG

    Pb when downloading huge files

    JMG, Apr 29, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    332
  3. Jeff

    upload/download huge files

    Jeff, Mar 10, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    489
    John Timney \( MVP \)
    Mar 16, 2006
  4. Christian Hiller
    Replies:
    7
    Views:
    1,381
    Andrew
    Oct 9, 2003
  5. Replies:
    3
    Views:
    494
Loading...

Share This Page