Toward a generic Disk Sort for Java

Discussion in 'Java' started by Roedy Green, Jun 26, 2008.

  1. Roedy Green

    Roedy Green Guest

    I was thinking how you might go about writing a sort that could handle
    more data than could fit in RAM. It handled the problem is Abundance
    by checkpointing the app to disk to free up maximum RAM, then spawning
    a copy of Opt-Tech sort. My records were roughly like DataOutputStream
    would produce, so I could automatically generate the command script
    sort the fields in any way I wanted.

    I thought you might pull it off in Java this way.

    1. You write a comparator as if you were going to sort Objects in an
    ArrayList.

    2. the external sort has an add method that also takes collections.

    It accepts a "chunk" of records, and sorts them using Sun's sort.

    Then it writes them out as SERIALISED objects in heavily buffered
    stream. There may be some way to do a partial reset after each object
    to speed it up.

    Then you repeat collecting, sorting and writing another batch to
    another file.

    When you have created N files, you recycle, appending. (Optimal N to
    be determined by experiment). Ideally each file would be on a
    different physical drive.

    Then when all the records have been added, you start merging chunks
    into longer chunks, and writing out the longer chunks. Each N-way
    merge cuts the number of chunks by 1/N and increases the length of the
    chunks N times.

    on the final merge pass does not happen until the user invokes the
    Iterator to hand over the resulting records.

    Another way it might be done is the records to be sorted must by byte
    arrays, chunks effectively produced by DataOutputStream. You specify
    offset, length and key type e.g.
    int, byte, short, float, double, String.

    This would require a detailed knowledge of the bit structure of the
    records, the way you did in the olden days of assembler and C.

    This would be clumsier to use, but would avoid the overhead of
    pickling and reconstituting records on every pass.

    Then of course, there is the possibility someone has already solved
    this and done it well.

    The universe has a sneaky habit. Problems start out small, and it
    looks like a purely in RAM solution is perfectly adequate. Then they
    bit by bit grow and grow and start pushing the limits of the RAM
    solution. Suddenly you are faced with a major redesign.
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Jun 26, 2008
    #1
    1. Advertising

  2. Roedy Green wrote:
    > I was thinking how you might go about writing a sort that could handle
    > more data than could fit in RAM. It handled the problem is Abundance
    > by checkpointing the app to disk to free up maximum RAM, then spawning
    > a copy of Opt-Tech sort. My records were roughly like DataOutputStream
    > would produce, so I could automatically generate the command script
    > sort the fields in any way I wanted.


    Knuth's The Art of Computer Programming, Volume 3 (Searching and
    Sorting) has long discussions on optimizing searching and sorting for
    tape drives that could be very relevant here.


    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
    Joshua Cranmer, Jun 26, 2008
    #2
    1. Advertising

  3. Roedy Green

    Arne Vajhøj Guest

    Roedy Green wrote:
    > I was thinking how you might go about writing a sort that could handle
    > more data than could fit in RAM. It handled the problem is Abundance
    > by checkpointing the app to disk to free up maximum RAM, then spawning
    > a copy of Opt-Tech sort. My records were roughly like DataOutputStream
    > would produce, so I could automatically generate the command script
    > sort the fields in any way I wanted.
    >
    > I thought you might pull it off in Java this way.
    >
    > 1. You write a comparator as if you were going to sort Objects in an
    > ArrayList.
    >
    > 2. the external sort has an add method that also takes collections.
    >
    > It accepts a "chunk" of records, and sorts them using Sun's sort.
    >
    > Then it writes them out as SERIALISED objects in heavily buffered
    > stream. There may be some way to do a partial reset after each object
    > to speed it up.
    >
    > Then you repeat collecting, sorting and writing another batch to
    > another file.
    >
    > When you have created N files, you recycle, appending. (Optimal N to
    > be determined by experiment). Ideally each file would be on a
    > different physical drive.
    >
    > Then when all the records have been added, you start merging chunks
    > into longer chunks, and writing out the longer chunks. Each N-way
    > merge cuts the number of chunks by 1/N and increases the length of the
    > chunks N times.
    >
    > on the final merge pass does not happen until the user invokes the
    > Iterator to hand over the resulting records.
    >
    > Another way it might be done is the records to be sorted must by byte
    > arrays, chunks effectively produced by DataOutputStream. You specify
    > offset, length and key type e.g.
    > int, byte, short, float, double, String.
    >
    > This would require a detailed knowledge of the bit structure of the
    > records, the way you did in the olden days of assembler and C.
    >
    > This would be clumsier to use, but would avoid the overhead of
    > pickling and reconstituting records on every pass.
    >
    > Then of course, there is the possibility someone has already solved
    > this and done it well.
    >
    > The universe has a sneaky habit. Problems start out small, and it
    > looks like a purely in RAM solution is perfectly adequate. Then they
    > bit by bit grow and grow and start pushing the limits of the RAM
    > solution. Suddenly you are faced with a major redesign.


    There are written plenty about that out of memory sorts.

    Start at:

    http://en.wikipedia.org/wiki/External_sorting

    Arne
    Arne Vajhøj, Jun 26, 2008
    #3
  4. Roedy Green

    Arne Vajhøj Guest

    Joshua Cranmer wrote:
    > Roedy Green wrote:
    >> I was thinking how you might go about writing a sort that could handle
    >> more data than could fit in RAM. It handled the problem is Abundance
    >> by checkpointing the app to disk to free up maximum RAM, then spawning
    >> a copy of Opt-Tech sort. My records were roughly like DataOutputStream
    >> would produce, so I could automatically generate the command script
    >> sort the fields in any way I wanted.

    >
    > Knuth's The Art of Computer Programming, Volume 3 (Searching and
    > Sorting) has long discussions on optimizing searching and sorting for
    > tape drives that could be very relevant here.


    Disks are random access where tapes are sequential access and that
    could change things.

    But I don't know if it actually does.

    Arne
    Arne Vajhøj, Jun 26, 2008
    #4
  5. Roedy Green

    Roedy Green Guest

    On Thu, 26 Jun 2008 13:18:41 -0400, Arne Vajhøj <>
    wrote, quoted or indirectly quoted someone who said :

    >Disks are random access where tapes are sequential access and that
    >could change things.
    >
    >But I don't know if it actually does.


    If your intermediate files are on the same physical disk drive, the
    arms will hop back and forth between them as you merge. However, If
    you have a big enough buffer, (better still if you had double
    buffering) then it would not be such a problem.

    One of the tweaking parameters is how big should be make your buffers
    and how many records should you sort at a time for each chunk. It
    gets more complicated with today's virtual RAM. I suppose you have to
    just benchmark it various ways.

    I helped write a tape sort back in the early 60s in Fortran for the
    7044.. Your buffering then was quite limited. The tricky part was
    making the tape sort come out so that be final pass went to the drive
    with the output tape mounted on it. You might have 4 to 6 intermediate
    drives. Tapes were typically on the same channel so you could not do
    simultaneous I/O.

    I don't have a feel for how sorts would behave now compared with then.
    Everything is much faster, and bigger but I don't know in what
    proportion.
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Jun 26, 2008
    #5
  6. Roedy Green

    Roedy Green Guest

    On Thu, 26 Jun 2008 13:16:56 -0400, Arne Vajhøj <>
    wrote, quoted or indirectly quoted someone who said :

    >http://en.wikipedia.org/wiki/External_sorting


    the surprise was " As a rule of thumb, it is inadvisable to perform a
    more-than-20-to-30-way merge."

    I would not have guessed the optimum was now so high. I would have
    guessed something like 5.
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Jun 26, 2008
    #6
  7. Roedy Green

    Mark Space Guest

    Roedy Green wrote:

    > The universe has a sneaky habit. Problems start out small, and it
    > looks like a purely in RAM solution is perfectly adequate. Then they
    > bit by bit grow and grow and start pushing the limits of the RAM
    > solution. Suddenly you are faced with a major redesign.


    With out doing any research, my gut instinct would be to start big with
    an actual general purpose merge sort, writing out temp files if needed.
    And I'd favor DataInput/OutputStream heavily over Serialization.

    The merge sort would degenerate to an in-memory shell sort below some
    limen, configure externally to the program (read from a config file, for
    example).
    Mark Space, Jun 26, 2008
    #7
  8. On Thu, 26 Jun 2008 12:28:42 -0700, Mark Space wrote:

    > Roedy Green wrote:
    >
    >> The universe has a sneaky habit. Problems start out small, and it
    >> looks like a purely in RAM solution is perfectly adequate. Then they
    >> bit by bit grow and grow and start pushing the limits of the RAM
    >> solution. Suddenly you are faced with a major redesign.

    >
    > With out doing any research, my gut instinct would be to start big with
    > an actual general purpose merge sort, writing out temp files if needed.
    > And I'd favor DataInput/OutputStream heavily over Serialization.
    >
    > The merge sort would degenerate to an in-memory shell sort below some
    > limen, configure externally to the program (read from a config file, for
    > example).
    >

    I'd go with that. As I'm sure you know, that is how traditional mainframe
    tape sorts used to work. Years back I implemented exactly that on an 8/16
    bit CPU (MC 6809) with 48K RAM and floppy disks. It was quite easy to do
    in PL/9 (a clone of PL/M) and worked surprisingly fast.

    Its described and the code provided in Kernighan & Plauger's "Software
    Tools in Pascal", though if you don't already have a copy you'll probably
    have to try used bookstores.


    --
    martin@ | Martin Gregorie
    gregorie. |
    org | Zappa fan & glider pilot
    Martin Gregorie, Jun 27, 2008
    #8
  9. Roedy Green

    Roedy Green Guest

    On Fri, 27 Jun 2008 00:15:51 +0100, Martin Gregorie
    <martin@see_sig_for_address.invalid> wrote, quoted or indirectly
    quoted someone who said :

    >Its described and the code provided in Kernighan & Plauger's "Software
    >Tools in Pascal", though if you don't already have a copy you'll probably
    >have to try used bookstores.


    In the old days, you created a string of bytes to sort, and you told
    the sort the offsets lengths and formats of the keys embedded in the
    record.

    We have been spoiled by in-RAM sorts where we can write a comparator
    that uses field names, and where the object need not be pulled
    together into a string of bytes.

    Part of the challenge is to figure out how to provide the convenience
    of an in-RAM comparator as the only thing you need to specify to use
    the external sort. Somehow the externalness should be transparent,
    just kicks in when RAM is tight.
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Jun 27, 2008
    #9
  10. In article
    <pan.2008.06.26.23.15.50.357565@see_sig_for_address.invalid>,
    Martin Gregorie <martin@see_sig_for_address.invalid> wrote:

    > On Thu, 26 Jun 2008 12:28:42 -0700, Mark Space wrote:
    >
    > > Roedy Green wrote:
    > >
    > >> The universe has a sneaky habit. Problems start out small, and it
    > >> looks like a purely in RAM solution is perfectly adequate. Then they
    > >> bit by bit grow and grow and start pushing the limits of the RAM
    > >> solution. Suddenly you are faced with a major redesign.

    > >
    > > With out doing any research, my gut instinct would be to start big with
    > > an actual general purpose merge sort, writing out temp files if needed.
    > > And I'd favor DataInput/OutputStream heavily over Serialization.
    > >
    > > The merge sort would degenerate to an in-memory shell sort below some
    > > limen, configure externally to the program (read from a config file, for
    > > example).
    > >

    > I'd go with that. As I'm sure you know, that is how traditional mainframe
    > tape sorts used to work. Years back I implemented exactly that on an 8/16
    > bit CPU (MC 6809) with 48K RAM and floppy disks. It was quite easy to do
    > in PL/9 (a clone of PL/M) and worked surprisingly fast.
    >
    > Its described and the code provided in Kernighan & Plauger's "Software
    > Tools in Pascal", though if you don't already have a copy you'll probably
    > have to try used bookstores.


    Excellent! Or Wirth's "Algorithms + Data Structures = Programs"; seen
    here in Oberon, p 64:

    <http://www-old.oberon.ethz.ch/WirthPubl/AD.pdf>

    Or, as the third shift operator once told me: throw it into a database
    and let Codd sort it out;-)

    --
    John B. Matthews
    trashgod at gmail dot com
    home dot woh dot rr dot com slash jbmatthews
    John B. Matthews, Jun 27, 2008
    #10
  11. Roedy Green

    Roedy Green Guest

    On Thu, 26 Jun 2008 22:28:24 -0400, "John B. Matthews"
    <> wrote, quoted or indirectly quoted someone who
    said :

    >Or, as the third shift operator once told me: throw it into a database
    >and let Codd sort it out;-)


    It sounds like a joke, but I suspect that is how most big sorts are
    done. Back in the olden days external sorts were pretty well the
    first software to go commercial. Nearly everything else you wrote
    yourself. You just don't see ads for them like you used to.
    Opt-Tech is still around, but very quiet.


    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Jun 27, 2008
    #11
  12. In article <>,
    Roedy Green <> wrote:

    > On Thu, 26 Jun 2008 22:28:24 -0400, "John B. Matthews"
    > <> wrote, quoted or indirectly quoted someone who
    > said :
    >
    > >Or, as the third shift operator once told me: throw it into a database
    > >and let Codd sort it out;-)

    >
    > It sounds like a joke, but I suspect that is how most big sorts are
    > done. Back in the olden days external sorts were pretty well the
    > first software to go commercial. Nearly everything else you wrote
    > yourself. You just don't see ads for them like you used to.
    > Opt-Tech is still around, but very quiet.


    Yes, a terrible pun on Codd and God, but you're right about the olden
    days. At the same time, I wonder if a JDBC-enabled database may be a
    useful abstraction for a generic, flat-file disk sort.

    Yes, I know the project is already out of hand:)

    --
    John B. Matthews
    trashgod at gmail dot com
    home dot woh dot rr dot com slash jbmatthews
    John B. Matthews, Jun 27, 2008
    #12
  13. Roedy Green

    Arne Vajhøj Guest

    Roedy Green wrote:
    > On Thu, 26 Jun 2008 22:28:24 -0400, "John B. Matthews"
    > <> wrote, quoted or indirectly quoted someone who
    > said :
    >> Or, as the third shift operator once told me: throw it into a database
    >> and let Codd sort it out;-)

    >
    > It sounds like a joke, but I suspect that is how most big sorts are
    > done. Back in the olden days external sorts were pretty well the
    > first software to go commercial. Nearly everything else you wrote
    > yourself. You just don't see ads for them like you used to.
    > Opt-Tech is still around, but very quiet.


    Data manipulation are done in databases or in memory today.

    The use of large flat files for data is rare today, so little
    demand for sort utilities.

    Arne
    Arne Vajhøj, Jun 28, 2008
    #13
  14. Roedy Green

    Roedy Green Guest

    On Thu, 26 Jun 2008 16:44:26 GMT, Roedy Green
    <> wrote, quoted or indirectly quoted
    someone who said :

    >I was thinking how you might go about writing a sort that could handle
    >more data than could fit in RAM.


    I have written this up a student project

    http://mindprod.com/jgloss/project/externalsort.html
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Jun 28, 2008
    #14
  15. On Fri, 27 Jun 2008 14:47:06 -0400, John B. Matthews wrote:

    > At the same time, I wonder if a JDBC-enabled database may be a
    > useful abstraction for a generic, flat-file disk sort.
    >

    It would be interesting to see how tipping records into an indexed
    table and then reading them out again compared for speed with the
    traditional in-memory sorted pre-string phase followed by n-way merge
    passes. I don't think I'd lay bets either way.

    Of course, if your system has enough swap space you could just substitute
    a TreeMap for JDBC+database.


    --
    martin@ | Martin Gregorie
    gregorie. |
    org | Zappa fan & glider pilot
    Martin Gregorie, Jun 28, 2008
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jas Shultz
    Replies:
    0
    Views:
    936
    Jas Shultz
    Dec 3, 2003
  2. Roedy Green

    Toward Terser Java

    Roedy Green, Jul 5, 2005, in forum: Java
    Replies:
    24
    Views:
    860
    Dale King
    Jul 15, 2005
  3. Steven Bethard

    generic object - moving toward PEP

    Steven Bethard, Nov 19, 2004, in forum: Python
    Replies:
    6
    Views:
    281
    Steven Bethard
    Nov 21, 2004
  4. Replies:
    12
    Views:
    507
    santosh
    Nov 15, 2006
  5. Navin
    Replies:
    1
    Views:
    672
    Ken Schaefer
    Sep 9, 2003
Loading...

Share This Page