sorting large file

Discussion in 'C++' started by bisuvious, Mar 25, 2007.

  1. bisuvious

    bisuvious Guest

    hi all,

    I am looking for a technique to sort data of a large file, say 2GB/
    4GB.
    Please suggest any algorithm, if you have any idea.


    thanks

    bisuvious
     
    bisuvious, Mar 25, 2007
    #1
    1. Advertising

  2. bisuvious wrote:
    > hi all,
    >
    > I am looking for a technique to sort data of a large file, say 2GB/
    > 4GB.
    > Please suggest any algorithm, if you have any idea.


    Provide more details. How much memory do you have? Do you
    need a stable sort or not? Is the file a text file or a
    binary file?

    If you have a lot of memory available, perhaps you could read
    the entire file into memory and then use std::sort. If not,
    it will get complicated.

    HTH,
    - J.
     
    Jacek Dziedzic, Mar 25, 2007
    #2
    1. Advertising

  3. bisuvious

    osmium Guest

    "bisuvious" writes:

    > I am looking for a technique to sort data of a large file, say 2GB/
    > 4GB.
    > Please suggest any algorithm, if you have any idea.


    Break the file into big "chunks" Sort each chunk with qsort and store the
    results in a file. Now open all the files and merge the entries into a
    single file. Find out how many files you can open on your OS before starting
    the endeavor.
     
    osmium, Mar 25, 2007
    #3
  4. On 25 Mar 2007 08:21:35 -0700, "bisuvious" wrote:
    >I am looking for a technique to sort data of a large file, say 2GB/
    >4GB.


    If you like old-fashioned 'modern' C++:
    http://www.informatik.hs-bremen.de/~brey/stlbe.html
    Chapter 10: External sorting

    Good luck,
    Roland Pibinger
     
    Roland Pibinger, Mar 25, 2007
    #4
  5. bisuvious

    terminator Guest

    On Mar 25, 7:21 pm, "bisuvious" <> wrote:
    > hi all,
    >
    > I am looking for a technique to sort data of a large file, say 2GB/
    > 4GB.
    > Please suggest any algorithm, if you have any idea.
    >
    > thanks
    >
    > bisuvious


    Well What size are your records stored in the file?
     
    terminator, Mar 25, 2007
    #5
  6. bisuvious

    bisuvious Guest

    On Mar 25, 11:21 pm, "bisuvious" <> wrote:
    > hi all,
    >
    > I am looking for a technique to sort data of alargefile, say 2GB/
    > 4GB.
    > Please suggest any algorithm, if you have any idea.
    >
    > thanks
    >
    > bisuvious


    Hi, here is more details of my problem...

    File size will be of order of GB. Data to be sorted are of integer
    type. Main memory is considered to be lower(256MB-512MB) than the file
    size.

    Theorytically We can use External memory merge sort procedure. But
    phase II (merging) of this procedure seems to be problematic.

    Please suggest in this regard.

    thanks
    bisu
     
    bisuvious, Mar 26, 2007
    #6
  7. bisuvious

    terminator Guest

    On Mar 26, 10:16 am, "bisuvious" <> wrote:
    >
    > Theorytically We can use External memory merge sort procedure. But
    > phase II (merging) of this procedure seems to be problematic.
    >


    I agree.
    If you are facing low disk space you can define some container type
    whose elements stored on disk (file) locations rather than ram, then
    use standard algorithms (e.g. std::sort) ,but this looks too slow
    and I am worried about too many times of writing to disk.

    osmium wrote:
    >
    >Break the file into big "chunks"
    >


    This is good: If your using 32-bit ints then we are dealing with the
    range [0,4G-1] of integers you can divide this range into a few (say
    8) files each one catching numbers in its own range (say [0,512M-1] ,
    [512M,1G-1] ... [3.5G,4G-1] respectively ) then delete/archive the
    original file and merge the above files into one.I do not see too much
    problem in merging any more.
     
    terminator, Mar 26, 2007
    #7
  8. bisuvious

    ManicQin Guest

    On Mar 25, 6:04 pm, "osmium" <> wrote:
    > "bisuvious" writes:
    > > I am looking for a technique to sort data of a large file, say 2GB/
    > > 4GB.
    > > Please suggest any algorithm, if you have any idea.

    >
    > Break the file into big "chunks" Sort each chunk with qsort and store the
    > results in a file. Now open all the files and merge the entries into a
    > single file. Find out how many files you can open on your OS before starting
    > the endeavor.


    I Got a Q' about the way you proposed.
    If he will divide the data into big chuncks then the size of two
    chuncks should be less than
    the size of RAM so the merge method wont crush am i right? (lets call
    it size N)
    But ... after merging two files you got to merge the big chunk (2*N)
    with a small chunk (N)... and by that exceed the RAM size... at the
    end you will merge a file in the size_of (ORIGINAL_SIZE - N) with a
    file in the size_of (N)...
    isnt that equivalent to sorting the original file?
     
    ManicQin, Mar 27, 2007
    #8
  9. bisuvious

    Guest

    On Mar 26, 11:01 pm, "ManicQin" <> wrote:
    > On Mar 25, 6:04 pm, "osmium" <> wrote:
    > > Break the file into big "chunks" Sort each chunk with qsort and store the
    > > results in a file. Now open all the files and merge the entries into a
    > > single file. Find out how many files you can open on your OS before starting
    > > the endeavor.

    >
    > I Got a Q' about the way you proposed.
    > If he will divide the data into big chuncks then the size of two
    > chuncks should be less than
    > the size of RAM so the merge method wont crush am i right? (lets call
    > it size N)
    > But ... after merging two files you got to merge the big chunk (2*N)
    > with a small chunk (N)... and by that exceed the RAM size... at the
    > end you will merge a file in the size_of (ORIGINAL_SIZE - N) with a
    > file in the size_of (N)...
    > isnt that equivalent to sorting the original file?


    If the individual chunks are sorted internally, then you don't have to
    load the entire chunk into memory -- just the next element out of that
    chunk. Then you take the loaded element with the least value, append
    it to output, and replace it with the next value from the chunk it
    came from. If you reach the end of one chunk before the others, stop
    considering values from it.
     
    , Mar 27, 2007
    #9
  10. bisuvious

    osmium Guest

    "ManicQin" writes:

    >> Break the file into big "chunks" Sort each chunk with qsort and store the
    >> results in a file. Now open all the files and merge the entries into a
    >> single file. Find out how many files you can open on your OS before
    >> starting
    >> the endeavor.


    > I Got a Q' about the way you proposed.
    > If he will divide the data into big chuncks then the size of two
    > chuncks should be less than
    > the size of RAM so the merge method wont crush am i right? (lets call
    > it size N)
    > But ... after merging two files you got to merge the big chunk (2*N)
    > with a small chunk (N)... and by that exceed the RAM size... at the
    > end you will merge a file in the size_of (ORIGINAL_SIZE - N) with a
    > file in the size_of (N)...
    > isnt that equivalent to sorting the original file?


    Let's say you have 8 chunks and you have reached the point where they have
    all been sorted into 8 files. You open these 8 files for reading plus
    another one, for writing, which will be a final, sorted, file. The OS lets
    you see a little "window" in RAM into each file, not the entire file. One
    of those 8 files has the smallest record of the original, unsorted, file.
    Write it to the output file. Now find the next one. Repeat until done.

    Note that when making the "chunks", you have to observe record boundaries,
    the "seam" can't be just any old place.
     
    osmium, Mar 27, 2007
    #10
  11. In message <>,
    bisuvious <> writes
    >hi all,
    >
    >I am looking for a technique to sort data of a large file, say 2GB/
    >4GB.
    >Please suggest any algorithm, if you have any idea.


    Don't reinvent the wheel: read chapter 5 of Knuth's "Art of Computer
    Programming" before you write a line of code.

    --
    Richard Herring
     
    Richard Herring, Apr 4, 2007
    #11
  12. bisuvious

    4N Guest

    did you try b-trees?
     
    4N, Apr 6, 2007
    #12
  13. 4N wrote:
    > did you try b-trees?


    Not an optimal solution for "sorting" a file that does not fit into memory.

    Probably the best solution for this is the "merge sort" or some variant.

    http://en.wikipedia.org/wiki/Merge_sort

    If, however, the file does fit in memory, then reading it fully into
    memory and sorting it using std::sort or similar would be just about as
    fast as you will get.
     
    Gianni Mariani, Apr 6, 2007
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JJ
    Replies:
    13
    Views:
    521
  2. Replies:
    23
    Views:
    1,252
    Albert van der Horst
    Feb 2, 2008
  3. Ketchup
    Replies:
    1
    Views:
    250
    Jan Tielens
    May 25, 2004
  4. Rishi  Dhupar

    Sorting a large XML file

    Rishi Dhupar, Apr 19, 2005, in forum: Perl Misc
    Replies:
    5
    Views:
    147
    John Bokma
    Apr 20, 2005
  5. Replies:
    5
    Views:
    881
    Xho Jingleheimerschmidt
    Apr 2, 2009
Loading...

Share This Page