sorting large file

B

bisuvious

hi all,

I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.


thanks

bisuvious
 
J

Jacek Dziedzic

bisuvious said:
hi all,

I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.

Provide more details. How much memory do you have? Do you
need a stable sort or not? Is the file a text file or a
binary file?

If you have a lot of memory available, perhaps you could read
the entire file into memory and then use std::sort. If not,
it will get complicated.

HTH,
- J.
 
O

osmium

bisuvious said:
I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.

Break the file into big "chunks" Sort each chunk with qsort and store the
results in a file. Now open all the files and merge the entries into a
single file. Find out how many files you can open on your OS before starting
the endeavor.
 
T

terminator

hi all,

I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.

thanks

bisuvious

Well What size are your records stored in the file?
 
B

bisuvious

hi all,

I am looking for a technique to sort data of alargefile, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.

thanks

bisuvious

Hi, here is more details of my problem...

File size will be of order of GB. Data to be sorted are of integer
type. Main memory is considered to be lower(256MB-512MB) than the file
size.

Theorytically We can use External memory merge sort procedure. But
phase II (merging) of this procedure seems to be problematic.

Please suggest in this regard.

thanks
bisu
 
T

terminator

Theorytically We can use External memory merge sort procedure. But
phase II (merging) of this procedure seems to be problematic.

I agree.
If you are facing low disk space you can define some container type
whose elements stored on disk (file) locations rather than ram, then
use standard algorithms (e.g. std::sort) ,but this looks too slow
and I am worried about too many times of writing to disk.
Break the file into big "chunks"

This is good: If your using 32-bit ints then we are dealing with the
range [0,4G-1] of integers you can divide this range into a few (say
8) files each one catching numbers in its own range (say [0,512M-1] ,
[512M,1G-1] ... [3.5G,4G-1] respectively ) then delete/archive the
original file and merge the above files into one.I do not see too much
problem in merging any more.
 
M

ManicQin

Break the file into big "chunks" Sort each chunk with qsort and store the
results in a file. Now open all the files and merge the entries into a
single file. Find out how many files you can open on your OS before starting
the endeavor.

I Got a Q' about the way you proposed.
If he will divide the data into big chuncks then the size of two
chuncks should be less than
the size of RAM so the merge method wont crush am i right? (lets call
it size N)
But ... after merging two files you got to merge the big chunk (2*N)
with a small chunk (N)... and by that exceed the RAM size... at the
end you will merge a file in the size_of (ORIGINAL_SIZE - N) with a
file in the size_of (N)...
isnt that equivalent to sorting the original file?
 
A

angrybaldguy

I Got a Q' about the way you proposed.
If he will divide the data into big chuncks then the size of two
chuncks should be less than
the size of RAM so the merge method wont crush am i right? (lets call
it size N)
But ... after merging two files you got to merge the big chunk (2*N)
with a small chunk (N)... and by that exceed the RAM size... at the
end you will merge a file in the size_of (ORIGINAL_SIZE - N) with a
file in the size_of (N)...
isnt that equivalent to sorting the original file?

If the individual chunks are sorted internally, then you don't have to
load the entire chunk into memory -- just the next element out of that
chunk. Then you take the loaded element with the least value, append
it to output, and replace it with the next value from the chunk it
came from. If you reach the end of one chunk before the others, stop
considering values from it.
 
O

osmium

I Got a Q' about the way you proposed.
If he will divide the data into big chuncks then the size of two
chuncks should be less than
the size of RAM so the merge method wont crush am i right? (lets call
it size N)
But ... after merging two files you got to merge the big chunk (2*N)
with a small chunk (N)... and by that exceed the RAM size... at the
end you will merge a file in the size_of (ORIGINAL_SIZE - N) with a
file in the size_of (N)...
isnt that equivalent to sorting the original file?

Let's say you have 8 chunks and you have reached the point where they have
all been sorted into 8 files. You open these 8 files for reading plus
another one, for writing, which will be a final, sorted, file. The OS lets
you see a little "window" in RAM into each file, not the entire file. One
of those 8 files has the smallest record of the original, unsorted, file.
Write it to the output file. Now find the next one. Repeat until done.

Note that when making the "chunks", you have to observe record boundaries,
the "seam" can't be just any old place.
 
R

Richard Herring

hi all,

I am looking for a technique to sort data of a large file, say 2GB/
4GB.
Please suggest any algorithm, if you have any idea.

Don't reinvent the wheel: read chapter 5 of Knuth's "Art of Computer
Programming" before you write a line of code.
 
G

Gianni Mariani

4N said:
did you try b-trees?

Not an optimal solution for "sorting" a file that does not fit into memory.

Probably the best solution for this is the "merge sort" or some variant.

http://en.wikipedia.org/wiki/Merge_sort

If, however, the file does fit in memory, then reading it fully into
memory and sorting it using std::sort or similar would be just about as
fast as you will get.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top