Merge strategy

C

Chris

This isn't Java-specific, but I this would be a good group to ask...

I have directory containing many files of varying sizes. An external
process adds small files periodically. Small files need to merged into
large files so as to reduce the total number of files in the directory.
We want to reduce the total number of files as quickly as possible.

What's the best merge strategy, assuming that the time require to merge
two or more files is roughly proportional to their size?

Clearly it makes sense to merge small files together first before going
after large files. One strategy would be to just merge the smallest X
files on each pass. This is not optimal, though, if the smallest X files
contains one or more very large files.

Assume that 1) there is a maximum number of files we want to involve in
any given merge operation, and 2) there is a maximum file size.

Thoughts?
 
E

Eric Sosman

Chris wrote On 06/29/06 15:26,:
This isn't Java-specific, but I this would be a good group to ask...

I have directory containing many files of varying sizes. An external
process adds small files periodically. Small files need to merged into
large files so as to reduce the total number of files in the directory.
We want to reduce the total number of files as quickly as possible.

What's the best merge strategy, assuming that the time require to merge
two or more files is roughly proportional to their size?

Under that assumption, an N-way merge of all N files
in one pass is the best strategy. Yes, an N-way merge takes
more computation time than a merge of lower order, but the
CPU is manymanymany orders of magnitude faster than the I/O.
(In the time for one ten-millisecond seak-and-read, the clock
on your 3GHz CPU ticks thirty million times.)

If N is extremely large, the proportionality assumption
may not hold. You may get non-linear slowdowns from effects
like the file system reducing the amount of read-ahead to combat
overcrowding in its caches and things. It's a good deal slower
to perform a hundred 10kB reads than one 1MB read, even though
the same volume of data is transferred.
Clearly it makes sense to merge small files together first before going
after large files.

Not clear at all. Merging in multiple stages requires
making multiple passes over at least some of the data (reading
it, writing it, and reading it back in agagin), and this is
more work than if you only read and wrote each item once. If
proportionality holds, doing everything in one stage wins. You
will need to make some measurements on your system to see how
well or poorly the proportionality assumption matches reality.
One strategy would be to just merge the smallest X
files on each pass. This is not optimal, though, if the smallest X files
contains one or more very large files.

Assume that 1) there is a maximum number of files we want to involve in
any given merge operation, and 2) there is a maximum file size.

Suppose you can merge K files simultaneously without breaking
the proportionality assumption. Then a reasonable strategy for
merging your N files is

while (N > 1) {
M = min(N, K, N+1-K)
merge the M smallest files into one
delete (or forget) the merge inputs
N = N - M + 1
}

This is probably not optimal because it has only limited
"foresight." For some ideas about better patterns, take a
look at Knuth TAOCP section 5.4.9 -- he's discussing merge
patterns for external sorting and the circumstances don't seem
quite like yours, but I'm only suggesting that you seek "ideas,"
not "recipes."
 
E

Eric Sosman

Eric Sosman wrote On 06/29/06 16:06,:
[...] Then a reasonable strategy for
merging your N files is

while (N > 1) {
M = min(N, K, N+1-K)

Oh, drat: that's not right. Try

M = (N <= K) ? N : min(K, N+1-K)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top