Efficient techniques to handle large binary files

P

pedagani

Dear comp.lang.c++,
I'm interested in knowing the general techniques used to handle large
binary files (>10GB) efficiently such as tweaking with filebuf , etc.
Reading chunk by chunk seems to be popular choice even though it
complicates the algorithm implementation. I am interested in knowing
when & what to apply in frequently encoutered scenarios. For example,
if I have to remove certain data patterns from a huge file. Is
processing the pattern search algorithm on chunk by chunk basis, the
only bet?
Thank you.
KK
 
D

Dietmar Kuehl

Reading chunk by chunk seems to be popular choice even though it
complicates the algorithm implementation.

If you can read 10GB into memory, just do so and do the processing
from there... Personally, I wouldn't even try to do it this way even
if I had a machine with this amount of memory or at least virtual
memory.
I am interested in knowing
when & what to apply in frequently encoutered scenarios.

In general, I find that the approach taken for formatted I/O works
quite well and a similar approach can be taken for binary I/O, too:
you start off with some elementary input operations for basic types,
encapsulating them into appropriate operators, i.e. for input using
'operator>>()'. For binary I/O you should use a new class similar to
'std::istream' but not 'std::istream' itself because this class is
used for formatted I/O. The important abstraction is that it
internally uses a stream buffer ('std::streambuf') and obtains
individual characters from there. Once you got input operations for
the basic building blocks (e.g. integers, doubles, blobs, etc.) you
would layer input for data structures on top of them.

Even though the basic input operations might operate on individual
characters or on relatively small entities, the added processing for
these elements typically does not matter at all because the actual
I/O waits are much bigger. This way, the block structure of the
file is nice abstracted in the file buffer (or whatever stream buffer
you are using).
For example,
if I have to remove certain data patterns from a huge file. Is
processing the pattern search algorithm on chunk by chunk basis, the
only bet?

Since the patterns typically won't respect the block structure of the
file, this does not necessarily work. If your pattern search is,
however, relatively simple, you might indeed just fill a buffer and
increase it when necessary to detect patterns. One thing you might
want to try is using 'std::istreambuf_iterator<char>()' to read your
file and buffer the match portions of pattern before sending them on.
However, many implementations of 'std::istreambuf_iterator<char>()'
are not really that good. Alternatively, you might want to process
individual characters obtained directly from the stream buffer
instead of going through a stream buffer iterator.

The best approach depends on what you really want to do and whether
you want to retain most or at least some data after reading it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top