Efficient techniques to handle large binary files

pedagani · Mar 3, 2006

Dear comp.lang.c++,
I'm interested in knowing the general techniques used to handle large
binary files (>10GB) efficiently such as tweaking with filebuf , etc.
Reading chunk by chunk seems to be popular choice even though it
complicates the algorithm implementation. I am interested in knowing
when & what to apply in frequently encoutered scenarios. For example,
if I have to remove certain data patterns from a huge file. Is
processing the pattern search algorithm on chunk by chunk basis, the
only bet?
Thank you.
KK

Dietmar Kuehl · Mar 3, 2006

Reading chunk by chunk seems to be popular choice even though it
complicates the algorithm implementation.

If you can read 10GB into memory, just do so and do the processing
from there... Personally, I wouldn't even try to do it this way even
if I had a machine with this amount of memory or at least virtual
memory.

I am interested in knowing
when & what to apply in frequently encoutered scenarios.

In general, I find that the approach taken for formatted I/O works
quite well and a similar approach can be taken for binary I/O, too:
you start off with some elementary input operations for basic types,
encapsulating them into appropriate operators, i.e. for input using
'operator>>()'. For binary I/O you should use a new class similar to
'std::istream' but not 'std::istream' itself because this class is
used for formatted I/O. The important abstraction is that it
internally uses a stream buffer ('std::streambuf') and obtains
individual characters from there. Once you got input operations for
the basic building blocks (e.g. integers, doubles, blobs, etc.) you
would layer input for data structures on top of them.

Even though the basic input operations might operate on individual
characters or on relatively small entities, the added processing for
these elements typically does not matter at all because the actual
I/O waits are much bigger. This way, the block structure of the
file is nice abstracted in the file buffer (or whatever stream buffer
you are using).

For example,
if I have to remove certain data patterns from a huge file. Is
processing the pattern search algorithm on chunk by chunk basis, the
only bet?

Since the patterns typically won't respect the block structure of the
file, this does not necessarily work. If your pattern search is,
however, relatively simple, you might indeed just fill a buffer and
increase it when necessary to detect patterns. One thing you might
want to try is using 'std::istreambuf_iterator<char>()' to read your
file and buffer the match portions of pattern before sending them on.
However, many implementations of 'std::istreambuf_iterator<char>()'
are not really that good. Alternatively, you might want to process
individual characters obtained directly from the stream buffer
instead of going through a stream buffer iterator.

The best approach depends on what you really want to do and whether
you want to retain most or at least some data after reading it.

comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
CFP: GAMEON 2007, November 20-22, 2007, University of Bologna, Bologna,Italy	0	Jun 15, 2007
python-dev Summary for 2005-06-16 through 2005-06-30	2	Jul 9, 2005
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007
python-dev Summary for 2005-04-16 through 2005-04-30	7	May 16, 2005

Efficient techniques to handle large binary files

pedagani

Dietmar Kuehl

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads