File Processing

Jeff · Sep 30, 2008

Hello

I want to read and process and rewrite a very large disk based file
(>3Gbytes) as quickly as possible.
The processing effectively involves finding certain strings and replacing
them with other strings of equal length such that the file size is unaltered
(the file is uncompressed btw). I wondered if anyone could advise me of the
best way to do this and also of things to avoid. More specifically I was
wondering :-

-Is it best to open a single file for read-write access and overwrite the
changed bytes or would it be better to create a new file?
-Is there any point in buffering bytes in rather than reading one byte at a
time or does this just defeat the buffering that's done by the OS anyway?
-Would this benefit from multi-threading - read, process, write?

And finally could anyone point me to any sample code which already does this
sort of thing in the fastest possible way?

Many Thanks
Jeff

James Kanze · Oct 1, 2008

It is always a good idea to leave the old file intact, unless you
somehow can ensure that a single write operation will never fail and
that an incomplete set of find/replace operations is still OK. Ask in
any database development newsgroup.

This is generally true, but he said a "very large" file. I'd
have some hesitations about making a copy if the file size were,
say, 100 Gigabytes.

As always, you have to weigh the trade offs. Making a copy is
certainly a safer solution, if you can afford it.

You'd have to experiment. C++ language does not define any
buffering AFA OS is concerned.

C++ does define buffering in iostreams. But the fastest
solution will almost certainly involve platform specific
requests. I'd probably start by using mmap on a Unix system, or
CreateFileMapping/MapViewOfFile under Windows. If performance
is really an issue, he'll probably have to experiment with
different solutions, but I'd be surprised if anything was
significantly faster than using a memory mapped file, modified
in place.

But of course, as you pointed out above, this solution doesn't
provide transactional integrity. And it only works if the
process has enough available address space to map the file.
(Probably no problem on a 64 bit processor, but likely not the
case on 32 bit one.)

Unlikely. Processing will take so little time compared to the
I/O, and I/O is going to be the bottleneck anyway, so...

If he uses memory mapping, the system will take care of all of
the IO behind his back anyway. Otherwise, some sort of
asynchronous I/O can sometimes improve performance.

jacek.dziedzic · Oct 1, 2008

Hello

I want to read and process and rewrite a very large disk based file
(>3Gbytes) as quickly as possible.
The processing effectively involves finding certain strings and replacing
them with other strings of equal length such that the file size is unaltered
(the file is uncompressed btw). I wondered if anyone could advise me of the
best way to do this and also of things to avoid. More specifically I was
wondering :-

-Is it best to open a single file for read-write access and overwrite the
changed bytes or would it be better to create a new file?

You are asking about performance or safety? As Victor pointed out
already,
it's always safer to work on a copy. Performance-wise overwriting the
bytes
in the one file you have will be way faster then copying the file.

-Is there any point in buffering bytes in rather than reading one byte at a
time or does this just defeat the buffering that's done by the OS anyway?

There is. If you intend to issue 3000000000 read() calls to read a
3GB file,
one byte a time, you're wasting quite a lot of time doing the calls.
Reading
in, say, 1MB chunks would make it faster, although it complicates
looking
for the strings (chunk boundaries).

-Would this benefit from multi-threading - read, process, write?

Not to any significant degree, unless you're doing a *lot* of
processing
to find the strings you need (like complex regexen or such). Very
likely
you're way I/O-bound here.

And finally could anyone point me to any sample code which already does this
sort of thing in the fastest possible way?

No, but I would strongly advise you to look into memory-mapped I/O,
if
your system supports it. This is not portable in C++ sense, and hence
OT for this newsgroup, but it is most likely the fastest you can get,
and -- as a bonus -- you avoid all read() and write() calls, and need
no
buffering. Google for the mmap() call.

HTH,
- J.

James Kanze · Oct 1, 2008

On Sep 30, 8:44 pm, "Jeff" <[email protected]> wrote:

No, but I would strongly advise you to look into memory-mapped
I/O, if your system supports it. This is not portable in C++
sense, and hence OT for this newsgroup, but it is most likely
the fastest you can get, and -- as a bonus -- you avoid all
read() and write() calls, and need no buffering. Google for
the mmap() call.

While it's true that mmap is usually faster than naïve file
handling, the buffering, reading and writing are still there.
The only difference is that its the OS which takes care of them
(with a bit of help from the hardware), and not you. Typically,
*IF* you're a real expert, and you're willing to invest a lot of
time and effort, you can do better for any specific use.
Typically, not much better, however, and typically, you're not a
real expert (the real experts are busy implementing the code in
the OS), and the slight gains you get aren't worth the cost.

AnonMail2005 · Oct 2, 2008

Hello

I want to read and process and rewrite a very large disk based file
(>3Gbytes) as quickly as possible.
The processing effectively involves finding certain strings and replacing
them with other strings of equal length such that the file size is unaltered
(the file is uncompressed btw). I wondered if anyone could advise me of the
best way to do this and also of things to avoid. More specifically I was
wondering :-

-Is it best to open a single file for read-write access and overwrite the
changed bytes or would it be better to create a new file?
-Is there any point in buffering bytes in rather than reading one byte at a
time or does this just defeat the buffering that's done by the OS anyway?
-Would this benefit from multi-threading - read, process, write?

And finally could anyone point me to any sample code which already does this
sort of thing in the fastest possible way?

Many Thanks
Jeff

First cut, I would look into unix text processing tools like grep and
sed.
Why reinvent the wheel? Also, these tools are available for use in
non
unix environmetns like the PC.

HTH

Jeff · Oct 2, 2008

Thanks a million for the very helpful replies.

I'm still experimenting, but I already found that I can make significant
(>10) improvements in speed by buffering in the file rather than reading it
byte by byte.

Jeff

PDF File Code	4	Apr 20, 2023
How to create PDF file in Batch	5	May 11, 2022
Processing a file using multithreads	4	Sep 8, 2011
Revised Question on File Processing	2	Jan 27, 2013
C++ input from file processing	3	Feb 23, 2010
Text processing	29	Sep 26, 2011
Reuse of FILE Variable	5	Aug 19, 2012
Help with file processing	5	Dec 15, 2006

File Processing

Jeff

James Kanze

jacek.dziedzic

James Kanze

AnonMail2005

Jeff

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads