File Processing

Discussion in 'C++' started by Jeff, Sep 30, 2008.

  1. Jeff

    Jeff Guest

    Hello

    I want to read and process and rewrite a very large disk based file
    (>3Gbytes) as quickly as possible.
    The processing effectively involves finding certain strings and replacing
    them with other strings of equal length such that the file size is unaltered
    (the file is uncompressed btw). I wondered if anyone could advise me of the
    best way to do this and also of things to avoid. More specifically I was
    wondering :-

    -Is it best to open a single file for read-write access and overwrite the
    changed bytes or would it be better to create a new file?
    -Is there any point in buffering bytes in rather than reading one byte at a
    time or does this just defeat the buffering that's done by the OS anyway?
    -Would this benefit from multi-threading - read, process, write?

    And finally could anyone point me to any sample code which already does this
    sort of thing in the fastest possible way?

    Many Thanks
    Jeff
    Jeff, Sep 30, 2008
    #1
    1. Advertising

  2. Jeff

    James Kanze Guest

    On Sep 30, 9:35 pm, Victor Bazarov <> wrote:
    > Jeff wrote:
    > > I want to read and process and rewrite a very large disk based file
    > > (>3Gbytes) as quickly as possible.
    > > The processing effectively involves finding certain strings and replacing
    > > them with other strings of equal length such that the file size is unaltered
    > > (the file is uncompressed btw). I wondered if anyone could advise me of the
    > > best way to do this and also of things to avoid. More specifically I was
    > > wondering :-


    > > -Is it best to open a single file for read-write access and overwrite the
    > > changed bytes or would it be better to create a new file?


    > It is always a good idea to leave the old file intact, unless you
    > somehow can ensure that a single write operation will never fail and
    > that an incomplete set of find/replace operations is still OK. Ask in
    > any database development newsgroup.


    This is generally true, but he said a "very large" file. I'd
    have some hesitations about making a copy if the file size were,
    say, 100 Gigabytes.

    As always, you have to weigh the trade offs. Making a copy is
    certainly a safer solution, if you can afford it.

    > > -Is there any point in buffering bytes in rather than
    > > reading one byte at a time or does this just defeat the
    > > buffering that's done by the OS anyway?


    > You'd have to experiment. C++ language does not define any
    > buffering AFA OS is concerned.


    C++ does define buffering in iostreams. But the fastest
    solution will almost certainly involve platform specific
    requests. I'd probably start by using mmap on a Unix system, or
    CreateFileMapping/MapViewOfFile under Windows. If performance
    is really an issue, he'll probably have to experiment with
    different solutions, but I'd be surprised if anything was
    significantly faster than using a memory mapped file, modified
    in place.

    But of course, as you pointed out above, this solution doesn't
    provide transactional integrity. And it only works if the
    process has enough available address space to map the file.
    (Probably no problem on a 64 bit processor, but likely not the
    case on 32 bit one.)

    > > -Would this benefit from multi-threading - read, process, write?


    > Unlikely. Processing will take so little time compared to the
    > I/O, and I/O is going to be the bottleneck anyway, so...


    If he uses memory mapping, the system will take care of all of
    the IO behind his back anyway. Otherwise, some sort of
    asynchronous I/O can sometimes improve performance.

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Oct 1, 2008
    #2
    1. Advertising

  3. Jeff

    Guest

    On Sep 30, 8:44 pm, "Jeff" <> wrote:
    > Hello
    >
    > I want to read and process and rewrite a very large disk based file
    > (>3Gbytes) as quickly as possible.
    > The processing effectively involves finding certain strings and replacing
    > them with other strings of equal length such that the file size is unaltered
    > (the file is uncompressed btw).  I wondered if anyone could advise me of the
    > best way to do this and also of things to avoid. More specifically I was
    > wondering :-
    >
    > -Is it best to open a single file for read-write access and overwrite the
    > changed bytes or would it be better to create a new file?


    You are asking about performance or safety? As Victor pointed out
    already,
    it's always safer to work on a copy. Performance-wise overwriting the
    bytes
    in the one file you have will be way faster then copying the file.

    > -Is there any point in buffering bytes in rather than reading one byte at a
    > time or does this just defeat the buffering that's done by the OS anyway?


    There is. If you intend to issue 3000000000 read() calls to read a
    3GB file,
    one byte a time, you're wasting quite a lot of time doing the calls.
    Reading
    in, say, 1MB chunks would make it faster, although it complicates
    looking
    for the strings (chunk boundaries).

    > -Would this benefit from multi-threading - read, process, write?


    Not to any significant degree, unless you're doing a *lot* of
    processing
    to find the strings you need (like complex regexen or such). Very
    likely
    you're way I/O-bound here.

    > And finally could anyone point me to any sample code which already does this
    > sort of thing in the fastest possible way?


    No, but I would strongly advise you to look into memory-mapped I/O,
    if
    your system supports it. This is not portable in C++ sense, and hence
    OT for this newsgroup, but it is most likely the fastest you can get,
    and -- as a bonus -- you avoid all read() and write() calls, and need
    no
    buffering. Google for the mmap() call.

    HTH,
    - J.
    , Oct 1, 2008
    #3
  4. Jeff

    James Kanze Guest

    On Oct 1, 2:24 pm, wrote:
    > On Sep 30, 8:44 pm, "Jeff" <> wrote:


    > No, but I would strongly advise you to look into memory-mapped
    > I/O, if your system supports it. This is not portable in C++
    > sense, and hence OT for this newsgroup, but it is most likely
    > the fastest you can get, and -- as a bonus -- you avoid all
    > read() and write() calls, and need no buffering. Google for
    > the mmap() call.


    While it's true that mmap is usually faster than naïve file
    handling, the buffering, reading and writing are still there.
    The only difference is that its the OS which takes care of them
    (with a bit of help from the hardware), and not you. Typically,
    *IF* you're a real expert, and you're willing to invest a lot of
    time and effort, you can do better for any specific use.
    Typically, not much better, however, and typically, you're not a
    real expert (the real experts are busy implementing the code in
    the OS), and the slight gains you get aren't worth the cost.

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Oct 1, 2008
    #4
  5. Jeff

    Guest

    On Sep 30, 2:44 pm, "Jeff" <> wrote:
    > Hello
    >
    > I want to read and process and rewrite a very large disk based file
    > (>3Gbytes) as quickly as possible.
    > The processing effectively involves finding certain strings and replacing
    > them with other strings of equal length such that the file size is unaltered
    > (the file is uncompressed btw).  I wondered if anyone could advise me of the
    > best way to do this and also of things to avoid. More specifically I was
    > wondering :-
    >
    > -Is it best to open a single file for read-write access and overwrite the
    > changed bytes or would it be better to create a new file?
    > -Is there any point in buffering bytes in rather than reading one byte at a
    > time or does this just defeat the buffering that's done by the OS anyway?
    > -Would this benefit from multi-threading - read, process, write?
    >
    > And finally could anyone point me to any sample code which already does this
    > sort of thing in the fastest possible way?
    >
    > Many Thanks
    > Jeff


    First cut, I would look into unix text processing tools like grep and
    sed.
    Why reinvent the wheel? Also, these tools are available for use in
    non
    unix environmetns like the PC.

    HTH
    , Oct 2, 2008
    #5
  6. Jeff

    Jeff Guest

    Thanks a million for the very helpful replies.

    I'm still experimenting, but I already found that I can make significant
    (>10) improvements in speed by buffering in the file rather than reading it
    byte by byte.

    Jeff
    Jeff, Oct 2, 2008
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Andy
    Replies:
    1
    Views:
    1,363
    Jürgen Exner
    Jan 20, 2004
  2. Maxim
    Replies:
    0
    Views:
    390
    Maxim
    Jul 7, 2003
  3. Long Le
    Replies:
    3
    Views:
    1,156
    Long Le
    Aug 11, 2004
  4. MWells
    Replies:
    2
    Views:
    403
    MWells
    Jan 11, 2005
  5. Hubert Hung-Hsien Chang
    Replies:
    2
    Views:
    403
    Michael Foord
    Sep 17, 2004
Loading...

Share This Page