reading large text files in reverse - optimization doubts

Discussion in 'C Programming' started by Rajorshi Biswas, Sep 24, 2005.

  1. Hi folks,
    Suppose I have a large (1 GB) text file which I want to read in
    reverse. The number of characters I want to read at a time is
    insignificant. I'm confused as to how best to do it. Upon browsing
    through this group and other sources on the web, it seems that there
    are many ways to do it.
    Some suggest that simply fseek'ing to 8K bytes before the end of file,
    and going backwards is the way. In this case, am I guaranteed best
    results if I fseek 8192 bytes behind my last position simply because
    BUFSIZ = 8192 ? In other words is 8192 an optimal number ?
    Others suggest that mmap'ing some 20-30K at a time into RAM is the way
    to go... (but why 20-30K) ?

    Another twist in the tale is that my program must be as portable as
    possible. So I'm wondering whether mmap'ing would be viable at all -
    but if it works fine on *X and Win (does Win even have mmap or an
    equivalent ?), then it would be okay.

    Any ideas on how to optimize for speed? I'll head off to do some
    profiling on Linux and WinXP.. but it needs to be as portable+fast on
    as many OS's as possible - so I need you guys' expertise :)

    Thanks in advance,
    Raj
     
    Rajorshi Biswas, Sep 24, 2005
    #1
    1. Advertising

  2. Rajorshi  Biswas

    Skarmander Guest

    Re: [OT] reading large text files in reverse - optimization doubts

    Rajorshi Biswas wrote:
    > Hi folks,
    > Suppose I have a large (1 GB) text file which I want to read in
    > reverse. The number of characters I want to read at a time is
    > insignificant. I'm confused as to how best to do it. Upon browsing
    > through this group and other sources on the web, it seems that there
    > are many ways to do it.


    This is not, strictly speaking, a C question. There's nothing in the
    language to answer a question like this.

    > Some suggest that simply fseek'ing to 8K bytes before the end of file,
    > and going backwards is the way. In this case, am I guaranteed best
    > results if I fseek 8192 bytes behind my last position simply because
    > BUFSIZ = 8192 ? In other words is 8192 an optimal number ?


    It might be. Or not. If your system page size happens to be 8K, it might
    be a good size to read.

    > Others suggest that mmap'ing some 20-30K at a time into RAM is the way
    > to go... (but why 20-30K) ?
    >

    Just because. I sincerely doubt anyone has calculated that these numbers
    are optimal. Clearly you'll want a buffer "as large as possible", but
    not so large that you're wasting I/O time reading data you're not
    touching for a long time. Hence a few system pages worth of data seems
    reasonable.

    > Another twist in the tale is that my program must be as portable as
    > possible. So I'm wondering whether mmap'ing would be viable at all -
    > but if it works fine on *X and Win (does Win even have mmap or an
    > equivalent ?), then it would be okay.


    Yes, Win32 has memory-mapped files. To make matters easier, however,
    there is of course no function called "mmap". I recommend using MinGW
    (Cygwin's too slow), which will allow you to pretend you're using a
    POSIX system, more or less. Otherwise you'll have to write your own
    wrappers around CreateFileMapping() and MapViewOfFile[Ex]().

    > Any ideas on how to optimize for speed? I'll head off to do some
    > profiling on Linux and WinXP.. but it needs to be as portable+fast on
    > as many OS's as possible - so I need you guys' expertise :)
    >

    Reading a file in reverse will be one of the most inefficient operations
    on any OS, I'd wager. If you use the standard I/O library, you're at the
    mercy of both the platform and the C library used -- they can make a lot
    of difference individually and in tandem. The "best" buffer size,
    whether fread()ing or mmap()ing, will depend on the OS, the library, the
    file, the time needed to process the data read, the system load...
    Profiling is indeed the way to go here. It's too easy to predict
    "reasonable" outcomes that will turn out to be entirely wrong in practice.

    Clean code that uses either fseek()/fread() or mmap() should run
    "reasonably well" on most platforms. The only thing that might trip you
    up is an implementation that hits pathological performance when reading
    in reverse, but I'd expect those to be rare.

    S.
     
    Skarmander, Sep 24, 2005
    #2
    1. Advertising

  3. Rajorshi  Biswas

    SM Ryan Guest

    "Rajorshi Biswas" <> wrote:
    # Hi folks,
    # Suppose I have a large (1 GB) text file which I want to read in
    # reverse. The number of characters I want to read at a time is
    # insignificant. I'm confused as to how best to do it. Upon browsing
    # through this group and other sources on the web, it seems that there
    # are many ways to do it.
    # Some suggest that simply fseek'ing to 8K bytes before the end of file,
    # and going backwards is the way. In this case, am I guaranteed best
    # results if I fseek 8192 bytes behind my last position simply because
    # BUFSIZ = 8192 ? In other words is 8192 an optimal number ?

    stdio implementations often don't like fseek and tend to discard
    the entire buffer and refill after a fseek. Theoretically you
    can seek to the end, fseek back one character and read it, fseek
    back two characters and read the second last, etc, all the way
    to the beginning. However it's likely stdio will be reading in
    as much of BUFSIZ as possible with each get character, so you
    work around buffering instead of with it.

    # Others suggest that mmap'ing some 20-30K at a time into RAM is the way
    # to go... (but why 20-30K) ?

    Not all virtual memories can map in a gigabyte.

    # Any ideas on how to optimize for speed? I'll head off to do some
    # profiling on Linux and WinXP.. but it needs to be as portable+fast on
    # as many OS's as possible - so I need you guys' expertise :)

    You can turn off bufferring with stdio, and then do your own
    page frame management. Same thing as mmap, but slower and a
    greater nuisance.

    You can read in blocks front to back, reverse the block in
    memory, and then write the blocks to a secondary file back
    to front. You then have a transpose of the original file
    that you can read in the forward direction and make the
    kernel and stdio happier.

    --
    SM Ryan http://www.rawbw.com/~wyrmwif/
    Wow. A sailboat.
     
    SM Ryan, Sep 24, 2005
    #3
  4. Rajorshi  Biswas

    Chris Torek Guest

    Re: [OT] reading large text files in reverse - optimization doubts

    In article <43357cad$0$11066$4all.nl>
    Skarmander <> wrote:
    >Reading a file in reverse will be one of the most inefficient operations
    >on any OS, I'd wager.


    That would depend on the OS and file system. A file system
    implemented via a B-tree with sideways links (B+ or "B-link" or
    any of those variations) could be read backwards quite quickly.
    A file system designed for use on a DVR/PVR might also use a doubly
    linked list, something like the old DOS FAT file system (i.e.
    cluster or extent based) except with links going both directions.

    The stdio I wrote for BSD (found in 4.4BSD derivatives) also
    attempts to optimize fseek() calls on files opened in read-only
    mode, so that the sequence:

    /* ensure that file is open and contains at least 1 byte */

    fseek(fp, -1L, SEEK_END);
    for (;;) {
    c = getc(fp);
    /* do something with c, e.g., putc(c) */
    if (ftell(fp) == 0)
    break;
    fseek(fp, -2L, SEEK_CUR);
    }

    behaves in a reasonably fast manner, reading one underlying file
    system block at a time and otherwise working entirely within the
    stdio buffer.

    Reading the blocks backwards is not particularly hard on a Unix-like
    file system (with inodes containing direct and indirect blocks),
    although it does not play well with the OS's read-ahead mechanism.
    It *is* particularly hard on the DOS FAT file system, which --
    through a design aimed at saving space on 360 kilobyte floppies --
    makes it impossible to find block N of any given file without first
    finding blocks zero through N-1 inclusive. Modern file systems
    (NTFS, ext2, ext3, Reiser, etc) are as good as Version 7 Unix
    though, having finally caught up to the 1970s. :) Some are even
    worth of the 1980s, handling files larger than four gigabytes! :)

    (To explain the amusement here, I should add that UFS -- the 4.2BSD
    file system that came out in 1983 or 1984 or so -- did all this,
    way back then. The newer Linux file systems do have other merits,
    but it does seem odd that only recently have other desktop OSes
    acquired capabilities we had back in the mid-1980s.)
    --
    In-Real-Life: Chris Torek, Wind River Systems
    Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
    email: forget about it http://web.torek.net/torek/index.html
    Reading email is like searching for food in the garbage, thanks to spammers.
     
    Chris Torek, Sep 25, 2005
    #4
  5. On 24 Sep 2005 06:07:46 -0700, in comp.lang.c , "Rajorshi Biswas"
    <> wrote:

    >Hi folks,
    > Suppose I have a large (1 GB) text file which I want to read in
    >reverse.


    Actually, I'd recommend you review why on earth you have text files
    that large. Logfiles perhaps? Some redesign of the creating app is in
    order, to break its logs into sensible sizes.

    >BUFSIZ = 8192 ? In other words is 8192 an optimal number ?


    Yes and no. It depends completely on your h/w. 8K may be a memory
    page, or a disk sector size or something, in which case maybe its
    optimal. .

    > Others suggest that mmap'ing some 20-30K at a time into RAM is the way
    >to go... (but why 20-30K) ?


    Same answer.

    > Another twist in the tale is that my program must be as portable as
    >possible. So I'm wondering whether mmap'ing would be viable at all -
    >but if it works fine on *X and Win (does Win even have mmap or an
    >equivalent ?), then it would be okay


    mmap is be a nonstandard function so theres no knowing. If you can
    find a compiler portable across your required h/w and osen, you may
    also have a portable version of this function

    > Any ideas on how to optimize for speed?


    Read up on the three rules of optimization.
    --
    Mark McIntyre
    CLC FAQ <http://www.eskimo.com/~scs/C-faq/top.html>
    CLC readme: <http://www.ungerhu.com/jxh/clc.welcome.txt>

    ----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
    http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
    ----= East and West-Coast Server Farms - Total Privacy via Encryption =----
     
    Mark McIntyre, Sep 25, 2005
    #5
  6. Hmm...
    Thanks for the inputs guys. Well I tried an implementation with
    fseek/fread with a read chunk size of 8k, also with 16k.. both are
    reasonably fast.. at least on Linux. One related doubt I had was
    whether fseek and mmap have any advantage over fseek and fread. For the
    sake of portability, I discarded the lseek/mmap route, but was just
    curious whether mmap would provide any better results.

    <ducking for cover>
    .... so - profiling is the only way to find out, eh ?
    </ducking for cover>
     
    Rajorshi Biswas, Sep 26, 2005
    #6
  7. Re: [OT] reading large text files in reverse - optimization doubts

    Chris Torek wrote:

    > In article <43357cad$0$11066$4all.nl>
    > Skarmander <> wrote:


    >>Reading a file in reverse will be one of the most inefficient operations
    >>on any OS, I'd wager.


    > That would depend on the OS and file system. A file system
    > implemented via a B-tree with sideways links (B+ or "B-link" or
    > any of those variations) could be read backwards quite quickly.
    > A file system designed for use on a DVR/PVR might also use a doubly
    > linked list, something like the old DOS FAT file system (i.e.
    > cluster or extent based) except with links going both directions.


    In the days of 9 track tape, maybe even 7 track tape, it was not
    uncommon for tape drives to have the ability to read a tape
    backwards. The data was loaded into memory from high address down
    to low address, such that the block came out normally in memory.

    This ability has been lost in many of the currently popular
    drives, especially in helical scan drives.

    It was mostly useful in the case of external sorts on tape,
    where it saved a rewind between writing the tape and reading
    the data back for the next pass.

    -- glen
     
    glen herrmannsfeldt, Sep 28, 2005
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SHV0dHk=?=

    Reading large text files

    =?Utf-8?B?SHV0dHk=?=, Sep 24, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    720
    =?Utf-8?B?SHV0dHk=?=
    Sep 27, 2005
  2. dogbite
    Replies:
    4
    Views:
    694
    osmium
    Oct 10, 2003
  3. Ameya

    doubts related to disk files in C

    Ameya, Nov 29, 2004, in forum: C Programming
    Replies:
    3
    Views:
    406
    Randy Howard
    Nov 30, 2004
  4. Matthew Crema

    Reading a large number of text files into an array

    Matthew Crema, Apr 27, 2005, in forum: C Programming
    Replies:
    4
    Views:
    711
    Matthew Crema
    Apr 27, 2005
  5. Ravikiran

    Zero Optimization and Sign Optimization???

    Ravikiran, Nov 17, 2008, in forum: C Programming
    Replies:
    22
    Views:
    866
    Thad Smith
    Nov 24, 2008
Loading...

Share This Page