reading large text files in reverse - optimization doubts

R

Rajorshi Biswas

Hi folks,
Suppose I have a large (1 GB) text file which I want to read in
reverse. The number of characters I want to read at a time is
insignificant. I'm confused as to how best to do it. Upon browsing
through this group and other sources on the web, it seems that there
are many ways to do it.
Some suggest that simply fseek'ing to 8K bytes before the end of file,
and going backwards is the way. In this case, am I guaranteed best
results if I fseek 8192 bytes behind my last position simply because
BUFSIZ = 8192 ? In other words is 8192 an optimal number ?
Others suggest that mmap'ing some 20-30K at a time into RAM is the way
to go... (but why 20-30K) ?

Another twist in the tale is that my program must be as portable as
possible. So I'm wondering whether mmap'ing would be viable at all -
but if it works fine on *X and Win (does Win even have mmap or an
equivalent ?), then it would be okay.

Any ideas on how to optimize for speed? I'll head off to do some
profiling on Linux and WinXP.. but it needs to be as portable+fast on
as many OS's as possible - so I need you guys' expertise :)

Thanks in advance,
Raj
 
S

Skarmander

Rajorshi said:
Hi folks,
Suppose I have a large (1 GB) text file which I want to read in
reverse. The number of characters I want to read at a time is
insignificant. I'm confused as to how best to do it. Upon browsing
through this group and other sources on the web, it seems that there
are many ways to do it.

This is not, strictly speaking, a C question. There's nothing in the
language to answer a question like this.
Some suggest that simply fseek'ing to 8K bytes before the end of file,
and going backwards is the way. In this case, am I guaranteed best
results if I fseek 8192 bytes behind my last position simply because
BUFSIZ = 8192 ? In other words is 8192 an optimal number ?

It might be. Or not. If your system page size happens to be 8K, it might
be a good size to read.
Others suggest that mmap'ing some 20-30K at a time into RAM is the way
to go... (but why 20-30K) ?
Just because. I sincerely doubt anyone has calculated that these numbers
are optimal. Clearly you'll want a buffer "as large as possible", but
not so large that you're wasting I/O time reading data you're not
touching for a long time. Hence a few system pages worth of data seems
reasonable.
Another twist in the tale is that my program must be as portable as
possible. So I'm wondering whether mmap'ing would be viable at all -
but if it works fine on *X and Win (does Win even have mmap or an
equivalent ?), then it would be okay.

Yes, Win32 has memory-mapped files. To make matters easier, however,
there is of course no function called "mmap". I recommend using MinGW
(Cygwin's too slow), which will allow you to pretend you're using a
POSIX system, more or less. Otherwise you'll have to write your own
wrappers around CreateFileMapping() and MapViewOfFile[Ex]().
Any ideas on how to optimize for speed? I'll head off to do some
profiling on Linux and WinXP.. but it needs to be as portable+fast on
as many OS's as possible - so I need you guys' expertise :)
Reading a file in reverse will be one of the most inefficient operations
on any OS, I'd wager. If you use the standard I/O library, you're at the
mercy of both the platform and the C library used -- they can make a lot
of difference individually and in tandem. The "best" buffer size,
whether fread()ing or mmap()ing, will depend on the OS, the library, the
file, the time needed to process the data read, the system load...
Profiling is indeed the way to go here. It's too easy to predict
"reasonable" outcomes that will turn out to be entirely wrong in practice.

Clean code that uses either fseek()/fread() or mmap() should run
"reasonably well" on most platforms. The only thing that might trip you
up is an implementation that hits pathological performance when reading
in reverse, but I'd expect those to be rare.

S.
 
S

SM Ryan

# Hi folks,
# Suppose I have a large (1 GB) text file which I want to read in
# reverse. The number of characters I want to read at a time is
# insignificant. I'm confused as to how best to do it. Upon browsing
# through this group and other sources on the web, it seems that there
# are many ways to do it.
# Some suggest that simply fseek'ing to 8K bytes before the end of file,
# and going backwards is the way. In this case, am I guaranteed best
# results if I fseek 8192 bytes behind my last position simply because
# BUFSIZ = 8192 ? In other words is 8192 an optimal number ?

stdio implementations often don't like fseek and tend to discard
the entire buffer and refill after a fseek. Theoretically you
can seek to the end, fseek back one character and read it, fseek
back two characters and read the second last, etc, all the way
to the beginning. However it's likely stdio will be reading in
as much of BUFSIZ as possible with each get character, so you
work around buffering instead of with it.

# Others suggest that mmap'ing some 20-30K at a time into RAM is the way
# to go... (but why 20-30K) ?

Not all virtual memories can map in a gigabyte.

# Any ideas on how to optimize for speed? I'll head off to do some
# profiling on Linux and WinXP.. but it needs to be as portable+fast on
# as many OS's as possible - so I need you guys' expertise :)

You can turn off bufferring with stdio, and then do your own
page frame management. Same thing as mmap, but slower and a
greater nuisance.

You can read in blocks front to back, reverse the block in
memory, and then write the blocks to a secondary file back
to front. You then have a transpose of the original file
that you can read in the forward direction and make the
kernel and stdio happier.
 
C

Chris Torek

Reading a file in reverse will be one of the most inefficient operations
on any OS, I'd wager.

That would depend on the OS and file system. A file system
implemented via a B-tree with sideways links (B+ or "B-link" or
any of those variations) could be read backwards quite quickly.
A file system designed for use on a DVR/PVR might also use a doubly
linked list, something like the old DOS FAT file system (i.e.
cluster or extent based) except with links going both directions.

The stdio I wrote for BSD (found in 4.4BSD derivatives) also
attempts to optimize fseek() calls on files opened in read-only
mode, so that the sequence:

/* ensure that file is open and contains at least 1 byte */

fseek(fp, -1L, SEEK_END);
for (;;) {
c = getc(fp);
/* do something with c, e.g., putc(c) */
if (ftell(fp) == 0)
break;
fseek(fp, -2L, SEEK_CUR);
}

behaves in a reasonably fast manner, reading one underlying file
system block at a time and otherwise working entirely within the
stdio buffer.

Reading the blocks backwards is not particularly hard on a Unix-like
file system (with inodes containing direct and indirect blocks),
although it does not play well with the OS's read-ahead mechanism.
It *is* particularly hard on the DOS FAT file system, which --
through a design aimed at saving space on 360 kilobyte floppies --
makes it impossible to find block N of any given file without first
finding blocks zero through N-1 inclusive. Modern file systems
(NTFS, ext2, ext3, Reiser, etc) are as good as Version 7 Unix
though, having finally caught up to the 1970s. :) Some are even
worth of the 1980s, handling files larger than four gigabytes! :)

(To explain the amusement here, I should add that UFS -- the 4.2BSD
file system that came out in 1983 or 1984 or so -- did all this,
way back then. The newer Linux file systems do have other merits,
but it does seem odd that only recently have other desktop OSes
acquired capabilities we had back in the mid-1980s.)
 
M

Mark McIntyre

Hi folks,
Suppose I have a large (1 GB) text file which I want to read in
reverse.

Actually, I'd recommend you review why on earth you have text files
that large. Logfiles perhaps? Some redesign of the creating app is in
order, to break its logs into sensible sizes.
BUFSIZ = 8192 ? In other words is 8192 an optimal number ?

Yes and no. It depends completely on your h/w. 8K may be a memory
page, or a disk sector size or something, in which case maybe its
optimal. .
Others suggest that mmap'ing some 20-30K at a time into RAM is the way
to go... (but why 20-30K) ?

Same answer.
Another twist in the tale is that my program must be as portable as
possible. So I'm wondering whether mmap'ing would be viable at all -
but if it works fine on *X and Win (does Win even have mmap or an
equivalent ?), then it would be okay

mmap is be a nonstandard function so theres no knowing. If you can
find a compiler portable across your required h/w and osen, you may
also have a portable version of this function
Any ideas on how to optimize for speed?

Read up on the three rules of optimization.
 
R

Rajorshi Biswas

Hmm...
Thanks for the inputs guys. Well I tried an implementation with
fseek/fread with a read chunk size of 8k, also with 16k.. both are
reasonably fast.. at least on Linux. One related doubt I had was
whether fseek and mmap have any advantage over fseek and fread. For the
sake of portability, I discarded the lseek/mmap route, but was just
curious whether mmap would provide any better results.

<ducking for cover>
.... so - profiling is the only way to find out, eh ?
</ducking for cover>
 
G

glen herrmannsfeldt

Chris said:
That would depend on the OS and file system. A file system
implemented via a B-tree with sideways links (B+ or "B-link" or
any of those variations) could be read backwards quite quickly.
A file system designed for use on a DVR/PVR might also use a doubly
linked list, something like the old DOS FAT file system (i.e.
cluster or extent based) except with links going both directions.

In the days of 9 track tape, maybe even 7 track tape, it was not
uncommon for tape drives to have the ability to read a tape
backwards. The data was loaded into memory from high address down
to low address, such that the block came out normally in memory.

This ability has been lost in many of the currently popular
drives, especially in helical scan drives.

It was mostly useful in the case of external sorts on tape,
where it saved a rewind between writing the tape and reading
the data back for the next pass.

-- glen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,906
Latest member
SkinfixSkintag

Latest Threads

Top