fseek speed

TJ Walls · Jul 27, 2004

Hello All,

I am baffled ... I am trying to improve the speed of a program
that I have written that performs random access within a file. It
relies heavily on fseek and is very slow. To test, I wrote the following
test program which just writes the numbers 1-167721 sequentially to
a binary file:

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

static long MAXNUM = 16777214;

int main() {
FILE *fp;
long i, tmp;
int j;
unsigned char c;

fp = NULL;
if ((fp = fopen("test.out", "w+b")) == NULL) {
fprintf(stderr, "Failed.\n");
exit(22);
}

fprintf(stderr, "Writing ... ");
for (i = 0; i < MAXNUM/10; i++) {
//fseek(fp, 0, SEEK_CUR);
tmp = i;
for (j = 0; j < 3; j++) {
c = (unsigned char)(tmp % 256);
tmp /= 256;
fwrite(&c, 1, 1, fp);
}
}
fprintf(stderr, "done.\n");

fclose(fp);

return 0;
}

When compiled and run on a linux-2.4.## system and an old DOS system it
is very fast. Now if you uncomment the line that says fseek(fp, 0,
SEEK_CUR), it runs 17x slower!

Is there anyway to improve on the speed hit incurred by the call to
fseek?

Thanks in advance for any thoughts,
TJ Walls
Ph.D. Candidate - Physics Dept. Stony Brook University

TJ Walls · Jul 27, 2004

Ooops ... of course I meant 'writes the numbers 0-167720' ...

Kenneth Brody · Jul 27, 2004

TJ Walls wrote:
[...]

for (i = 0; i < MAXNUM/10; i++) {
//fseek(fp, 0, SEEK_CUR);
tmp = i;
for (j = 0; j < 3; j++) {
c = (unsigned char)(tmp % 256);
tmp /= 256;
fwrite(&c, 1, 1, fp);
}
} [...]
When compiled and run on a linux-2.4.## system and an old DOS system it
is very fast. Now if you uncomment the line that says fseek(fp, 0,
SEEK_CUR), it runs 17x slower!

Is there anyway to improve on the speed hit incurred by the call to
fseek?

The fseek() call you have is, in essence, a no-op. You are fseek'ing to
the current location.

However, a side-effect of the fseek is the flushing of the buffer. Without
the fseek(), your output will (actually, I suppose "can" is correct in the
general sense) be buffered, and only written when the buffer fille. With
the fseek(), you are forcing the buffer to be written for every character.

What is your purpose for using fseek() here?

TJ Walls · Jul 27, 2004

What is your purpose for using fseek() here?

My purpose for putting the fseek() here is to test
what kind of time hit fseek() is giving me (in this case, I thought it
should be a no-op too ... and thus give me a time penalty of 0, but in
fact the cost is HUGE). In my real program I am fseeking to various
places in the file and reading 3 bytes of data, so the fseek has a purpose
there, but it runs very slowly and I am trying to figure out a way to
speed it up.

-TJ Walls

Brian Gough · Jul 27, 2004

TJ Walls said:
Is there anyway to improve on the speed hit incurred by the call to
fseek?

Generally, map the file into memory with mmap() on Unix systems.

=?iso-8859-1?q?Nils_O=2E_Sel=E5sdal?= · Jul 27, 2004

My purpose for putting the fseek() here is to test
what kind of time hit fseek() is giving me (in this case, I thought it
should be a no-op too ... and thus give me a time penalty of 0, but in
fact the cost is HUGE). In my real program I am fseeking to various
places in the file and reading 3 bytes of data, so the fseek has a purpose
there, but it runs very slowly and I am trying to figure out a way to
speed it up.

If you only read few bytes at a time, you could try setting the stream
to be unbuffred (setvbuf(..))

Eric Sosman · Jul 27, 2004

TJ said:
[...] In my real program I am fseeking to various
places in the file and reading 3 bytes of data, so the fseek has a purpose
there, but it runs very slowly and I am trying to figure out a way to
speed it up.

The file in your test program was only about 160Kbytes
long. If your actual file is of similar size, or even a
hundred or so times larger, you might do well to read a
copy of the whole thing into memory (sequentially) and
access the data from there.

(Pedants may advance a number of reasons why you should
not do this. There's no fully portable way to discover the
size of a file prior to reading it, there's no guarantee
that the C implementation can handle objects of that size,
and the C language Standard says nothing about the relative
speeds of sequential and random input. Pay attention to
such arguments only long enough to ascertain that you can
safely ignore them.)

Richard Tobin · Jul 27, 2004

Kenneth Brody said:
However, a side-effect of the fseek is the flushing of the buffer.

Is that necessary? It seems like an obvious optimisation to not flush
the buffer when the seek is to somewhere within the buffer.

-- Richard

Richard Tobin · Jul 27, 2004

TJ Walls said:
In my real program I am fseeking to various
places in the file and reading 3 bytes of data

Your example program writes rather than reads...

For most operating systems, provided the size of the file is small
compared with the amount of real memory, I would expect caching in the
file system interface to avoid real disk accesses once the file has
all been read. There will still be the overhead of copying between
the cache and your program; you may be able to avoid that by using
some (system-specific) method to map the file into your program's
memory.

-- Richard

TJ Walls · Jul 28, 2004

Thanks for all the responses! I should mention that the system
I find myself programming on is quite limited ...

Your example program writes rather than reads...

True ... sorry for the slight ambiguity. My original test program
did both, but I thought for the sake of example size I would delete
the bottom half (and I think my example illustrates my point?).

For most operating systems, provided the size of the file is small
compared with the amount of real memory, I would expect caching in the
file system interface to avoid real disk accesses once the file has
all been read. There will still be the overhead of copying between
the cache and your program; you may be able to avoid that by using
some (system-specific) method to map the file into your program's
memory.

This is the unforuntate part ... The system I am working on is running
DOS 6.22 with 256K of RAM. I was hoping to _not_ have to map the file into
memory, but if that is my only choice, I guess I'll have to make it work.

Any other ideas would be greatly appreciated ...

Sincerely,
TJ Walls
Ph.D. Candidate - Physics Dept. Stony Brook University

TJ Walls · Jul 28, 2004

Is that necessary? It seems like an obvious optimisation to not flush
the buffer when the seek is to somewhere within the buffer.

-- Richard

True ... it still seems a little strange to me that the cost
of doing effectively nothing is so dramatic ...

-TJ

Richard Tobin · Jul 28, 2004

TJ Walls said:
This is the unforuntate part ... The system I am working on is running
DOS 6.22 with 256K of RAM. I was hoping to _not_ have to map the file into
memory, but if that is my only choice, I guess I'll have to make it work.

I have no idea what DOS does for caching file data.

But if you seek around randomly in a file that's bigger than physical
memory, then you're going to have to do a lot of disk reads.

Some possibilities:

- Do you have to do the reads in a random order? If you could list
them all and put them in order, that would be much better. But if
one value points to the next, that won't work.

- Is it really random? Are some parts of the file used more often than
others? If so, it may be worth caching just those parts. Is there
any "locality of reference" (ie, are successive seeks to nearby parts
of the file)? If so, implementing your own cache of recently used
blocks will help.

Incidentally, I doubt that you can map files in DOS: it doesn't just
mean copying it into memory, it means having the operating system
automatically associate an area of memory with the file and keep the
two in sync, reading blocks as they are needed and writing them out
when modified.

-- Richard

SM Ryan · Jul 28, 2004

# Hello All,
#
# I am baffled ... I am trying to improve the speed of a program
# that I have written that performs random access within a file. It
# relies heavily on fseek and is very slow. To test, I wrote the following
# test program which just writes the numbers 1-167721 sequentially to
# a binary file:

Library provided bufferring and random access don't usually go together
well. If you want to do this in ANSI C, you'll probably have better
performance if you turn off library buffering and do your own.

Thomas Matthews · Jul 28, 2004

TJ said:
Hello All,

I am baffled ... I am trying to improve the speed of a program
that I have written that performs random access within a file. It
relies heavily on fseek and is very slow. To test, I wrote the following
test program which just writes the numbers 1-167721 sequentially to
a binary file: [snip]

When compiled and run on a linux-2.4.## system and an old DOS system it
is very fast. Now if you uncomment the line that says fseek(fp, 0,
SEEK_CUR), it runs 17x slower!

Is there anyway to improve on the speed hit incurred by the call to
fseek?

Thanks in advance for any thoughts,
TJ Walls
Ph.D. Candidate - Physics Dept. Stony Brook University

In general, fseek is used to position the file pointer (cursor) to
an existing location in a file. A file is created containing data,
then the fseek function is applied to the file in read mode. The
fseek function can be applied in write mode to as if you were going
to overwrite data in the file.

To validate the fseek performance, you should use a better
benchmarking algorithm. For example, seeking to random positions
or using a butterfly pattern. A test should read the same amount of
data from the position. Buffering should be turned off, or the
algorithm should fseek to positions outside the size of the buffer.
Buffer, by the device, operating system or the C run-time library,
impacts the measurements of the fseek function. Most of the buffering
is to improve file accesses. Some harddrive controllers have large
buffers to reduce the number of seeks (thus wreaking havoc on your
experiment).

Also, remember that you will need to perform a huge amount of
iterations to get a stable average. Something over 1E+06 should
be enough to average out minor fluctuations.

Consult your applied statistics knowledge.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.comeaucomputing.com/learn/faq/
Other sites:
http://www.josuttis.com -- C++ STL Library book

Kenneth Brody · Jul 28, 2004

TJ said:
My purpose for putting the fseek() here is to test
what kind of time hit fseek() is giving me (in this case, I thought it
should be a no-op too ... and thus give me a time penalty of 0, but in
fact the cost is HUGE). In my real program I am fseeking to various
places in the file and reading 3 bytes of data, so the fseek has a purpose
there, but it runs very slowly and I am trying to figure out a way to
speed it up.

The fseek(f,0,SEEK_CUR) basically _is_ a no-op, with the single (yet very
important in this case) exception of flushing the buffer.

To see that fseek() otherwise has virtually no impact on the speed of your
program, compare speeds in an unbuffered environment. (Use setbuf or
setvbuf to get rid of the buffer.)

cody · Jul 29, 2004

Normally seeking isn't such a huge impact. But if you seek at *every* byte
you read, it can indeed really slow down, thats not a big surprise.

CBFalconer · Aug 1, 2004

cody said:
Normally seeking isn't such a huge impact. But if you seek at
*every* byte you read, it can indeed really slow down, thats not
a big surprise.

Please do not toppost in this newsgroup. Please do snip non
germane portions of your quotations.

fseek	17	Jul 27, 2012
fseek and open_memstream	23	Jan 31, 2007
How to accept text and put each letter into a 2d matrix?	0	Jun 3, 2022
Adding adressing of IPv6 to program	1	Feb 16, 2023
seg fault	76	Jun 17, 2008
code review: read in ps(1) result	6	Jul 16, 2010
PyImport_ImportModule(..) returns NULL	0	Dec 31, 2013
Can not read VCD file in Linux	8	Feb 1, 2011

fseek speed

TJ Walls

TJ Walls

Kenneth Brody

TJ Walls

Brian Gough

=?iso-8859-1?q?Nils_O=2E_Sel=E5sdal?=

Eric Sosman

Richard Tobin

Richard Tobin

TJ Walls

TJ Walls

Richard Tobin

SM Ryan

Thomas Matthews

Kenneth Brody

cody

CBFalconer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads