There are several different entities covered by the terms "EOF" and/or
"end-of-file", and I think [the OP is] mixing them up.
Indeed.
1. EOF is a macro that expands to a constant integer expression. ...
2. The end of a file is simply the position in the file at which it
ends, depending on its current size.
3. An "end-of-file indicator" is an internal flag associated with a
stream. ...
Those are all we have in C, anyway. In actual implementations we
may also have an "EOF character", just to confuse the issue further.
(More on that later.)
These are three entirely different things.
As is the fourth, alas.
A "stream" is an entity in your executing program that's associated
with an external file. The file exists outside your program,
typically sitting on a disk somewhere (though there are a myriad of
other possibilities).
It is worth mentioning at least two of the other possibilities,
since they normally occur when running any C program (on a hosted
system anyway). An input stream can be connected to the user's
keyboard ("stdin"), or to the user's output screen/window/whatever
("stdout" and "stderr" both, normally).
The OP may (or may not) find some help by stepping outside the C
language for a while. I think he mentioned something about Linux
at one point, although for my purposes here, Linux, Unix, or even
Windows would all be similar enough. (Something like VMS or TSO
would not.)
If we put C aside for a while, we can ignore the whole concept of
a "stream". Here, a "file" is simply an on-disk entity. It is
definitely *not* something interactive like a keyboard or screen,
nor a communications channel to a remote computer like a socket.
(Unix-like systems attempt to make those special things act a lot
like disk files, with varying degrees of success. But we want to
ignore those too.)
Now, given an on-disk file, of whatever size, the OS has some way
of letting you "open" the file, then poke around in the contents.
A Unix-like system does this with the open(), read(), and lseek()
functions (none of which are Standard C, remember; they just happen
to be there, with known behavior, on the Unix-like systems).
When you open() a file, you get a small integer number called a
"file descriptor". There is *no* "end of file indicator" associated
with this descriptor, but there is a "current position within file"
associated with it. The current position is initially 0.
To move the current position (without doing anything else), you
use lseek().
To read data starting at the current position, you use read(). You
give read() three numbers: a file descriptor, a pointer to a buffer,
and a number of bytes. The descriptor identifiers the file and
carries the current lseek() offset. The buffer pointer tells the
OS where to copy the file data. The number of bytes tells the OS
how many bytes to read from the file, into that buffer:
int fd, result;
char buf[SIZE];
...
result = read(fd, buf, SIZE);
If the read is completely, totally, 100% successful, then:
- "result" is set to SIZE;
- the successfully-read bytes are put in buf[0] through buf[SIZE-1]; and
- the current lseek() position is moved forward SIZE bytes.
If the read fails entirely for some reason -- e.g., if the file is
on a floppy disk and the disk has gone bad -- the call returns -1:
- "result" is set to -1;
- the contents of buf[] may or may not be garbage[%];
- the variable "errno" is set to indicate the underlying error
(usually EIO, but maybe something else depending on the OS); and
- the current lseek() position does not change.
[% The buffer contents tend to depend on whether the OS uses DMA
and how the hardware behaves when the bad floppy reports its
bad-ness.]
There is a third possibility as well, though. Suppose that SIZE
is 100, the current offset is 200, and the file is only 220 bytes
long. In this case, there are only 20 bytes remaining in the file.
This read() call needs to "partly succeed" and "partly fail". How
can read() report "partial success"?
The answer is: if -1 means "error" and 100 means "total success",
then any number *between* those two means "partial success". The
exact *amount* of success is given by the number. In this case,
since there are 20 bytes left, the read() will return 20:
- "result" is set to 20;
- the successfully-read bytes are put in buf[0] through buf[19]; and
- the current lseek() position is moved forward 20 bytes.
In other words, "partial success" looks EXACTLY THE SAME as "total
success", except that the count is less than the number of bytes
you asked for.
Now, what happens if the current position is at, or even *beyond*,
the end of the file? One option -- one that was used in some OSes
before Unix -- is to report this as an "error". Unix could have
done that: it could have returned -1, and set errno to EEOF or some
such. This is not how Ken Thompson decided to do it, though.
Instead, he had the read() report "partial success", with the number
of bytes successfully read being zero. If read() calls this "partial
success" and reports zero bytes read, then:
- "result" is set to 0;
- the successfully-read bytes are not put anywhere (because
there are none), so buf[] is left unchanged; and
- the current lseek() position is moved forward 0 bytes,
which leaves it unchanged.
So, if one uses the low-level calls (open(), optionally lseek(),
read(), and close()) on a Unix-like system, one simply checks the
return value from each read() call: -1 means error, 0 means "end
of file", and any other value means "success", with the successful
count being very important, since it may be less than you asked
for.
Now, the reason for all of this "off-topic drift", as it were, is
because C's stdio was originally designed to "wrap around" the Unix
I/O model, but also allow for the more obnoxious I/O models found
on other systems available at the time.
In C, instead of a "file descriptor", you get a "stream". On a
Unix-like system, a stream is a pretty thin wrapper around the
underlying file descriptor. On other systems, it may be much
"thicker", hiding all kinds of system-level obnoxiousness, such as
different I/O routines for "interactive" streams (keyboard and
display) from "disk" streams (on-disk files). (Yes, other OSes
really do require different OS calls to do I/O on files vs devices.)
In any case, however, a "stream" still has, as an underlying concept,
a "current seek position". On a Unix-like system, this position
-- which you manipulate with fseek() -- is exactly the same thing
as the descriptor's byte offset, which the library manipulates with
lseek(). On more-obnoxious systems, though, the "fseek position"
may have little or even no resemblance to a byte offset. (In fact,
on some VMS systems, the values returned by ftell() are derived
from pointers obtained by malloc(). Each ftell() does a new malloc()
to remember exactly where you are in the file now, so that the full
positioning information can be passed to RMS and/or the SYS$QIO OS
calls.)
Now, fread() could have used a trick similar to Unix's read(): it
could return a short count for end-of-file, and -1 for error. But
there is one problem: fread() returns a size_t, which is an *unsigned*
integer. There is no "-1". So fread() returns zero for both
"encountered end of file" and "encountered error".
Similarly, fgetc() could have used a trick to report "end of file"
and "error" separately. In this case, fgetc() -- and getc() and
getchar(), which are defined in terms of fgetc() -- returns a value
in [0..UCHAR_MAX] on success. If UCHAR_MAX is 255 (as it usually
is), that means there are 256 "successful" values. Since fgetc()
returns an ordinary "int", and an "int" has to be able to count
negative numbers from -1 to -32767 (if not more), there are *plenty*
of extra values to use. But -- for some reason (I have no idea
what reason) -- the guys who wrote the original implementations
chose to report both "end of file" and "error" with a single return
value, just like fread().
As Keith wrote above, the value fgetc() returns, to indicate any
kind of failure -- "end of file" failure, or "error reading disk"
failure, or whatever -- is the one defined in <stdio.h>. On most
implementations, the value fgetc() returns for these failures is
-1. The C Standard requires only "a negative int value", and that
<stdio.h> define EOF to that particular negative value, but most
implementors use -1.
This is where the "end of file indicator" on the stream -- Keith's
item #3 -- comes in. Both fread() and fgetc() fail to distinguish
between the two "failure" cases. (These cases, remember, are:
"(A) You asked me to read, but I failed to read anything because,
while everything is all still working fine, there is nothing left
to read!" and "(B) You asked me to read, but I failed to read
anything ... and by the way, look out, the floppy disk is on fire!")
The guys who wrote the C "standard I/O" library decided to allow
you -- the C programmer -- to be able to distinguish between
these two cases, using the feof() and ferror() macros:
FILE *fp;
...
... attempt to read something from the file ...
if (our attempt to read failed) {
if (feof(fp))
printf(
"this was case (A): read failed, but all is well;\n"
"this was just the normal end of file.\n");
if (ferror(fp))
printf(
"this was case (B): read failed, and something is\n"
"badly wrong. Better check: is the disk on fire?\n");
} else {
... our attempt to read worked; use the data ...
}
The tricky bits with feof() and ferror() are:
1) These are "after-the-fact flags", not "predictions about the
future". You should only use the macros to test *why* a
read operation failed, after one has *actually failed*.
2) Once one or both of these flag is/are set, they *stay* set
until you, the C programmer, take action to clear them. There
are a number of ways to clear them, including the clearerr()
function. The clearerr() function clears both of them without
doing anything else. It does not correct the underlying
problem (e.g., put out the fire, in case (B)). But if you
have some way of correcting it, you can do that, and then do
clearerr(), and then try your read() again.
One of the many things confusing the OP is that the fseek() function
clears the end-of-file indicator -- the flag that feof() tests --
whenever the fseek() succeeds. It does so even if the fseek() puts
the current seek position at or beyond the actual end of the actual
on-disk file (assuming, as always, that there is in fact an actual
on-disk file involved).
The rule here, then, is the same as always: try to do the I/O,
and pay attention to the return value from your I/O function:
if (fseek(fp, newpos, SEEK_SET)) ... do something about failure ...
result = fread(buf, item_size, number_of_items, fp);
if (result == 0) {
if (feof(fp))
... fread() failed due to normal, ordinary EOF ...
else
... fread() must have failed due to serious problem ...
} else {
... work with the data: fread() got "result" items ...
}
or:
if (fseek(fp, newpos, SEEK_SET)) ... do something about failure ...
result = fgetc(fp);
if (result == EOF) {
if (feof(fp))
... fgetc() failed due to normal, ordinary EOF ...
else
... fgetc() must have failed due to serious problem ...
} else {
... work with the data: fgetc() got the byte in "result"
}
Beginners can use a simple rule: NEVER call feof(). NEVER, EVER.
Just look at the return value from fread() or fgetc() or whatever.
Assume that "read failed" means "normal end-of-file", i.e., that
disks never catch on fire (and, more practically, that no one
ever uses a magnet to put the floppy up on the fridge).
More-advanced C programmers can move on to the more-advanced rule:
use feof() (and ferror()) ONLY after a read operation (fread, fgetc,
etc) has failed. (People actually do stick magnets on floppies,
or take the floppy out of its "wrapper", or hammer more than one
into a drive, or any number of other bone-headed stunts.)
Now, about that fourth item, the "EOF character" ... if it even
exists. This is another OS-specific thing.
Consider the Unix-like system, in which keyboard input is "just
like" an ordinary file. You open() the keyboard ("/dev/tty" or
"/dev/console" perhaps), or -- more simply -- get the file descriptor
handed to you in the usual way. You call read() to get input.
What if the user wants to signal "end of input"? He has to get
your read() call to return 0. How can he do that?
On a Unix system, the trick is to input the "EOF character" at
the start of a line (i.e., after pushing the RETURN or ENTER key).
This "EOF character" is usually control-D, but is changeable.
The same technique applies to Microsoft's DOS and Windows systems,
except that the key is different: you use control-Z instead of
control-D. But here things get even weirder.
On very old MS-DOS systems, and systems that predated MS-DOS, *disk*
files had a peculiar problem: they did not have a "size" associated
with them. Instead of a "size", they had a "number of disk sectors".
A disk file was 0 sectors long, or 1 sector long, or 2 sectors, or
3 or 4 or whatever. If a disk sector was 512 bytes (though 128
and 256 were also common), then a disk *file* could be 0 bytes, or
512 bytes, or 1024 bytes, or 1536 bytes, or 2048 bytes, and so on
-- but no disk file could ever be just 20 bytes. It had to be at
least "one sector", if it had any bytes at all.
So, on these systems, how could you mark the end of a text file?
The answer was: pick a character, call it the "end of file" character,
and write that somewhere in the last sector. When reading the
file, if you encounter that character, pretend that there is no
more data in the file, even if there really is more.
The "EOF character" in these disk files was usually control-Z (this,
incidentally, is why MS-DOS and Windows use control-Z as a keyboard
"EOF" character). Some I/O routines might detect ^Z as EOF only
in the *last* sector, while others would detect it in *any* sector
(though the latter took less machine code, and with every byte
being precious, this was certainly more common). The fact that
this sort of "EOF character" exists is part of the reason that you
have to fopen() a binary file with "rb" to read it. (It is not
the whole reason, but it is part of the reason.) If you fopen()
with just plain "r", a control-Z byte in the stream may -- or may
not -- cause your stdio to report "end of file".