Bart said:
[...]
I don't know how to deal with /var/adm/messages or similar. Suppose I read
byte-by-byte as recommended then someone updates the beginning of the file?
Maybe it's foolhardy to even attempt making a copy of such a file. I'm not
allowed to lock it because that's also frowned upon. What exactly can one do
with such a file?
There are at least two problems to be faced when dealing
with /var/adm/messages. First, the file grows as the system
appends new log messages to it, an activity not coordinated
with the actions of J. Random Program. If JRP queries the
file size, then allocates a buffer of that size, and then
tries to fill the buffer by reading blindly to EOF, the file
may grow larger between the query and the reading. Thus, the
queried size must be considered an estimate, not an absolute:
use Chris Torek's approach, not Jacob Navia's.
The second problem is that the system does not allow the
file to grow without limit. Every now and then, it renames
the existing /var/adm/messages and creates a new, empty one
where future messages are deposited. So the semantics of how
the system identifies "a" file become important: Are the query
and the reading directed to the same bunch of bytes, or only
to the same file name? This is the (or a) reason POSIX systems
have both the stat() and fstat() query operations: the first
asks about the file associated with a given name, while the
second asks about a file that's already open and whose name
is no longer relevant.
So how does one process /var/adm/messages, or more broadly,
how does one process a file that may be subject to change while
the processing is in progress? The C language has almost no
support for parallel activities; signals and volatile are about
the extent of it, and neither is useful without a slew of
platform-specific semantics that C itself does not legislate.
(The question of whether the committee chose well or poorly in
declining to so legislate is an issue for some other forum.)
So when you're processing a file that is being modified by other
agencies, you must begin by expanding your view to encompass not
only the language you want to use, but also the environment in
which you want to use it.
That wider programming environment can provide guarantees
of various kinds. One might be "Your program has exclusive
access to the file for the interval of interest," which seems
to be the unstated but crucial assumption in some of the arguments
elsethread. It's a guarantee that can often be provided -- by
voluntary convention, if nothing else -- but it's certainly not
provided for all files under all circumstances in all environments.
Other guarantees might be "This file is only appended to" or "It
is possible to acquire an exclusive lock on all or part of this
file" or "Your program can arrange to be notified whenever something
happens to this file." You make use of this additional platform-
specific and/or application-specific knowledge as best you can for
the task at hand.
Personally, I don't think much of the read-it-all-in approach
to handling data that's in files. That's because I'm from the
Old School, and learned my craft in the days when memory was
scarce and expensive. I'm always worried that I won't be able to
fit the file into the available memory, so I look for ways to
process the data one "window" at a time: Rather than read the whole
thing, process it, and write the results, I try to read-process-
write-lather-rinse-repeat, and thus not place such an obvious and
inflexible upper limit on the size of the file I can handle. It
doesn't always work, of course -- sometimes you really do need the
entire contents of the file -- but even then you seldom need to
have the whole thing sitting in one big contiguous array. You
read the enormous CAD model and build data structures while reading
it; all the data winds up in memory, but you never have or need
an "image" of the entire file as it looks on disk.