R
Richard Harter
I have taken the original posting and have commented on it at
length. Some people might enjoy the critique. In a separate
post I will give a (hopefully) complete specification. This
discussion and the spec will also go into web pages.
I appreciate that such an extended discussion drifts into the
area of goldplated bullshit; so be it.
The names, getline and fgetline, are already in common use. In
the end I decided to use getfline instead.
is plausible, but it may be overkill. Be that as it may, in a C
implementation this prototype doesn't work too well. First of all
the user has to go through the struct to access the line and the
status field. More importantly the normal usage in C would be a
while loop, e.g.,
struct fgetline_info fg;
...
while (fg = fgetline(FILE *fptr, size_t maxsize)) {...}
This doesn't work - the idiom requires a test on something that
can be "zero". If we convert the prototype and fg to a pointer
the code still doesn'twork. Now the problem is that when the read
is successful we don't need the status field; when the read fails
we don't have the status field.
Whether the cost of calls to malloc and free matters obviously
depends on the application. However (a) the costs are commonly
underestimated, and (b) a general policy of ignoring apparently
minor costs can cumulate and produce unexpected inefficiencies.
As a note, producing dirty copies "works" provided that there is
only one file being read at a time. Don't count on it.
return type with a cumbersome calling sequence.
This distinction between kinds of copies that we might get is a
critical one; without it either a default is chosen without
weighing its merits or, worse yet, concepts are mixed and subtle
bugs can slip in.
This is a good maxim but it is not the whole story. If one choice
commits the user and the other doesn't, provide the noncommital
result. In this case providing clean copies is the commital
choice; if the routine provides a clean copy the user cannot opt
to get a transient copy. The choice commits the user to a malloc
cost and an eventual free cost. If the routine returns a
transient copy the user can duplicate it to get a clean copy.
This is an example of a serious error in design. Let us
accept for the moment that we want to offer a choice; naturally
we ask how to do it. What I did was to envisage an implementation
and treat it as the only alternative. Instead of thinking in
terms of a potential implementation I should have thought in
terms of API alternatives. An obvious alternative is to set a
flag instead.
Beating this piece of ground where a horse once died some
more, using different data values to select modes is a suspect
practice, one that one should be wary about using. In this case
the value of the buffer pointer (null or not) was used as a flag.
A more serious issue is that there are alternatives for
supplying the space. They include:
1. The user declares an array on the stack.
2. The user declares a structure that contains an array.
3. The user allocates space using malloc.
4. The read routine allocates the space.
In alternatives 1 and 2 there need not be calls to malloc at all.
However care must be taken to avoid "freeing" stack storage.In
alternatives 3 and 4 there is no need for a separate buffer.
This schematic code illustrates why it was a bad idea to return a
struct and to put the line pointer in the struct. Also it should
be be easy enough for the implementation to free that pesky
increased line; having the user do it is an unnecessary
complication.
The increase/no_increase distinction is unnecessary. That
leaves three cases, normal EOF, normal line read, and a premature
EOF in the final line.
One issue not covered was what to do about overly long lines.
There are several possible actions, e.g.,
* We could treat it as an error condition.
* We could treat it as an end of data, reserving the option to
pick up later where we left off.
* We could cut the long line into pieces maxlen characters
long.
* We could chop out all characters in the long line after the
first maxlen characters.
* We could skip the long line and proceed to the next line.
Clearly the routine should not decide on its own what action
to take; the user must specify what action to take by passing a
flag. If the user does not specify an action the default should
be to treat a long line as an error condition.
This way of proceeding - listing error tags and short
explanations - has its faults; it comes from focusing too heavily
on potential code. For example, the calling sequence argument
errors only apply to a particular choice of calling sequence. A
better approach might be to try to analyze the kinds of errors
that can occur. Briefly, these include:
* A format fault in the file: This can either be a premature
EOF or an overly large line. This is not necessarily an error
since the user can opt to accept the faulty lines.
* Calling sequence error: The calling sequence is faulty in
some way. The key here is that the constraints on the calling
sequence should be spelled out at some point.
* Memory allocation error: Malloc/realloc fails. Usually, but
not necessarily, this is a fatal error.
* I/O error: Something goes wrong while trying to read the
file.
What action should be taken is quite another matter. Users
may or may not have an error management strategy. If there is
one, the routine shouldn't pre-empt it. At a minimum the routine
should return an error code. The objection to returning an error
code is that the conscientious user is then stuck with adding
error handling code. A useful technique is to pass to the routine
flags telling what to do about errors, e.g., write to a log file,
call exit, etc.
This is a useful technique. However to avoid magic numbers
disease the positions of the various flag bits should be defined
separately in the include file that spells out the file read
package prototypes.
The issue here is whether the delivered line contains \n (the EOL
character) or not. In other do lines look like abc\n\0 or abc\0?
Since we are already committed to using flags we may as well have
a flag to select whether to include the EOF character; the
appropriate default is to not include it.
At this point we have arrived at the end of the original
discussion. It went off track, but it was thought provoking.
Finally, what should the interface look like? I now believe that
the prototype should look like this:
int getfline(FILE * fptr,
char ** line_ptr,
long * status,
size_t * length_ptr,
size_t maxlen);
Normal usage would be to declare line, status, and len as char *,
long, and size_t respectively. Getfline returns 1 if the read was
normal and 0 if it was not. It would normally be used in a while
loop, e.g.,
while (getfline(fptr,&line,&status,&len,maxlen) {
/* Code to process line */
}
/* Error testing (if any) goes here
Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die
length. Some people might enjoy the critique. In a separate
post I will give a (hopefully) complete specification. This
discussion and the spec will also go into web pages.
I appreciate that such an extended discussion drifts into the
area of goldplated bullshit; so be it.
What we want is a routine that keeps reading and returning lines
from a file. A simple prototype for this function is
char * fgetline(FILE *);
The names, getline and fgetline, are already in common use. In
the end I decided to use getfline instead.
Here I went wrong. Having a struct (record) to hold informationThere are some gotchas with this prototype - well, not really
gotchas, rather little bits of awkwardness and inefficiency. One
is that we are throwing away information that (usually) has to be
recomputed, namely the length of the line and, as we shall see,
status information. If we want to add the information to our
prototype there are several that to do it - it can be
passed back through the calling sequence or it can be returned by
the function. (We could also pass it to a global - ugh, don't do
it.) My vote is to have it all returned by the function which
means we create a struct (record) that looks like:
struct fgetline_info {
char *line;
size_t len;
int status;
};
There are various choices that could be made in the types -
season to taste. (A pointer to the struct could be passed in as
an argument if need be.)
A second issue is that there is no limit to how large the line
can be. A plausible way to provide a limit that is to pass one
in as an argument. If we do, our prototype looks like this:
struct fgetline_info fgetline(FILE * fptr, size_t maxsize);
is plausible, but it may be overkill. Be that as it may, in a C
implementation this prototype doesn't work too well. First of all
the user has to go through the struct to access the line and the
status field. More importantly the normal usage in C would be a
while loop, e.g.,
struct fgetline_info fg;
...
while (fg = fgetline(FILE *fptr, size_t maxsize)) {...}
This doesn't work - the idiom requires a test on something that
can be "zero". If we convert the prototype and fg to a pointer
the code still doesn'twork. Now the problem is that when the read
is successful we don't need the status field; when the read fails
we don't have the status field.
This version has a little problem - where did the storage for the
line come from? One way to deal with that is for fgetline to
allocate storage for the line and for the calling routine to free
it when it is done with the line.
Okay, so how do we do this in fgetline? Well, there is a very
simple way to do it, one that has been reinvented time and time
again. We start out allocating a standard amount of space
(numbers like 32, 64, and 80 bytes are typical) for a line.
We read from the file until we either hit an EOL (end of line
marker), a failure to read, or we have read as many characters as
were allocated. If we haven't hit the EOL we reallocate the
array, commonly by doubling the size, and doing another read.
This works, but it is inefficient - we have to call malloc and
free for every line.
Whether the cost of calls to malloc and free matters obviously
depends on the application. However (a) the costs are commonly
underestimated, and (b) a general policy of ignoring apparently
minor costs can cumulate and produce unexpected inefficiencies.
One way to get around this is to use a
technique I call highwater buffering. In the file containing the
code for fgetline we have two file-scope variables:
static char * buffer = 0;
static size_t size = 0;
In fgetline we have some initialization code that allocates
buffer space when the buffer size is zero. Thereafter we use our
little doubling trick. The advantage of this scheme is that we
have at most a small number of calls to malloc (the number of
doublings needed to get to the largest line) and no calls to
free.
The very real disadvantage of this scheme is that it produces a
dirty copy of the line. By a dirty copy, I mean one that can be
scribbled on elsewhere before we are done with it. This will
happen whenever fgetline is called anywhere else with any FILE
argument before our next read. This is double plus ungood.
As a note, producing dirty copies "works" provided that there is
only one file being read at a time. Don't count on it.
This is a worst of all worlds prototype - it combines an awkwardIs there a way to get the efficiency of the highwater scheme
without being dirty? Yes; the trick is that the calling program
provides the initial buffer. If we add the buffer information to
the prototype we get:
fgetline_info fgetline(FILE * fptr,
char * buffer,
size_t size,
size_t maxsize);
return type with a cumbersome calling sequence.
There are three kinds of copies of the line that we might get
from fgetline - clean, transient, and dirty. A clean copy is one
that has no encumbrances - it lasts until it is specifically
deleted. A transient copy is one that lasts until the next
call to fgetline to get another line from the current file.
(There may be another file being read elsewhere.)
This distinction between kinds of copies that we might get is a
critical one; without it either a default is chosen without
weighing its merits or, worse yet, concepts are mixed and subtle
bugs can slip in.
What kind of copy should fgetline provide, clean or transient?
There are arguments for each choice. Clean copies are more
expensive but safer. Transient copies are (usually) cheaper. In
questions like this there is much to be said for the design
principle that:
When there is a significant choice as to the kind of
output being delivered the library routine should let
the user make the choice rather than dictating the
choice unless offering the choice unduly complicates
usage.
This is a good maxim but it is not the whole story. If one choice
commits the user and the other doesn't, provide the noncommital
result. In this case providing clean copies is the commital
choice; if the routine provides a clean copy the user cannot opt
to get a transient copy. The choice commits the user to a malloc
cost and an eventual free cost. If the routine returns a
transient copy the user can duplicate it to get a clean copy.
So how do we provide a choice? That is simple; if the buffer
pointer in the calling sequence is NULL, fgetline must return a
clean copy; otherwise fgetline MAY return a transient copy. I
say "may" because it might have to increase the buffer size.
If it does the calling routine will have to check whether there
was a size increase and, if so, will have to free the returned
buffer. (The fgetline implementation can't realloc the passed
in buffer; it will have to do a malloc for the first resizing.)
This is an example of a serious error in design. Let us
accept for the moment that we want to offer a choice; naturally
we ask how to do it. What I did was to envisage an implementation
and treat it as the only alternative. Instead of thinking in
terms of a potential implementation I should have thought in
terms of API alternatives. An obvious alternative is to set a
flag instead.
Beating this piece of ground where a horse once died some
more, using different data values to select modes is a suspect
practice, one that one should be wary about using. In this case
the value of the buffer pointer (null or not) was used as a flag.
A more serious issue is that there are alternatives for
supplying the space. They include:
1. The user declares an array on the stack.
2. The user declares a structure that contains an array.
3. The user allocates space using malloc.
4. The read routine allocates the space.
In alternatives 1 and 2 there need not be calls to malloc at all.
However care must be taken to avoid "freeing" stack storage.In
alternatives 3 and 4 there is no need for a separate buffer.
It's probably best to have a value in the status field that
signifies the size has been increased; call that value
fg_increased and a normal read fg_normal. Then the usage for
transient copies might look like this (first cut):
...
struct fgetline_info fg;
char buffer[80];
int done = 0;
....
for(;!done {
...
fg = fgetline(fptr,buffer,80,FG_MAXSIZE);
if (fg.status == fg_normal || fg.status == fg_increased) {
/* do stuff */
if (fg.status == fg_increased) free(fg.line);
} else done = 1;
}
If we want clean copies the corresponding loop body would be
fg = fgetline(fptr,0,0,FG_MAXSIZE);
if (fg.status == fg_normal) {
/* do stuff */
free(fg.line);
} else done = 1;
This schematic code illustrates why it was a bad idea to return a
struct and to put the line pointer in the struct. Also it should
be be easy enough for the implementation to free that pesky
increased line; having the user do it is an unnecessary
complication.
What sort of return values should be available in the status
field? These are the successful ones that occur to me:
end_of file The last read (if any) found an EOL
marker. The current read finds an
immediate EOF.
no_increase Normal read - either buffer was null or
else it was not increased.
increase Normal read - the buffer size has been
increased.
abn_no_increase Abnormal read - an EOF was found without
an EOL. Buffer was not increased.
abn_increase Abnormal read - an EOF was found without
an EOL. Buffer was increased.
The increase/no_increase distinction is unnecessary. That
leaves three cases, normal EOF, normal line read, and a premature
EOF in the final line.
One issue not covered was what to do about overly long lines.
There are several possible actions, e.g.,
* We could treat it as an error condition.
* We could treat it as an end of data, reserving the option to
pick up later where we left off.
* We could cut the long line into pieces maxlen characters
long.
* We could chop out all characters in the long line after the
first maxlen characters.
* We could skip the long line and proceed to the next line.
Clearly the routine should not decide on its own what action
to take; the user must specify what action to take by passing a
flag. If the user does not specify an action the default should
be to treat a long line as an error condition.
In addition there are numerous kinds of errors that can occur.
Calling sequence arguments include:
no_file The file pointer is null.
bad_buffer One and only one of buffere and size s
zero.
bad_size Size is 0 or is greater than maxsize
bad_maxsize Maxsize is 0
Then there are the memory allocation failures:
bad_allocate Malloc or realloc failure
big_line Line length is greater than maxsize.
When there is a memory allocation failure the line and len fields
hold what has been successfuly read.
This way of proceeding - listing error tags and short
explanations - has its faults; it comes from focusing too heavily
on potential code. For example, the calling sequence argument
errors only apply to a particular choice of calling sequence. A
better approach might be to try to analyze the kinds of errors
that can occur. Briefly, these include:
* A format fault in the file: This can either be a premature
EOF or an overly large line. This is not necessarily an error
since the user can opt to accept the faulty lines.
* Calling sequence error: The calling sequence is faulty in
some way. The key here is that the constraints on the calling
sequence should be spelled out at some point.
* Memory allocation error: Malloc/realloc fails. Usually, but
not necessarily, this is a fatal error.
* I/O error: Something goes wrong while trying to read the
file.
What action should be taken is quite another matter. Users
may or may not have an error management strategy. If there is
one, the routine shouldn't pre-empt it. At a minimum the routine
should return an error code. The objection to returning an error
code is that the conscientious user is then stuck with adding
error handling code. A useful technique is to pass to the routine
flags telling what to do about errors, e.g., write to a log file,
call exit, etc.
There are various conventions one could use for assigning code
values for the possible return values. My view is that there
should be a bit that is set only if there was an error, a bit
that is set only if there was a memory allocation error, a bit
that is set only if there was a buffer increase, and a bit that
is set only if an EOL occurred.
The point of doing this is that we can have simple tests to check
whether an error occurred, whether some space has to be freed,
and whether the file read is complete.
This is a useful technique. However to avoid magic numbers
disease the positions of the various flag bits should be defined
separately in the include file that spells out the file read
package prototypes.
As a final note, there is one minor decision left unsettled,
namely should the returned line include an EOL marker before the
string terminating 0. My take is that this matter is too trivial
to warrant adding a flag to the calling sequence, and that it
will be slightly less confusing if there is one present even if
it has to be manufactured. Perhaps someone has a convincing
argument one way or the other.
The issue here is whether the delivered line contains \n (the EOL
character) or not. In other do lines look like abc\n\0 or abc\0?
Since we are already committed to using flags we may as well have
a flag to select whether to include the EOF character; the
appropriate default is to not include it.
At this point we have arrived at the end of the original
discussion. It went off track, but it was thought provoking.
Finally, what should the interface look like? I now believe that
the prototype should look like this:
int getfline(FILE * fptr,
char ** line_ptr,
long * status,
size_t * length_ptr,
size_t maxlen);
Normal usage would be to declare line, status, and len as char *,
long, and size_t respectively. Getfline returns 1 if the read was
normal and 0 if it was not. It would normally be used in a while
loop, e.g.,
while (getfline(fptr,&line,&status,&len,maxlen) {
/* Code to process line */
}
/* Error testing (if any) goes here
Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die