Designing fgetline - a perspective

R

Richard Harter

I have taken the original posting and have commented on it at
length. Some people might enjoy the critique. In a separate
post I will give a (hopefully) complete specification. This
discussion and the spec will also go into web pages.

I appreciate that such an extended discussion drifts into the
area of goldplated bullshit; so be it.
What we want is a routine that keeps reading and returning lines
from a file. A simple prototype for this function is

char * fgetline(FILE *);

The names, getline and fgetline, are already in common use. In
the end I decided to use getfline instead.
There are some gotchas with this prototype - well, not really
gotchas, rather little bits of awkwardness and inefficiency. One

is that we are throwing away information that (usually) has to be

recomputed, namely the length of the line and, as we shall see,
status information. If we want to add the information to our
prototype there are several that to do it - it can be
passed back through the calling sequence or it can be returned by

the function. (We could also pass it to a global - ugh, don't do

it.) My vote is to have it all returned by the function which
means we create a struct (record) that looks like:

struct fgetline_info {
char *line;
size_t len;
int status;
};

There are various choices that could be made in the types -
season to taste. (A pointer to the struct could be passed in as
an argument if need be.)


A second issue is that there is no limit to how large the line
can be. A plausible way to provide a limit that is to pass one
in as an argument. If we do, our prototype looks like this:

struct fgetline_info fgetline(FILE * fptr, size_t maxsize);
Here I went wrong. Having a struct (record) to hold information
is plausible, but it may be overkill. Be that as it may, in a C
implementation this prototype doesn't work too well. First of all
the user has to go through the struct to access the line and the
status field. More importantly the normal usage in C would be a
while loop, e.g.,

struct fgetline_info fg;
...
while (fg = fgetline(FILE *fptr, size_t maxsize)) {...}

This doesn't work - the idiom requires a test on something that
can be "zero". If we convert the prototype and fg to a pointer
the code still doesn'twork. Now the problem is that when the read
is successful we don't need the status field; when the read fails
we don't have the status field.
This version has a little problem - where did the storage for the

line come from? One way to deal with that is for fgetline to
allocate storage for the line and for the calling routine to free

it when it is done with the line.

Okay, so how do we do this in fgetline? Well, there is a very
simple way to do it, one that has been reinvented time and time
again. We start out allocating a standard amount of space
(numbers like 32, 64, and 80 bytes are typical) for a line.
We read from the file until we either hit an EOL (end of line
marker), a failure to read, or we have read as many characters as

were allocated. If we haven't hit the EOL we reallocate the
array, commonly by doubling the size, and doing another read.

This works, but it is inefficient - we have to call malloc and
free for every line.

Whether the cost of calls to malloc and free matters obviously
depends on the application. However (a) the costs are commonly
underestimated, and (b) a general policy of ignoring apparently
minor costs can cumulate and produce unexpected inefficiencies.


One way to get around this is to use a
technique I call highwater buffering. In the file containing the

code for fgetline we have two file-scope variables:

static char * buffer = 0;
static size_t size = 0;

In fgetline we have some initialization code that allocates
buffer space when the buffer size is zero. Thereafter we use our

little doubling trick. The advantage of this scheme is that we
have at most a small number of calls to malloc (the number of
doublings needed to get to the largest line) and no calls to
free.

The very real disadvantage of this scheme is that it produces a
dirty copy of the line. By a dirty copy, I mean one that can be
scribbled on elsewhere before we are done with it. This will
happen whenever fgetline is called anywhere else with any FILE
argument before our next read. This is double plus ungood.

As a note, producing dirty copies "works" provided that there is
only one file being read at a time. Don't count on it.
Is there a way to get the efficiency of the highwater scheme
without being dirty? Yes; the trick is that the calling program
provides the initial buffer. If we add the buffer information to

the prototype we get:

fgetline_info fgetline(FILE * fptr,
char * buffer,
size_t size,
size_t maxsize);
This is a worst of all worlds prototype - it combines an awkward
return type with a cumbersome calling sequence.
There are three kinds of copies of the line that we might get
from fgetline - clean, transient, and dirty. A clean copy is one

that has no encumbrances - it lasts until it is specifically
deleted. A transient copy is one that lasts until the next
call to fgetline to get another line from the current file.
(There may be another file being read elsewhere.)

This distinction between kinds of copies that we might get is a
critical one; without it either a default is chosen without
weighing its merits or, worse yet, concepts are mixed and subtle
bugs can slip in.
What kind of copy should fgetline provide, clean or transient?
There are arguments for each choice. Clean copies are more
expensive but safer. Transient copies are (usually) cheaper. In

questions like this there is much to be said for the design
principle that:

When there is a significant choice as to the kind of
output being delivered the library routine should let
the user make the choice rather than dictating the
choice unless offering the choice unduly complicates
usage.

This is a good maxim but it is not the whole story. If one choice
commits the user and the other doesn't, provide the noncommital
result. In this case providing clean copies is the commital
choice; if the routine provides a clean copy the user cannot opt
to get a transient copy. The choice commits the user to a malloc
cost and an eventual free cost. If the routine returns a
transient copy the user can duplicate it to get a clean copy.
So how do we provide a choice? That is simple; if the buffer
pointer in the calling sequence is NULL, fgetline must return a
clean copy; otherwise fgetline MAY return a transient copy. I
say "may" because it might have to increase the buffer size.
If it does the calling routine will have to check whether there
was a size increase and, if so, will have to free the returned
buffer. (The fgetline implementation can't realloc the passed
in buffer; it will have to do a malloc for the first resizing.)

This is an example of a serious error in design. Let us
accept for the moment that we want to offer a choice; naturally
we ask how to do it. What I did was to envisage an implementation
and treat it as the only alternative. Instead of thinking in
terms of a potential implementation I should have thought in
terms of API alternatives. An obvious alternative is to set a
flag instead.

Beating this piece of ground where a horse once died some
more, using different data values to select modes is a suspect
practice, one that one should be wary about using. In this case
the value of the buffer pointer (null or not) was used as a flag.

A more serious issue is that there are alternatives for
supplying the space. They include:

1. The user declares an array on the stack.
2. The user declares a structure that contains an array.
3. The user allocates space using malloc.
4. The read routine allocates the space.

In alternatives 1 and 2 there need not be calls to malloc at all.
However care must be taken to avoid "freeing" stack storage.In
alternatives 3 and 4 there is no need for a separate buffer.

It's probably best to have a value in the status field that
signifies the size has been increased; call that value
fg_increased and a normal read fg_normal. Then the usage for
transient copies might look like this (first cut):

...
struct fgetline_info fg;
char buffer[80];
int done = 0;
....
for(;!done;) {
...
fg = fgetline(fptr,buffer,80,FG_MAXSIZE);
if (fg.status == fg_normal || fg.status == fg_increased) {
/* do stuff */
if (fg.status == fg_increased) free(fg.line);
} else done = 1;
}

If we want clean copies the corresponding loop body would be

fg = fgetline(fptr,0,0,FG_MAXSIZE);
if (fg.status == fg_normal) {
/* do stuff */
free(fg.line);
} else done = 1;

This schematic code illustrates why it was a bad idea to return a
struct and to put the line pointer in the struct. Also it should
be be easy enough for the implementation to free that pesky
increased line; having the user do it is an unnecessary
complication.
What sort of return values should be available in the status
field? These are the successful ones that occur to me:

end_of file The last read (if any) found an EOL
marker. The current read finds an
immediate EOF.

no_increase Normal read - either buffer was null or
else it was not increased.

increase Normal read - the buffer size has been
increased.

abn_no_increase Abnormal read - an EOF was found without
an EOL. Buffer was not increased.

abn_increase Abnormal read - an EOF was found without
an EOL. Buffer was increased.

The increase/no_increase distinction is unnecessary. That
leaves three cases, normal EOF, normal line read, and a premature
EOF in the final line.

One issue not covered was what to do about overly long lines.
There are several possible actions, e.g.,

* We could treat it as an error condition.
* We could treat it as an end of data, reserving the option to
pick up later where we left off.
* We could cut the long line into pieces maxlen characters
long.
* We could chop out all characters in the long line after the
first maxlen characters.
* We could skip the long line and proceed to the next line.

Clearly the routine should not decide on its own what action
to take; the user must specify what action to take by passing a
flag. If the user does not specify an action the default should
be to treat a long line as an error condition.
In addition there are numerous kinds of errors that can occur.
Calling sequence arguments include:

no_file The file pointer is null.
bad_buffer One and only one of buffere and size s
zero.
bad_size Size is 0 or is greater than maxsize
bad_maxsize Maxsize is 0

Then there are the memory allocation failures:

bad_allocate Malloc or realloc failure
big_line Line length is greater than maxsize.

When there is a memory allocation failure the line and len fields
hold what has been successfuly read.

This way of proceeding - listing error tags and short
explanations - has its faults; it comes from focusing too heavily
on potential code. For example, the calling sequence argument
errors only apply to a particular choice of calling sequence. A
better approach might be to try to analyze the kinds of errors
that can occur. Briefly, these include:

* A format fault in the file: This can either be a premature
EOF or an overly large line. This is not necessarily an error
since the user can opt to accept the faulty lines.
* Calling sequence error: The calling sequence is faulty in
some way. The key here is that the constraints on the calling
sequence should be spelled out at some point.
* Memory allocation error: Malloc/realloc fails. Usually, but
not necessarily, this is a fatal error.
* I/O error: Something goes wrong while trying to read the
file.

What action should be taken is quite another matter. Users
may or may not have an error management strategy. If there is
one, the routine shouldn't pre-empt it. At a minimum the routine
should return an error code. The objection to returning an error
code is that the conscientious user is then stuck with adding
error handling code. A useful technique is to pass to the routine
flags telling what to do about errors, e.g., write to a log file,
call exit, etc.
There are various conventions one could use for assigning code
values for the possible return values. My view is that there
should be a bit that is set only if there was an error, a bit
that is set only if there was a memory allocation error, a bit
that is set only if there was a buffer increase, and a bit that
is set only if an EOL occurred.

The point of doing this is that we can have simple tests to check

whether an error occurred, whether some space has to be freed,
and whether the file read is complete.

This is a useful technique. However to avoid magic numbers
disease the positions of the various flag bits should be defined
separately in the include file that spells out the file read
package prototypes.
As a final note, there is one minor decision left unsettled,
namely should the returned line include an EOL marker before the
string terminating 0. My take is that this matter is too trivial

to warrant adding a flag to the calling sequence, and that it
will be slightly less confusing if there is one present even if
it has to be manufactured. Perhaps someone has a convincing
argument one way or the other.

The issue here is whether the delivered line contains \n (the EOL
character) or not. In other do lines look like abc\n\0 or abc\0?
Since we are already committed to using flags we may as well have
a flag to select whether to include the EOF character; the
appropriate default is to not include it.

At this point we have arrived at the end of the original
discussion. It went off track, but it was thought provoking.

Finally, what should the interface look like? I now believe that
the prototype should look like this:

int getfline(FILE * fptr,
char ** line_ptr,
long * status,
size_t * length_ptr,
size_t maxlen);

Normal usage would be to declare line, status, and len as char *,
long, and size_t respectively. Getfline returns 1 if the read was
normal and 0 if it was not. It would normally be used in a while
loop, e.g.,

while (getfline(fptr,&line,&status,&len,maxlen) {
/* Code to process line */
}
/* Error testing (if any) goes here
Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die
 
C

Charlie Gordon

Finally, what should the interface look like? I now believe that
the prototype should look like this:

int getfline(FILE * fptr,
char ** line_ptr,
long * status,
size_t * length_ptr,
size_t maxlen);

Normal usage would be to declare line, status, and len as char *,
long, and size_t respectively. Getfline returns 1 if the read was
normal and 0 if it was not. It would normally be used in a while
loop, e.g.,

while (getfline(fptr,&line,&status,&len,maxlen) {
/* Code to process line */
}
/* Error testing (if any) goes here

You did not document what the parameters mean. Let me infer from the
discussion that:

- FILE *fptr is the stream from which to read characters.
- char **line_ptr is the address of a pointer to the buffer for reading the
line. This buffer will be reallocated with realloc or malloc if needed,
under control of the flags. Its initial value can be NULL.
- long *status is an input/output parameter: it points to a long with
input/output flags:
o whether the buffer was allocated.
o whether is should be extended
o whether the newline character should be stored in the buffer.
o whether there was a read error, an EOF without a newline.
o whether the line was truncated
o whether to skip the truncated part or split it into separate lines.
- size_t *length_ptr points to a size_t containing the current size of the
allocated buffer. It will be updated if reallocation occurs.
- size_t maxlen is the maximum size for buffer reallocation.

the int return value is already documented as a boolean value indicating
normal or abnormal conditions.

This API still need work:

- the maxlen and length_ptr names are misleading, they should be maxsize
and bufsize_ptr. They are sizes in bytes, not the length of the line that
can be read into the buffer. This kind of misnomer tends to produce off by
one errors in implementations and calling code.
- the bufsize_ptr should be passed right after the line_ptr as they are
stronly related.
- there is no need to make the flags a long, int should suffice.
- the return value could be more informative: we could make it an ssize_t to
return the length of the line read or EOF upon end of file. It would also
be more intuitive as a replacement for fgets() if it returned a pointer to
the char buffer.

Input/output parameters are a bit confusing, it might be more effective to
declare a structure for all these fields and pass a pointer to it.

struct getfline_args {
char *buffer;
size_t buffer_size;
char *allocated_buffer;
size_t max_alloc_size;
int flags; // or even better: define flags as bitfields
size_t read_length;
};

char *getfline(FILE *fp, struct getfline_args *gp);

/* example of use */

char buffer[BUFSIZ];
struct getfline_args args = { buffer, sizeof(buffer), NULL, SIZE_MAX };

while (getfline(fp, &args)) {
/* do something with args.buffer contents */
}
free(args.allocated_buffer);
 
R

Richard Harter

Below is a "man page" for a file read utility. An HTML version
can be found at http://home.tiac.net/~cri/2007/gflspec.html. It
is not guaranteed to be typo or brain fart clean.

----

SYNOPSIS:

#include "getfline.h"

int getfline(FILE * fptr,
char ** line_ptr,
size_t * length_ptr,
size_t * bufsize_ptr,
size_t maxlen,
long * flags);

Note: getfline.h is guaranteed to include stdio.h.

DESCRIPTION

Function getfline reads lines from a stream. Each time it is
called it fetches the next line until there are none left.
Getfline has fairly elaborate error detection and reporting
facilities. It also permits minimizing storage allocation and
deallocation.

CALLING SEQUENCE ARGUMENTS:

Argument fptr is the stream from which to read characters. It
must have previously been opened with fopen. fptr==NULL is an
argument error; other invalid values for fptr produce an I/O
error.

Argument line_ptr is the address of a pointer to the buffer for
reading the line. In "clean copy" mode the previous value of
line_ptr is overwritten with a pointer to a newly allocated
buffer; the user is responsible for freeing all buffers. In
"transient mode" the value of line_ptr is controlled by getfline;
it should initially be NULL.

Argument length_ptr is a pointer to a size_t variable that holds
the length of the line when there is a successful read.

Argument bufsize_ptr is a pointer to a size_t variable that holds
the allocated size of the buffer holding the line.

Argument maxlen is the maximum line length that getfline will
return.

Finally argument flags is a pointer to a status word. The status
word is divided into two sections, input flags and output flags.
When getfline is called it uses the input flags to determine the
course of action. When it returns the status word will hold
output flags that indicate what happened during the read. It
should be initialized by oring together the desired input flags.

Function return

Getfline returns 1 if a line was returned and 0 if no line was
returned. A return of 0 can either indicate the EOF was reached
or that the read was terminated by an error.

USAGE MODES

Getfline can produce either "clean copies" or "transient copies".
A clean copy is one that has no encumbrances - it lasts until it
is specifically deleted. A transient copy is one that lasts until
the next call to getfline to get another line from the current
file. Transient copy mode will be used if the GFL_FLAG_TRANS flag
is set; clean copy mode is the default.

In clean copy mode getfline allocates a new buffer for each line
that is read. The user is responsible for freeing the storage for
the lines. Clean copy mode is appropriate when we want to keep
part or all of the file in memory. In clean copy mode the values
of items pointed to by line_ptr, length_ptr, and bufsize_ptr are
overwritten; the previous values are ignored.

In transient copy mode getfline reuses the line buffer; the
previous contents, if any, will be overwritten. Getfline handles
buffer storage management; the user does not have to allocate or
free space for lines. Transient copy mode is appropriate when the
contents of a line are immediately processed. Transient copy mode
may be more efficient than clean copy mode because it minimizes
the number of calls to malloc and free.

In transient copy mode the line buffer pointer must initially
either be NULL or point to storage allocated by malloc, and
bufsize must contain the allocated size of the line buffer. The
user should not alter these variables after initializing them.
When a read fails the line buffer storage is freed by getfline
and the line pointer is cleared.

HANDLING ANOMALOUS LINES

There are two kinds of anomalous lines that can occur, a
prematurely terminated last line (i.e., one that lacks an EOL
(\n) marker), and lines that are longer than maxlen. A
prematurely terminated last line will be treated as error unless
the GFL_FLAG_ADDEOL flag is set. Whether or not it is treated as
an error, the GFL_RD_BADEOF flag will be set.

Getfline offers four different ways to handle "long" lines; they
can be treated as errors, they can be omitted, they can be
truncated, or they can be cut into pieces that are maxlen bytes
long. Regardless of the method chosen, if the line is long the
GFL_RD_LONG flag will be set in the status word.

The default is to treat "long" lines as errors. One of the other
three choices can be setting one of the following three flags:
GFL_FLAG_OMIT, GFL_FLAG_TRUNC, or GFL_FLAG_CUT. Only one of these
flags can be set; setting more than one is an arguments error.

There are two flags, GFL_RD_LONG and GFL_RD_BADEOF, that are set
if the current line being read is anomalous. These flags are set
even if the line is acceptable. The GFL_RD_BADEOF flag is set if
the last line was prematurely terminated. The GFL_RD_LONG if the
current line is long. If the GFL_FLAG_CUT is set the
GFL_RD_BADEOF flag will not be set when the final piece of a long
line is read.

RESPONSES TO ERRORS BY GETFLINE

If there is an error getfline will set three flag bits in the
status word. One is the general error flag, GFL_ERR; the second
is a error category flag; and the third is the specific error.
There are two error categories, argument errors, and execution
errors; the corresponding flags are GFL_ARG and GFL_EX. There is
a separate flag for an error (usually the argument being a NULL
pointer) in each of the arguments.

There are three different execution errors, I/O errors, storage
allocation errors, and unacceptable line errors. An I/O error is
reported if there is a read error. A storage allocation error if
malloc or realloc fails. An unacceptable line is either a long
line or a prematurely terminated line for which there is no input
flag.

There are two input flags, GFL_FLAG_LOG and GFL_FLAG_EXIT, that
specify actions that getfline should take if there is an error.
If the GFL_FLAG_LOG flag is set getfline will write an error
message to stderr when there is an error. If the GFL_FLAG_EXIT is
set, getfline will call exit when there is an error.

SAMPLE USAGE

Here are a couple of usage examples. For each we assume the
following includes and declarations:

#include "getfline.h"
....
FILE *fptr;
char *line = 0;
size_t len, bufsize = 0;
long status;

Example one illustrates clean copy mode. In example one we are
reading a file called somefile.txt. Each line is passed to
processing function that takes ownership of the lines and the
responsibility for freeing their storage.

fptr = fopen("somefile.txt","r");
while(getfline(fptr,&line,&bufsize,1024,&status)) {
process_line(line);
}
if (status & GFL_ERR) fprintf(stderr,"Oops\n");
if (fptr) fclose(fptr);

Example two illustrates transient copy mode. In example two we
are reading input from stdin and writing it to stdout. However in
lines longer than 80 characters we insert new line characters
every 80 characters.

status = GFL_FLAG_CUT | GFL_FLAG_LOG;
while(getfline(stdin,&line,&bufsize,80,&status)) {
fprintf(stdout,"%s\n",line);
}

TABLE OF FLAGS

Input flags:
GFL_FLAG_TRANS Provide transient copies. The default is
clean copies.
GFL_FLAG_EOFOK Says it's okay if the last line is
terminated by a premature EOF. The
default is to treat a premature EOF
as an error.
GFL_FLAG_CUT Says to use all of a long line but
break into pieces of length maxlen.
GFL_FLAG_TRUNC Says to use the first maxlen bytes
of a long line and discard the rest.
GFL_FLAG_OMIT Ignore long lines.
GFL_FLAG_EXIT Exit if there is an execution error.
GFL_FLAG_LOG Write an error message to stderr if
there is an execution error.

General error flags:
GFL_ERR This is set if there was any error.
GFL_ARG This is set if there are any
argument errors.
GFL_EX This is set if there are any
execution errors.

Input argument error flags:
GFL_ARG_FILE An invalid file pointer (presumably
a null pointer) was passed.
GFL_ARG_LINE The pointer to the line argument was
a null pointer.
GFL_ARG_LENGTH The pointer to the length_ptr argument
was a null pointer.
GFL_ARG_BUFSIZE The pointer to the bufsize_ptr argument
was a null pointer.
GFL_ARG_STATUS The pointer to the status argument
was a null pointer.
GFL_ARG_FLAGS Multiple long line action flags are set.

Execution error flags:
GFL_EX_IO There was an I/O error in trying to
read the file.
GFL_EX_STORAGE There was an error in the storage
allocation routines.
GFL_EX_READ The file had a format error (long
line or premature EOF) and no flag
was set for handling the error.

Anomalous line flags:
GFL_RD_LONG This is set if the line is long.
GFL_RD_BADEOF This is set if there was a premature
end of file.


Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die
 
R

Richard Harter

You did not document what the parameters mean. Let me infer from the
discussion that:

I got tired of writing; I should have persevered.
- FILE *fptr is the stream from which to read characters.
- char **line_ptr is the address of a pointer to the buffer for reading the
line. This buffer will be reallocated with realloc or malloc if needed,
under control of the flags. Its initial value can be NULL.
- long *status is an input/output parameter: it points to a long with
input/output flags:
o whether the buffer was allocated.
o whether is should be extended
o whether the newline character should be stored in the buffer.
o whether there was a read error, an EOF without a newline.
o whether the line was truncated
o whether to skip the truncated part or split it into separate lines.
- size_t *length_ptr points to a size_t containing the current size of the
allocated buffer. It will be updated if reallocation occurs.
- size_t maxlen is the maximum size for buffer reallocation.

These are all correct except for length_ptr, which returns the
length of the line that strlen would return. The prototype above
has a bug - there must also be an argument that contains the
buffer size. In a separate, rather long winded post, I have
spelled out an entire description.
the int return value is already documented as a boolean value indicating
normal or abnormal conditions.

This API still need work:

- the maxlen and length_ptr names are misleading, they should be maxsize
and bufsize_ptr. They are sizes in bytes, not the length of the line that
can be read into the buffer. This kind of misnomer tends to produce off by
one errors in implementations and calling code.

As noted above there should be a separate bufsize_ptr argument.
My take on this is that the user shouldn't be messing with the
buffer size or the maximum buffer size at all. What the user
gets is a buffer containing the line as a \0 terminated string
(that may not be in the spec) and its length. Ergo, the limit
should be the maximum line length, not the maximum buffer size.
My view is that if the user hits an "off by one" error the user
probably isn't looking for null terminated strings. That said,
I can think of cases where maxbuf would be the "right" thing to
do.
- the bufsize_ptr should be passed right after the line_ptr as they are
stronly related.

I disagree. The related things are the pointer to the line and
the length of the line; they should be together. The bufsize_ptr
is part of the passing state info hack.
- there is no need to make the flags a long, int should suffice.

See the "man page" post - I ended up with 22 flags. A long is
needed for portability.
- the return value could be more informative: we could make it an ssize_t to
return the length of the line read or EOF upon end of file. It would also
be more intuitive as a replacement for fgets() if it returned a pointer to
the char buffer.

I forget whether ssize_t is posix or C99 but AFAIK it isn't C89.
I could be convinced that it should return a pointer to the char
buffer. The thing that I worried about is whether it might be
confusing because the buffer pointer is also going through the
calling sequence.
Input/output parameters are a bit confusing, it might be more effective to
declare a structure for all these fields and pass a pointer to it.

I waffled on this one; perhaps I waffled the wrong way. What I
concluded was that it came to much the same thing in that the
calling sequence and a compact structure initialization look much
the same, and adding a struct just adds more machinery.
struct getfline_args {
char *buffer;
size_t buffer_size;
char *allocated_buffer;
size_t max_alloc_size;
int flags; // or even better: define flags as bitfields
size_t read_length;
};

Buffer and allocated buffer? Yes, bitfields would be much
better.
char *getfline(FILE *fp, struct getfline_args *gp);

/* example of use */

char buffer[BUFSIZ];
struct getfline_args args = { buffer, sizeof(buffer), NULL, SIZE_MAX };

I get the impression that you have a different kind of
implementation in mind than I do. My thinking is that the user
does not declare buffer as an array. Things get messy if we
start mixing up automatic and allocated storage. So: no
char buffer[BUFSIZ];
Also the initialization looks suspect.
while (getfline(fp, &args)) {
/* do something with args.buffer contents */
}
free(args.allocated_buffer);

The final free isn't needed. Getfline can free the allocated
buffer when it hits a terminating condition.

Thanks muchly for the comments. I think I will follow most of
your suggestions. AFAICT bit fields would be the right way
to go. Using a struct to hold most of the stuff might be
cleaner, though I'm still waffling on that one. Returning a
pointer to the line is probably right. Ugh, I hate rewriting.



Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die
 
E

Eric Sosman

Richard Harter wrote On 11/09/07 10:34,:
Below is a "man page" for a file read utility. An HTML version
can be found at http://home.tiac.net/~cri/2007/gflspec.html. It
is not guaranteed to be typo or brain fart clean.

----

SYNOPSIS:

#include "getfline.h"

int getfline(FILE * fptr,
char ** line_ptr,
size_t * length_ptr,
size_t * bufsize_ptr,
size_t maxlen,
long * flags);
[...]

Isn't there some way you can find an excuse to add
a couple more arguments? Six is too many for most people
to keep straight, but you may as well try to confuse the
geniuses, too. ;-)

Kidding aside, an argument list of that length suggests
to me that the function may be trying to be too many things
to too many people at the same time. It might be wise to
give up some functionality to improve ease of use; the net
change in usefulness could in fact be positive.
 
B

Ben Bacarisse

Thanks muchly for the comments. I think I will follow most of
your suggestions. AFAICT bit fields would be the right way
to go. Using a struct to hold most of the stuff might be
cleaner, though I'm still waffling on that one. Returning a
pointer to the line is probably right. Ugh, I hate rewriting.

I would not advocate returning a pointer to a buffer that also
"returned" via a char **. The reason is that you get an alias that
may subsequently become invalid (or change). Of course, it is not
"wrong", but I would be wary of adding that pitfall to your otherwise
very clear and flexible API.
 
R

Richard Harter

[snip]
Thanks muchly for the comments. I think I will follow most of
your suggestions. AFAICT bit fields would be the right way
to go. Using a struct to hold most of the stuff might be
cleaner, though I'm still waffling on that one. Returning a
pointer to the line is probably right. Ugh, I hate rewriting.

I would not advocate returning a pointer to a buffer that also
"returned" via a char **. The reason is that you get an alias that
may subsequently become invalid (or change). Of course, it is not
"wrong", but I would be wary of adding that pitfall to your otherwise
very clear and flexible API.

Point well taken. Upon reflection, I'm not sure that bit fields
work well in this context. Is there a portable way to set
several bits in a bit field struct?


Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die
 
R

Richard Harter

Richard Harter wrote On 11/09/07 10:34,:
Below is a "man page" for a file read utility. An HTML version
can be found at http://home.tiac.net/~cri/2007/gflspec.html. It
is not guaranteed to be typo or brain fart clean.

----

SYNOPSIS:

#include "getfline.h"

int getfline(FILE * fptr,
char ** line_ptr,
size_t * length_ptr,
size_t * bufsize_ptr,
size_t maxlen,
long * flags);
[...]

Isn't there some way you can find an excuse to add
a couple more arguments? Six is too many for most people
to keep straight, but you may as well try to confuse the
geniuses, too. ;-)

I take it you didn't check out the list of flags. I'm
after the geniuses too. :)
Kidding aside, an argument list of that length suggests
to me that the function may be trying to be too many things
to too many people at the same time. It might be wise to
give up some functionality to improve ease of use; the net
change in usefulness could in fact be positive.

Point well taken, though there really is nothing that I would
want to give up. I wouldn't want to give up error information,
nor a bound on maximum line size, nor a reusable buffer (for
which both address and size are needed), and not even the line
length. One thing that could be done is package some of the
arguments in a struct. Frex:

struct getfline_args {
size_t bufsize,
size_t length,
size_t maxlen,
long flags);

Then we can have a prototype that looks like this:

int * getfline(FILE *fptr,
char **line_ptr,
struct getfline_args *args);

If we adopt the convention a maxlen of zero means no upper bound
check or alternately, another input flag needed to activate
bounds checking, then instead of a single long calling sequence
we can do:

struct getfline_args gfl_args = {0,0,0,0};
char *line = 0;
FILE *fptr = 0;
...
while (getfline(fptr,&line,&gfl_args) {
/* Do stuff */
}

This doesn't really change anything but it makes it easier to use
the defaults and it moves some of the definitions into the
include file. What do you think of this?


Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die
 
E

Eric Sosman

Richard Harter wrote On 11/09/07 17:00,:
Richard Harter wrote On 11/09/07 10:34,:
Below is a "man page" for a file read utility. An HTML version
can be found at http://home.tiac.net/~cri/2007/gflspec.html. It
is not guaranteed to be typo or brain fart clean.

----

SYNOPSIS:

#include "getfline.h"

int getfline(FILE * fptr,
char ** line_ptr,
size_t * length_ptr,
size_t * bufsize_ptr,
size_t maxlen,
long * flags);
[...]

Isn't there some way you can find an excuse to add
a couple more arguments? Six is too many for most people
to keep straight, but you may as well try to confuse the
geniuses, too. ;-)


I take it you didn't check out the list of flags. I'm
after the geniuses too. :)

Well, I noticed you needed a `long' to hold them,
not a mere `int' ... Close to two dozen flags, aren't
there?
Point well taken, though there really is nothing that I would
want to give up. I wouldn't want to give up error information,
nor a bound on maximum line size, nor a reusable buffer (for
which both address and size are needed), and not even the line
length. One thing that could be done is package some of the
arguments in a struct. Frex:

struct getfline_args {
size_t bufsize,
size_t length,
size_t maxlen,
long flags);

Then we can have a prototype that looks like this:

int * getfline(FILE *fptr,
char **line_ptr,
struct getfline_args *args);

I'm not sure why the return value changed from an int
to a pointer. Typo?
If we adopt the convention a maxlen of zero means no upper bound
check or alternately, another input flag needed to activate
bounds checking, then instead of a single long calling sequence
we can do:

struct getfline_args gfl_args = {0,0,0,0};
char *line = 0;
FILE *fptr = 0;
...
while (getfline(fptr,&line,&gfl_args) {
/* Do stuff */
}

This doesn't really change anything but it makes it easier to use
the defaults and it moves some of the definitions into the
include file. What do you think of this?

It reawakens memories of the days when interfaces
used "control blocks," often populated by assembly macros.
The C library's struct tm is such a beast, FILE can be
thought of in that light, and POSIX has no shortage of
structured arguments. If you take this route, you'll at
least be on a well-marked trail.

If you like the control block approach, though, why
not go whole hog and put the line_ptr in the struct, too?
You *could* even park the fptr there, but it could ease
things a little if a struct initialized to all zeroes
meant something sensible. With a purpose-built struct
type handy by, that overwhelming mass of flags might
become a bunch of bit-fields. Bit-fields are, IMHO, a
mixed blessing, but in a case like this they'd avoid the
need to pollute the name space with all those GFL_xxx
macros.

Still, I can't shake the feeling that you're trying
to get one function to do too many tasks. Instead of
using a bazillion flags to govern different modes of
operation, might you consider a suite of related functions
to handle the important variations? In effect, you'd be
moving a few flags out of the struct and into the function
name. Observe the exec*() family of POSIX interfaces, for
example: they all do "the same thing" and they probably
all devolve on the same internal implementation, but using
different names for different (although related) operations
is a useful simplification. Try to imagine what it would
be like if all of printf(), fprintf(), sprintf(), vprintf(),
vfprintf(), vsprintf(), and vsnprintf() were packaged
into one function name with a control block to choose
different operation modes ... Ugh!

As you may gather, I am a big fan of small, simple
interfaces. Some people feel my adoration of simplicity
is unhealthily intense; my own line-getter has been
criticized for violating the "... but not simpler" part
of Einstein's dictum. (The critic was someone I think
highly of, so I gave his criticism careful thought before
deciding to ignore it.) I mention all this to make my
biases clear: the KISS principle is very important to me,
but may not hold quite so much sway over others.
 
C

CBFalconer

Eric said:
.... snip ...

As you may gather, I am a big fan of small, simple interfaces.
Some people feel my adoration of simplicity is unhealthily
intense; my own line-getter has been criticized for violating
the "... but not simpler" part of Einstein's dictum. (The
critic was someone I think highly of, so I gave his criticism
careful thought before deciding to ignore it.) I mention all
this to make my biases clear: the KISS principle is very
important to me, but may not hold quite so much sway over others.

Here is the heart of my ggets function, which more or less adheres
to your principles. The entire package, with testing code, demos,
docs, etc. is available at:

<http://cbfalconer.home.att.net/download/>

#include <stdio.h>
#include <stdlib.h>
#include "ggets.h"

#define INITSIZE 112 /* power of 2 minus 16, helps malloc */
#define DELTASIZE (INITSIZE + 16)

enum {OK = 0, NOMEM};

int fggets(char* *ln, FILE *f)
{
int cursize, ch, ix;
char *buffer, *temp;

*ln = NULL; /* default */
if (NULL == (buffer = malloc(INITSIZE))) return NOMEM;
cursize = INITSIZE;

ix = 0;
while ((EOF != (ch = getc(f))) && ('\n' != ch)) {
if (ix >= (cursize - 1)) { /* extend buffer */
cursize += DELTASIZE;
if (NULL == (temp = realloc(buffer, (size_t)cursize))) {
/* ran out of memory, return partial line */
buffer[ix] = '\0';
*ln = buffer;
return NOMEM;
}
buffer = temp;
}
buffer[ix++] = ch;
}
if ((EOF == ch) && (0 == ix)) {
free(buffer);
return EOF;
}

buffer[ix] = '\0';
if (NULL == (temp = realloc(buffer, (size_t)ix + 1))) {
*ln = buffer; /* without reducing it */
}
else *ln = temp;
return OK;
} /* fggets */
 
C

Carl

Still, I can't shake the feeling that you're trying
to get one function to do too many tasks. Instead of
using a bazillion flags to govern different modes of
operation, might you consider a suite of related functions
to handle the important variations? In effect, you'd be
moving a few flags out of the struct and into the function
name.

I think it's better to have one function. Consider that typically an ALU
will be a single circuit and the operation it performs (e.g. add,
multiply, etc.) is determined by control flags. So keeping down the
number of functions is following in good footsteps.
 
E

Eric Sosman

Carl said:
I think it's better to have one function. Consider that typically an ALU
will be a single circuit and the operation it performs (e.g. add,
multiply, etc.) is determined by control flags. So keeping down the
number of functions is following in good footsteps.

Something about the argument strikes me as unconvincing.
Do you really advocate eliminating sqrt() and cos() and
exp() and log() and all the rest, and replacing them with

double math(struct mathctl*, ...);

Or take my earlier tongue-in-cheek suggestion about
combining all the *printf() functions into one, using a
control block to manage the variations. If you think the
"Swiss Army knife" approach is good, let's see (1) your
design for the combined interface, and (2) helloworld.c
written as your interface would require.
 
M

Malcolm McLean

Eric Sosman said:
Isn't there some way you can find an excuse to add
a couple more arguments? Six is too many for most people
to keep straight, but you may as well try to confuse the
geniuses, too. ;-)
The rule of four. Four arguments is as many as can be scanned. Seven as many
as can be read.
 
R

Richard Harter

Richard Harter wrote On 11/09/07 17:00,:[snip]
I take it you didn't check out the list of flags. I'm
after the geniuses too. :)

Well, I noticed you needed a `long' to hold them,
not a mere `int' ... Close to two dozen flags, aren't
there?

22 at the last count. I expect I could trim a few if I had to.
I'm not sure why the return value changed from an int
to a pointer. Typo?

Yep. Read the disclaimer. :)
It reawakens memories of the days when interfaces
used "control blocks," often populated by assembly macros.
The C library's struct tm is such a beast, FILE can be
thought of in that light, and POSIX has no shortage of
structured arguments. If you take this route, you'll at
least be on a well-marked trail.


If you like the control block approach, though, why
not go whole hog and put the line_ptr in the struct, too?
You *could* even park the fptr there, but it could ease
things a little if a struct initialized to all zeroes
meant something sensible.

If we do the "all zeroes means something sensible" then you don't
want line_ptr in there. If you do the user code would need
something like

line = *(funky.line_ptr);

You could put line in there and let the user refer to funky.line.
That's a workable alternative but I didn't think it would be
popular. Perhaps the answer is that there are no alternatives
that would be popular.

With a purpose-built struct
type handy by, that overwhelming mass of flags might
become a bunch of bit-fields. Bit-fields are, IMHO, a
mixed blessing, but in a case like this they'd avoid the
need to pollute the name space with all those GFL_xxx
macros.

That's probably a good idea. I used the GFL_XXX hack to avoid
infringing on other namespaces. Still, "probably not used" is
not the same thing as "not used".
Still, I can't shake the feeling that you're trying
to get one function to do too many tasks. Instead of
using a bazillion flags to govern different modes of
operation, might you consider a suite of related functions
to handle the important variations? In effect, you'd be
moving a few flags out of the struct and into the function
name. Observe the exec*() family of POSIX interfaces, for
example: they all do "the same thing" and they probably
all devolve on the same internal implementation, but using
different names for different (although related) operations
is a useful simplification. Try to imagine what it would
be like if all of printf(), fprintf(), sprintf(), vprintf(),
vfprintf(), vsprintf(), and vsnprintf() were packaged
into one function name with a control block to choose
different operation modes ... Ugh!

It could scarcely be less ugly than the *print* collection.
As you may gather, I am a big fan of small, simple
interfaces. Some people feel my adoration of simplicity
is unhealthily intense; my own line-getter has been
criticized for violating the "... but not simpler" part
of Einstein's dictum. (The critic was someone I think
highly of, so I gave his criticism careful thought before
deciding to ignore it.) I mention all this to make my
biases clear: the KISS principle is very important to me,
but may not hold quite so much sway over others.

Here we may have to disagree a bit. The trouble with a KISS
approach in this case is that error and efficiency go by the
wayside. The simple way is

char * getalife(FILE *);

Now this means that you are automatically committed to separately
allocating space for each line. This is a tradeoff; what is
being traded is calling sequence simplicity for execution time.
A KISS advocate is saying in effect that the execution time is
not worth worrying about. I take the view that one should have a
choice. There is a second small issue with the KISS dictum - one
often ends up having to do extra work anyway - in this case
freeing the allocated space for each line.

The compact prototype also throws away the knowledge of the
length of the line. If users want that information they have to
call strlen, i.e., redo the work that had already be done.

The compact prototype leaves no room for a bound on the line
size. I believe an unnamed party has made that very point. I
happen to agree. I tend to take the view that malloc failures
mean that there is something fundamentally wrong with a program.


Another thing that is wrong with the KISS prototype is that there
is no place for an error return. I/O errors can be detected by
the user, but that again is extra code on the user side. However
there is no good way to pass back the knowledge that malloc has
failed.

Falconer's version isn't much better. IIRC it tooks like:

int nolifehere(FILE *, char **);

where I assume that the return is 0 if a line was read properly
and a code value if it fails in some way. This adds error
returns for a modest price in complexity. It has the same built
in automatic inefficiencies. It also steps on one of C's little
awkwardnesses, namely the convention that 0 is false and non-zero
is true. IMNSHO this is an ancient mistake; generally speaking
there usually is only one way to succeed and many ways to fail.
It would have better to do it the other way around. However it
is a convention cast is three billion year old granite so there
is nothing to be done about it.

The real problem with the KISS dictum in this case is that there
is no way to implement it and still reuse buffer space. The
problem is the itsagoodlife function needs state that is
persistent from call to call, and that the space for that state
must be supplied by the user. What it comes down to is that you
can have a getalife function with a simple interface but not a
itsagoodlife function with a simple interface.

Of course one could have a getalife wrapper around a itsagoodlife
implementation; I have no problem with that.

I expect I will do yet another spec and post it. Ugh. Still
more rewriting.







Richard Harter, (e-mail address removed)
http://home.tiac.net/~cri, http://www.varinoma.com
In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die
 
E

Eric Sosman

Richard said:
If we do the "all zeroes means something sensible" then you don't
want line_ptr in there. If you do the user code would need
something like

line = *(funky.line_ptr);

You could put line in there and let the user refer to funky.line.
That's a workable alternative but I didn't think it would be
popular. Perhaps the answer is that there are no alternatives
that would be popular.

I was thinking along the lines of "If the line_ptr element
is NULL, fgetline() allocates and thereafter manages a suitable
buffer." C.f. setvbuf().
As you may gather, I am a big fan of small, simple
interfaces. [...] I mention all this to make my
biases clear: the KISS principle is very important to me,
but may not hold quite so much sway over others.

Here we may have to disagree a bit. The trouble with a KISS
approach in this case is that error and efficiency go by the
wayside. The simple way is

char * getalife(FILE *);

Oddly enough, that's the signature of my line-getter.
Now this means that you are automatically committed to separately
allocating space for each line.

Well, no. Mine reuses (and perhaps expands) the buffer at
each call. You only get to keep one line at at time "internal"
to the line-getter, which doesn't seem to be a problem: My usual
pattern is not to save the lines verbatim, but to parse them and
extract data. If you *do* want to save the lines verbatim, you
can't re-use the same buffer in any case (although in this case
mine would incur a string copy yours might be able to avoid).
The compact prototype also throws away the knowledge of the
length of the line. If users want that information they have to
call strlen, i.e., redo the work that had already be done.

True. (Shrug.) I have seldom found the line length an
interesting datum.
The compact prototype leaves no room for a bound on the line
size.

True. (Shrug, perhaps with a touch less assuredness.) If
you know an upper limit on line length, fgets() is available.
Another thing that is wrong with the KISS prototype is that there
is no place for an error return. I/O errors can be detected by
the user, but that again is extra code on the user side. However
there is no good way to pass back the knowledge that malloc has
failed.

"No place for an error return?" Mine returns NULL for end-of-
input, I/O error, or malloc() failure, which can be disambiguated
(if desired) by calling feof() and ferror(). Usually the program
only cares about normal-vs.-other, and the code looks like

while ((line = getline(stream)) != NULL) {
...
}
if (! feof(stream)) {
die_horribly();
}

Note that *some* kind of test is necessary in any event,
since the underlying getc() lumps end-of-input and error into
a single EOF return value.
The real problem with the KISS dictum in this case is that there
is no way to implement it and still reuse buffer space.

Not true of mine.

I acknowledge that my "simplicity trumps all" approach is not
The Answer for all situations, nor (obviously) for all programmers.
My misgiving about the design you are working on is that it seems
to be trying to be The Answer, and I doubt that's a reasonable goal.
 
C

CBFalconer

Richard said:
.... snip ...

where I assume that the return is 0 if a line was read properly
and a code value if it fails in some way. This adds error
returns for a modest price in complexity. It has the same built
in automatic inefficiencies. It also steps on one of C's little
awkwardnesses, namely the convention that 0 is false and non-zero
is true. IMNSHO this is an ancient mistake; generally speaking
there usually is only one way to succeed and many ways to fail.
It would have better to do it the other way around. However it
is a convention cast is three billion year old granite so there
is nothing to be done about it.

No, the thing is that there is only one signal needed for 'OK', but
it is quite possible to return various error forms. The 0 == OK
matches this. It is a general practice in the standard library,
and avoids returning a separate error value.
 
D

David Thompson

Here I went wrong. Having a struct (record) to hold information
is plausible, but it may be overkill. Be that as it may, in a C
implementation this prototype doesn't work too well. First of all
the user has to go through the struct to access the line and the
status field. More importantly the normal usage in C would be a
while loop, e.g.,

struct fgetline_info fg;
...
while (fg = fgetline(FILE *fptr, size_t maxsize)) {...}

This doesn't work - the idiom requires a test on something that
can be "zero". If we convert the prototype and fg to a pointer
the code still doesn'twork. Now the problem is that when the read
is successful we don't need the status field; when the read fails
we don't have the status field.
You can do
while( (fg = fgetline (...)) . status == 0 ){ ... }
/* or != 0, or > 0, or whatever your semantics is */

This is not usual practice (in C), and I'd want to get some experience
with it before deciding whether I PREFER it, but it definitely works.

FWIW it could be argued that it's analogous to the much more idiomatic
use in LISP of a multiple-valued return that silently collapses to the
first value if the caller doesn't ask for the rest.

<snip MANY other points>
- formerly david.thompson1 || achar(64) || worldnet.att.net
 
C

Chris Torek

Eric Sosman said:
It reawakens memories of the days when interfaces
used "control blocks," ...
As you may gather, I am a big fan of small, simple
interfaces. Some people feel my adoration of simplicity
is unhealthily intense; ...

There is a middle path here. Consider:

char *get_a_line(FILE *file, struct whatever *control_block);

where the "control_block" parameter is optional, and may be
given as NULL. The "simple interface user" then writes:

while ((line = get_a_line(fp, NULL)) != NULL)
... work with the line ...

and ignores all distinctions between "complete" and "incomplete"
lines, "ordinary EOF" and "read error", and so on. The "complex
interface with control block" user creates and populates the control
block, and does:

while (get_a_line(fp, &cb), cb.status == OK)
... etc ...

or whatever is appropriate. (Note: the above assumes that the
return value used in the "simple interface" case is duplicated
somewhere in the control block, in this case.)
 
C

Charlie Gordon

CBFalconer said:
Here is the heart of my ggets function, which more or less adheres
to your principles. The entire package, with testing code, demos,
docs, etc. is available at:

<http://cbfalconer.home.att.net/download/>

#include <stdio.h>
#include <stdlib.h>
#include "ggets.h"

#define INITSIZE 112 /* power of 2 minus 16, helps malloc */
#define DELTASIZE (INITSIZE + 16)

enum {OK = 0, NOMEM};

The return value is poor: testing for EOF is indirect.
Returning the length of the string could be useful too.
int fggets(char* *ln, FILE *f)
{
int cursize, ch, ix;

Why not size_t for the size and the index ?
char *buffer, *temp;

*ln = NULL; /* default */
if (NULL == (buffer = malloc(INITSIZE))) return NOMEM;
cursize = INITSIZE;

ix = 0;
while ((EOF != (ch = getc(f))) && ('\n' != ch)) {
if (ix >= (cursize - 1)) { /* extend buffer */
cursize += DELTASIZE;
if (NULL == (temp = realloc(buffer, (size_t)cursize))) {
/* ran out of memory, return partial line */
buffer[ix] = '\0';
*ln = buffer;

You should push the character back to the stream with ungetc(ch, f), or it
will be lost.
return NOMEM;
}
buffer = temp;
}
buffer[ix++] = ch;
}
if ((EOF == ch) && (0 == ix)) {
free(buffer);
return EOF;
}

buffer[ix] = '\0';
if (NULL == (temp = realloc(buffer, (size_t)ix + 1))) {
*ln = buffer; /* without reducing it */
}
else *ln = temp;
return OK;
} /* fggets */

This function effectively calls malloc *and* realloc at least once per line
successfully read: that's simple but quite inefficient.
 
R

Richard Heathfield

Charlie Gordon said:
"CBFalconer" <[email protected]> a écrit dans le message de


This function effectively calls malloc *and* realloc at least once per
line successfully read: that's simple but quite inefficient.

Yeah, he's been told over and over again about that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top