Taking a stab at getline

N

Noob

Hello group,

Here's my attempt at writing a "get_line" implementation, which
reads an entire line from a file stream, dynamically allocating
the space needed to store said line.

Since this is for my own personal use, I didn't bother handling
out-of-memory conditions elegantly. I just kick the bucket.
("We all kick the bucket in the end, in the end.")

EOF is signaled by returning NULL.

I'd like to hear suggestions/criticism.

#include <stdio.h>
#include <stdlib.h>
#define BUFLEN 4000
char *get_line(FILE *stream)
{
char *s = NULL; size_t len = 0; int c;

do c = fgetc(stream); while (c == '\n'); /* ignore leading empty lines */
if (c == EOF) return NULL;
ungetc(c, stream);

while ( 1 )
{
size_t max = len + BUFLEN;
s = realloc(s, max);
if (s == NULL) exit(EXIT_FAILURE);
while (len < max)
{
c = fgetc(stream);
if (c == EOF || c == '\n')
{
s[len] = '\0';
return realloc(s, len+1);
}
s[len++] = c;
}
}
}

Regards.
 
B

BartC

Noob said:
I'd like to hear suggestions/criticism.

No user docs given. As I understand how it works:

o The function allocates a new dynamic buffer for every line read; the
caller is responsible for freeing this.

o The buffer allocated will always be at least 4000 bytes? (Significant if
the caller wants to store these pointers in an array)

o A return value of NULL means end-of-file (out-of-memory signalled by
aborting)

o *Blank lines are ignored*. (This is different from a normal getline();
blank lines can be significant. However blank lines containing only white
space *are* returned.)

o The return value points to a null-terminated string (which always contain
at least one character), when it's not NULL.
while ( 1 )
{
size_t max = len + BUFLEN;
s = realloc(s, max);
if (s == NULL) exit(EXIT_FAILURE);
while (len < max)
{
c = fgetc(stream);
if (c == EOF || c == '\n')
{
s[len] = '\0';
return realloc(s, len+1);

Why the extra byte allocated here?
 
B

BartC

o The buffer allocated will always be at least 4000 bytes? (Significant if
the caller wants to store these pointers in an array)

Why the extra byte allocated here?

I got it just as I pressed Send! (I rarely use realloc.)
 
E

Eric Sosman

Hello group,

Here's my attempt at writing a "get_line" implementation, which
reads an entire line from a file stream, dynamically allocating
the space needed to store said line.

Since this is for my own personal use, I didn't bother handling
out-of-memory conditions elegantly. I just kick the bucket.
("We all kick the bucket in the end, in the end.")

EOF is signaled by returning NULL.

I'd like to hear suggestions/criticism.

#include <stdio.h>
#include <stdlib.h>
#define BUFLEN 4000
char *get_line(FILE *stream)
{
char *s = NULL; size_t len = 0; int c;

do c = fgetc(stream); while (c == '\n'); /* ignore leading empty lines */

It's a matter of preference, but I wouldn't choose to do this.
Whether an empty line is meaningful or not isn't something for a
low-level utility function to decide.
if (c == EOF) return NULL;
ungetc(c, stream);

while ( 1 )
{
size_t max = len + BUFLEN;

There's a faint chance this sum could wrap around, but you're
probably safe in ignoring it: You're likely to run out of memory
before exceeding what a size_t can handle. Still, an "industrial-
strength" function would be on guard against the possibility.
s = realloc(s, max);
if (s == NULL) exit(EXIT_FAILURE);

Ugh. Ugh. Ugh. Yes, I saw what you wrote about kicking the
bucket -- but here you're kicking the entire program's bucket,
without even knowing what the bucket may hold. Ugh. Ugh. Ugh.
while (len < max)
{
c = fgetc(stream);

Why fgetc() instead of getc()? (Same question above, too.)
if (c == EOF || c == '\n')
{
s[len] = '\0';
return realloc(s, len+1);

... possibly returning NULL after successfully reading a line,
plus leaking the memory where the line resides ... You could
retain the original pointer and return it if realloc() fails, or
you could make BUFLEN smaller and just not worry about the extra
few bytes.
}
s[len++] = c;
}
}
}
 
J

James Kuyper

Hello group,

Here's my attempt at writing a "get_line" implementation, which
reads an entire line from a file stream, dynamically allocating
the space needed to store said line.

Since this is for my own personal use, I didn't bother handling
out-of-memory conditions elegantly. I just kick the bucket.
("We all kick the bucket in the end, in the end.")

EOF is signaled by returning NULL.

I'd like to hear suggestions/criticism.

#include <stdio.h>
#include <stdlib.h>
#define BUFLEN 4000
char *get_line(FILE *stream)
{
char *s = NULL; size_t len = 0; int c;

do c = fgetc(stream); while (c == '\n'); /* ignore leading empty lines */
if (c == EOF) return NULL;
ungetc(c, stream);

while ( 1 )
{
size_t max = len + BUFLEN;
s = realloc(s, max);
if (s == NULL) exit(EXIT_FAILURE);
while (len < max)
{
c = fgetc(stream);
if (c == EOF || c == '\n')
{

You're treating End Of File, End Of Line, and an I/O error in exactly
the same fashion. In some contexts that might be acceptable, but I would
ordinarily want to distinguish them. As long as you're restricted to
reporting errors by returning NULL, you can't distinguish them. That's
why I generally prefer using the return value of my function to report
an error code; I'd take a pointer to a pointer as an argument to the
function, and return the pointer the allocated memory through that argument.
s[len] = '\0';
return realloc(s, len+1);
}
s[len++] = c;
}
}
}

When this routine is done, the only record of the length of the
allocated memory is the null character at the end of the allocation. I
prefer storing the length along with the allocation. To avoid having to
allocate two separate objects, I use a flexible array member:

struct allocation {
size_t length;
allocated_type data[];
};

In this case, allocated_type is char. Allocate the entire structure, and
then store it's length in the structure. Reallocate as needed, storing
the new length in the structure only after the reallocation is successful.

On the rare occasions when I've done anything like this, I've usually
occasionally desired to re-use an existing allocation. Therefore, the
routine should take a pointer to an allocation as input, and malloc() a
new one only if that pointer is null.
 
J

James Kuyper

On 02/07/2013 08:55 AM, BartC wrote:
....
o The buffer allocated will always be at least 4000 bytes? (Significant if
the caller wants to store these pointers in an array)

No, it gets reallocated below 4000 if less space is needed.

....
o *Blank lines are ignored*. (This is different from a normal getline();
blank lines can be significant. However blank lines containing only white
space *are* returned.)

It ignores leading blank lines. That's a specialized usage, A general
purpose function shouldn't work that way. But blank lines after the
first blank line are not ignored; they cause the return of a pointer to
null-terminated string containing no characters, which seems reasonable.
 
E

Eric Sosman

On 02/07/2013 08:55 AM, BartC wrote:
...

No, it gets reallocated below 4000 if less space is needed.

...

It ignores leading blank lines. That's a specialized usage, A general
purpose function shouldn't work that way. But blank lines after the
first blank line are not ignored; they cause the return of a pointer to
null-terminated string containing no characters, which seems reasonable.

It looks like *all* empty lines are skipped, since the
same skippage occurs on each call to the function. Or did
I miss something?
 
J

James Kuyper

On 2/7/2013 10:49 AM, James Kuyper wrote: ....

It looks like *all* empty lines are skipped, since the
same skippage occurs on each call to the function. Or did
I miss something?

No, I just made a stupid mistake while reading the code. It's one of
several such mistakes, but I caught all (I hope!) of the others before
posting my message.
 
N

Noob

Eric said:
It's a matter of preference, but I wouldn't choose to do this.
Whether an empty line is meaningful or not isn't something for a
low-level utility function to decide.

You're probably right. I'll change it in v2.
There's a faint chance this sum could wrap around, but you're
probably safe in ignoring it: You're likely to run out of memory
before exceeding what a size_t can handle. Still, an "industrial-
strength" function would be on guard against the possibility.

I understand the point you are making.
Ugh. Ugh. Ugh. Yes, I saw what you wrote about kicking the
bucket -- but here you're kicking the entire program's bucket,
without even knowing what the bucket may hold. Ugh. Ugh. Ugh.

Thing is, since I chose NULL to mean EOF, there was no "room"
left for me to signal OOM. Unless I change the prototype...
I'll give it a shot in v2.
Why fgetc() instead of getc()? (Same question above, too.)

At some point in the early stages, I was thinking that it might
be "dangerous" to use the macro version. Obviously, there's no
problem when 'stream' is a function argument.
if (c == EOF || c == '\n')
{
s[len] = '\0';
return realloc(s, len+1);

... possibly returning NULL after successfully reading a line,
plus leaking the memory where the line resides ... You could
retain the original pointer and return it if realloc() fails, or
you could make BUFLEN smaller and just not worry about the extra
few bytes.

Wait what?! Reallocating to a smaller size (i.e. truncating
excessive space) could fail?

At this point, len < max, thus len+1 <= max

And I thought realloc frees its argument if it doesn't return
what was passed in?

Regards.
 
N

Noob

BartC said:
No user docs given. As I understand how it works:

I need to add a comment explaining what the caller is supposed
to do with the return value.
o The function allocates a new dynamic buffer for every line read; the
caller is responsible for freeing this.
Correct.

o The buffer allocated will always be at least 4000 bytes? (Significant if
the caller wants to store these pointers in an array)

No, the buffer is just big enough to store the string.
o A return value of NULL means end-of-file (out-of-memory signaled by
aborting)
Right.

o *Blank lines are ignored*. (This is different from a normal getline();
blank lines can be significant. However blank lines containing only white
space *are* returned.)

Right. I'll make the behavior more consistent.
o The return value points to a null-terminated string (which always contain
at least one character), when it's not NULL.
Correct.
while ( 1 )
{
size_t max = len + BUFLEN;
s = realloc(s, max);
if (s == NULL) exit(EXIT_FAILURE);
while (len < max)
{
c = fgetc(stream);
if (c == EOF || c == '\n')
{
s[len] = '\0';
return realloc(s, len+1);

Why the extra byte allocated here?

For the terminator :)

Regards.
 
E

Eric Sosman

Eric said:
[...]
s = realloc(s, max);
if (s == NULL) exit(EXIT_FAILURE);

Ugh. Ugh. Ugh. Yes, I saw what you wrote about kicking the
bucket -- but here you're kicking the entire program's bucket,
without even knowing what the bucket may hold. Ugh. Ugh. Ugh.

Thing is, since I chose NULL to mean EOF, there was no "room"
left for me to signal OOM. Unless I change the prototype...
I'll give it a shot in v2.

One possibility is to keep your "narrow" interface and
let the caller figure it out:

char *line = get_line(stream);
if (line == NULL) {
if (feof(stream))
fputs("End of file\n", stderr);
else if ferror(stream))
fputs("I/O error\n", stderr);
else
fputs("Out of memory\n", stderr);
}

This is not as squeaky-clean as some other approaches (for example,
you could return a struct containing a status code and a pointer),
but it still allows the caller to discover everything needed.

If you decide to return NULL (or other failure indication) on
an I/O error, it might be a good idea to preserve the errno value
in case the failing getc() set it. The issue is to prevent other
functions from clobbering it while cleaning up: copy it to an int
of your own, call free() or whatever, then restore it before
returning.
Wait what?! Reallocating to a smaller size (i.e. truncating
excessive space) could fail?

Sure. Consider a memory manager that uses different "arenas"
for different allocation sizes: Shrinking an allocation might move
it from one arena to another. Or consider a memory manager that
*always* reallocates, maybe filling the old area with 0xDEADBEEF
to aid in debugging.
At this point, len < max, thus len+1 <= max

And I thought realloc frees its argument if it doesn't return
what was passed in?

No: If realloc() fails, it leaves the original allocation
untouched.
 
N

Noob

James said:
You're treating End Of File, End Of Line, and an I/O error in exactly
the same fashion. In some contexts that might be acceptable, but I would
ordinarily want to distinguish them. As long as you're restricted to
reporting errors by returning NULL, you can't distinguish them. That's
why I generally prefer using the return value of my function to report
an error code; I'd take a pointer to a pointer as an argument to the
function, and return the pointer the allocated memory through that argument.

getc returning -1 means either EOF or I/O error, right?
At this point, would the following assertion hold:

!feof(stream) == !!ferror(stream)

If we're dealing with an I/O error, how can we learn more about
the nature of the error? Is errno set at some point?
When this routine is done, the only record of the length of the
allocated memory is the null character at the end of the allocation. I
prefer storing the length along with the allocation. To avoid having to
allocate two separate objects, I use a flexible array member:

struct allocation {
size_t length;
allocated_type data[];
};

In this case, allocated_type is char. Allocate the entire structure, and
then store it's length in the structure. Reallocate as needed, storing
the new length in the structure only after the reallocation is successful.

On the rare occasions when I've done anything like this, I've usually
occasionally desired to re-use an existing allocation. Therefore, the
routine should take a pointer to an allocation as input, and malloc() a
new one only if that pointer is null.

The getline implementation in glibc works like you describe.
Using a flexible array is an interesting suggestion.

Regards.
 
E

Eric Sosman

getc returning -1 means either EOF or I/O error, right?

Almost, but not quite. A return of EOF (often -1, but
possibly some other negative value) means end-of-input or
I/O error. On some "exotic" systems it might even mean a
character was successfully read; see below.
At this point, would the following assertion hold:

!feof(stream) == !!ferror(stream)

On an "exotic" system where UCHAR_MAX > INT_MAX, it is
possible that a perfectly ordinary unsigned char could be equal
to EOF after conversion to int; if both feof() and ferror() are
false you've successfully read that character.

Even on a "vanilla" system I think both feof() and ferror()
could be true, if you've kept on calling getc() after an earlier
failure indication. The error and end-of-file indicators are
"sticky:" once they're set, they remain set until you clear
them (the actions that clear them are different for the two
indicators). So if you breeze past an I/O error and then
encounter end-of-input, I think both indicators could be set.
If we're dealing with an I/O error, how can we learn more about
the nature of the error? Is errno set at some point?

Some implementations may set it; none are obliged to. This
is part of the price one pays for portable I/O: The library
shields you from the ugly details of the underlying system, but
at the same time prevents you from peeking at them. "Shielding"
and "hiding" are closely related ...
 
K

Keith Thompson

Noob said:
getc returning -1 means either EOF or I/O error, right?

Terminology problem.

EOF is a macro defined in <stdio.h>. It expands to an integer
constant expression of type int with a negative value. It's a value
that can be returned by certain I/O functions, usually to indicate
that no more input is available. Its value is typically -1, but
you shouldn't assume that.

If getc() returns the value of EOF, it means that it tried and failed
to read a character from stdin. This can be either because of an
end-of-file condition (which will cause feof(stdin) to return a true
result), or by an error condition (which will cause ferror(stdin)
to return a true result).

EOF stands for "end of file", but don't confuse EOF (either the
macro or the int value it denotes) with an end-of-file condition.

As the standard says (N1570 7.21.7.5):

int getc(FILE *stream);

The getc function returns the next character from the input
stream pointed to by stream. If the stream is at end-of-file,
the end-of-file indicator for the stream is set and getc returns
EOF. If a read error occurs, the error indicator for the stream
is set and getc returns EOF.

(It's a bit easier to read with the highlighting in the PDF.)
At this point, would the following assertion hold:

!feof(stream) == !!ferror(stream)

It should, yes.
If we're dealing with an I/O error, how can we learn more about
the nature of the error? Is errno set at some point?

Maybe. The standard doesn't require anything in <stdio.h> to set errno,
but POSIX does.

[...]
 
B

Ben Bacarisse

Noob said:
Here's my attempt at writing a "get_line" implementation, which
reads an entire line from a file stream, dynamically allocating
the space needed to store said line.

A couple of things that have not yet been said...
Since this is for my own personal use, I didn't bother handling
out-of-memory conditions elegantly. I just kick the bucket.
("We all kick the bucket in the end, in the end.")

EOF is signaled by returning NULL.

I'd like to hear suggestions/criticism.

#include <stdio.h>
#include <stdlib.h>
#define BUFLEN 4000

Tiny point: I don't like the name, since it is not the length of a
buffer -- it's a length increment.

More substantively, it's often better to grow the allocation by a
constant factor rather than by a constant increment. However, since
realloc is probably cheap compared to I/O it may not matter here, but
it's worth considering.
char *get_line(FILE *stream)
{
char *s = NULL; size_t len = 0; int c;

do c = fgetc(stream); while (c == '\n'); /* ignore leading empty lines */
if (c == EOF) return NULL;
ungetc(c, stream);

while ( 1 )
{
size_t max = len + BUFLEN;
s = realloc(s, max);
if (s == NULL) exit(EXIT_FAILURE);

If you take Eric's suggestion of returning NULL (leaving the caller to
figure our the reason for the failure) you should adjust this to free
the allocated buffer before returning. If a huge line has caused an out
of memory condition, the last thing you want to do is return without
freeing what you can.
while (len < max)
{
c = fgetc(stream);
if (c == EOF || c == '\n')
{
s[len] = '\0';
return realloc(s, len+1);
}
s[len++] = c;
}
}
}

Regards.
 
S

Shao Miller

Here's my attempt at writing a "get_line" implementation, which
reads an entire line from a file stream, dynamically allocating
the space needed to store said line.

Based on your other posts in this thread, I gather your next version
won't skip lines, or it'll be documented and the name adjusted.
Since this is for my own personal use, I didn't bother handling
out-of-memory conditions elegantly. I just kick the bucket.
("We all kick the bucket in the end, in the end.")

EOF is signaled by returning NULL.

An alternative is to return an 'enum' value which a caller can 'switch'
with, or opaque values which a header can define macros for.
I'd like to hear suggestions/criticism.

#include <stdio.h>
#include <stdlib.h>
#define BUFLEN 4000

As Ben mentioned, it might be better called 'BUFINC', or some such. Or
you could use 'BUFSIZ' from stdio.h. Or you could allow the caller to
make the choice either through a parameter, encapsulating the choice in
some kind of state, or even with a callback function that allows the
caller to specify the method of growth. (You could have a default
choice or callback, in case the caller isn't interested.)
char *get_line(FILE *stream)
{
char *s = NULL; size_t len = 0; int c;

do c = fgetc(stream); while (c == '\n'); /* ignore leading empty lines */
if (c == EOF) return NULL;
ungetc(c, stream);

while ( 1 )
{
size_t max = len + BUFLEN;
s = realloc(s, max);
if (s == NULL) exit(EXIT_FAILURE);

Although you said above that error-handling wasn't particularly
graceful, this call to 'exit' seems a bit harsh. :)
while (len < max)
{
c = fgetc(stream);
if (c == EOF || c == '\n')
{
s[len] = '\0';
return realloc(s, len+1);
}
s[len++] = c;
}
}
}

Since a stream might be "line buffered", I wonder if you could gain any
sort of advantage by using 'fscanf'... The following quick C89 code
(could have bugs!) doesn't try to grow any buffer, but uses 'fscanf':

#include <stdio.h>

int fetch_a_line(char * buf, unsigned int bufsz, FILE * stream) {
char format_buf[
/* '%' */
1 +
/* 32 decimal digits */
32 +
/* "[^\n]%n" */
6 +
/* null terminator */
1
];
int matched;
int bytes;
int c;

sprintf(format_buf, "%%%u[^\n]%%n", bufsz);
matched = fscanf(stream, format_buf, buf, &bytes);
c = getc(stream);
if (c != '\n')
ungetc(c, stream);
if (matched > 0)
return bytes;
return matched;
}

int main(void) {
char test[512];
int bytes;

while (1) {
bytes = fetch_a_line(test, sizeof test - 1, stdin);
if (bytes == EOF)
break;
printf("Read line of %d characters: %s\n", bytes, test);
}
printf("Done.\n");
return 0;
}
 
N

Noob

pete said:
I deactivated that feature
and modified a get_line_test program which I already had,
so that it would work with your function.

What was your conclusion of your test?
 
K

Keith Thompson

Philip Lantz said:
Yes, but "EOF" also means "end of file" in ordinary everyday
computerese, which is clearly the way he was using the word in the
sentence above.

Yes, but given the existence of the EOF macro, using EOF *in a C
context* to refer to an end-of-file condition is potentially confusing.
 
R

Rosario1903

Hello group,

Here's my attempt at writing a "get_line" implementation, which
reads an entire line from a file stream, dynamically allocating
the space needed to store said line.

Since this is for my own personal use, I didn't bother handling
out-of-memory conditions elegantly. I just kick the bucket.
("We all kick the bucket in the end, in the end.")

EOF is signaled by returning NULL.

I'd like to hear suggestions/criticism.

// ritorna un puntatore da liberarsi tramite free()
// ritorna 0 per errore
char* __export GetLineI(char* testo)
{unsigned s, i;
int c;
char *p,*t;
s=1024; p=malloc(s+4);
if(p==0) return 0;
printf("%s", testo); fflush(stdout);
for(i=0; (c=getchar())!=EOF ; )
{if(i>=s)
{s=s+1024;
t=realloc(p, s+4);
if(t==0||(int)s<0)
{free(p); return 0;}
p=t;
}
p=c; ++i;
if(c=='\n')
break;
}
p=0;
return p;
}
 
S

Shao Miller

Hello group,

Here's my attempt at writing a "get_line" implementation, which
reads an entire line from a file stream, dynamically allocating
the space needed to store said line.

Since this is for my own personal use, I didn't bother handling
out-of-memory conditions elegantly. I just kick the bucket.
("We all kick the bucket in the end, in the end.")

EOF is signaled by returning NULL.

I'd like to hear suggestions/criticism.

[some code]

Here's an 'indent -kr' version of Rosario1903's code:

// ritorna un puntatore da liberarsi tramite free()
// ritorna 0 per errore
char *__export GetLineI(char *testo)
{
unsigned s, i;
int c;
char *p, *t;
s = 1024;
p = malloc(s + 4);
if (p == 0)
return 0;
printf("%s", testo);
fflush(stdout);
for (i = 0; (c = getchar()) != EOF;) {
if (i >= s) {
s = s + 1024;
t = realloc(p, s + 4);
if (t == 0 || (int) s < 0) {
free(p);
return 0;
}
p = t;
}
p = c;
++i;
if (c == '\n')
break;
}
p = 0;
return p;
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top