Reading lines from a text file

M

Mark Hobley

I want to read a text file a line at a time from within a C program. Are there
some available functions or code already written that does this or do I need
to code from scratch?

If I am doing this from scratch, what is the best practise for allocating
a buffer size for the input line?

I guess open the file, scan once to determine the buffer size, then rewind and
start reading. Has this already been done or do I need to code this from
scratch?

(My project is open source, so I can utilize GPL licenced code, if necessary.)

C89 compatible code is preferred.

Mark.
 
A

Andrew Poelstra

I want to read a text file a line at a time from within a C program. Are there
some available functions or code already written that does this or do I need
to code from scratch?

If I am doing this from scratch, what is the best practise for allocating
a buffer size for the input line?

I guess open the file, scan once to determine the buffer size, then rewind and
start reading. Has this already been done or do I need to code this from
scratch?

(My project is open source, so I can utilize GPL licenced code, if necessary.)

Well, if you know how big your lines are, or know a reasonable
maximum, you can just use:

char buffer[1024];
fgets(buffer, sizeof buffer, file);
C89 compatible code is preferred.

Otherwise, Chuck Falconer has a function called ggets() on his
website that handles memory allocation and all that. I don't
remember the link, but Google will find it.

Richard Heathfield also has such a beast, according to the
comments in Chuck's code. Given that Richard is still around
and Chuck is not, you maybe will be better off with that.

In either case, they're very easy functions to use.
 
B

Ben Bacarisse

I want to read a text file a line at a time from within a C program. Are there
some available functions or code already written that does this or do I need
to code from scratch?
(My project is open source, so I can utilize GPL licenced code, if necessary.)

gcc's glibc includes getline. If you can't use gcc and link against glibc
you might be able to use the source (though extracting parts of the
library might be fiddly).

<snip>
 
S

Seebs

I want to read a text file a line at a time from within a C program. Are there
some available functions or code already written that does this or do I need
to code from scratch?

There are some.
If I am doing this from scratch, what is the best practise for allocating
a buffer size for the input line?

Good question!
I guess open the file, scan once to determine the buffer size, then rewind and
start reading. Has this already been done or do I need to code this from
scratch?

That's a very expensive way to do it. Reading is usually much more expensive
than, say, copying in memory. If you can make reasonable guesses about buffer
sizes, you should be able to do pretty well.

Have a look at fgets(), which gets a string of definitely no more than a
particular length. If a line is too long for it, you can call fgets()
again to get more of the line.

Do you need to keep multiple lines in memory, or do you just need to look
at each one? A typical strategy I'll use for "look at each item in turn"
is basically this:
size_t line_len = 256;
char *line_data;
line_data = malloc(line_len);
while (fgets(line_data, line_len, stdin)) {
char *s;
size_t this_line_len;
this_line_len = strlen(line_data);
while (line_data[this_line_len - 1] != '\n') {
s = malloc(line_len * 2);
memcpy(s, line_data, line_len);
free(line_data);
line_data = s;
fgets(line_data + line_len, line_len, stdin);
line_len *= 2;
this_line_len = strlen(line_data);
}
}

This omits quite a bit of error checking, but the basic idea is, you
pick a buffer size, and use it, and if it's not big enough, you increase
the buffer size, reallocate, then keep using that larger buffer. In
most cases, you'll probably never even reallocate once.

-s
 
J

Jens Thoms Toerring

Mark Hobley said:
I want to read a text file a line at a time from within a C program. Are there
some available functions or code already written that does this or do I need
to code from scratch?
If I am doing this from scratch, what is the best practise for allocating
a buffer size for the input line?

The simplest method is to start with guess for the length of the
longest line and allocate as much. Now you use fgets() to read in
a line and check if it ends in a '\n' - if it does everything is
ok but if it doesn't the line was too long to fit into the buffer
you started of with. In that case you jincrease the size of the
buffer, e.g. by doubling its size, using realloc(), and try to
read the rest of the line by calling fgets() again (but with the
first argument pointing into the buffer were the last try stopped).
Then repeat the test for the final '\n' and repeat increasing the
buffer size if necessary. If you don't run out of memory you end
up with a buffer that contains the complete line.

The only special case you may have to consider is that the last
line of a file may not end with a '\n' and then, of course, also
what fgets() reads in can't contain that character - but if you
try to read at the very end fgets() will return NULL, so it's
possible to check for that condition.
I guess open the file, scan once to determine the buffer size, then rewind
and start reading.

I guess reading the file twice just to find out the length of the
longest line is too much work.
Has this already been done or do I need to code this from
scratch?

Probably everyone being faced with the problem of reading lines of
arbitary length will have written such a function at least once;-)
Here's something I found looking through my files (although with
quite a number changes to the original, so be wary, I may have
broken it!):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define LEN_GUESS 128

int
read_line( FILE * fp,
char ** line )
{
static char *buf = NULL;
static size_t buf_len = LEN_GUESS;
char *p = buf;
size_t rem_len = buf_len;

if ( ! fp || ! line )
return -1; /* bad argument(s) */

if ( ! buf && ! ( buf = p = malloc( buf_len ) ) )
return -1; /* running out of memory */
*buf = '\0';

while ( 1 )
{
size_t len;
char *tmp;

if ( ! fgets( p, rem_len, fp ) )
{
if ( ferror( fp ) )
return -1; /* read failure */
break;
}

len = strlen( p );

if ( p[ len - 1 ] == '\n')
break;

if ( ! ( tmp = realloc( buf, 2 * buf_len ) ) )
return -1; /* running out of memory */

buf = tmp;
p += len;
rem_len += buf_len - len;
buf_len *= 2;
}

*line = buf;
return feof( fp ) ? 1 : 0; /* indicate if EOF has been reached */
}

Note that it's, of course, not thread-safe. And when you call it
again the last line returned will be overwritten. When you don't
need to call the function anymore you should free() the returned
pointer.
(My project is open source, so I can utilize GPL licenced code, if
necessary.) C89 compatible code is preferred.

Use it for whatever you want if it fits your needs (but better
check carefully that it works, it's not my tested version, I
just checked that it compiles!) And, of course, there are quite
a number of ways it could be improved, it's more meant for giving
you a better idea of how it could be done.

Regards, Jens
 
K

Keith Thompson

Andrew Poelstra said:
Otherwise, Chuck Falconer has a function called ggets() on his
website that handles memory allocation and all that. I don't
remember the link, but Google will find it.

Richard Heathfield also has such a beast, according to the
comments in Chuck's code. Given that Richard is still around
and Chuck is not, you maybe will be better off with that.

"still around" meaning that Richard still posts here in comp.lang.c;
Chuck used to, but hasn't lately.
 
B

Ben Bacarisse

bartc said:
I just use a fixed size, big enough for text files that are line-oriented.

I've just checked and I'm using a 2KB buffer, but it could be much higher if
memory allows.

If the lines are longer than that sort of size, the file probably isn't
line-oriented and could do with a different approach. (Or might use a
different newline convention from that expected. Either way, you have a file
that is not in the right format.)

I have two CSV files I'm using at the moment whose longest lines have
2201 and 2306 bytes and one old one with a 10155 byte line. It's hard
to put an upper limit on what is reasonable. Today's absurd it
tomorrow's "pah!".

<snip>
 
J

James Harris

I want to read a text file a line at a time from within a C program. Are there
some available functions or code already written that does this or do I need
to code from scratch?

Yes, I wrote a piece of code to do just that and incorporated in it
helpful input from other people on comp.lang.c.

http://codewiki.wikispaces.com/xbuf.c

The section on reading lines shows what you are looking for and also
why the code was needed, i.e. problems with other solutions.

James
 
N

Nick Keighley

I just use a fixed size, big enough for text files that are line-oriented.

I've just checked and I'm using a 2KB buffer, but it could be much higher if
memory allows.

If the lines are longer than that sort of size, the file probably isn't
line-oriented and could do with a different approach. (Or might use a
different newline convention from that expected. Either way, you have a file
that is not in the right format.)

and what does your program do?
 
K

Keith Thompson

bartc said:
What seems wrong is to let the input file dicate to you some ridiculous
'line length' of perhaps a billion characters, and to go along with that.

What seems wrong to me is to let limitations in the program impose
some arbitrary limit on line length, when the input format you're
trying to process imposes no such limit.

If a file format specifies a maximum line length, then by all means go
with that (and ideally report an error for any line that exceeds the
limit, unless the format specification says that characters past the
maximum are quietly ignored). If it doesn't, then handling
arbitrarily long lines is better than imposing *any* limit other than
what's imposed by available memory.

And if the file format doesn't impose a maximum length but you're
unwilling to handle very long lines, IMHO you should at least report
an internal error if you see a line longer than you can handle.
 
B

Ben Bacarisse

bartc said:
The text file format is being abused then. This sounds like an export
from a database or spreadsheet. It's not text, unless you're using to
reading pages 60 feet wide.

The structure is line-oriented. It should be read in text mode and a
line ends when you see '\n'. I call that a text file.
If you already have code for a flexible getline(), then just
it. Otherwise the next step up from a hard-coded size is a one-time
allocated buffer which remains the same size. Bung 20KB (or 200KB) in
there, and have done with it.

These solutions work, of course. I was just disputing the fact that
there is some maximum line length beyond which something stops being a
text file.

<snip>
 
B

bartc

Keith Thompson said:
What seems wrong to me is to let limitations in the program impose
some arbitrary limit on line length, when the input format you're
trying to process imposes no such limit.

OK, but then be prepared for your getline() function to actually need to be
a getfile() function with some input, and to potentially grab most of the
memory in your system, or even to bring down the program (if a giant file
uses the wrong newline format for example).
 
M

Moi

I want to read a text file a line at a time from within a C program. Are
there some available functions or code already written that does this or
do I need to code from scratch?

If I am doing this from scratch, what is the best practise for
allocating a buffer size for the input line?

I guess open the file, scan once to determine the buffer size, then
rewind and start reading. Has this already been done or do I need to
code this from scratch?

(My project is open source, so I can utilize GPL licenced code, if
necessary.)

C89 compatible code is preferred.

Mark.

No need for limits.

1) Read the entire file into one buffer using fread, realloc() when needed.
2) Make a second pass on the buffer, find the line endings , handle \r\n,
replace them by \0, save the beginnings of the lines in an array of
pointers, realloc()ing when needed,
3) Make a third pass: process each line , searching for commas, replacing
them by \0, saving pointers to the beginnings, realloc()ing when needed.

Step 2 and 3 need to take care of quoting / escaping.
Step 1,2,3 _can_ be combined into one state machine.



HTH,
AvK
 
K

Keith Thompson

bartc said:
OK, but then be prepared for your getline() function to actually need
to be a getfile() function with some input, and to potentially grab
most of the memory in your system, or even to bring down the program
(if a giant file uses the wrong newline format for example).

Or don't read an entire line into memory at a time. For example,
if you're reading an XML file -- well, you should be using an
XML parser that somebody else has already written. But if you're
writing an XML parser for some reason, it might make more sense to
read and store input until you see a '<' or '>' rather than '\n'.
I've seen XML files with extremely long lines, but not with extremely
long tag names.

But yes, sometimes it does make sense to read entire lines into memory
at once, even if they might be inordinately long.
 
B

bartc

Or don't read an entire line into memory at a time. For example,
if you're reading an XML file -- well, you should be using an
XML parser that somebody else has already written. But if you're
writing an XML parser for some reason, it might make more sense to
read and store input until you see a '<' or '>' rather than '\n'.
I've seen XML files with extremely long lines, but not with extremely
long tag names.

I think XML is one of those text formats (like C source files and HTML),
which are not really line-oriented; newline is just another whitespace
character.

In that case, if you don't use a dedicated file reader as you've suggested,
you can't really use simple line-input.
 
K

Keith Thompson

bartc said:
I think XML is one of those text formats (like C source files and
HTML), which are not really line-oriented; newline is just another
whitespace character.

Quibble: C preprocessor directives are line-oriented. And a C
compiler is allowed to impose a maximum line length on source files.
In that case, if you don't use a dedicated file reader as you've
suggested, you can't really use simple line-input.

Sure you can, as long as your simple line-input can handle arbitrarily
long lines (and you have enough memory to store them). Admittedly
it might not be the ideal solution.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,024
Latest member
ARDU_PROgrammER

Latest Threads

Top