Reading whole text files

M

Michael Mair

Cheerio,


I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

From the labels, it is pretty obvious that I would favour the
last one, so there is the question about possible pitfalls
(yes, I will use the return value and "read") and whether there
are environmental limits for BUFLEN.


If I missed some obvious source (looking for the wrong sort of
stuff in the FAQ and google archives), then please point me
toward it :)


Regards,
Michael
 
I

infobahn

Michael said:
Cheerio,

I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.

Why inefficient? I'd prefer getc in case you're fortunate enough
to have it implemented as a macro, but it should be efficient
enough.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.

And you have to maintain /two/ buffers (quite apart from the buffer
maintained by your text stream handler) - your expanding buffer,
and the buffer you give to fgets (unless you use the expanding
buffer for that too, which is certainly doable but probably gives
you more headaches).
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

From the labels, it is pretty obvious that I would favour the
last one,

Fine, so use that. But it wouldn't be my choice.

Vive la difference!
 
M

Michael Mair

infobahn said:
Why inefficient? I'd prefer getc in case you're fortunate enough
to have it implemented as a macro, but it should be efficient
enough.

"Probably" inefficient in that I cannot rely on getc() being
implemented as a macro and that I do not want to make assumptions
about the underlying library. So, essentially, the question is
for me whether having a loop in my code is "better" than just
telling fscanf() to get, say 8K characters in one go.
The main beauty of this approach lies for me in the clarity of the
code. Thanks for reminding me of getc() vs. fgetc().
And you have to maintain /two/ buffers (quite apart from the buffer
maintained by your text stream handler) - your expanding buffer,
and the buffer you give to fgets (unless you use the expanding
buffer for that too, which is certainly doable but probably gives
you more headaches).

Actually, I have implemented it first with fgets() and one extending
buffer but found, looking at the final code, that approach too unwieldy
and error prone, as you need more code and variables.
Usually, I would have gone for the "Low" approach due to the clarity
of the resulting code but -- as I was at it -- I just asked myself
which options do I have.

Fine, so use that. But it wouldn't be my choice.

I _was_ asking for opinions.

Vive la difference!

:)
Thank you for your input!


Cheers
Michael
 
J

jacob navia

Michael said:
Cheerio,


I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

From the labels, it is pretty obvious that I would favour the
last one, so there is the question about possible pitfalls
(yes, I will use the return value and "read") and whether there
are environmental limits for BUFLEN.


If I missed some obvious source (looking for the wrong sort of
stuff in the FAQ and google archives), then please point me
toward it :)


Regards,
Michael

What about this?

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *ReadFileIntoRam(char *fname,int *plen)
{
FILE *infile;
char *contents;
int actualBytesRead=0;
unsigned int len;

infile = fopen(fname,"rb");
if (infile == NULL) {
fprintf(stderr,"impossible to open %s\n",fname);
return NULL;
}
fseek(infile,0,SEEK_END);
len = ftell(infile);
fseek(infile,0,SEEK_SET);
contents = calloc(len+1,1);
if (contents) {
actualBytesRead = fread(contents,1,len,infile);
}
else {
fprintf(stderr,"Can't allocate memory to read the file\n");
}
fclose(infile);
*plen = actualBytesRead;
return contents;
}

int main(int argc,char *argv[])
{
if (argc < 2) {
printf("usage: readfile <filename>\n");
exit(1);
}
int len=0;
char *contents=ReadFileIntoRam(argv[1],&len);
// work with the contents of the file
}
 
M

Michael Mair

jacob said:
What about this?

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *ReadFileIntoRam(char *fname,int *plen)
{
FILE *infile;
char *contents;
int actualBytesRead=0;
unsigned int len;

infile = fopen(fname,"rb");

Here is the crux: I want/have to work with a _text_ file.
Everything else may give me wrong results.
if (infile == NULL) {
fprintf(stderr,"impossible to open %s\n",fname);
return NULL;
}
fseek(infile,0,SEEK_END);
len = ftell(infile);
fseek(infile,0,SEEK_SET);
contents = calloc(len+1,1);
if (contents) {
actualBytesRead = fread(contents,1,len,infile);

This is what I would do for binary files.
Essentially, I am looking for the text file equivalent of fread().
}
else {
fprintf(stderr,"Can't allocate memory to read the file\n");
}
fclose(infile);
*plen = actualBytesRead;
return contents;
}

int main(int argc,char *argv[])
{
if (argc < 2) {
printf("usage: readfile <filename>\n");
exit(1);
}
int len=0;
char *contents=ReadFileIntoRam(argv[1],&len);
// work with the contents of the file
}

Thank you for trying :)


Cheers
Michael
 
S

S.Tobias

Why inefficient? I'd prefer getc in case you're fortunate enough
to have it implemented as a macro, but it should be efficient
enough.

In thread-safe libraries getc() family functions can actually
be quite inefficient, because they must lock the stream object,
which takes time. This is the reason why some systems provide
getc_unlocked() (thread-unsafe) family (I remember a noticeable
difference between them in my tests some time ago).

+++

Excuse my ignorance, I have no experience with text files in
the C Std context. Why wouldn't fread() be suitable for
reading text files? In 7.19.8p2 it says the fread() call is
performed as if by use of fgetc() function in the bottom.
I haven't spotted any mention where these functions would be
constrained to binary streams only.
 
M

Michael Mair

S.Tobias said:
In thread-safe libraries getc() family functions can actually
be quite inefficient, because they must lock the stream object,
which takes time. This is the reason why some systems provide
getc_unlocked() (thread-unsafe) family (I remember a noticeable
difference between them in my tests some time ago).
Interesting.

+++

Excuse my ignorance, I have no experience with text files in
the C Std context. Why wouldn't fread() be suitable for
reading text files? In 7.19.8p2 it says the fread() call is
performed as if by use of fgetc() function in the bottom.
I haven't spotted any mention where these functions would be
constrained to binary streams only.

It seems I am plain stupid... Somewhere in my brain, there was
"fread()/fwrite() <-> binary I/O" hardwired :-/
So, if I open the stream as text stream, everything should be
fine. (If this is wrong, please correct me.)
Moreover, if I read the data into dynamically allocated
storage pointed to by an unsigned char *, I circumvent potential
problems with the is** functions from <ctype.h> (as I asked in
another thread).

Thank you :)


Cheers
Michael
 
M

Michael Mair

Michael said:
Here is the crux: I want/have to work with a _text_ file.
Everything else may give me wrong results.

Sorry, the "b" brought me back onto the wrong track I already
was on. See the other subthread.


Cheers
Michael
if (infile == NULL) {
fprintf(stderr,"impossible to open %s\n",fname);
return NULL;
}
fseek(infile,0,SEEK_END);
len = ftell(infile);
fseek(infile,0,SEEK_SET);
contents = calloc(len+1,1);
if (contents) {
actualBytesRead = fread(contents,1,len,infile);


This is what I would do for binary files.
Essentially, I am looking for the text file equivalent of fread().
}
else {
fprintf(stderr,"Can't allocate memory to read the file\n");
}
fclose(infile);
*plen = actualBytesRead;
return contents;
}

int main(int argc,char *argv[])
{
if (argc < 2) {
printf("usage: readfile <filename>\n");
exit(1);
}
int len=0;
char *contents=ReadFileIntoRam(argv[1],&len);
// work with the contents of the file
}


Thank you for trying :)


Cheers
Michael
 
S

SM Ryan

# Cheerio,
#
#
# I would appreciate opinions on the following:
#
# Given the task to read a _complete_ text file into a string:
# What is the "best" way to do it?
# Handling the buffer is not the problem -- the character
# input is a different matter, at least if I want to remain within
# the bounds of the standard library.
#
# Essentially, I can think of three variants:
# - Low: Use fgetc(). Simple, straightforward, probably inefficient.

char *contents=0; int m=0,n=0,ch;
while ((ch=fgetc(file))!=EOF) {
if (n+2>=m) {m = 2*n+2; contents = realloc(contents,m);}
contents[n++] = ch; contents[n] = 0;
}
contents = realloc(contents,n+1);

You might also include #ifdef/#endif code to use memory mapping on systems
that support it.
 
A

Al Bowers

Michael said:
"Probably" inefficient in that I cannot rely on getc() being
implemented as a macro and that I do not want to make assumptions
about the underlying library. So, essentially, the question is
for me whether having a loop in my code is "better" than just
telling fscanf() to get, say 8K characters in one go.
The main beauty of this approach lies for me in the clarity of the
code. Thanks for reminding me of getc() vs. fgetc().

My intuition is the the definition of a "_complete_" text file
would require the "ugly". Hence, I would use function fgets in
a loop.
Actually, I have implemented it first with fgets() and one extending
buffer but found, looking at the final code, that approach too unwieldy
and error prone, as you need more code and variables.

Use fgets to copy into a buffer. And, then append to a
expanding dynamically allocated char array. This is not unwieldy.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(void)
{
char buffer[128],*fstr, *tmp;
size_t slen, blen;
FILE *fp;

if((fp = fopen("test.c","r")) == NULL) exit(EXIT_FAILURE);
for(slen = 0, fstr = NULL;
(fgets(buffer,sizeof buffer, fp)) ; slen+=blen)
{
blen = strlen(buffer);
if((tmp = realloc(fstr,slen+blen+1)) == NULL)
{
free(fstr);
exit(EXIT_FAILURE);
}
if(slen == 0) *tmp = '\0';
fstr = tmp;
strcat(fstr,buffer);
}
fclose(fp);
puts(fstr);
free(fstr);
return 0;
}
 
I

infobahn

Use fgets to copy into a buffer. And, then append to a
expanding dynamically allocated char array. This is not unwieldy.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(void)
{
char buffer[128],*fstr, *tmp;
size_t slen, blen;
FILE *fp;

if((fp = fopen("test.c","r")) == NULL) exit(EXIT_FAILURE);
for(slen = 0, fstr = NULL;
(fgets(buffer,sizeof buffer, fp)) ; slen+=blen)
{
blen = strlen(buffer);

Consider a file 12,800,000 or so bytes in length. This means you'll
call strlen 10,000 times, and just about every call will have to
trawl through 128 (or so) bytes. That is, modulo the last read,
you'll have to touch every character /three/ times - once while
reading, once while strlenning, and once while copying. For large
files, this is a serious overhead.
if((tmp = realloc(fstr,slen+blen+1)) == NULL)

You don't have to go to the well quite this often. You can keep
a max, and only realloc when the max is about to be exceeded.
Whenever you do this, multiply the not-enough-storage value by
some constant (some people double, others use 1.1 or 1.5 or
whatever) to decide how much to allocate next time.

Consider adding a way to stop the reading of a file larger than
the largest the user is prepared to allocate RAM for.
{
free(fstr);
exit(EXIT_FAILURE);
}
if(slen == 0) *tmp = '\0';
fstr = tmp;
strcat(fstr,buffer);

Its getting worse. strcat has to find the end of the string, which
is O(n). Put it into a loop, and you get O(n*n). This will seriously
impact on performance for large files. It's not hard to keep a
pointer to the next place to write.
 
E

Eric Sosman

Michael said:
"Probably" inefficient in that I cannot rely on getc() being
implemented as a macro and that I do not want to make assumptions
about the underlying library. So, essentially, the question is
for me whether having a loop in my code is "better" than just
telling fscanf() to get, say 8K characters in one go.
The main beauty of this approach lies for me in the clarity of the
code. Thanks for reminding me of getc() vs. fgetc().

Considerations of the relative efficiency of library
functions already involve matters you cannot "rely" on; the
Standard has nothing to say about it, and you're forced to
empirical methods.

I can, perhaps, offer a data point. My fgets() replacement
(everybody writes one eventually, it seems) originally used
fgets() itself, on the grounds that it might be implemented
more efficiently "under the covers" than repeated getc(). After
each fgets() I'd check whether the line was too long (no '\n'
in the buffer), and if so I'd expand the buffer and do another
fgets(). All well and good.

Just for curiosity's sake, though, I wrote a second version
that made repeated getc() calls -- and guess what? It was a
little bit faster. Whatever speed advantage fgets() might have
had was lost in the need to search for the end of the line
afterwards. strlen(buff) was a hair faster than strchr(buff,'\n'),
but either way the combined fgets()/strxxx() was slower than a
loop calling getc() and testing each character on the fly.

The "getc() is faster" result was reproducible on four
configurations: SPARC with Sun Studio compiler and Solaris' C
library, SPARC with gcc and Solaris' C library, and on two
different Pentium models with gcc and the DJgpp library.

YMMV, and the problem you're trying to solve is slightly
different from the one I attacked. Still, it's suggestive.
 
R

Randy Howard

wyrmwif@tango-sierra-oscar- said:
# Cheerio,
#
#
# I would appreciate opinions on the following:
#
# Given the task to read a _complete_ text file into a string:
# What is the "best" way to do it?
# Handling the buffer is not the problem -- the character
# input is a different matter, at least if I want to remain within
# the bounds of the standard library.
#
# Essentially, I can think of three variants:
# - Low: Use fgetc(). Simple, straightforward, probably inefficient.

char *contents=0; int m=0,n=0,ch;
while ((ch=fgetc(file))!=EOF) {
if (n+2>=m) {m = 2*n+2; contents = realloc(contents,m);}
contents[n++] = ch; contents[n] = 0;
}
contents = realloc(contents,n+1);

What happens to contents if this realloc() fails?
 
C

CBFalconer

jacob said:
What about this?

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *ReadFileIntoRam(char *fname,int *plen)
{
FILE *infile;
char *contents;
int actualBytesRead=0;
unsigned int len;

infile = fopen(fname,"rb");
if (infile == NULL) {
fprintf(stderr,"impossible to open %s\n",fname);
return NULL;
}
fseek(infile,0,SEEK_END);
len = ftell(infile);
fseek(infile,0,SEEK_SET);
contents = calloc(len+1,1);
if (contents) {
actualBytesRead = fread(contents,1,len,infile);
}
else {
fprintf(stderr,"Can't allocate memory to read the file\n");
}
fclose(infile);
*plen = actualBytesRead;
return contents;
}

No good. Note that ftell is meaningless for text files. It also
returns a long, not an int. You haven't even tested for failure
(which it will on input from a keyboard). Even if everything works
use of calloc is silly, why zero what you are about to fill.
Instead just add a single '\0' after filling. From N869:

7.19.9.4 The ftell function

Synopsis

[#1]

#include <stdio.h>
long int ftell(FILE *stream);

Description

[#2] The ftell function obtains the current value of the
file position indicator for the stream pointed to by stream.
For a binary stream, the value is the number of characters
from the beginning of the file. For a text stream, its file
position indicator contains unspecified information, usable
by the fseek function for returning the file position
indicator for the stream to its position at the time of the
ftell call; the difference between two such return values is
not necessarily a meaningful measure of the number of
characters written or read.

Returns

[#3] If successful, the ftell function returns the current
value of the file position indicator for the stream. On
failure, the ftell function returns -1L and stores an
implementation-defined positive value in errno.

One way to get a whole file into memory in a useful form is to
buffer it in lines and make a linked list of those lines. An
example in my ggets package just just that. See:

<http://cbfalconer.home.att.net/download/ggets.zip>

Any attempt to pre-allocate a buffer for the whole file is doomed,
because you cannot reliably tell how big that buffer should be.
 
J

jacob navia

For text files is the same as
above, but add:


char *p1 =contents,char *p2 = contents;
int i = 0;
while (i < actualBytesRead) {
if (*p1 != '\r') {
*p2++ = *p1;
}
p1++;
i++;
}
*p2++ = 0;

This is thousand times more efficient that all those
calls to realloc, or all those calls to fread.

True, you will waste some bytes because you will read
many \r that you later erase, allocating a slightly
bigger buffer than needed but this is not very
important in most applications...


Note: You could do this more stable if you want to keep
isolated \r (i.e. \r not followed by \n) in which case
you can add the corresponding tests...
 
C

cpg

Michael said:
This is what I would do for binary files.
Essentially, I am looking for the text file equivalent of fread().

I'm just curious, why would you do anything different for text/binary
data files? The approach is the same, what you can do afterwards on
the resultant buffer is the only thing that differs. Since a "text"
file is a special case of a "raw binary" file,you only have to code the
common functionality once (buffering in this case).

Would you not simply perform raw reads into a temp buffer accumulating
your overall file buffer until the entire file is read? Apply a filter
afterwards for some sort of sanity checking that this file meets your
requirements (ctype.h), then continue on.

Obviously, if the file checks out as "text", then things like lines
make sense. I would create "text" functions to operate on these buffers
to fit your needs. Later on you may find a need to write some binary
equivalents to do other tasks (a raw strstr() equivalent becomes
particularly useful for searching binary data), and the buffering part
is already done.

Also, it's probably more useful to define a structure that abstracts
these "buffers". That way, you can add functionality without breaking
the interface.

Have fun, cpg
 
J

jacob navia

CBFalconer said:

Please Chuck, it was a program written in a few minutes!

Note that ftell is meaningless for text files.

That's why I opened in binary mode


It also
returns a long, not an int.

OK

You haven't even tested for failure
(which it will on input from a keyboard).
The function receives a file name Chuck. There is NO
keyboard input...



Even if everything works
use of calloc is silly, why zero what you are about to fill.

No. This dispenses with the zeroing of the last byte,
maybe inefficient but it is an habit...

Any attempt to pre-allocate a buffer for the whole file is doomed,
because you cannot reliably tell how big that buffer should be.

If you open it in binary mode yes, you can...
 
C

CBFalconer

infobahn said:
.... snip ...

You don't have to go to the well quite this often. You can keep
a max, and only realloc when the max is about to be exceeded.
Whenever you do this, multiply the not-enough-storage value by
some constant (some people double, others use 1.1 or 1.5 or
whatever) to decide how much to allocate next time.

A certain Richard Heathfield has made available a routine for this
approach, found in fgetline at:

<http://users.powernet.co.uk/eton/c/fgetdata.html>

while I prefer using my own ggets/fggets, which doesn't keep a
history (thus having a much simpler calling sequence), and which
can be found at:

<http://cbfalconer.home.att.net/download/ggets.zip>
 
M

Michael Mair

cpg said:
Michael Mair wrote:




I'm just curious, why would you do anything different for text/binary
data files? The approach is the same, what you can do afterwards on
the resultant buffer is the only thing that differs. Since a "text"
file is a special case of a "raw binary" file,you only have to code the
common functionality once (buffering in this case).

Would you not simply perform raw reads into a temp buffer accumulating
your overall file buffer until the entire file is read? Apply a filter
afterwards for some sort of sanity checking that this file meets your
requirements (ctype.h), then continue on.

The thing is that I do not want to make _any_ assumptions like that
there is a one-to-one correspondence for certain byte ranges -- the
standard does not guarantee that and even mentions that "Characters
may have to be added, altered, or deleted ..."
Moreover, if I want to move on to wide characters/multibyte characters,
then I certainly will stick to the narrow path and not try to find
convenient shortcuts.
So, I will treat reading in a text file in a different manner than
reading in a binary file if necessary. It is quite possible that
fread() on a text stream will do what I want; then I will use it.
I have no interest in sanity checks which work with the C locale
but not every other locale as well.
If there was a standard way to read in a binary file and then convert
the resulting buffer into the "text" equivalent, then I would use this
approach.

Obviously, if the file checks out as "text", then things like lines
make sense. I would create "text" functions to operate on these buffers
to fit your needs. Later on you may find a need to write some binary
equivalents to do other tasks (a raw strstr() equivalent becomes
particularly useful for searching binary data), and the buffering part
is already done.

That is true in general but here I have a special requirement where
I am certain that I will deal only with text files and the only possible
extension is going for multibyte/wide characters. However, this will not
be any problem as I essentially will only have to create wide char
versions of my functions and get a "w" or "wc" into the called library
functions.
The only thing left is a "good" way to get a complete text file into
a buffer. The organisation in lines does not play any role at all, so
the question is using a getc loop vs. using something to obtain large
chunks of characters from text files.

Also, it's probably more useful to define a structure that abstracts
these "buffers". That way, you can add functionality without breaking
the interface.

True but in this case only overhead.

Have fun, cpg

Thanks :)


-Michael
 
E

Eric Sosman

cpg said:
Michael Mair wrote:




I'm just curious, why would you do anything different for text/binary
data files? The approach is the same, what you can do afterwards on
the resultant buffer is the only thing that differs. Since a "text"
file is a special case of a "raw binary" file,you only have to code the
common functionality once (buffering in this case).

Would you not simply perform raw reads into a temp buffer accumulating
your overall file buffer until the entire file is read? Apply a filter
afterwards for some sort of sanity checking that this file meets your
requirements (ctype.h), then continue on.

There's the rub: What should the "filter" do? On
one system I've used, for example, if you were to write
the line "Hello, world!\n" to a text file and then read
it back in binary, here are the bytes you would get:

\015 \000 H e l l o , w o r l d ! \000

Notice that the '\n' you wrote has vanished and that three
new bytes have appeared out of thin air. The system in
question knows how to translate this sequence of bytes back
to "Hello, world!\n" -- but do *you* know how?

By the way, the above illustrates the system's "usual"
way of storing text in a file. The system actually provides
six additional text formats, some of which permit variations.
How many "filters" are you prepared to write, simply to avoid
using what the C library already provides?

(A hint for the curious: The company that bought the
company that bought the company that made this system recently
fired its CEO.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top