Reading whole text files

C

Chris Torek

The [file-reading] function receives a file name Chuck. There is NO
keyboard input...

What if the file name is "CON:" or "CON" or "/dev/tty" or "/tyCo/0"
or whatever external file name is used to represent "keyboard input"
on that system?

The way to read the whole text file is to read the whole text file. :)

int read_whole_text_file(const char *fname, char **memp, size_t *sizep) {
FILE *fp; /* the open file */
char *mem, *new; /* memory regions (current and new) */
size_t memsize, newsize; /* sizes of regions (current & new) */
size_t tot; /* total bytes read so far */
size_t rdattempt, rdresult; /* argument & result for fread */

*memp = NULL; /* optional */
*sizep = NULL; /* optional */

fp = fopen(fname, "r");
if (fp == NULL)
return UNABLE_TO_OPEN;
memsize = INITIAL_BLOCK_SIZE;
mem = malloc(memsize);
if (mem == NULL) {
fclose(fp);
return UNABLE_TO_GET_MEM;
}
tot = 0;

/* loop, reading what we can, until we get less than we ask for */
for (;;) {
rdattempt = memsize - tot;
rdresult = fread(mem + tot, 1, memsize - tot, fp);
if (rdresult < rdattempt)
break;
tot += rdresult;
newsize = memsize * 2; /* use whatever strategy you like */
new = realloc(mem, newsize);
if (new == NULL) {
/*
* Here, I choose to discard the data read so far.
* You have other options, including returning the
* partial result, or allocating a smaller incremental
* amount of memory.
*/
free(mem);
fclose(fp);
return UNABLE_TO_GET_MEM;
}
mem = new;
memsize = newsize;
}

/* we reach this line only when fread() stopped due to EOF or error */
/* if (ferror(fp)) ... -- optional, handle read-error */

/* optional (but required if adding '\0') */
new = realloc(mem, tot); /* or tot+1 if you want to add a '\0' */
if (new == NULL) {
/* since I'm not adding the '\0', can just use existing mem */
} else {
mem = new;
/* mem[tot] = '\0'; -- to add '\0' */
}

/* set return-value parameters */
*memp = mem;
*sizep = tot;

return SUCCEEDED;
}

(The code above is completely untested. Note that if you want to
add a '\0', you can subtract 1 from "rdattempt", and still skip
the final realloc() or allow it to fail, as long as INITIAL_BLOCK_SIZE
and the newsize computation allow forward progress with this
subtraction. Of course, you also have to define the initial block
size and the three return values -- one success, two error codes.)
 
F

Flash Gordon

jacob said:
CBFalconer wrote:



Please Chuck, it was a program written in a few minutes!


That's why I opened in binary mode

See below
The function receives a file name Chuck. There is NO
keyboard input...

And if on a windows system that file name is COM1: ? Or on any system
that provides a file name for a user input device?
No. This dispenses with the zeroing of the last byte,
maybe inefficient but it is an habit...

It's a terrible habit which in this case leads to incredibly inefficient
code.
If you open it in binary mode yes, you can...

Then you have to do system specific things to convert it to text since
the new line might be represented by *anything*.
 
K

Keith Thompson

jacob navia said:
For text files is the same as
above, but add:


char *p1 =contents,char *p2 = contents;
int i = 0;
while (i < actualBytesRead) {
if (*p1 != '\r') {
*p2++ = *p1;
}
p1++;
i++;
}
*p2++ = 0;

He said he's reading a text file, not necessarily a DOS-format text
file. There's no reason to assume that there's anything special about
the '\r' character.
 
S

SM Ryan

# blen = strlen(buffer);
# if((tmp = realloc(fstr,slen+blen+1)) == NULL)

If you increase the buffer size by a constant factor>1, the time and space complexity
are linear in the size of input. If you increase by a constant increment the time
complexity is quadratic.

If you're concerned above overshooting the file size and exhausting memory, you can
instead calculate the file size (from fseek or system specific calls like stat() or
by reading through the file once without storing and then rewinding and reading again),
and then allocating one block once.
 
S

SM Ryan

# What happens to contents if this realloc() fails?

Subsequent code gets a SIGBUS. Since I work on systems with more virtual memory than
the largest files I use, it doesn't happen.

If you want to contract me with pay to adapt the code to your system, I'll happily
include whatever warnings and work arounds you desire that are possible.
 
R

Randy Howard

wyrmwif@tango-sierra-oscar- said:
# What happens to contents if this realloc() fails?

Subsequent code gets a SIGBUS. Since I work on systems with more virtual memory than
the largest files I use, it doesn't happen.

If you want to contract me with pay to adapt the code to your system,

With that attitude about error handling, don't hold your breath.
 
M

Mac

On Fri, 11 Feb 2005 01:36:49 +0000, SM Ryan wrote:
[Randy Howard wrote]
# What happens to contents if this realloc() fails?

Subsequent code gets a SIGBUS. Since I work on systems with more virtual memory than
the largest files I use, it doesn't happen.

If you want to contract me with pay to adapt the code to your system, I'll happily
include whatever warnings and work arounds you desire that are possible.

This response totally misses the point.

If you want to say that you didn't bother because you were just trying to
sketch out a quick idea, that is fine. But most people in this newsgroup
try to post decent code, and if you review past posts, you will see that
it is considered very bad form to call realloc() without checking to see
if it succeeds. Newbies are consistently admonished not to do that.

If nothing else, you are setting a bad example for the newbies.

I humbly submit that you should readjust your attitude.

--Mac
 
B

Barry Schwarz

Cheerio,


I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

Why not consider fread?


<<Remove the del for email>>
 
I

infobahn

SM said:
# What happens to contents if this realloc() fails?

Subsequent code gets a SIGBUS. Since I work on systems with more virtual memory than
the largest files I use, it doesn't happen.

Then assert that it doesn't happen, using this macro:

#define ASSERT(cond, msg) if(cond) { fprintf(stderr, "%s\n", msg);
abort(); }

like this:

ASSERT(contents != NULL,
"I don't understand it! This CAN'T happen!"
" My home phone number is...");

(insert your home phone number in the appropriate place)

If you want to contract me with pay to adapt the code to your system, I'll happily
include whatever warnings and work arounds you desire that are possible.

I wouldn't pay anyone who used realloc like THAT, except perhaps to
clean the toilets.
 
J

Jack Klein

Cheerio,


I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

From the labels, it is pretty obvious that I would favour the
last one, so there is the question about possible pitfalls
(yes, I will use the return value and "read") and whether there
are environmental limits for BUFLEN.


If I missed some obvious source (looking for the wrong sort of
stuff in the FAQ and google archives), then please point me
toward it :)


Regards,
Michael

If you want to read a whole text file into a single string, I'd use
fread(), which is not restricted to just binary files, you know. If
you use it on files opened in text mode, the same translations, if
any, are performed the same as if you used fscanf() or fgets().

Start with an fread() of the initial size of your allocated
destination buffer. If the return value is equal to the buffer size,
you have to grow your buffer by a fixed size or some percentage. Do
another fread() and check the result.

Continue until the return value is less than the requested number of
characters. Then you can check to see whether feof() or ferror() was
the cause.

If you really want a string, add a '\0' after the last character read.

In the calls to fread(), use 1 as the second parameter (size of each
element) and the number of bytes to read as the third (number of
elements). That way, the return value is exactly the number of bytes
read.

The only potential problem is if the text file contains '\0'
characters. The C standard does not guarantee much about such files
no matter how you try to read them (see 7.19.2 P2), so if your input
files look like that, you'll have to deal with that yourself no matter
how you read them.
 
C

CBFalconer

jacob said:
CBFalconer wrote:
.... snip ...


If you open it in binary mode yes, you can...

Even if you forbid keyboard input, what if the file is on tape, or
coming from a serial line, etc. There is no requirement for ftell
to work. That's why it returns an error signal and may store
something in errno.
 
M

Michael Mair

Jack said:
If you want to read a whole text file into a single string, I'd use
fread(), which is not restricted to just binary files, you know. If
you use it on files opened in text mode, the same translations, if
any, are performed the same as if you used fscanf() or fgets().

Thank you very much, Jack!
As mentioned in my reply to S.Tobias, I somehow was under the (wrong)
impression that fread() only works for binary files. As I wanted to get
this right, I now was just waiting for someone who knows for sure to
tell me that it really works and I am not just jumping to a wrong
conclusion as the description of fread() does not indicate otherwise.
Start with an fread() of the initial size of your allocated
destination buffer. If the return value is equal to the buffer size,
you have to grow your buffer by a fixed size or some percentage. Do
another fread() and check the result.

Continue until the return value is less than the requested number of
characters. Then you can check to see whether feof() or ferror() was
the cause.

If you really want a string, add a '\0' after the last character read.

In the calls to fread(), use 1 as the second parameter (size of each
element) and the number of bytes to read as the third (number of
elements). That way, the return value is exactly the number of bytes
read.

The only potential problem is if the text file contains '\0'
characters. The C standard does not guarantee much about such files
no matter how you try to read them (see 7.19.2 P2), so if your input
files look like that, you'll have to deal with that yourself no matter
how you read them.

Once again: Thank you very much for your detailed reply!
I was aware of most of it (but for the last) but with this reply I could
have started working safely even if I had not been. I really appreciate
that :)


Cheers
Michael
 
M

Michael Mair

Michael said:
Cheerio,


I would appreciate opinions on the following:

Given the task to read a _complete_ text file into a string:
What is the "best" way to do it?
Handling the buffer is not the problem -- the character
input is a different matter, at least if I want to remain within
the bounds of the standard library.

Essentially, I can think of three variants:
- Low: Use fgetc(). Simple, straightforward, probably inefficient.
- Default: Use fgets(); ugly, if we are not interested in lines
and have many newline characters to read.
- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
XSTR(BUFLEN) gives me BUFLEN in a string literal.

From the labels, it is pretty obvious that I would favour the
last one, so there is the question about possible pitfalls
(yes, I will use the return value and "read") and whether there
are environmental limits for BUFLEN.


If I missed some obvious source (looking for the wrong sort of
stuff in the FAQ and google archives), then please point me
toward it :)

Essentially I was looking for a "text file replacement" of fread()
because I had the wrong impression that fread() were only for binary
input. Jack Klein's reply (<[email protected]>)
but also S.Tobias's question in this direction made clear that this
was a misconception.
So the solution clearly is using fread(). If you are interested in
details, then read Jack's message -- it describes the complete usage
in a safe way.

Thank you very much to everyone for their input!


Cheers
Michael
 
S

SM Ryan

# In article <[email protected]>, wyrmwif@tango-sierra-oscar-
# foxtrot-tango.fake.org says...
# > # What happens to contents if this realloc() fails?
# >
# > Subsequent code gets a SIGBUS. Since I work on systems with more virtual memory than
# > the largest files I use, it doesn't happen.
# >
# > If you want to contract me with pay to adapt the code to your system,
#
# With that attitude about error handling, don't hold your breath.

The program forks; if it gets the signal, it does a traceback dump saved to a
database telling what line it failed at and local variables. The parent sees
the child exit, reports it, forks and continues. I've done daemons that run
months at a time, restarting and recoverring when needed.

If 2GB of VM isn't enough, the process probably has enough other problems
that pretending you can recover within the process is a fool's errand. Better
to let the process die noisily, save enough information to figure out what went
wrong, and then restart.

Memory exhausation is rarely a problem on machines with virtual memory; when
it does happen the real problem is almost always a stuck loop or recursion. Dinging
random memory is a more frequent problem for all programmers. And when it
happens for most programmers, they're stuck trying to figure out how to recreate
it under a debugger and usually can't, leave a random error that persists for
months or years and no way to diagnose it.

Don't lecture people about error handling until you can guarentee you capture
the error state of every one of your program failures, even 'production' versions
with all discretionary error checking turned off.
 
S

SM Ryan

# I humbly submit that you should readjust your attitude.

Probably because most people don't want to have to plow through all
_ and other macros I use to thread the stack for traceback dumps.
 
E

Eric Sosman

jacob said:
If you open it in binary mode yes, you can...

No, you cannot. There is no necessary connection
between the number of characters you can read from a
file via a binary stream and the number you can read
from it via a text stream. The "binary count" can be
greater than, equal to, or less than the "text count."

Specific example: OpenVMS. One of its file formats
"decorates" each line stored in th file by attaching
counts of the number of empty lines to skip before or
after the line itself (I've always assumed this was
for the benefit of the COBOL implementation). Each
such count byte can thus become as many as 255 newline
characters when read by a C text stream, making the "text
count" larger than the "binary count."
 
W

websnarf

SM said:
# blen = strlen(buffer);
# if((tmp = realloc(fstr,slen+blen+1)) == NULL)

If you increase the buffer size by a constant factor>1, the time
and space complexity are linear in the size of input. If you
increase by a constant increment the time complexity is
quadratic.

Correct! (Glad someone else here has figured this out.) But actually
the issue is not limited to just performance -- one can easily *shred*
your heap by doing this. You can actually lose access to some of your
heap memory by sufficiently leaning on is as such as scheme is likely
to do (I've seen this happen with a deployed System V-like heap).
If you're concerned above overshooting the file size and exhausting
memory, you can instead calculate the file size (from fseek or
system specific calls like stat() or by reading through the file
once without storing and then rewinding and reading again), and
then allocating one block once.

Yeah, except that these functions are useless for systems with 32bit
ints that allow for file lengths to be > 4GB (i.e., the vast majority
of desktop and workstation systems in existence today.) I mean they
didn't even make them use size_t's -- I mean that's incompetence to the
extreme.

The simplest strategy which works well and is portable is to use some
exponentially growing sequence of reallocs -- then you are guaranteed
to make at most O(ln(sizeof(int))) calls to the heap.
 
W

websnarf

SM said:
# Cheerio,
#
# I would appreciate opinions on the following:
#
# Given the task to read a _complete_ text file into a string:
# What is the "best" way to do it?
# Handling the buffer is not the problem -- the character
# input is a different matter, at least if I want to remain within
# the bounds of the standard library.
#
# Essentially, I can think of three variants:
# - Low: Use fgetc(). Simple, straightforward, probably inefficient.

char *contents=0; int m=0,n=0,ch;
while ((ch=fgetc(file))!=EOF) {
if (n+2>=m) {m = 2*n+2; contents = realloc(contents,m);}
contents[n++] = ch; contents[n] = 0;
}
contents = realloc(contents,n+1);

That's a little too condensed, and its not surprising that people
jumped all over you about error handling. The idea, of course, is
perfectly correct, however. Let's make things a little clearer:

struct tagbstring {
int mlen, slen;
char * data;
} c = {0, 0, NULL};
int ch;

while ((ch == fgetc (file)) != EOF) {
if (c.slen < c.mlen) {
char * data;
c.mlen = (c.slen <= 0) ? 1 : 2*c.slen;
data = (c.data) ? realloc (c.data, c.mlen) : malloc
(c.mlen);
if (!data) {
free (c.data);
c.data = NULL;
break;
}
c.data = data;
}
c.data[c.slen] = ch;
c.slen ++;
}
if (c.data && c.data[c.slen]) {
c.data[c.slen] = '\0';
c.slen++;
}

Now the value ch.data has a pointer to a '\0' terminated string with
the desired contents, or else its NULL (because we ran out of memory.)
We could do more with ferror(), but I'll leave that as an exercise to
the reader.
 
C

CBFalconer

Correct! (Glad someone else here has figured this out.) But
actually the issue is not limited to just performance -- one can
easily *shred* your heap by doing this. You can actually lose
access to some of your heap memory by sufficiently leaning on is
as such as scheme is likely to do (I've seen this happen with a
deployed System V-like heap).

The reason being that the sum of the (possibly) freed chunks is not
enough to allocate a new large chunk. If other calls to malloc
have been interspersed the situation is likely to be even worse.
It is rooted in the fact that (1 + 2 + 4 + 8 .... + N) < 2N.
.... snip ...

The simplest strategy which works well and is portable is to use
some exponentially growing sequence of reallocs -- then you are
guaranteed to make at most O(ln(sizeof(int))) calls to the heap.

Bearing in mind the above gotcha. Wonder if we can beat it by
allocating, in alternation, 1.5x and 2x the previous allocation?
Too lazy to work it out for now.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top