Reading text file contents to a character buffer

N

Navaneeth

Hello,

I am trying to load a text file into in-memory character buffer. I
wrote the following code.

static char *readcontent(const char *filename)
{
char *fcontent = NULL, *tmp = NULL;
size_t result, elements_read = 0;
int pagenum = 1;

FILE *fp;
fp = fopen(filename, "r");

if(fp) {
while(1) {
tmp = (char*) realloc(fcontent, PAGE_SIZE * pagenum);
if(tmp) {
fcontent = tmp;
}
else {
free(fcontent);
fcontent = NULL;
printf("failed to reallocate memory\n");
break;
}

result = fread(fcontent + elements_read, 1, PAGE_SIZE,
fp);

if(result != PAGE_SIZE) {
if(feof(fp)) {
break;
}
else {
printf("error reading file");
break;
}
}
elements_read += result;
++pagenum;
}
fclose(fp);
}
return fcontent;
}

This is working fine. Since I am new to C, I am wondering is this the
correct approach to do it? Any suggestions to improve the code are
welcome. This code is expected to compile atleast on GCC and MSVC.

Any thoughts?

Thanks
 
B

Ben Bacarisse

Navaneeth said:
I am trying to load a text file into in-memory character buffer. I
wrote the following code.

static char *readcontent(const char *filename)
{

From a purely functional point of view, I'd say that you should have
some way to return how long the file was to the caller. It may not
matter in your case but (a) you have the information so it's 'free' and
(b) it makes the function more useful. A good way to do this is to
supply a size_t * argument which if non-NULL get set to the length of
the data written. Permitting a null pointer means that callers who
don't care about the length don't have to allocate a dummy variable just
for the purpose.

If you don't do this (and quite possibly if you so) you should at least
null-terminate the result. As it stands, the caller can't use the
returned data without running off the end of the array.
char *fcontent = NULL, *tmp = NULL;
size_t result, elements_read = 0;
int pagenum = 1;

FILE *fp;
fp = fopen(filename, "r");

if(fp) {
while(1) {
tmp = (char*) realloc(fcontent, PAGE_SIZE * pagenum);

I would not cast the return from realloc.

A common trick is to increase the storage used by some factor (> 1)
rather than but adding a fixed amount. This has the advantage of
reducing the number of realloc calls at the expense of a little more
data.
if(tmp) {
fcontent = tmp;
}
else {
free(fcontent);
fcontent = NULL;
printf("failed to reallocate memory\n");

I would not print an error message to stdout. In fact, I would not
print an error message at all! A utility function like should be usable
by programs that might want to report errors in all sorts of other
ways. In general, it is best to put error reporting as "high up" in the
you program as is practical
break;
}

result = fread(fcontent + elements_read, 1, PAGE_SIZE,
fp);

if(result != PAGE_SIZE) {
if(feof(fp)) {
break;
}
else {
printf("error reading file");
break;
}

I'd put the break here.
}
elements_read += result;
++pagenum;
}
fclose(fp);
}
return fcontent;
}

<snip>
 
B

BGB / cr88192

Ben Bacarisse said:
From a purely functional point of view, I'd say that you should have
some way to return how long the file was to the caller. It may not
matter in your case but (a) you have the information so it's 'free' and
(b) it makes the function more useful. A good way to do this is to
supply a size_t * argument which if non-NULL get set to the length of
the data written. Permitting a null pointer means that callers who
don't care about the length don't have to allocate a dummy variable just
for the purpose.

If you don't do this (and quite possibly if you so) you should at least
null-terminate the result. As it stands, the caller can't use the
returned data without running off the end of the array.

yeah.



I would not cast the return from realloc.

I disagree. there is no harm done by the cast, and (among several other
practices commonly disliked here) it also aides in writing code which works
in both C and C++ compilers.

A common trick is to increase the storage used by some factor (> 1)
rather than but adding a fixed amount. This has the advantage of
reducing the number of realloc calls at the expense of a little more
data.

yeah, the 1.5^x curve works well in my case.
every time one needs more memory, scale the prior size by 1.5...
average case overhead is 25%.

scaling by a fixed size leads to more realloc, which means more time, and
greater memory fragmentation.
fragmentation may well be a greater enemy than the raw anount of memory
used.

I would not print an error message to stdout. In fact, I would not
print an error message at all! A utility function like should be usable
by programs that might want to report errors in all sorts of other
ways. In general, it is best to put error reporting as "high up" in the
you program as is practical

yeah, status code and returning NULL being a common way to do this.

less common ways include callbacks, event queuing, exceptions, longjmp, ...
but each has drawbacks...

I list exceptions here, but these are typically OS/compiler/framework
specific (and not part of standard C), which is their main drawback.

I'd put the break here.

this function being a great example of CLC pedantics...


personally, I would just be like "to hell with it" and just use fseek/ftell,
since there is almost no way that a text file is going to be larger than 2GB
and expected to just be read into memory like this (it wouldn't fit on 32
bit systems anyways...).

typically, on a 32-bit system, even 500MB malloc's are prone to fail, since
often some random DLL will end up mapped right into the middle of an
otherwise clean 1GB span of memory, of the theoretically 2GB or 3GB or so
available to the app. (for those few apps which try to allocate memory in
such large chunks...).

but, memory usage patterns is its own complicated set of issues.


however, given that the average size of most "sane" text files is measurable
in kB, with a few larger ones being a few MB, usually there is little risk
of size overflows just using the fseek/ftell strategy...

(although, yes, if the file *is* larger than 2GB or so, then the size may
come back negative or otherwise completely wrong...).


but, oh well, whatever...
 
I

ImpalerCore

this function being a great example of CLC pedantics...

personally, I would just be like "to hell with it" and just use fseek/ftell,
since there is almost no way that a text file is going to be larger than 2GB
and expected to just be read into memory like this (it wouldn't fit on 32
bit systems anyways...).

Unless you have a data file that's over 2GB and you convert it to an
ASCII representation for 'grep-ability' and now it's ballooned to over
9GB. Not saying that it's a good thing, but I have done it before
when I've been in a pinch.
typically, on a 32-bit system, even 500MB malloc's are prone to fail, since
often some random DLL will end up mapped right into the middle of an
otherwise clean 1GB span of memory, of the theoretically 2GB or 3GB or so
available to the app. (for those few apps which try to allocate memory in
such large chunks...).

I experience similar issues with 500MB malloc's failure, and I've not
been able to determine why. I don't allocate memory that large, but I
have some test programs that try to blow up memory and noticed the max
was around 500MB.
but, memory usage patterns is its own complicated set of issues.

however, given that the average size of most "sane" text files is measurable
in kB, with a few larger ones being a few MB, usually there is little risk
of size overflows just using the fseek/ftell strategy...

(although, yes, if the file *is* larger than 2GB or so, then the size may
come back negative or otherwise completely wrong...).

but, oh well, whatever...

As soon as you say getting the size of a file, afaik it's outside the
scope of the standard. I know in my environment I can't use ftell to
get a 2GB+ NTFS file size since sizeof( long int ) is 4, and thus I
have to resort to using a system specific method using uint64_t and
GetFileAttributeExA. Unfortunately there is no completely portable
method to get a file size. The best thing in my opinion is to make a
wrapper function that encapsulates the non-portable behavior so that
the non-portability is centralized in one spot if at all possible.

Of course, I wouldn't fault someone for using the ftell/fseek combo if
that is all that is needed to handle his range of input file sizes as
long as that person doesn't claim it's portable.

Best regards,
John D.
 
B

BGB / cr88192

this function being a great example of CLC pedantics...

personally, I would just be like "to hell with it" and just use
fseek/ftell,
since there is almost no way that a text file is going to be larger than
2GB
and expected to just be read into memory like this (it wouldn't fit on 32
bit systems anyways...).

<--
Unless you have a data file that's over 2GB and you convert it to an
ASCII representation for 'grep-ability' and now it's ballooned to over
9GB. Not saying that it's a good thing, but I have done it before
when I've been in a pinch.
-->

note: by "and" I meant T&T=T, not T|T=T.

but, otherwise, potentially...
this depends on the app then.

I suspect this is likely to be a very rare case though, where most text one
deals with is typically in the kB or low MB range...

typically, on a 32-bit system, even 500MB malloc's are prone to fail,
since
often some random DLL will end up mapped right into the middle of an
otherwise clean 1GB span of memory, of the theoretically 2GB or 3GB or so
available to the app. (for those few apps which try to allocate memory in
such large chunks...).

<--
I experience similar issues with 500MB malloc's failure, and I've not
been able to determine why. I don't allocate memory that large, but I
have some test programs that try to blow up memory and noticed the max
was around 500MB.
-->

this is due mostly to address-space fragmentation...

as the size gets larger, the probability of there being an unbroken run of a
given size becomes increasingly smaller (although, the specifics depend on
the exact memory layout and allocation behaviors).

but, 500MB seems to be a good breakoff, as:
it reprsents, typically, around 1/4 of the total app-usable address space;
typically, the OS will load libraries, ... in ways which are not typically
convinient for maintaining an unbroken space.


there is much less of an issue on 64-bit systems, but there are still
issues, such as trying to allocate a huge chunk of memory (several GB or
more) will often horridly bog down the system IME (or, at least on Windows).

on Windows, a better strategy was to use reserved/uncommitted memory (via
VirtualAlloc), and then to commit it piecewise...

in this particular case, I had allocated a 4GB chunk of memory, mostly as I
needed a region of memory with a "window" where any point in the space could
be reached via a 32-bit offset, whereas with the wider (64-bit) address
space, I couldn't make any guerantees about distance (2GB would be maybe
better though, as 2GB allows any point to be reached from any other via a
signed 32-bit offset).

I haven't tested on Linux to see what happens with large malloc or mmap
calls.

but, memory usage patterns is its own complicated set of issues.

however, given that the average size of most "sane" text files is
measurable
in kB, with a few larger ones being a few MB, usually there is little risk
of size overflows just using the fseek/ftell strategy...

(although, yes, if the file *is* larger than 2GB or so, then the size may
come back negative or otherwise completely wrong...).

but, oh well, whatever...

<--
As soon as you say getting the size of a file, afaik it's outside the
scope of the standard. I know in my environment I can't use ftell to
get a 2GB+ NTFS file size since sizeof( long int ) is 4, and thus I
have to resort to using a system specific method using uint64_t and
GetFileAttributeExA. Unfortunately there is no completely portable
method to get a file size. The best thing in my opinion is to make a
wrapper function that encapsulates the non-portable behavior so that
the non-portability is centralized in one spot if at all possible.
-->

agreed...

IME, one typically ends up with a lot of OS-specific wrappers, usually
dedicated to their own source files (like, having various OS-specific source
files, and then swapping them out for the OS in question).
this is typically much cleaner than the use of fine-grained ifdef's
(although, sometimes I have used coarse-grained ifdef's in these files,
typically so that I don't have to jerk off the Makefile so much, and so the
entire file contents disappear if not on the OS in question).


<--
Of course, I wouldn't fault someone for using the ftell/fseek combo if
that is all that is needed to handle his range of input file sizes as
long as that person doesn't claim it's portable.
-->

one can claim it is portable (since it *is* part of the standard), so long
as they also don't claim it will work with larger files...
 
B

Ben Bacarisse

BGB / cr88192 said:
I disagree. there is no harm done by the cast, and (among several other
practices commonly disliked here) it also aides in writing code which works
in both C and C++ compilers.

You disagree that I would not cast the return from realloc? :)

I tried to put it as mildly as possible with the sole purpose of
encouraging the OP to think about it. I don't mind either way, but the
argument about compiling with C++ is (or should be) a red herring. It
is trivial to link C with C++ and preferable to compile the C with a C
compiler. In cases like this there is no reason to do anything else.
yeah, the 1.5^x curve works well in my case.
every time one needs more memory, scale the prior size by 1.5...
average case overhead is 25%.

scaling by a fixed size leads to more realloc, which means more time, and
greater memory fragmentation.
fragmentation may well be a greater enemy than the raw anount of memory
used.

Did you really mean to say "scaling" here? It sounds like you are
arguing against it having agreed that it is often preferable.

this function being a great example of CLC pedantics...

personally, I would just be like "to hell with it" and just use fseek/ftell,
since there is almost no way that a text file is going to be larger than 2GB
and expected to just be read into memory like this (it wouldn't fit on 32
bit systems anyways...).

Putting aside the issue of pedantry, it is worth pointing out that ftell
does not report the correct size to read in a whole text file on some
very common platforms. It may usually be a "safe" lie (i.e. you
get told a size larger than required) but it is, none the less, not the
size you need to read.

<snip>
 
I

Ian Collins

Putting aside the issue of pedantry, it is worth pointing out that ftell
does not report the correct size to read in a whole text file on some
very common platforms. It may usually be a "safe" lie (i.e. you
get told a size larger than required) but it is, none the less, not the
size you need to read.

But on those systems there isn't a direct way of determining the correct
size to read without scanning the file. Provided the code doesn't
expect the result of a read to exactly match the size requested, all
should be well.
 
M

Morris Keesan

printf("failed to reallocate memory\n"); ....
if(result != PAGE_SIZE) {
if(feof(fp)) {
break;
}
else {
printf("error reading file"); ....

This is working fine. Since I am new to C, I am wondering is this the
correct approach to do it? Any suggestions to improve the code are
welcome.

If you're going to output an error message, it's more appropriate to
send it to stderr, e.g.
fputs("failed to reallocate memory\n", stderr);

stdout and stderr are very often the same device, but when they're not,
users expect to see error messages on stderr, not stdout.
 
N

Navaneeth

yeah, the 1.5^x curve works well in my case.
every time one needs more memory, scale the prior size by 1.5...
average case overhead is 25%.

scaling by a fixed size leads to more realloc, which means more time, and
greater memory fragmentation.
fragmentation may well be a greater enemy than the raw anount of memory
used.

Thanks. I am not sure I followed you fully. Can you give me code
sample of "realloc" which allocates in a way that you are explaining?
 
M

Malcolm McLean

Thanks. I am not sure I followed you fully. Can you give me code
sample of "realloc" which allocates in a way that you are explaining?
He means this

/*
untested
*/
char *loadtext(FILE *fp)
{
char *answer;
void *temp;
int N = 0;
int buffsize = 8;
int ch;

answer = malloc(buffsize);
if(!answer)
return 0;

while( (ch = fgetc(fp)) != EOF)
{
if(N == buffsize -1)
{
buffsize = buffsize + buffsize/2;
temp = realloc(answer, buffsize);
if(!temp)
{
free(answer);
return 0;
}
}
answer[N++] = ch;
}
answer[N] = 0;
return answer;
}
 
B

BGB / cr88192

Ben Bacarisse said:
You disagree that I would not cast the return from realloc? :)

I tried to put it as mildly as possible with the sole purpose of
encouraging the OP to think about it. I don't mind either way, but the
argument about compiling with C++ is (or should be) a red herring. It
is trivial to link C with C++ and preferable to compile the C with a C
compiler. In cases like this there is no reason to do anything else.

actually, I have done this on several cases, for different reasons:
a library is written in C, but may be changed to/from C++ later (has
happened a few times);
writing C code to be compiler on .NET, since technically .NET doesn't
support C (only C++/CLI), one has to write their C code in a C++ friendly
way if they hope for it to work on .NET...

....

and, most of the time, it is for really trivial crap, like, casting the
result of malloc/realloc calls, or using "()" for no-argument functions
rather than "(void)", ...

even if many people here dislike them, this is often the most direct route
to keeping them compatible, which in many cases does matter.

Did you really mean to say "scaling" here? It sounds like you are
arguing against it having agreed that it is often preferable.

yeah, stupid wording sometimes...

yeah, scaling by 1.5 works well.
adding a fixed size each time is problematic...

Putting aside the issue of pedantry, it is worth pointing out that ftell
does not report the correct size to read in a whole text file on some
very common platforms. It may usually be a "safe" lie (i.e. you
get told a size larger than required) but it is, none the less, not the
size you need to read.

typically I read text files in binary mode, since this avoids the whole
issue. it is then a matter of dealing with line-endings, ... oneself. the
main use of text-mode then is in writing files.

luckily, on nearly all architectures one is likely to deal with, ASCII (or
UTF-8) is king and only a few different line-ending styles exist (LF, CR LF,
and CR).
 
B

Bartc

Gordon Burditt said:
One of the more serious practical problems of getting the size of
a file is that it can change size between the time that you get the
size and when you read the file. Don't underestimate this problem
- many programs that analyze things end up being applied to growing
log files.
Then there's the odd case where
on some systems you can create a terabyte-sized "sparse" file (which
reads as a terabyte of mostly zero bytes) on a filesystem on a
floppy disk (say, 1.44 megabytes) with space to spare.
You really need to deal with the problem of the file growing after
the you get the size, so you don't end up crashing the application
due to overrunning the memory you malloc()ed. One approach is to
use the "file size" as an initial estimate, and be prepared to grow
it if needed. Another approach is to simply not read past the
initially-determined file size.

If you're going to worry about all these possibilities, even for reading a
6-line config file, then you're never going to get anywhere.

And you could do everything right, and the file *still* changes just before,
during, or after you've read it. Or maybe all three.
 
B

Ben Bacarisse

BGB / cr88192 said:
actually, I have done this on several cases, for different reasons:
a library is written in C, but may be changed to/from C++ later (has
happened a few times);

That sounds circular to me. It does not say why you change C code to
C++ and then possible back again. Yes, if you switch compilers the code
will need to be both, but you don't say why it makes sense to work this
way in the first place.
writing C code to be compiler on .NET, since technically .NET doesn't
support C (only C++/CLI), one has to write their C code in a C++ friendly
way if they hope for it to work on .NET...

C++/CLI is, of course, yet another language even further from C. Trying
to get source code compatibility with C++/CLI seems to me to be doomed
except for the simplest code fragments. And I am not sure what you mean
by "does not support C". It was my understanding (patchy, I'll admit)
that C++/CLI can link to native C functions just fine. I.e. the generic
"compile as C and link" advice stands for C++/CLI as well.

I think that writing C with an eye to C++ or, worse, to C++/CLI is not
advisable in most cases.

<snip>
 
B

BGB / cr88192

Ben Bacarisse said:
That sounds circular to me. It does not say why you change C code to
C++ and then possible back again. Yes, if you switch compilers the code
will need to be both, but you don't say why it makes sense to work this
way in the first place.

usually, it is because code is very fluid typically.
sometimes one will write it for one purpose, copy it somewhere else, and use
it for another task in a new context. sometimes, one moves across language
barriers as well.

C <-> C++ is a common one, but I have also done copy-paste-edit to move code
between C and Java a few times as well, ...

C++/CLI is, of course, yet another language even further from C. Trying
to get source code compatibility with C++/CLI seems to me to be doomed
except for the simplest code fragments. And I am not sure what you mean
by "does not support C". It was my understanding (patchy, I'll admit)
that C++/CLI can link to native C functions just fine. I.e. the generic
"compile as C and link" advice stands for C++/CLI as well.

well, yes, but only if one is compiling the C or C++/CLI code as native
code.
one has to compile the code as C++/CLI though if one wants to have it be
compiled to MSIL.

I think that writing C with an eye to C++ or, worse, to C++/CLI is not
advisable in most cases.

possibly...

but, I also tend to write my code to be processed via fairly dumb automated
tools as well, most of which have only partial understanding of C syntax
(and typically impose many restrictions, such as limiting valid declaration
syntax, modifier order, ...).

writing code so that it can also be fed through C++/CLI is not asking
much...


hell, some people went and compiled Quake2 using C++/CLI (Quake2 is itself
written in C), so it is probably not asking that much...
 
I

Ian Collins

actually, I have done this on several cases, for different reasons:
a library is written in C, but may be changed to/from C++ later (has
happened a few times);

It sounds like you need to organise your code to enable you to link C
and C++ libraries, rather than copy and paste code.
writing C code to be compiler on .NET, since technically .NET doesn't
support C (only C++/CLI), one has to write their C code in a C++ friendly
way if they hope for it to work on .NET...

Probably because there isn't a good reason for writing C on that
platform. Why don't you just link C functions from a library?
and, most of the time, it is for really trivial crap, like, casting the
result of malloc/realloc calls, or using "()" for no-argument functions
rather than "(void)", ...

even if many people here dislike them, this is often the most direct route
to keeping them compatible, which in many cases does matter.

I'm one of the few people I know who compiles C as C++ and that's only
for a specific scenario (testing). Even then, I'll keep the code that
has to be C++ compatible separate from the rest of the source.
 
B

BGB / cr88192

Ian Collins said:
It sounds like you need to organise your code to enable you to link C and
C++ libraries, rather than copy and paste code.

I go both routes...

usually copy/paste happens if one is doing mix and match of code from
different places, or codebases, in order to make a combined result...

Probably because there isn't a good reason for writing C on that platform.
Why don't you just link C functions from a library?

the reason mostly is that of CPU arch:
if one uses 32-bit native DLL's, they are stuck with x86;
if they use MSIL, than the JIT can do its magic and compile as either 32-bit
or 64-bit.

otherwise, they would have to build separate 32 and 64 bit versions of their
DLL's...

but, mind you, I do make a lot of use of libraries as well, just libraries
and copy/paste/edit each have their merits and drawbacks, and which one is
better depends a lot on the situation...

I'm one of the few people I know who compiles C as C++ and that's only for
a specific scenario (testing). Even then, I'll keep the code that has to
be C++ compatible separate from the rest of the source.

possibly, however, as making it convert easily is almost free, I see little
reason not to do it wherever it makes sense...
 
I

Ian Collins

possibly, however, as making it convert easily is almost free, I see little
reason not to do it wherever it makes sense...

The most obvious reason not to do it is idiomatic C != idiomatic C++!
 
N

Navaneeth

  untested
*/
char *loadtext(FILE *fp)
{
  char *answer;
  void *temp;
  int N = 0;
  int buffsize = 8;
  int ch;

  answer = malloc(buffsize);
  if(!answer)
    return 0;

  while( (ch = fgetc(fp)) != EOF)
  {
    if(N == buffsize -1)
    {
      buffsize = buffsize + buffsize/2;
      temp = realloc(answer, buffsize);
      if(!temp)
      {
        free(answer);
        return 0;
      }
    }
    answer[N++] = ch;
  }
  answer[N] = 0;
  return answer;

}

Well, If I use this, I think I will have more allocations. Consider a
10KB file. I am allocating 4KB (I have PAGE_SIZE set as 4096 bytes)
first. If I use my code, I will allocate 8KB for the second iteration
and 12KB for the last iteration. Now on your code, second allocation
will be made for 6KB, third allocation for 9 and the last for 13. So
my initial code finished in just 3 allocation while the new code takes
4.

I am not understanding how this is a better strategy for allocations?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top