Reading whole text files

Discussion in 'C Programming' started by Michael Mair, Feb 10, 2005.

  1. Michael Mair

    Michael Mair Guest

    Cheerio,


    I would appreciate opinions on the following:

    Given the task to read a _complete_ text file into a string:
    What is the "best" way to do it?
    Handling the buffer is not the problem -- the character
    input is a different matter, at least if I want to remain within
    the bounds of the standard library.

    Essentially, I can think of three variants:
    - Low: Use fgetc(). Simple, straightforward, probably inefficient.
    - Default: Use fgets(); ugly, if we are not interested in lines
    and have many newline characters to read.
    - Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
    XSTR(BUFLEN) gives me BUFLEN in a string literal.

    From the labels, it is pretty obvious that I would favour the
    last one, so there is the question about possible pitfalls
    (yes, I will use the return value and "read") and whether there
    are environmental limits for BUFLEN.


    If I missed some obvious source (looking for the wrong sort of
    stuff in the FAQ and google archives), then please point me
    toward it :)


    Regards,
    Michael
    --
    E-Mail: Mine is an /at/ gmx /dot/ de address.
    Michael Mair, Feb 10, 2005
    #1
    1. Advertising

  2. Michael Mair

    infobahn Guest

    Michael Mair wrote:
    >
    > Cheerio,
    >
    > I would appreciate opinions on the following:
    >
    > Given the task to read a _complete_ text file into a string:
    > What is the "best" way to do it?
    > Handling the buffer is not the problem -- the character
    > input is a different matter, at least if I want to remain within
    > the bounds of the standard library.
    >
    > Essentially, I can think of three variants:
    > - Low: Use fgetc(). Simple, straightforward, probably inefficient.


    Why inefficient? I'd prefer getc in case you're fortunate enough
    to have it implemented as a macro, but it should be efficient
    enough.

    > - Default: Use fgets(); ugly, if we are not interested in lines
    > and have many newline characters to read.


    And you have to maintain /two/ buffers (quite apart from the buffer
    maintained by your text stream handler) - your expanding buffer,
    and the buffer you give to fgets (unless you use the expanding
    buffer for that too, which is certainly doable but probably gives
    you more headaches).

    > - Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
    > XSTR(BUFLEN) gives me BUFLEN in a string literal.
    >
    > From the labels, it is pretty obvious that I would favour the
    > last one,


    Fine, so use that. But it wouldn't be my choice.

    Vive la difference!
    infobahn, Feb 10, 2005
    #2
    1. Advertising

  3. Michael Mair

    Michael Mair Guest

    infobahn wrote:
    > Michael Mair wrote:
    >
    >>Cheerio,
    >>
    >>I would appreciate opinions on the following:
    >>
    >>Given the task to read a _complete_ text file into a string:
    >>What is the "best" way to do it?
    >>Handling the buffer is not the problem -- the character
    >>input is a different matter, at least if I want to remain within
    >>the bounds of the standard library.
    >>
    >>Essentially, I can think of three variants:
    >>- Low: Use fgetc(). Simple, straightforward, probably inefficient.

    >
    > Why inefficient? I'd prefer getc in case you're fortunate enough
    > to have it implemented as a macro, but it should be efficient
    > enough.


    "Probably" inefficient in that I cannot rely on getc() being
    implemented as a macro and that I do not want to make assumptions
    about the underlying library. So, essentially, the question is
    for me whether having a loop in my code is "better" than just
    telling fscanf() to get, say 8K characters in one go.
    The main beauty of this approach lies for me in the clarity of the
    code. Thanks for reminding me of getc() vs. fgetc().

    >>- Default: Use fgets(); ugly, if we are not interested in lines
    >> and have many newline characters to read.

    >
    > And you have to maintain /two/ buffers (quite apart from the buffer
    > maintained by your text stream handler) - your expanding buffer,
    > and the buffer you give to fgets (unless you use the expanding
    > buffer for that too, which is certainly doable but probably gives
    > you more headaches).


    Actually, I have implemented it first with fgets() and one extending
    buffer but found, looking at the final code, that approach too unwieldy
    and error prone, as you need more code and variables.
    Usually, I would have gone for the "Low" approach due to the clarity
    of the resulting code but -- as I was at it -- I just asked myself
    which options do I have.


    >>- Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
    >> XSTR(BUFLEN) gives me BUFLEN in a string literal.
    >>
    >> From the labels, it is pretty obvious that I would favour the
    >>last one,

    >
    > Fine, so use that. But it wouldn't be my choice.


    I _was_ asking for opinions.


    > Vive la difference!


    :)
    Thank you for your input!


    Cheers
    Michael
    --
    E-Mail: Mine is a gmx dot de address.
    Michael Mair, Feb 10, 2005
    #3
  4. Michael Mair

    jacob navia Guest

    Michael Mair wrote:
    > Cheerio,
    >
    >
    > I would appreciate opinions on the following:
    >
    > Given the task to read a _complete_ text file into a string:
    > What is the "best" way to do it?
    > Handling the buffer is not the problem -- the character
    > input is a different matter, at least if I want to remain within
    > the bounds of the standard library.
    >
    > Essentially, I can think of three variants:
    > - Low: Use fgetc(). Simple, straightforward, probably inefficient.
    > - Default: Use fgets(); ugly, if we are not interested in lines
    > and have many newline characters to read.
    > - Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
    > XSTR(BUFLEN) gives me BUFLEN in a string literal.
    >
    > From the labels, it is pretty obvious that I would favour the
    > last one, so there is the question about possible pitfalls
    > (yes, I will use the return value and "read") and whether there
    > are environmental limits for BUFLEN.
    >
    >
    > If I missed some obvious source (looking for the wrong sort of
    > stuff in the FAQ and google archives), then please point me
    > toward it :)
    >
    >
    > Regards,
    > Michael


    What about this?

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    char *ReadFileIntoRam(char *fname,int *plen)
    {
    FILE *infile;
    char *contents;
    int actualBytesRead=0;
    unsigned int len;

    infile = fopen(fname,"rb");
    if (infile == NULL) {
    fprintf(stderr,"impossible to open %s\n",fname);
    return NULL;
    }
    fseek(infile,0,SEEK_END);
    len = ftell(infile);
    fseek(infile,0,SEEK_SET);
    contents = calloc(len+1,1);
    if (contents) {
    actualBytesRead = fread(contents,1,len,infile);
    }
    else {
    fprintf(stderr,"Can't allocate memory to read the file\n");
    }
    fclose(infile);
    *plen = actualBytesRead;
    return contents;
    }

    int main(int argc,char *argv[])
    {
    if (argc < 2) {
    printf("usage: readfile <filename>\n");
    exit(1);
    }
    int len=0;
    char *contents=ReadFileIntoRam(argv[1],&len);
    // work with the contents of the file
    }
    jacob navia, Feb 10, 2005
    #4
  5. Michael Mair

    Michael Mair Guest

    jacob navia wrote:
    > Michael Mair wrote:
    >
    >> Cheerio,
    >>
    >>
    >> I would appreciate opinions on the following:
    >>
    >> Given the task to read a _complete_ text file into a string:
    >> What is the "best" way to do it?
    >> Handling the buffer is not the problem -- the character
    >> input is a different matter, at least if I want to remain within
    >> the bounds of the standard library.
    >>
    >> Essentially, I can think of three variants:
    >> - Low: Use fgetc(). Simple, straightforward, probably inefficient.
    >> - Default: Use fgets(); ugly, if we are not interested in lines
    >> and have many newline characters to read.
    >> - Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
    >> XSTR(BUFLEN) gives me BUFLEN in a string literal.
    >>
    >> From the labels, it is pretty obvious that I would favour the
    >> last one, so there is the question about possible pitfalls
    >> (yes, I will use the return value and "read") and whether there
    >> are environmental limits for BUFLEN.
    >>
    >>
    >> If I missed some obvious source (looking for the wrong sort of
    >> stuff in the FAQ and google archives), then please point me
    >> toward it :)
    >>
    >>
    >> Regards,
    >> Michael

    >
    >
    > What about this?
    >
    > #include <stdio.h>
    > #include <stdlib.h>
    > #include <string.h>
    > char *ReadFileIntoRam(char *fname,int *plen)
    > {
    > FILE *infile;
    > char *contents;
    > int actualBytesRead=0;
    > unsigned int len;
    >
    > infile = fopen(fname,"rb");


    Here is the crux: I want/have to work with a _text_ file.
    Everything else may give me wrong results.

    > if (infile == NULL) {
    > fprintf(stderr,"impossible to open %s\n",fname);
    > return NULL;
    > }
    > fseek(infile,0,SEEK_END);
    > len = ftell(infile);
    > fseek(infile,0,SEEK_SET);
    > contents = calloc(len+1,1);
    > if (contents) {
    > actualBytesRead = fread(contents,1,len,infile);


    This is what I would do for binary files.
    Essentially, I am looking for the text file equivalent of fread().

    > }
    > else {
    > fprintf(stderr,"Can't allocate memory to read the file\n");
    > }
    > fclose(infile);
    > *plen = actualBytesRead;
    > return contents;
    > }
    >
    > int main(int argc,char *argv[])
    > {
    > if (argc < 2) {
    > printf("usage: readfile <filename>\n");
    > exit(1);
    > }
    > int len=0;
    > char *contents=ReadFileIntoRam(argv[1],&len);
    > // work with the contents of the file
    > }


    Thank you for trying :)


    Cheers
    Michael
    --
    E-Mail: Mine is a gmx dot de address.
    Michael Mair, Feb 10, 2005
    #5
  6. Michael Mair

    S.Tobias Guest

    infobahn <> wrote:
    > Michael Mair wrote:
    > >
    > > Cheerio,
    > >
    > > I would appreciate opinions on the following:
    > >
    > > Given the task to read a _complete_ text file into a string:
    > > What is the "best" way to do it?
    > > Handling the buffer is not the problem -- the character
    > > input is a different matter, at least if I want to remain within
    > > the bounds of the standard library.
    > >
    > > Essentially, I can think of three variants:
    > > - Low: Use fgetc(). Simple, straightforward, probably inefficient.


    > Why inefficient? I'd prefer getc in case you're fortunate enough
    > to have it implemented as a macro, but it should be efficient
    > enough.


    In thread-safe libraries getc() family functions can actually
    be quite inefficient, because they must lock the stream object,
    which takes time. This is the reason why some systems provide
    getc_unlocked() (thread-unsafe) family (I remember a noticeable
    difference between them in my tests some time ago).

    +++

    Excuse my ignorance, I have no experience with text files in
    the C Std context. Why wouldn't fread() be suitable for
    reading text files? In 7.19.8p2 it says the fread() call is
    performed as if by use of fgetc() function in the bottom.
    I haven't spotted any mention where these functions would be
    constrained to binary streams only.

    --
    Stan Tobias
    mailx `echo LID | sed s/[[:upper:]]//g`
    S.Tobias, Feb 10, 2005
    #6
  7. Michael Mair

    Michael Mair Guest

    S.Tobias wrote:
    > infobahn <> wrote:
    >
    >>Michael Mair wrote:
    >>
    >>>Cheerio,
    >>>
    >>>I would appreciate opinions on the following:
    >>>
    >>>Given the task to read a _complete_ text file into a string:
    >>>What is the "best" way to do it?
    >>>Handling the buffer is not the problem -- the character
    >>>input is a different matter, at least if I want to remain within
    >>>the bounds of the standard library.
    >>>
    >>>Essentially, I can think of three variants:
    >>>- Low: Use fgetc(). Simple, straightforward, probably inefficient.

    >
    >
    >>Why inefficient? I'd prefer getc in case you're fortunate enough
    >>to have it implemented as a macro, but it should be efficient
    >>enough.

    >
    >
    > In thread-safe libraries getc() family functions can actually
    > be quite inefficient, because they must lock the stream object,
    > which takes time. This is the reason why some systems provide
    > getc_unlocked() (thread-unsafe) family (I remember a noticeable
    > difference between them in my tests some time ago).


    Interesting.

    > +++
    >
    > Excuse my ignorance, I have no experience with text files in
    > the C Std context. Why wouldn't fread() be suitable for
    > reading text files? In 7.19.8p2 it says the fread() call is
    > performed as if by use of fgetc() function in the bottom.
    > I haven't spotted any mention where these functions would be
    > constrained to binary streams only.


    It seems I am plain stupid... Somewhere in my brain, there was
    "fread()/fwrite() <-> binary I/O" hardwired :-/
    So, if I open the stream as text stream, everything should be
    fine. (If this is wrong, please correct me.)
    Moreover, if I read the data into dynamically allocated
    storage pointed to by an unsigned char *, I circumvent potential
    problems with the is** functions from <ctype.h> (as I asked in
    another thread).

    Thank you :)


    Cheers
    Michael
    --
    E-Mail: Mine is a gmx dot de address.
    Michael Mair, Feb 10, 2005
    #7
  8. Michael Mair

    Michael Mair Guest

    Michael Mair wrote:
    >
    >
    > jacob navia wrote:
    >
    >> Michael Mair wrote:
    >>
    >>> Cheerio,
    >>>
    >>>
    >>> I would appreciate opinions on the following:
    >>>
    >>> Given the task to read a _complete_ text file into a string:
    >>> What is the "best" way to do it?
    >>> Handling the buffer is not the problem -- the character
    >>> input is a different matter, at least if I want to remain within
    >>> the bounds of the standard library.
    >>>
    >>> Essentially, I can think of three variants:
    >>> - Low: Use fgetc(). Simple, straightforward, probably inefficient.
    >>> - Default: Use fgets(); ugly, if we are not interested in lines
    >>> and have many newline characters to read.
    >>> - Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
    >>> XSTR(BUFLEN) gives me BUFLEN in a string literal.
    >>>
    >>> From the labels, it is pretty obvious that I would favour the
    >>> last one, so there is the question about possible pitfalls
    >>> (yes, I will use the return value and "read") and whether there
    >>> are environmental limits for BUFLEN.
    >>>
    >>>
    >>> If I missed some obvious source (looking for the wrong sort of
    >>> stuff in the FAQ and google archives), then please point me
    >>> toward it :)
    >>>
    >>>
    >>> Regards,
    >>> Michael

    >>
    >>
    >>
    >> What about this?
    >>
    >> #include <stdio.h>
    >> #include <stdlib.h>
    >> #include <string.h>
    >> char *ReadFileIntoRam(char *fname,int *plen)
    >> {
    >> FILE *infile;
    >> char *contents;
    >> int actualBytesRead=0;
    >> unsigned int len;
    >>
    >> infile = fopen(fname,"rb");

    >
    >
    > Here is the crux: I want/have to work with a _text_ file.
    > Everything else may give me wrong results.


    Sorry, the "b" brought me back onto the wrong track I already
    was on. See the other subthread.


    Cheers
    Michael
    >
    >> if (infile == NULL) {
    >> fprintf(stderr,"impossible to open %s\n",fname);
    >> return NULL;
    >> }
    >> fseek(infile,0,SEEK_END);
    >> len = ftell(infile);
    >> fseek(infile,0,SEEK_SET);
    >> contents = calloc(len+1,1);
    >> if (contents) {
    >> actualBytesRead = fread(contents,1,len,infile);

    >
    >
    > This is what I would do for binary files.
    > Essentially, I am looking for the text file equivalent of fread().
    >
    >> }
    >> else {
    >> fprintf(stderr,"Can't allocate memory to read the file\n");
    >> }
    >> fclose(infile);
    >> *plen = actualBytesRead;
    >> return contents;
    >> }
    >>
    >> int main(int argc,char *argv[])
    >> {
    >> if (argc < 2) {
    >> printf("usage: readfile <filename>\n");
    >> exit(1);
    >> }
    >> int len=0;
    >> char *contents=ReadFileIntoRam(argv[1],&len);
    >> // work with the contents of the file
    >> }

    >
    >
    > Thank you for trying :)
    >
    >
    > Cheers
    > Michael



    --
    E-Mail: Mine is a gmx dot de address.
    Michael Mair, Feb 10, 2005
    #8
  9. Michael Mair

    SM Ryan Guest

    Michael Mair <> wrote:
    # Cheerio,
    #
    #
    # I would appreciate opinions on the following:
    #
    # Given the task to read a _complete_ text file into a string:
    # What is the "best" way to do it?
    # Handling the buffer is not the problem -- the character
    # input is a different matter, at least if I want to remain within
    # the bounds of the standard library.
    #
    # Essentially, I can think of three variants:
    # - Low: Use fgetc(). Simple, straightforward, probably inefficient.

    char *contents=0; int m=0,n=0,ch;
    while ((ch=fgetc(file))!=EOF) {
    if (n+2>=m) {m = 2*n+2; contents = realloc(contents,m);}
    contents[n++] = ch; contents[n] = 0;
    }
    contents = realloc(contents,n+1);

    You might also include #ifdef/#endif code to use memory mapping on systems
    that support it.

    --
    SM Ryan http://www.rawbw.com/~wyrmwif/
    This is one wacky game show.
    SM Ryan, Feb 10, 2005
    #9
  10. Michael Mair

    Al Bowers Guest

    Michael Mair wrote:

    >>> Given the task to read a _complete_ text file into a string:
    >>> What is the "best" way to do it?
    >>> Handling the buffer is not the problem -- the character
    >>> input is a different matter, at least if I want to remain within
    >>> the bounds of the standard library.
    >>>
    >>> Essentially, I can think of three variants:
    >>> - Low: Use fgetc(). Simple, straightforward, probably inefficient.

    >>
    >>
    >> Why inefficient? I'd prefer getc in case you're fortunate enough
    >> to have it implemented as a macro, but it should be efficient
    >> enough.

    >
    >
    > "Probably" inefficient in that I cannot rely on getc() being
    > implemented as a macro and that I do not want to make assumptions
    > about the underlying library. So, essentially, the question is
    > for me whether having a loop in my code is "better" than just
    > telling fscanf() to get, say 8K characters in one go.
    > The main beauty of this approach lies for me in the clarity of the
    > code. Thanks for reminding me of getc() vs. fgetc().
    >
    >>> - Default: Use fgets(); ugly, if we are not interested in lines
    >>> and have many newline characters to read.

    >>


    My intuition is the the definition of a "_complete_" text file
    would require the "ugly". Hence, I would use function fgets in
    a loop.

    >>
    >> And you have to maintain /two/ buffers (quite apart from the buffer
    >> maintained by your text stream handler) - your expanding buffer,
    >> and the buffer you give to fgets (unless you use the expanding
    >> buffer for that too, which is certainly doable but probably gives
    >> you more headaches).

    >
    >
    > Actually, I have implemented it first with fgets() and one extending
    > buffer but found, looking at the final code, that approach too unwieldy
    > and error prone, as you need more code and variables.


    Use fgets to copy into a buffer. And, then append to a
    expanding dynamically allocated char array. This is not unwieldy.

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>

    int main(void)
    {
    char buffer[128],*fstr, *tmp;
    size_t slen, blen;
    FILE *fp;

    if((fp = fopen("test.c","r")) == NULL) exit(EXIT_FAILURE);
    for(slen = 0, fstr = NULL;
    (fgets(buffer,sizeof buffer, fp)) ; slen+=blen)
    {
    blen = strlen(buffer);
    if((tmp = realloc(fstr,slen+blen+1)) == NULL)
    {
    free(fstr);
    exit(EXIT_FAILURE);
    }
    if(slen == 0) *tmp = '\0';
    fstr = tmp;
    strcat(fstr,buffer);
    }
    fclose(fp);
    puts(fstr);
    free(fstr);
    return 0;
    }


    --
    Al Bowers
    Tampa, Fl USA
    mailto: (remove the x to send email)
    http://www.geocities.com/abowers822/
    Al Bowers, Feb 10, 2005
    #10
  11. Michael Mair

    infobahn Guest

    Al Bowers wrote:
    >

    <snip>
    >
    > Use fgets to copy into a buffer. And, then append to a
    > expanding dynamically allocated char array. This is not unwieldy.
    >
    > #include <stdio.h>
    > #include <string.h>
    > #include <stdlib.h>
    >
    > int main(void)
    > {
    > char buffer[128],*fstr, *tmp;
    > size_t slen, blen;
    > FILE *fp;
    >
    > if((fp = fopen("test.c","r")) == NULL) exit(EXIT_FAILURE);
    > for(slen = 0, fstr = NULL;
    > (fgets(buffer,sizeof buffer, fp)) ; slen+=blen)
    > {
    > blen = strlen(buffer);


    Consider a file 12,800,000 or so bytes in length. This means you'll
    call strlen 10,000 times, and just about every call will have to
    trawl through 128 (or so) bytes. That is, modulo the last read,
    you'll have to touch every character /three/ times - once while
    reading, once while strlenning, and once while copying. For large
    files, this is a serious overhead.

    > if((tmp = realloc(fstr,slen+blen+1)) == NULL)


    You don't have to go to the well quite this often. You can keep
    a max, and only realloc when the max is about to be exceeded.
    Whenever you do this, multiply the not-enough-storage value by
    some constant (some people double, others use 1.1 or 1.5 or
    whatever) to decide how much to allocate next time.

    Consider adding a way to stop the reading of a file larger than
    the largest the user is prepared to allocate RAM for.

    > {
    > free(fstr);
    > exit(EXIT_FAILURE);
    > }
    > if(slen == 0) *tmp = '\0';
    > fstr = tmp;
    > strcat(fstr,buffer);


    Its getting worse. strcat has to find the end of the string, which
    is O(n). Put it into a loop, and you get O(n*n). This will seriously
    impact on performance for large files. It's not hard to keep a
    pointer to the next place to write.
    infobahn, Feb 10, 2005
    #11
  12. Michael Mair

    Eric Sosman Guest

    Michael Mair wrote:

    > infobahn wrote:
    >
    >> Michael Mair wrote:
    >>
    >>> Cheerio,
    >>>
    >>> I would appreciate opinions on the following:
    >>>
    >>> Given the task to read a _complete_ text file into a string:
    >>> What is the "best" way to do it?
    >>> Handling the buffer is not the problem -- the character
    >>> input is a different matter, at least if I want to remain within
    >>> the bounds of the standard library.
    >>>
    >>> Essentially, I can think of three variants:
    >>> - Low: Use fgetc(). Simple, straightforward, probably inefficient.

    >>
    >> Why inefficient? I'd prefer getc in case you're fortunate enough
    >> to have it implemented as a macro, but it should be efficient
    >> enough.

    >
    > "Probably" inefficient in that I cannot rely on getc() being
    > implemented as a macro and that I do not want to make assumptions
    > about the underlying library. So, essentially, the question is
    > for me whether having a loop in my code is "better" than just
    > telling fscanf() to get, say 8K characters in one go.
    > The main beauty of this approach lies for me in the clarity of the
    > code. Thanks for reminding me of getc() vs. fgetc().


    Considerations of the relative efficiency of library
    functions already involve matters you cannot "rely" on; the
    Standard has nothing to say about it, and you're forced to
    empirical methods.

    I can, perhaps, offer a data point. My fgets() replacement
    (everybody writes one eventually, it seems) originally used
    fgets() itself, on the grounds that it might be implemented
    more efficiently "under the covers" than repeated getc(). After
    each fgets() I'd check whether the line was too long (no '\n'
    in the buffer), and if so I'd expand the buffer and do another
    fgets(). All well and good.

    Just for curiosity's sake, though, I wrote a second version
    that made repeated getc() calls -- and guess what? It was a
    little bit faster. Whatever speed advantage fgets() might have
    had was lost in the need to search for the end of the line
    afterwards. strlen(buff) was a hair faster than strchr(buff,'\n'),
    but either way the combined fgets()/strxxx() was slower than a
    loop calling getc() and testing each character on the fly.

    The "getc() is faster" result was reproducible on four
    configurations: SPARC with Sun Studio compiler and Solaris' C
    library, SPARC with gcc and Solaris' C library, and on two
    different Pentium models with gcc and the DJgpp library.

    YMMV, and the problem you're trying to solve is slightly
    different from the one I attacked. Still, it's suggestive.

    --
    Eric Sosman
    lid
    Eric Sosman, Feb 10, 2005
    #12
  13. Michael Mair

    Randy Howard Guest

    In article <>, wyrmwif@tango-sierra-oscar-
    foxtrot-tango.fake.org says...
    > Michael Mair <> wrote:
    > # Cheerio,
    > #
    > #
    > # I would appreciate opinions on the following:
    > #
    > # Given the task to read a _complete_ text file into a string:
    > # What is the "best" way to do it?
    > # Handling the buffer is not the problem -- the character
    > # input is a different matter, at least if I want to remain within
    > # the bounds of the standard library.
    > #
    > # Essentially, I can think of three variants:
    > # - Low: Use fgetc(). Simple, straightforward, probably inefficient.
    >
    > char *contents=0; int m=0,n=0,ch;
    > while ((ch=fgetc(file))!=EOF) {
    > if (n+2>=m) {m = 2*n+2; contents = realloc(contents,m);}
    > contents[n++] = ch; contents[n] = 0;
    > }
    > contents = realloc(contents,n+1);


    What happens to contents if this realloc() fails?

    --
    Randy Howard (2reply remove FOOBAR)
    "Making it hard to do stupid things often makes it hard
    to do smart ones too." -- Andrew Koenig
    Randy Howard, Feb 10, 2005
    #13
  14. Michael Mair

    CBFalconer Guest

    jacob navia wrote:
    > Michael Mair wrote:
    >>
    >> Given the task to read a _complete_ text file into a string:
    >> What is the "best" way to do it?
    >> Handling the buffer is not the problem -- the character
    >> input is a different matter, at least if I want to remain within
    >> the bounds of the standard library.
    >>
    >> Essentially, I can think of three variants:
    >> - Low: Use fgetc(). Simple, straightforward, probably inefficient.
    >> - Default: Use fgets(); ugly, if we are not interested in lines
    >> and have many newline characters to read.
    >> - Interesting: fscanf("%"XSTR(BUFLEN)"c%n", curr, &read), where
    >> XSTR(BUFLEN) gives me BUFLEN in a string literal.
    >>
    >> From the labels, it is pretty obvious that I would favour the
    >> last one, so there is the question about possible pitfalls
    >> (yes, I will use the return value and "read") and whether there
    >> are environmental limits for BUFLEN.
    >>
    >> If I missed some obvious source (looking for the wrong sort of
    >> stuff in the FAQ and google archives), then please point me
    >> toward it :)

    >
    > What about this?
    >
    > #include <stdio.h>
    > #include <stdlib.h>
    > #include <string.h>
    > char *ReadFileIntoRam(char *fname,int *plen)
    > {
    > FILE *infile;
    > char *contents;
    > int actualBytesRead=0;
    > unsigned int len;
    >
    > infile = fopen(fname,"rb");
    > if (infile == NULL) {
    > fprintf(stderr,"impossible to open %s\n",fname);
    > return NULL;
    > }
    > fseek(infile,0,SEEK_END);
    > len = ftell(infile);
    > fseek(infile,0,SEEK_SET);
    > contents = calloc(len+1,1);
    > if (contents) {
    > actualBytesRead = fread(contents,1,len,infile);
    > }
    > else {
    > fprintf(stderr,"Can't allocate memory to read the file\n");
    > }
    > fclose(infile);
    > *plen = actualBytesRead;
    > return contents;
    > }


    No good. Note that ftell is meaningless for text files. It also
    returns a long, not an int. You haven't even tested for failure
    (which it will on input from a keyboard). Even if everything works
    use of calloc is silly, why zero what you are about to fill.
    Instead just add a single '\0' after filling. From N869:

    7.19.9.4 The ftell function

    Synopsis

    [#1]

    #include <stdio.h>
    long int ftell(FILE *stream);

    Description

    [#2] The ftell function obtains the current value of the
    file position indicator for the stream pointed to by stream.
    For a binary stream, the value is the number of characters
    from the beginning of the file. For a text stream, its file
    position indicator contains unspecified information, usable
    by the fseek function for returning the file position
    indicator for the stream to its position at the time of the
    ftell call; the difference between two such return values is
    not necessarily a meaningful measure of the number of
    characters written or read.

    Returns

    [#3] If successful, the ftell function returns the current
    value of the file position indicator for the stream. On
    failure, the ftell function returns -1L and stores an
    implementation-defined positive value in errno.

    One way to get a whole file into memory in a useful form is to
    buffer it in lines and make a linked list of those lines. An
    example in my ggets package just just that. See:

    <http://cbfalconer.home.att.net/download/ggets.zip>

    Any attempt to pre-allocate a buffer for the whole file is doomed,
    because you cannot reliably tell how big that buffer should be.

    --
    "If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers." - Keith Thompson
    CBFalconer, Feb 10, 2005
    #14
  15. Michael Mair

    jacob navia Guest

    For text files is the same as
    above, but add:


    char *p1 =contents,char *p2 = contents;
    int i = 0;
    while (i < actualBytesRead) {
    if (*p1 != '\r') {
    *p2++ = *p1;
    }
    p1++;
    i++;
    }
    *p2++ = 0;

    This is thousand times more efficient that all those
    calls to realloc, or all those calls to fread.

    True, you will waste some bytes because you will read
    many \r that you later erase, allocating a slightly
    bigger buffer than needed but this is not very
    important in most applications...


    Note: You could do this more stable if you want to keep
    isolated \r (i.e. \r not followed by \n) in which case
    you can add the corresponding tests...
    jacob navia, Feb 10, 2005
    #15
  16. Michael Mair

    cpg Guest

    Michael Mair wrote:

    > This is what I would do for binary files.
    > Essentially, I am looking for the text file equivalent of fread().


    I'm just curious, why would you do anything different for text/binary
    data files? The approach is the same, what you can do afterwards on
    the resultant buffer is the only thing that differs. Since a "text"
    file is a special case of a "raw binary" file,you only have to code the
    common functionality once (buffering in this case).

    Would you not simply perform raw reads into a temp buffer accumulating
    your overall file buffer until the entire file is read? Apply a filter
    afterwards for some sort of sanity checking that this file meets your
    requirements (ctype.h), then continue on.

    Obviously, if the file checks out as "text", then things like lines
    make sense. I would create "text" functions to operate on these buffers
    to fit your needs. Later on you may find a need to write some binary
    equivalents to do other tasks (a raw strstr() equivalent becomes
    particularly useful for searching binary data), and the buffering part
    is already done.

    Also, it's probably more useful to define a structure that abstracts
    these "buffers". That way, you can add functionality without breaking
    the interface.

    Have fun, cpg
    cpg, Feb 10, 2005
    #16
  17. Michael Mair

    jacob navia Guest

    CBFalconer wrote:

    >

    jacob wrote:
    >
    >>What about this?
    >>
    >>#include <stdio.h>
    >>#include <stdlib.h>
    >>#include <string.h>
    >>char *ReadFileIntoRam(char *fname,int *plen)
    >>{
    >> FILE *infile;
    >> char *contents;
    >> int actualBytesRead=0;
    >> unsigned int len;
    >>
    >> infile = fopen(fname,"rb");
    >> if (infile == NULL) {
    >> fprintf(stderr,"impossible to open %s\n",fname);
    >> return NULL;
    >> }
    >> fseek(infile,0,SEEK_END);
    >> len = ftell(infile);
    >> fseek(infile,0,SEEK_SET);
    >> contents = calloc(len+1,1);
    >> if (contents) {
    >> actualBytesRead = fread(contents,1,len,infile);
    >> }
    >> else {
    >> fprintf(stderr,"Can't allocate memory to read the file\n");
    >> }
    >> fclose(infile);
    >> *plen = actualBytesRead;
    >> return contents;
    >>}

    >
    >
    > No good.


    Please Chuck, it was a program written in a few minutes!

    Note that ftell is meaningless for text files.

    That's why I opened in binary mode


    It also
    > returns a long, not an int.


    OK

    You haven't even tested for failure
    > (which it will on input from a keyboard).

    The function receives a file name Chuck. There is NO
    keyboard input...



    Even if everything works
    > use of calloc is silly, why zero what you are about to fill.


    No. This dispenses with the zeroing of the last byte,
    maybe inefficient but it is an habit...


    > Any attempt to pre-allocate a buffer for the whole file is doomed,
    > because you cannot reliably tell how big that buffer should be.


    If you open it in binary mode yes, you can...
    jacob navia, Feb 10, 2005
    #17
  18. Michael Mair

    CBFalconer Guest

    infobahn wrote:
    > Al Bowers wrote:
    >>

    > <snip>
    >>
    >> Use fgets to copy into a buffer. And, then append to a
    >> expanding dynamically allocated char array. This is not unwieldy.
    >>

    .... snip ...
    >
    > You don't have to go to the well quite this often. You can keep
    > a max, and only realloc when the max is about to be exceeded.
    > Whenever you do this, multiply the not-enough-storage value by
    > some constant (some people double, others use 1.1 or 1.5 or
    > whatever) to decide how much to allocate next time.


    A certain Richard Heathfield has made available a routine for this
    approach, found in fgetline at:

    <http://users.powernet.co.uk/eton/c/fgetdata.html>

    while I prefer using my own ggets/fggets, which doesn't keep a
    history (thus having a much simpler calling sequence), and which
    can be found at:

    <http://cbfalconer.home.att.net/download/ggets.zip>

    --
    "If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers." - Keith Thompson
    CBFalconer, Feb 10, 2005
    #18
  19. Michael Mair

    Michael Mair Guest

    cpg wrote:
    > Michael Mair wrote:
    >
    >
    >>This is what I would do for binary files.
    >>Essentially, I am looking for the text file equivalent of fread().

    >
    >
    > I'm just curious, why would you do anything different for text/binary
    > data files? The approach is the same, what you can do afterwards on
    > the resultant buffer is the only thing that differs. Since a "text"
    > file is a special case of a "raw binary" file,you only have to code the
    > common functionality once (buffering in this case).
    >
    > Would you not simply perform raw reads into a temp buffer accumulating
    > your overall file buffer until the entire file is read? Apply a filter
    > afterwards for some sort of sanity checking that this file meets your
    > requirements (ctype.h), then continue on.


    The thing is that I do not want to make _any_ assumptions like that
    there is a one-to-one correspondence for certain byte ranges -- the
    standard does not guarantee that and even mentions that "Characters
    may have to be added, altered, or deleted ..."
    Moreover, if I want to move on to wide characters/multibyte characters,
    then I certainly will stick to the narrow path and not try to find
    convenient shortcuts.
    So, I will treat reading in a text file in a different manner than
    reading in a binary file if necessary. It is quite possible that
    fread() on a text stream will do what I want; then I will use it.
    I have no interest in sanity checks which work with the C locale
    but not every other locale as well.
    If there was a standard way to read in a binary file and then convert
    the resulting buffer into the "text" equivalent, then I would use this
    approach.


    > Obviously, if the file checks out as "text", then things like lines
    > make sense. I would create "text" functions to operate on these buffers
    > to fit your needs. Later on you may find a need to write some binary
    > equivalents to do other tasks (a raw strstr() equivalent becomes
    > particularly useful for searching binary data), and the buffering part
    > is already done.


    That is true in general but here I have a special requirement where
    I am certain that I will deal only with text files and the only possible
    extension is going for multibyte/wide characters. However, this will not
    be any problem as I essentially will only have to create wide char
    versions of my functions and get a "w" or "wc" into the called library
    functions.
    The only thing left is a "good" way to get a complete text file into
    a buffer. The organisation in lines does not play any role at all, so
    the question is using a getc loop vs. using something to obtain large
    chunks of characters from text files.


    > Also, it's probably more useful to define a structure that abstracts
    > these "buffers". That way, you can add functionality without breaking
    > the interface.


    True but in this case only overhead.


    > Have fun, cpg


    Thanks :)


    -Michael
    --
    E-Mail: Mine is a gmx dot de address.
    Michael Mair, Feb 10, 2005
    #19
  20. Michael Mair

    Eric Sosman Guest

    cpg wrote:
    > Michael Mair wrote:
    >
    >
    >>This is what I would do for binary files.
    >>Essentially, I am looking for the text file equivalent of fread().

    >
    >
    > I'm just curious, why would you do anything different for text/binary
    > data files? The approach is the same, what you can do afterwards on
    > the resultant buffer is the only thing that differs. Since a "text"
    > file is a special case of a "raw binary" file,you only have to code the
    > common functionality once (buffering in this case).
    >
    > Would you not simply perform raw reads into a temp buffer accumulating
    > your overall file buffer until the entire file is read? Apply a filter
    > afterwards for some sort of sanity checking that this file meets your
    > requirements (ctype.h), then continue on.


    There's the rub: What should the "filter" do? On
    one system I've used, for example, if you were to write
    the line "Hello, world!\n" to a text file and then read
    it back in binary, here are the bytes you would get:

    \015 \000 H e l l o , w o r l d ! \000

    Notice that the '\n' you wrote has vanished and that three
    new bytes have appeared out of thin air. The system in
    question knows how to translate this sequence of bytes back
    to "Hello, world!\n" -- but do *you* know how?

    By the way, the above illustrates the system's "usual"
    way of storing text in a file. The system actually provides
    six additional text formats, some of which permit variations.
    How many "filters" are you prepared to write, simply to avoid
    using what the C library already provides?

    (A hint for the curious: The company that bought the
    company that bought the company that made this system recently
    fired its CEO.)

    --
    Eric Sosman, Feb 10, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    593
    kryptomoon
    Sep 13, 2005
  2. markspace

    reading a whole file?

    markspace, May 24, 2004, in forum: C++
    Replies:
    3
    Views:
    3,360
    John Harrison
    May 24, 2004
  3. \A_Michigan_User\
    Replies:
    2
    Views:
    887
    \A_Michigan_User\
    Aug 21, 2006
  4. Tomasz Wrobel
    Replies:
    1
    Views:
    103
    Robert Klemme
    Apr 30, 2009
  5. Roger Pack
    Replies:
    3
    Views:
    94
    Caleb Clausen
    Dec 2, 2009
Loading...

Share This Page