Reading Words from File

Discussion in 'C Programming' started by dough, Oct 4, 2005.

  1. dough

    dough Guest

    I want to read in lines from a file and then seperate the words so i
    can do a process on each of the words. Say the text file "readme.txt"
    contains the following:

    In the face of criticism from the left and right, President Bush
    insisted Tuesday that Harriet Miers is the nation's best-qualified
    candidate for the Supreme Court and assured skeptical conservatives
    that his lawyer...

    I could get an input to a char *s such that s = "In" and then i do
    something with s, then s = "the" and then i do something with that,
    etc. With no idea the length of any string or line or whitespace.

    Heres what I have so far.

    #include <ctype.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    void process(char *s) /* whats here is not really important *
    {
    printf("%s", s);
    }

    int main() {

    char buffer[80];
    FILE *f = fopen("readme.txt", "r");
    char *s;

    while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
    {
    while( sscanf(buffer, "%s", s) ) /* scans for words in line */
    {
    process(s); /* do stuff to the words */
    }
    }

    fclose(f);
    return 0;

    }

    Also, is there anyway to adjust the size of the buffer or reallocate
    the memory so it doesn't overflow and get a seg error.
    dough, Oct 4, 2005
    #1
    1. Advertising

  2. "dough" <> wrote in message
    news:...
    > I want to read in lines from a file and then seperate the words so i
    > can do a process on each of the words. Say the text file "readme.txt"
    > contains the following:
    >
    > In the face of criticism from the left and right, President Bush
    > insisted Tuesday that Harriet Miers is the nation's best-qualified
    > candidate for the Supreme Court and assured skeptical conservatives
    > that his lawyer...
    >
    > I could get an input to a char *s such that s = "In" and then i do
    > something with s, then s = "the" and then i do something with that,
    > etc. With no idea the length of any string or line or whitespace.


    I don't want to be harsh, but it seems to me the 2nd paragraph is off topic
    and unwise for a poster looking for help...

    Alex
    Alexei A. Frounze, Oct 4, 2005
    #2
    1. Advertising

  3. In article <>,
    dough <> wrote:
    :I want to read in lines from a file and then seperate the words so i
    :can do a process on each of the words.

    There is often a non-trivial semantic problem in deciding what
    a "word" is in such matters. For example, in

    "Oh!," he yelled (into his Hello-Kitty phone.)

    then if you go by whitespace you get "words" such as

    "Oh!," and (into and phone.) and Hello-Kitty

    which is usually not the breakdown you want.
    --
    These .signatures are sold by volume, and not by weight.
    Walter Roberson, Oct 4, 2005
    #3
  4. dough

    Eric Sosman Guest

    dough wrote On 10/04/05 14:39,:
    > I want to read in lines from a file and then seperate the words so i
    > can do a process on each of the words. Say the text file "readme.txt"
    > contains the following:
    >
    > In the face of criticism from the left and right, President Bush
    > insisted Tuesday that Harriet Miers is the nation's best-qualified
    > candidate for the Supreme Court and assured skeptical conservatives
    > that his lawyer...
    >
    > I could get an input to a char *s such that s = "In" and then i do
    > something with s, then s = "the" and then i do something with that,
    > etc. With no idea the length of any string or line or whitespace.
    >
    > Heres what I have so far.
    >
    > #include <ctype.h>
    > #include <stdio.h>
    > #include <stdlib.h>
    > #include <string.h>
    >
    > void process(char *s) /* whats here is not really important *
    > {
    > printf("%s", s);
    > }
    >
    > int main() {
    >
    > char buffer[80];
    > FILE *f = fopen("readme.txt", "r");
    > char *s;


    It would be a good idea to test `f == NULL' before
    proceeding ...

    > while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
    > {
    > while( sscanf(buffer, "%s", s) ) /* scans for words in line */


    Here's a problem: `s' doesn't point to anything, so
    when scanf() locates a word and tries to copy it to the
    memory `s' points at, all manner of mischief can ensue.

    > {
    > process(s); /* do stuff to the words */
    > }
    > }
    >
    > fclose(f);
    > return 0;
    >
    > }



    > Also, is there anyway to adjust the size of the buffer or reallocate
    > the memory so it doesn't overflow and get a seg error.


    If you used malloc() to create the space for `buffer', you
    could use realloc() to enlarge it. But the immediate problem
    is not the size of `buffer', but the uninitialized `s'.

    Your overall task sounds like a job for the much-maligned
    strtok() function. However, see Walter Roberson's post for
    some of the pitfalls of using simple string-bashing to separate
    "words" from their surroundings.

    --
    Eric Sosman, Oct 4, 2005
    #4
  5. Walter Roberson <-cnrc.gc.ca> wrote:

    > There is often a non-trivial semantic problem in deciding what
    > a "word" is in such matters. For example, in


    > "Oh!," he yelled (into his Hello-Kitty phone.)


    I must say that that is a truly bizarre example sentence :) That
    aside, it seems to me that assuming a "word" is a sequence of
    consecutive alpha characters would yield better results, at least
    depending on what OP wants to do with the "words" once he has them.

    --
    Christopher Benson-Manica | I *should* know what I'm talking about - if I
    ataru(at)cyberspace.org | don't, I need to know. Flames welcome.
    Christopher Benson-Manica, Oct 4, 2005
    #5
  6. dough

    Hemanth Guest

    dough wrote:
    > I want to read in lines from a file and then seperate the words so i
    > can do a process on each of the words.



    .......use strtok() function to split a string into words (use
    whitespace or any other separator you want)


    > char buffer[80];
    > FILE *f = fopen("readme.txt", "r");
    > while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
    >
    > Also, is there anyway to adjust the size of the buffer or reallocate
    > the memory so it doesn't overflow and get a seg error.



    ........the fgets statement reads until num-1 characters are read (in
    this case 79) or a newline or EOF is reached (whichever happens first).
    So I don't think you need a realloc in this case.


    HTH,
    Hemanth
    Hemanth, Oct 4, 2005
    #6
  7. dough

    Michael Mair Guest

    dough wrote:
    > I want to read in lines from a file and then seperate the words so i
    > can do a process on each of the words. Say the text file "readme.txt"
    > contains the following:
    >
    > In the face of criticism from the left and right, President Bush
    > insisted Tuesday that Harriet Miers is the nation's best-qualified
    > candidate for the Supreme Court and assured skeptical conservatives
    > that his lawyer...
    >
    > I could get an input to a char *s such that s = "In" and then i do
    > something with s, then s = "the" and then i do something with that,
    > etc. With no idea the length of any string or line or whitespace.


    I am not sure what your problem is.
    When you have a problem, please help us help you:
    State what you want to achieve (this part seems clear) and
    what about your solution did not work.
    Otherwise, everyone tells you about A because you seemed to
    ask for B while meaning C...

    >
    > Heres what I have so far.
    >
    > #include <ctype.h>
    > #include <stdio.h>
    > #include <stdlib.h>
    > #include <string.h>
    >
    > void process(char *s) /* whats here is not really important *
    > {
    > printf("%s", s);
    > }
    >
    > int main() {
    >
    > char buffer[80];
    > FILE *f = fopen("readme.txt", "r");
    > char *s;


    Check whether f is != NULL. If you omitted the check for
    brevity, then write a comment.

    > while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
    > {
    > while( sscanf(buffer, "%s", s) ) /* scans for words in line */
    > {
    > process(s); /* do stuff to the words */
    > }
    > }


    Okay, so what is the problem here? About everything:
    1) you may inadvertently separate a word if your buffer is not
    long enough (uncritical)
    2) You scan always from the same position (buffer is effectively &buffer[0])
    3) You read your string into memory pointed to by an unitialized pointer.

    Consider
    char s[sizeof buffer] = "", *tmp = NULL;
    while (....)
    {
    tmp = buffer;
    while ( sscanf(tmp, "%s", s) )
    {
    process(s);
    tmp += strlen(s);
    }
    /* a */
    }
    This solves 2) and 3).
    Another solution is the use of strtok() etc.

    If you check at point "a" whether buffer[strlen(buffer)-1]=='\n',
    then you can also detect instances of 1).
    However, this may not be what you are looking for (see below)

    >
    > fclose(f);
    > return 0;
    >
    > }
    >
    > Also, is there anyway to adjust the size of the buffer or reallocate
    > the memory so it doesn't overflow and get a seg error.


    realloc() helps you do that.
    Have a look at the comp.lang.c archives to see how to use it.

    If you do not need the words in context, you also use getc() which
    may be clearer:

    #include <stdio.h>
    #include <stdlib.h>
    #include <ctype.h>

    #define START_BUFSIZE 20


    void process(const char *s);
    int resize_buffer (char **buf, size_t *len);


    int main (void)
    {
    FILE *f;
    char *s = NULL;
    size_t length = 0;
    int input;

    if (NULL == (f = fopen("readme.txt", "r")))
    {
    fprintf(stderr, "Cannot open file\n");
    exit(EXIT_FAILURE);
    }
    if (NULL == (s = malloc((START_BUFSIZE+1) * sizeof *s)))
    {
    fprintf(stderr, "Error on allocating memory for s\n");
    fclose(f);
    exit(EXIT_FAILURE);
    }
    length = START_BUFSIZE;

    do /* ... while (input != EOF) */
    {
    size_t curr = 0;

    /* Read up to the first whitespace */
    while (!isspace(input = getc(f)) && input != EOF)
    {
    s[curr++] = input;
    if (curr == length)
    {
    if (resize_buffer(&s, &length))
    {
    /* perform error handling */
    break;
    }
    }
    }
    /* Make s a string */
    s[curr] = '\0';

    if (curr)
    process(s);

    /* Read up to the first non-whitespace */
    while ((input = getc(f)) != EOF)
    {
    putchar('*');
    if (!isspace(input))
    {
    ungetc(input, f);
    break;
    }
    }
    } while (input != EOF);

    free(s);
    fclose(f);

    putchar('\n');

    return 0;
    }

    void process(const char *s) /* whats here is not really important */
    {
    printf("%s", s); fflush (stdout);
    }

    int resize_buffer (char **buf, size_t *len)
    {
    /* Using mybuf and mylen for readability */
    char *mybuf = *buf;
    size_t mylen = *len;

    char *tmp;
    size_t destlen = 2*mylen+1;

    /* A */
    if (NULL == (tmp = realloc(mybuf, destlen)))
    {
    return 1;
    }
    mybuf = tmp;
    mylen = destlen - 1;

    /* write back to parameters */
    *buf = mybuf;
    *len = mylen;

    return 0;
    }


    Cheers
    Michael
    --
    E-Mail: Mine is an /at/ gmx /dot/ de address.
    Michael Mair, Oct 4, 2005
    #7
  8. In article <dhumdl$j2o$>,
    Christopher Benson-Manica <> wrote:
    >Walter Roberson <-cnrc.gc.ca> wrote:


    >> There is often a non-trivial semantic problem in deciding what
    >> a "word" is in such matters.


    >aside, it seems to me that assuming a "word" is a sequence of
    >consecutive alpha characters would yield better results, at least
    >depending on what OP wants to do with the "words" once he has them.


    Using "alpha" as the boundary definition runs into difficulties
    with possessives, contractions, joined-words, and words such as
    re-enter in which the dash indicates seperation of vowels that
    would otherwise form a diapthong. It would likely also run
    into problems with Mr. Salutation, and abbreviations such as etc.
    in which the period is really part of the word.
    --
    Okay, buzzwords only. Two syllables, tops. -- Laurie Anderson
    Walter Roberson, Oct 4, 2005
    #8
  9. dough

    Eric Sosman Guest

    Christopher Benson-Manica wrote On 10/04/05 15:50,:
    > Walter Roberson <-cnrc.gc.ca> wrote:
    >
    >
    >>There is often a non-trivial semantic problem in deciding what
    >>a "word" is in such matters. For example, in

    >
    >
    >> "Oh!," he yelled (into his Hello-Kitty phone.)

    >
    >
    > I must say that that is a truly bizarre example sentence :) That
    > aside, it seems to me that assuming a "word" is a sequence of
    > consecutive alpha characters would yield better results, at least
    > depending on what OP wants to do with the "words" once he has them.


    This is a reasonable 1st approximation, but its tend-
    ency to generate non-words (e.g., "st") isn't desirable.

    --
    Eric Sosman, Oct 4, 2005
    #9
  10. dough

    Barry Guest

    "dough" <> wrote in message
    news:...
    > I want to read in lines from a file and then seperate the words so i
    > can do a process on each of the words. Say the text file "readme.txt"
    > contains the following:
    >
    > In the face of criticism from the left and right, President Bush
    > insisted Tuesday that Harriet Miers is the nation's best-qualified
    > candidate for the Supreme Court and assured skeptical conservatives
    > that his lawyer...
    >
    > I could get an input to a char *s such that s = "In" and then i do
    > something with s, then s = "the" and then i do something with that,
    > etc. With no idea the length of any string or line or whitespace.
    >
    > Heres what I have so far.
    >
    > #include <ctype.h>
    > #include <stdio.h>
    > #include <stdlib.h>
    > #include <string.h>
    >
    > void process(char *s) /* whats here is not really important *
    > {
    > printf("%s", s);
    > }
    >
    > int main() {
    >
    > char buffer[80];
    > FILE *f = fopen("readme.txt", "r");
    > char *s;
    >
    > while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
    > {
    > while( sscanf(buffer, "%s", s) ) /* scans for words in line */
    > {
    > process(s); /* do stuff to the words */
    > }
    > }
    >
    > fclose(f);
    > return 0;
    >
    > }
    >
    > Also, is there anyway to adjust the size of the buffer or reallocate
    > the memory so it doesn't overflow and get a seg error.
    >


    "process" is a terrible name for a function in any context.

    Barry
    Barry, Oct 4, 2005
    #10
  11. dough

    Mabden Guest

    "Michael Mair" <> wrote in message
    news:...
    > dough wrote:
    > > I want to read in lines from a file and then seperate the words so i
    > > can do a process on each of the words. Say the text file

    "readme.txt"
    > > contains the following:


    Interesting. No one has ever thought of doing that before. Where did you
    come up with such a great idea for a program? It's unlike anything I've
    ever heard of...

    > > Also, is there anyway to adjust the size of the buffer or reallocate
    > > the memory so it doesn't overflow and get a seg error.

    >
    > realloc() helps you do that.
    > Have a look at the comp.lang.c archives to see how to use it.



    That would be like studying. If he wanted to study he would go to
    school.

    >
    > If you do not need the words in context, you also use getc() which
    > may be clearer:
    >


    <Homework answers snipped>

    Nice job you get him an A-.

    --
    Mabden
    Mabden, Oct 4, 2005
    #11
  12. On 4 Oct 2005 11:39:39 -0700, "dough" <> wrote:

    >I want to read in lines from a file and then seperate the words so i
    >can do a process on each of the words. Say the text file "readme.txt"
    >contains the following:


    It would be nice if you mentioned what your problem was.

    snip

    >Heres what I have so far.
    >
    >#include <ctype.h>
    >#include <stdio.h>
    >#include <stdlib.h>
    >#include <string.h>
    >
    >void process(char *s) /* whats here is not really important *
    >{
    > printf("%s", s);
    >}
    >
    >int main() {
    >
    >char buffer[80];
    >FILE *f = fopen("readme.txt", "r");
    >char *s;
    >
    >while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
    >{
    > while( sscanf(buffer, "%s", s) ) /* scans for words in line */


    s doesn't point anywhere sscanf can write to. This invokes undefined
    behavior.

    > {
    > process(s); /* do stuff to the words */
    > }
    >}
    >
    >fclose(f);
    >return 0;
    >
    >}
    >
    >Also, is there anyway to adjust the size of the buffer or reallocate
    >the memory so it doesn't overflow and get a seg error.


    The seg error you experience has nothing to do with buffer, since you
    never overflow it. It has everything to do with failing to have s
    point somewhere.


    <<Remove the del for email>>
    Barry Schwarz, Oct 5, 2005
    #12
  13. dough

    Michael Mair Guest

    Mabden wrote:
    > "Michael Mair" <> wrote in message
    > news:...
    >

    [snip]

    >>If you do not need the words in context, you also use getc() which
    >>may be clearer:

    >
    > <Homework answers snipped>
    >
    > Nice job you get him an A-.


    The original message was not too obviously a homework question
    to me and contained a first shot at the problem, so I decided
    to give the OP the benefit of doubt. If "dough" posts something
    like that again or does not respond to the answer he or she got
    in this thread, I won't.


    Cheers
    Michael
    --
    E-Mail: Mine is an /at/ gmx /dot/ de address.
    Michael Mair, Oct 5, 2005
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Strøiman
    Replies:
    1
    Views:
    2,082
    Peter Strøiman
    Aug 23, 2005
  2. Richard Heathfield
    Replies:
    7
    Views:
    361
    Barry Schwarz
    Oct 5, 2003
  3. utab

    Words Words

    utab, Feb 16, 2006, in forum: C++
    Replies:
    6
    Views:
    420
    Daniel T.
    Feb 16, 2006
  4. BerlinBrown
    Replies:
    6
    Views:
    4,479
  5. Lasse Edsvik

    replace words with bold words

    Lasse Edsvik, Oct 5, 2003, in forum: ASP General
    Replies:
    9
    Views:
    234
Loading...

Share This Page