Reading Words from File

D

dough

I want to read in lines from a file and then seperate the words so i
can do a process on each of the words. Say the text file "readme.txt"
contains the following:

In the face of criticism from the left and right, President Bush
insisted Tuesday that Harriet Miers is the nation's best-qualified
candidate for the Supreme Court and assured skeptical conservatives
that his lawyer...

I could get an input to a char *s such that s = "In" and then i do
something with s, then s = "the" and then i do something with that,
etc. With no idea the length of any string or line or whitespace.

Heres what I have so far.

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void process(char *s) /* whats here is not really important *
{
printf("%s", s);
}

int main() {

char buffer[80];
FILE *f = fopen("readme.txt", "r");
char *s;

while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
{
while( sscanf(buffer, "%s", s) ) /* scans for words in line */
{
process(s); /* do stuff to the words */
}
}

fclose(f);
return 0;

}

Also, is there anyway to adjust the size of the buffer or reallocate
the memory so it doesn't overflow and get a seg error.
 
A

Alexei A. Frounze

dough said:
I want to read in lines from a file and then seperate the words so i
can do a process on each of the words. Say the text file "readme.txt"
contains the following:

In the face of criticism from the left and right, President Bush
insisted Tuesday that Harriet Miers is the nation's best-qualified
candidate for the Supreme Court and assured skeptical conservatives
that his lawyer...

I could get an input to a char *s such that s = "In" and then i do
something with s, then s = "the" and then i do something with that,
etc. With no idea the length of any string or line or whitespace.

I don't want to be harsh, but it seems to me the 2nd paragraph is off topic
and unwise for a poster looking for help...

Alex
 
W

Walter Roberson

:I want to read in lines from a file and then seperate the words so i
:can do a process on each of the words.

There is often a non-trivial semantic problem in deciding what
a "word" is in such matters. For example, in

"Oh!," he yelled (into his Hello-Kitty phone.)

then if you go by whitespace you get "words" such as

"Oh!," and (into and phone.) and Hello-Kitty

which is usually not the breakdown you want.
 
E

Eric Sosman

dough wrote On 10/04/05 14:39,:
I want to read in lines from a file and then seperate the words so i
can do a process on each of the words. Say the text file "readme.txt"
contains the following:

In the face of criticism from the left and right, President Bush
insisted Tuesday that Harriet Miers is the nation's best-qualified
candidate for the Supreme Court and assured skeptical conservatives
that his lawyer...

I could get an input to a char *s such that s = "In" and then i do
something with s, then s = "the" and then i do something with that,
etc. With no idea the length of any string or line or whitespace.

Heres what I have so far.

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void process(char *s) /* whats here is not really important *
{
printf("%s", s);
}

int main() {

char buffer[80];
FILE *f = fopen("readme.txt", "r");
char *s;

It would be a good idea to test `f == NULL' before
proceeding ...
while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
{
while( sscanf(buffer, "%s", s) ) /* scans for words in line */

Here's a problem: `s' doesn't point to anything, so
when scanf() locates a word and tries to copy it to the
memory `s' points at, all manner of mischief can ensue.
{
process(s); /* do stuff to the words */
}
}

fclose(f);
return 0;

}

Also, is there anyway to adjust the size of the buffer or reallocate
the memory so it doesn't overflow and get a seg error.

If you used malloc() to create the space for `buffer', you
could use realloc() to enlarge it. But the immediate problem
is not the size of `buffer', but the uninitialized `s'.

Your overall task sounds like a job for the much-maligned
strtok() function. However, see Walter Roberson's post for
some of the pitfalls of using simple string-bashing to separate
"words" from their surroundings.
 
C

Christopher Benson-Manica

Walter Roberson said:
There is often a non-trivial semantic problem in deciding what
a "word" is in such matters. For example, in
"Oh!," he yelled (into his Hello-Kitty phone.)

I must say that that is a truly bizarre example sentence :) That
aside, it seems to me that assuming a "word" is a sequence of
consecutive alpha characters would yield better results, at least
depending on what OP wants to do with the "words" once he has them.
 
H

Hemanth

dough said:
I want to read in lines from a file and then seperate the words so i
can do a process on each of the words.


.......use strtok() function to split a string into words (use
whitespace or any other separator you want)

char buffer[80];
FILE *f = fopen("readme.txt", "r");
while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */

Also, is there anyway to adjust the size of the buffer or reallocate
the memory so it doesn't overflow and get a seg error.


........the fgets statement reads until num-1 characters are read (in
this case 79) or a newline or EOF is reached (whichever happens first).
So I don't think you need a realloc in this case.


HTH,
Hemanth
 
M

Michael Mair

dough said:
I want to read in lines from a file and then seperate the words so i
can do a process on each of the words. Say the text file "readme.txt"
contains the following:

In the face of criticism from the left and right, President Bush
insisted Tuesday that Harriet Miers is the nation's best-qualified
candidate for the Supreme Court and assured skeptical conservatives
that his lawyer...

I could get an input to a char *s such that s = "In" and then i do
something with s, then s = "the" and then i do something with that,
etc. With no idea the length of any string or line or whitespace.

I am not sure what your problem is.
When you have a problem, please help us help you:
State what you want to achieve (this part seems clear) and
what about your solution did not work.
Otherwise, everyone tells you about A because you seemed to
ask for B while meaning C...
Heres what I have so far.

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void process(char *s) /* whats here is not really important *
{
printf("%s", s);
}

int main() {

char buffer[80];
FILE *f = fopen("readme.txt", "r");
char *s;

Check whether f is != NULL. If you omitted the check for
brevity, then write a comment.
while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
{
while( sscanf(buffer, "%s", s) ) /* scans for words in line */
{
process(s); /* do stuff to the words */
}
}

Okay, so what is the problem here? About everything:
1) you may inadvertently separate a word if your buffer is not
long enough (uncritical)
2) You scan always from the same position (buffer is effectively &buffer[0])
3) You read your string into memory pointed to by an unitialized pointer.

Consider
char s[sizeof buffer] = "", *tmp = NULL;
while (....)
{
tmp = buffer;
while ( sscanf(tmp, "%s", s) )
{
process(s);
tmp += strlen(s);
}
/* a */
}
This solves 2) and 3).
Another solution is the use of strtok() etc.

If you check at point "a" whether buffer[strlen(buffer)-1]=='\n',
then you can also detect instances of 1).
However, this may not be what you are looking for (see below)
fclose(f);
return 0;

}

Also, is there anyway to adjust the size of the buffer or reallocate
the memory so it doesn't overflow and get a seg error.

realloc() helps you do that.
Have a look at the comp.lang.c archives to see how to use it.

If you do not need the words in context, you also use getc() which
may be clearer:

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>

#define START_BUFSIZE 20


void process(const char *s);
int resize_buffer (char **buf, size_t *len);


int main (void)
{
FILE *f;
char *s = NULL;
size_t length = 0;
int input;

if (NULL == (f = fopen("readme.txt", "r")))
{
fprintf(stderr, "Cannot open file\n");
exit(EXIT_FAILURE);
}
if (NULL == (s = malloc((START_BUFSIZE+1) * sizeof *s)))
{
fprintf(stderr, "Error on allocating memory for s\n");
fclose(f);
exit(EXIT_FAILURE);
}
length = START_BUFSIZE;

do /* ... while (input != EOF) */
{
size_t curr = 0;

/* Read up to the first whitespace */
while (!isspace(input = getc(f)) && input != EOF)
{
s[curr++] = input;
if (curr == length)
{
if (resize_buffer(&s, &length))
{
/* perform error handling */
break;
}
}
}
/* Make s a string */
s[curr] = '\0';

if (curr)
process(s);

/* Read up to the first non-whitespace */
while ((input = getc(f)) != EOF)
{
putchar('*');
if (!isspace(input))
{
ungetc(input, f);
break;
}
}
} while (input != EOF);

free(s);
fclose(f);

putchar('\n');

return 0;
}

void process(const char *s) /* whats here is not really important */
{
printf("%s", s); fflush (stdout);
}

int resize_buffer (char **buf, size_t *len)
{
/* Using mybuf and mylen for readability */
char *mybuf = *buf;
size_t mylen = *len;

char *tmp;
size_t destlen = 2*mylen+1;

/* A */
if (NULL == (tmp = realloc(mybuf, destlen)))
{
return 1;
}
mybuf = tmp;
mylen = destlen - 1;

/* write back to parameters */
*buf = mybuf;
*len = mylen;

return 0;
}


Cheers
Michael
 
W

Walter Roberson

aside, it seems to me that assuming a "word" is a sequence of
consecutive alpha characters would yield better results, at least
depending on what OP wants to do with the "words" once he has them.

Using "alpha" as the boundary definition runs into difficulties
with possessives, contractions, joined-words, and words such as
re-enter in which the dash indicates seperation of vowels that
would otherwise form a diapthong. It would likely also run
into problems with Mr. Salutation, and abbreviations such as etc.
in which the period is really part of the word.
 
E

Eric Sosman

Christopher Benson-Manica wrote On 10/04/05 15:50,:
I must say that that is a truly bizarre example sentence :) That
aside, it seems to me that assuming a "word" is a sequence of
consecutive alpha characters would yield better results, at least
depending on what OP wants to do with the "words" once he has them.

This is a reasonable 1st approximation, but its tend-
ency to generate non-words (e.g., "st") isn't desirable.
 
B

Barry

dough said:
I want to read in lines from a file and then seperate the words so i
can do a process on each of the words. Say the text file "readme.txt"
contains the following:

In the face of criticism from the left and right, President Bush
insisted Tuesday that Harriet Miers is the nation's best-qualified
candidate for the Supreme Court and assured skeptical conservatives
that his lawyer...

I could get an input to a char *s such that s = "In" and then i do
something with s, then s = "the" and then i do something with that,
etc. With no idea the length of any string or line or whitespace.

Heres what I have so far.

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void process(char *s) /* whats here is not really important *
{
printf("%s", s);
}

int main() {

char buffer[80];
FILE *f = fopen("readme.txt", "r");
char *s;

while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
{
while( sscanf(buffer, "%s", s) ) /* scans for words in line */
{
process(s); /* do stuff to the words */
}
}

fclose(f);
return 0;

}

Also, is there anyway to adjust the size of the buffer or reallocate
the memory so it doesn't overflow and get a seg error.

"process" is a terrible name for a function in any context.

Barry
 
M

Mabden

Interesting. No one has ever thought of doing that before. Where did you
come up with such a great idea for a program? It's unlike anything I've
ever heard of...
realloc() helps you do that.
Have a look at the comp.lang.c archives to see how to use it.


That would be like studying. If he wanted to study he would go to
school.
If you do not need the words in context, you also use getc() which
may be clearer:

<Homework answers snipped>

Nice job you get him an A-.
 
B

Barry Schwarz

I want to read in lines from a file and then seperate the words so i
can do a process on each of the words. Say the text file "readme.txt"
contains the following:

It would be nice if you mentioned what your problem was.

snip
Heres what I have so far.

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void process(char *s) /* whats here is not really important *
{
printf("%s", s);
}

int main() {

char buffer[80];
FILE *f = fopen("readme.txt", "r");
char *s;

while( fgets(buffer, sizeof(buffer), f) != NULL ) /* reads a line */
{
while( sscanf(buffer, "%s", s) ) /* scans for words in line */

s doesn't point anywhere sscanf can write to. This invokes undefined
behavior.
{
process(s); /* do stuff to the words */
}
}

fclose(f);
return 0;

}

Also, is there anyway to adjust the size of the buffer or reallocate
the memory so it doesn't overflow and get a seg error.

The seg error you experience has nothing to do with buffer, since you
never overflow it. It has everything to do with failing to have s
point somewhere.


<<Remove the del for email>>
 
M

Michael Mair

Mabden said:
[snip]
If you do not need the words in context, you also use getc() which
may be clearer:

<Homework answers snipped>

Nice job you get him an A-.

The original message was not too obviously a homework question
to me and contained a first shot at the problem, so I decided
to give the OP the benefit of doubt. If "dough" posts something
like that again or does not respond to the answer he or she got
in this thread, I won't.


Cheers
Michael
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top