Count total no. of characters,words & sentences in a text file

J

James Kanze

Umesh said:

[...]
The problem to which you refer is one of the easier K&R examples, and
should cause you no difficulty.

Sort of. As I recall it, it was characters, words and lines,
not sentences. And K&R made a special point of defining what
they meant by "words" and "lines" (and pointing out that they
were simplistic definitions, which didn't necessarily correspond
to the "intuitive" definition).

So, independantly of the language he choses (C, C++ or
whatever), the first step should be to define exactly what the
code is supposed to do. Until he's done that, he shouldn't
write a single line of code. (Defining a "sentence" in a way
that can be programmed is not obvious, and defining "word" in a
way compatible with everyday use is perhaps not trivial either:
is "don't" one word or two?)
 
J

James Kanze

if(ch=='.' && c!=' ') ++num3; /* '.' followed by '.' denotes end of a
sentence.*/}

So "Mr. and Mrs. Brown went out." is two sentences, and "I went
out." isn't a sentence.

I suggest that you start by defining exactly what is and what
isn't a sentence (and a word---and you might even ask the
question about characters; I work a lot with UTF-8, where a
character can require several char's).

Don't write a single line of code until you've defined the
problem space precisely.

FWIW: in my own line breaking algorithms, I defined a sentence
as anything ending with [.?!], optionally followed by ["'], and
then any amount of white space (not just ' '). In the context
I'm working in abbreviations aren't a problem; the only one I
encounter in practice is "etc.", and it's easy to special case
that. For learning code, handling abbreviations may be a bit
too complex (but you should at least document the restriction),
but the rest can easily be implemented by means of a simple
state machine.
 
L

lovecreatesbea...

This program has numerous problems, which I'll be happy to discuss
with you if you pick *one* newsgroup to post to (if that newsgroup
happens to be comp.lang.c; I don't regularly read comp.lang.c++).
C and C++ are two different languages, and cross-posting between
comp.lang.c and comp.lang.c++ is almost never a good idea.

But the first thing you should do is to run the program and take a
look at its output (hint: the results it reports are incorrect).

I have done an exercise on this. But it doesn't deal with this
condition: comp.lang.c - this group name will be treated as two
sentences. How can I improve it :)

$
$ type a.c
#include <stdio.h>
#include <ctype.h>

int wc2(const char *filename)
{
FILE *fp;
int ch;
int nc; /*num of chars*/
int nw; /*num of words*/
int ns; /*num of sentences*/
int inw; /*inside a word*/

nc = nw = ns = 0;
inw = 0;
if ((fp = fopen(filename, "r")) == NULL)
return -1;
while ((ch = fgetc(fp)) != EOF){
if (isalnum(ch)){
nc++;
inw = 1;
} else if ((ispunct(ch) || ch == ' ') && (inw == 1)){
nw++;
if (ch == '!' || ch == '?' || ch == '.' || ch == ';')
ns++;
inw = 0;
}
}
fprintf(stdout, "num of chars: %d\nnum of words: %d\nnum of
sentences: %d\n"

, nc, nw, ns);
fclose(fp);
return 0;
}

int main(int argc, char **argv)
{
if (argc != 2)
fprintf(stdout, "Usage: %s <filename>", argv[0]);
wc2(argv[1]);
return 0;
}

$ type test.txt
This program has numerous problems, which I'll be happy to discuss
with you if you pick *one* newsgroup to post to (if that newsgroup
happens to be comp.lang.c; I don't regularly read comp.lang.c++).
C and C++ are two different languages, and cross-posting between
comp.lang.c and comp.lang.c++ is almost never a good idea.

But the first thing you should do is to run the program and take a
look at its output (hint: the results it reports are incorrect).


--
Keith Thompson (The_Other_Keith) (e-mail address removed) <http://www.ghoti.net/
~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/
~kst>
"We must do something. This is something. Therefore, we must do
this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"




$ a.out test.txt
num of chars: 533
num of words: 130
num of sentences: 19

$
 
L

lovecreatesbea...

Please try to do it while I try myself!

#include <stdio.h>
#include <ctype.h>

int wc2(const char *filename)
{
FILE *fp;
int ch;
int nc; /*num of chars*/
int nw; /*num of words*/
int ns; /*num of sentences*/
int inw; /*inside a word*/

nc = nw = ns = 0;
inw = 0;
if ((fp = fopen(filename, "r")) == NULL){
return -1;
}
while ((ch = fgetc(fp)) != EOF){
if (isalnum(ch)){
nc++;
inw = 1;
} else if ((ispunct(ch) || ch == ' ') && (inw == 1)){
nw++;
if (ch == '!' || ch == '?' || ch == '.' || ch == ';')
ns++;
inw = 0;
}
}
fprintf(stdout, "num of chars: %d\nnum of words: %d\nnum of
sentences: %d\n"
, nc, nw, ns);
fclose(fp);
return 0;
}

int main(int argc, char **argv)
{
if (argc != 2){
fprintf(stdout, "Usage: %s <filename>\n", argv[0]);
return -1;
}
wc2(argv[1]);
return 0;
}
 
L

lovecreatesbea...

Please try to do it while I try myself!

$ cat a.c
#include <stdio.h>

int wc2(char *filename)
{
FILE *fp;
int ch;
int nc; /*num of characters*/
int nw; /*num of words*/
int ns; /*num of sentences*/
int inw; /*inside a word or not*/

nc = nw = ns = 0; inw = 0;
if ((fp = fopen(filename, "r")) == NULL)
return -1;
while ((ch = fgetc(fp)) != EOF){
if (isalnum(ch)){
nc++;
inw = 1;
} else if ((ispunct(ch) || ch == ' ') && inw == 1){
nw++;
if (ch == '.' || ch == '!' || ch == '?' || ch == ';')
ns++;
inw = 0;
}
}
fprintf(stdout, "characters: %d, ", nc);
fprintf(stdout, "words: %d, ", nw);
fprintf(stdout, "sentences: %d ", ns);
fprintf(stdout, "\n");
fclose(fp);
return 0;
}

int main(int argc, char **argv)
{
if (argc == 2)
wc2(argv[1]);
else
fprintf(stdout, "Usage: %s <filename>\n", argv[0]);
return 0;
}
$ cc a.c
$ cat test.txt
getc() and getchar() are implemented both as library functions and
macros. The macro versions, which are used by default, are
defined in
<stdio.h>. To obtain the library function either use a #undef
to
remove the macro definition or, if compiling in ANSI-C mode,
enclose
the function name in parenthesis or use the function address.
The
following example illustrates each of these methods :
$ ./a.out test.txt
characters: 311, words: 64, sentences: 3
$
 
J

James Kanze

I have done an exercise on this. But it doesn't deal with this
condition: comp.lang.c - this group name will be treated as two
sentences. How can I improve it :)

Define what you mean by sentence more exactly.

IIRC, the definition TeX uses, adopted to the C/C++ character
set, would be:

-- one of [.?!],
-- followed by zero or more of ['"],
-- followed by
. either zero or more whitespace, followed by the end of
file, or
. one or more whitespace, followed by a capital letter.

This works because TeX also expects things like "Mr. Brown" to
contain a non-breaking whitespace ('~' in TeX, 0xA0 in ISO
8859-1), which doesn't count as a whitespace. If you don't
require that, I can't think of anything but special casing to
handle Mr. and Mrs. (and Dr. and... any other abbreviation that
is often followed by a noun).

This is probably most easily handled by some sort of state
machine.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,123
Latest member
Layne6498
Top