Count total no. of characters,words & sentences in a text file

James Kanze · Apr 24, 2007

Umesh said:

[...]

The problem to which you refer is one of the easier K&R examples, and
should cause you no difficulty.

Sort of. As I recall it, it was characters, words and lines,
not sentences. And K&R made a special point of defining what
they meant by "words" and "lines" (and pointing out that they
were simplistic definitions, which didn't necessarily correspond
to the "intuitive" definition).

So, independantly of the language he choses (C, C++ or
whatever), the first step should be to define exactly what the
code is supposed to do. Until he's done that, he shouldn't
write a single line of code. (Defining a "sentence" in a way
that can be programmed is not obvious, and defining "word" in a
way compatible with everyday use is perhaps not trivial either:
is "don't" one word or two?)

James Kanze · Apr 24, 2007

if(ch=='.' && c!=' ') ++num3; /* '.' followed by '.' denotes end of a
sentence.*/}

So "Mr. and Mrs. Brown went out." is two sentences, and "I went
out." isn't a sentence.

I suggest that you start by defining exactly what is and what
isn't a sentence (and a word---and you might even ask the
question about characters; I work a lot with UTF-8, where a
character can require several char's).

Don't write a single line of code until you've defined the
problem space precisely.

FWIW: in my own line breaking algorithms, I defined a sentence
as anything ending with [.?!], optionally followed by ["'], and
then any amount of white space (not just ' '). In the context
I'm working in abbreviations aren't a problem; the only one I
encounter in practice is "etc.", and it's easy to special case
that. For learning code, handling abbreviations may be a bit
too complex (but you should at least document the restriction),
but the rest can easily be implemented by means of a simple
state machine.

lovecreatesbea... · Apr 24, 2007

This program has numerous problems, which I'll be happy to discuss
with you if you pick *one* newsgroup to post to (if that newsgroup
happens to be comp.lang.c; I don't regularly read comp.lang.c++).
C and C++ are two different languages, and cross-posting between
comp.lang.c and comp.lang.c++ is almost never a good idea.

But the first thing you should do is to run the program and take a
look at its output (hint: the results it reports are incorrect).

I have done an exercise on this. But it doesn't deal with this
condition: comp.lang.c - this group name will be treated as two
sentences. How can I improve it

$
$ type a.c
#include <stdio.h>
#include <ctype.h>

int wc2(const char *filename)
{
FILE *fp;
int ch;
int nc; /*num of chars*/
int nw; /*num of words*/
int ns; /*num of sentences*/
int inw; /*inside a word*/

nc = nw = ns = 0;
inw = 0;
if ((fp = fopen(filename, "r")) == NULL)
return -1;
while ((ch = fgetc(fp)) != EOF){
if (isalnum(ch)){
nc++;
inw = 1;
} else if ((ispunct(ch) || ch == ' ') && (inw == 1)){
nw++;
if (ch == '!' || ch == '?' || ch == '.' || ch == ';')
ns++;
inw = 0;
}
}
fprintf(stdout, "num of chars: %d\nnum of words: %d\nnum of
sentences: %d\n"

, nc, nw, ns);
fclose(fp);
return 0;
}

int main(int argc, char **argv)
{
if (argc != 2)
fprintf(stdout, "Usage: %s <filename>", argv[0]);
wc2(argv[1]);
return 0;
}

$ type test.txt
This program has numerous problems, which I'll be happy to discuss
with you if you pick *one* newsgroup to post to (if that newsgroup
happens to be comp.lang.c; I don't regularly read comp.lang.c++).
C and C++ are two different languages, and cross-posting between
comp.lang.c and comp.lang.c++ is almost never a good idea.

But the first thing you should do is to run the program and take a
look at its output (hint: the results it reports are incorrect).

--
Keith Thompson (The_Other_Keith) (e-mail address removed) <http://www.ghoti.net/
~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/
~kst>
"We must do something. This is something. Therefore, we must do
this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

$ a.out test.txt
num of chars: 533
num of words: 130
num of sentences: 19

$

lovecreatesbea... · Apr 24, 2007

Please try to do it while I try myself!

#include <stdio.h>
#include <ctype.h>

int wc2(const char *filename)
{
FILE *fp;
int ch;
int nc; /*num of chars*/
int nw; /*num of words*/
int ns; /*num of sentences*/
int inw; /*inside a word*/

nc = nw = ns = 0;
inw = 0;
if ((fp = fopen(filename, "r")) == NULL){
return -1;
}
while ((ch = fgetc(fp)) != EOF){
if (isalnum(ch)){
nc++;
inw = 1;
} else if ((ispunct(ch) || ch == ' ') && (inw == 1)){
nw++;
if (ch == '!' || ch == '?' || ch == '.' || ch == ';')
ns++;
inw = 0;
}
}
fprintf(stdout, "num of chars: %d\nnum of words: %d\nnum of
sentences: %d\n"
, nc, nw, ns);
fclose(fp);
return 0;
}

int main(int argc, char **argv)
{
if (argc != 2){
fprintf(stdout, "Usage: %s <filename>\n", argv[0]);
return -1;
}
wc2(argv[1]);
return 0;
}

lovecreatesbea... · Apr 25, 2007

Please try to do it while I try myself!

$ cat a.c
#include <stdio.h>

int wc2(char *filename)
{
FILE *fp;
int ch;
int nc; /*num of characters*/
int nw; /*num of words*/
int ns; /*num of sentences*/
int inw; /*inside a word or not*/

nc = nw = ns = 0; inw = 0;
if ((fp = fopen(filename, "r")) == NULL)
return -1;
while ((ch = fgetc(fp)) != EOF){
if (isalnum(ch)){
nc++;
inw = 1;
} else if ((ispunct(ch) || ch == ' ') && inw == 1){
nw++;
if (ch == '.' || ch == '!' || ch == '?' || ch == ';')
ns++;
inw = 0;
}
}
fprintf(stdout, "characters: %d, ", nc);
fprintf(stdout, "words: %d, ", nw);
fprintf(stdout, "sentences: %d ", ns);
fprintf(stdout, "\n");
fclose(fp);
return 0;
}

int main(int argc, char **argv)
{
if (argc == 2)
wc2(argv[1]);
else
fprintf(stdout, "Usage: %s <filename>\n", argv[0]);
return 0;
}
$ cc a.c
$ cat test.txt
getc() and getchar() are implemented both as library functions and
macros. The macro versions, which are used by default, are
defined in
<stdio.h>. To obtain the library function either use a #undef
to
remove the macro definition or, if compiling in ANSI-C mode,
enclose
the function name in parenthesis or use the function address.
The
following example illustrates each of these methods :
$ ./a.out test.txt
characters: 311, words: 64, sentences: 3
$

James Kanze · Apr 26, 2007

I have done an exercise on this. But it doesn't deal with this
condition: comp.lang.c - this group name will be treated as two
sentences. How can I improve it

Define what you mean by sentence more exactly.

IIRC, the definition TeX uses, adopted to the C/C++ character
set, would be:

-- one of [.?!],
-- followed by zero or more of ['"],
-- followed by
. either zero or more whitespace, followed by the end of
file, or
. one or more whitespace, followed by a capital letter.

This works because TeX also expects things like "Mr. Brown" to
contain a non-breaking whitespace ('~' in TeX, 0xA0 in ISO
8859-1), which doesn't count as a whitespace. If you don't
require that, I can't think of anything but special casing to
handle Mr. and Mrs. (and Dr. and... any other abbreviation that
is often followed by a noun).

This is probably most easily handled by some sort of state
machine.

Minimum Total Difficulty	0	Nov 15, 2023
Sort and count word pairs in a string	6	Jan 29, 2023
I would like to use awk to calculate the total number of records processed	1	Aug 25, 2022
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
Find and count strings of text from multiple files	17	Dec 16, 2021
Php combine identical lines in text file	4	Oct 11, 2023
Need total amount displayed of data-price attribute from each table	2	Jul 3, 2022
Single put routine overlapping words during iteration	4	Jan 2, 2023

Count total no. of characters,words & sentences in a text file

James Kanze

James Kanze

lovecreatesbea...

lovecreatesbea...

lovecreatesbea...

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads