Counting no.of sentences

Guru Nathan via JavaKB.com · Feb 28, 2005

Friends No of sentences can be counted based on no.of (.) full stop...
But if there is a word like M.B.B.S , Mr.Ram..etc.., How to count the no
..of sentences..Plz help me to solve this in java...
Eagerly waiting for ur reply..

Thanking you
Your Friend guru

Oscar kind · Feb 28, 2005

Guru Nathan via JavaKB.com said:
Friends No of sentences can be counted based on no.of (.) full stop...
But if there is a word like M.B.B.S , Mr.Ram..etc.., How to count the no
.of sentences..Plz help me to solve this in java...

As this is an algorithm question, a solution is nescessarily in Java, but
possible in any language. But let's put that aside.

In a clean text, regardless of the number of abbreviations (with dots),
elipsis (...), etc., every sentence ends with a dot, followed by
whitespace.

So if you're lucky enough to have clean input, the number of sentences
equals the number of occurrences of the following regular expression (plus
one if the last character is a dot):
"\.\s"

For more information, read the API docs of java.util.regex.Pattern.

Rhino · Feb 28, 2005

Guru Nathan via JavaKB.com said:
Friends No of sentences can be counted based on no.of (.) full stop...
But if there is a word like M.B.B.S , Mr.Ram..etc.., How to count the no
.of sentences..Plz help me to solve this in java...
Eagerly waiting for ur reply..

Thanking you
Your Friend guru

I think you're going to have to build a list of all of the abbreviations
that end with periods which are possible within the document you are
reading.

But I don't know exactly how you can do that with any certainty because
there are an awful lot of abbreviations and you'll probably never get them
all.

In years gone by, a lot of documents were written so that a sentence was
ended with a period followed by TWO spaces; if that were still true today,
I'd say that you could just search for periods followed by two spaces.
However, a great many people use only a single space after their periods
these days so that makes 'Mr. Ram' look like the end of one sentence and the
start of another.

Personally, I don't think you're likely to find a satisfactory solution to
this problem.

For what it's worth, most times that I see analyses of documents, they only
tell you the number of words, not the number of sentences. Maybe your
requirement is unrealistic and you can convince your user that this is the
case?

Rhino

Eric Sosman · Feb 28, 2005

Oscar said:
As this is an algorithm question, a solution is nescessarily in Java, but
possible in any language. But let's put that aside.

In a clean text, regardless of the number of abbreviations (with dots),
elipsis (...), etc., every sentence ends with a dot, followed by
whitespace.

I dispute your claim! Would you care to bet? "There
are more things in Heaven and Earth, Horatio, than are
dreamt of in your philosophy."

Tilman Bohn · Feb 28, 2005

Oscar kind wrote: [...]

In a clean text, regardless of the number of abbreviations (with dots),
elipsis (...), etc., every sentence ends with a dot, followed by
whitespace.

Click to expand...

I dispute your claim! Would you care to bet? "There
are more things in Heaven and Earth, Horatio, than are
dreamt of in your philosophy."

(I think you're on to something.)

Hal Rosser · Mar 1, 2005

If the purpose is to estimate actual number of sentences, Then counting
periods may work with a twist.
But to make it closer to actual, I would not count sentences of less than 5
or 6 characters or periods that follow other periods, so as to eliminate
abbreviations, number decimals, and elipses.
==

SMC · Mar 1, 2005

Eric Sosman wrote on said:
Eric Sosman wrote on said:

Oscar kind wrote: [...]

In a clean text, regardless of the number of abbreviations (with
dots), elipsis (...), etc., every sentence ends with a dot, followed
by whitespace.

Click to expand...

I dispute your claim! Would you care to bet? "There
are more things in Heaven and Earth, Horatio, than are dreamt of in
your philosophy."

Click to expand...

(I think you're on to something.)

Surely not! Doesn't every sentence end with a fullstop?

Oscar kind · Mar 1, 2005

SMC said:
Eric Sosman wrote on said:

Oscar kind wrote: [...]
In a clean text, regardless of the number of abbreviations (with
dots), elipsis (...), etc., every sentence ends with a dot, followed
by whitespace.

I dispute your claim! Would you care to bet? "There
are more things in Heaven and Earth, Horatio, than are dreamt of in
your philosophy."

Click to expand...

(I think you're on to something.)

Click to expand...

Surely not! Doesn't every sentence end with a fullstop?

Yes, but I realized that that's not all that ends in a dot with
whitespace. Consider this example:
"Oscar Kind is a M.Sc. with several years of programming experience."

Roland · Mar 1, 2005

Friends No of sentences can be counted based on no.of (.) full stop...
But if there is a word like M.B.B.S , Mr.Ram..etc.., How to count the no
..of sentences..Plz help me to solve this in java...
Eagerly waiting for ur reply..

Thanking you
Your Friend guru

You could use BreakIterator. The program below counts 5 sentence in your
post (though I replaced "sentences..Plz" by "sentences? Plz").

import java.text.BreakIterator;
import java.util.Locale;

public class SentenceCount {
public static void main(String[] args) {
String text = "Friends No of sentences can be counted "
+ "based on no.of (.) full stop... But if there is a "
+ "word like M.B.B.S , Mr.Ram..etc.., How to count "
+ "the no. of sentences? Plz help me to solve this "
+ "in java... Eagerly waiting for ur reply. Thanking "
+ "you Your Friend guru";

BreakIterator boundary = BreakIterator.getSentenceInstance(
Locale.getDefault());

printAll(boundary, text);
System.out.println("No. of sentences: "
+ count(boundary, text));
}
public static int count(BreakIterator boundary, String source) {
boundary.setText(source);
int count = 0;
int start = boundary.first();
for (int end = boundary.next(); end != BreakIterator.DONE;
start = end, end = boundary.next(), count++) {
}
return count;
}
public static void printAll(BreakIterator boundary, String source) {
boundary.setText(source);
int count = 0;
int start = boundary.first();
for (int end = boundary.next(); end != BreakIterator.DONE;
start = end, end = boundary.next(), count++) {
System.out.print(count + 1);
System.out.print('\t');
System.out.println(source.substring(start, end));
}
}
}

--
Regards,

Roland de Ruiter
___ ___
/__/ w_/ /__/
/ \ /_/ / \

HK · Mar 1, 2005

Guru said:
Friends No of sentences can be counted based on no.of (.) full

stop...

This is a reasonable heuristic:

a) a sentence ends with "[.?!][\r \n\t]+" if it is followed by [^a-z].
b) This is not true if you find an abbreviation just before the
dot, e.g. "Mr|Dr|vs|e.g|i.e|c.f|Fig|fig|No|no"

Like almost every phenomenon in natural language analysis, this
is subject to Zipf's law which says that a few rules cover
most of the cases, while allmost all other cases each need
their own rule.

http://en.wikipedia.org/wiki/Zipfs_law

Harald.

Michael · Mar 1, 2005

Guru Nathan via JavaKB.com said:
Friends No of sentences can be counted based on no.of (.) full stop...
But if there is a word like M.B.B.S , Mr.Ram..etc.., How to count the no
.of sentences..Plz help me to solve this in java...
Eagerly waiting for ur reply..

Thanking you
Your Friend guru

How about in addition to the looking for the special characters that
end a sentence - (followed by the 1 or 2 spaces) - to then make sure
that the first letter of the next word is a capital. every sentence
begins with a capital letter.

Michael

Eric Sosman · Mar 1, 2005

Michael said:
How about in addition to the looking for the special characters that
end a sentence - (followed by the 1 or 2 spaces) - to then make sure
that the first letter of the next word is a capital. every sentence
begins with a capital letter.

Mr. Michael is a devious person.

Various people have suggested lexical rules of this
kind, and others (sometimes the suggesters themselves)
have offered examples where the rules fail. The whole
issue rests on what is meant by "sentence:" if Guru Nathan
defines "sentence" the way a grammar book does, I don't
think purely lexical rules can possibly succeed. Even a
full-fledged formal grammar would most likely fail, since
natural languages are probably not context-free. This is
the stuff of Ph.D. theses, and possibly of patents.

Guru Nathan needs to think about what the count will
be used for, and from this decide how accurate he requires
it to be. The accuracy requirement will tell him how
fancy he needs to make his lexical and/or syntactic rules.
In the meantime, we'll all have fun dreaming up cases that
make trouble for this or that plausible rule.

"The capital letters are A, B, ... Z," said Tom,
elliptically.

HK · Mar 2, 2005

Michael said:
How about in addition to the looking for the special characters that
end a sentence - (followed by the 1 or 2 spaces) - to then make sure
that the first letter of the next word is a capital. every sentence
begins with a capital letter.

Like everything in natural language processing, this is only
part of the truth. Some sentences start with other things, e.g.
"(i)". Also very nice are gene names some authors put
at the beginning of a sentence and don't upcase them,
because this would not be the gene name anymore.

Harald.

opalpa · Mar 3, 2005

Hi there, I've got some experience in this!

Been a couple of years, but I feel acquatined, listen, the key thought
IMHO is that if you give people sentences and ask how many they'll
disagree, so it's okay to have programs disagree on this too.

IMHO a good program to do this would give results similar to humans.
There are multiple processes to have that happen. Check out some
machine learning methodologies and have them tell you which things are
sentences. And then give a count.

That's the way I've done it and it's head some suprisingly great
results.

Laters, Pawel Opalinski

how can I dynamically display the footer text of datagrid	0	Nov 27, 2004
I'm tempted to quit out of frustration	1	Aug 13, 2023
Urgent Req for Senior UI Devloper with our client at LA and San Jose,CA. MJI>>>>>>>	0	Sep 18, 2009
MJI>>Urgent Req for Senior UI Devloper with our client at LA and SanJose ,CA.	0	Sep 18, 2009
Grouping on and exporting to csv files	1	Mar 20, 2013
[QUIZ] Counting Cards (#152)	9	Jan 11, 2008
Tasks	1	Nov 29, 2022
Manual Memory Management and Automatic Garbage Collection	25	Dec 6, 2010

Counting no.of sentences

Guru Nathan via JavaKB.com

Oscar kind

Rhino

Eric Sosman

Tilman Bohn

Hal Rosser

SMC

Oscar kind

Roland

HK

Michael

Eric Sosman

HK

opalpa

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads