Counting no.of sentences

  • Thread starter Guru Nathan via JavaKB.com
  • Start date
G

Guru Nathan via JavaKB.com

Friends No of sentences can be counted based on no.of (.) full stop...
But if there is a word like M.B.B.S , Mr.Ram..etc.., How to count the no
..of sentences..Plz help me to solve this in java...
Eagerly waiting for ur reply..

Thanking you
Your Friend guru
 
O

Oscar kind

Guru Nathan via JavaKB.com said:
Friends No of sentences can be counted based on no.of (.) full stop...
But if there is a word like M.B.B.S , Mr.Ram..etc.., How to count the no
.of sentences..Plz help me to solve this in java...

As this is an algorithm question, a solution is nescessarily in Java, but
possible in any language. But let's put that aside.

In a clean text, regardless of the number of abbreviations (with dots),
elipsis (...), etc., every sentence ends with a dot, followed by
whitespace.

So if you're lucky enough to have clean input, the number of sentences
equals the number of occurrences of the following regular expression (plus
one if the last character is a dot):
"\.\s"

For more information, read the API docs of java.util.regex.Pattern.
 
R

Rhino

Guru Nathan via JavaKB.com said:
Friends No of sentences can be counted based on no.of (.) full stop...
But if there is a word like M.B.B.S , Mr.Ram..etc.., How to count the no
.of sentences..Plz help me to solve this in java...
Eagerly waiting for ur reply..

Thanking you
Your Friend guru
I think you're going to have to build a list of all of the abbreviations
that end with periods which are possible within the document you are
reading.

But I don't know exactly how you can do that with any certainty because
there are an awful lot of abbreviations and you'll probably never get them
all.

In years gone by, a lot of documents were written so that a sentence was
ended with a period followed by TWO spaces; if that were still true today,
I'd say that you could just search for periods followed by two spaces.
However, a great many people use only a single space after their periods
these days so that makes 'Mr. Ram' look like the end of one sentence and the
start of another.

Personally, I don't think you're likely to find a satisfactory solution to
this problem.

For what it's worth, most times that I see analyses of documents, they only
tell you the number of words, not the number of sentences. Maybe your
requirement is unrealistic and you can convince your user that this is the
case?

Rhino
 
E

Eric Sosman

Oscar said:
As this is an algorithm question, a solution is nescessarily in Java, but
possible in any language. But let's put that aside.

In a clean text, regardless of the number of abbreviations (with dots),
elipsis (...), etc., every sentence ends with a dot, followed by
whitespace.

I dispute your claim! Would you care to bet? "There
are more things in Heaven and Earth, Horatio, than are
dreamt of in your philosophy."
 
T

Tilman Bohn

Oscar kind wrote: [...]
In a clean text, regardless of the number of abbreviations (with dots),
elipsis (...), etc., every sentence ends with a dot, followed by
whitespace.

I dispute your claim! Would you care to bet? "There
are more things in Heaven and Earth, Horatio, than are
dreamt of in your philosophy."

(I think you're on to something.)
 
H

Hal Rosser

If the purpose is to estimate actual number of sentences, Then counting
periods may work with a twist.
But to make it closer to actual, I would not count sentences of less than 5
or 6 characters or periods that follow other periods, so as to eliminate
abbreviations, number decimals, and elipses.
==
 
S

SMC

Eric Sosman wrote on said:
Oscar kind wrote: [...]
In a clean text, regardless of the number of abbreviations (with
dots), elipsis (...), etc., every sentence ends with a dot, followed
by whitespace.

I dispute your claim! Would you care to bet? "There
are more things in Heaven and Earth, Horatio, than are dreamt of in
your philosophy."

(I think you're on to something.)

Surely not! Doesn't every sentence end with a fullstop?
 
O

Oscar kind

SMC said:
Eric Sosman wrote on said:
Oscar kind wrote: [...]
In a clean text, regardless of the number of abbreviations (with
dots), elipsis (...), etc., every sentence ends with a dot, followed
by whitespace.

I dispute your claim! Would you care to bet? "There
are more things in Heaven and Earth, Horatio, than are dreamt of in
your philosophy."

(I think you're on to something.)

Surely not! Doesn't every sentence end with a fullstop?

Yes, but I realized that that's not all that ends in a dot with
whitespace. Consider this example:
"Oscar Kind is a M.Sc. with several years of programming experience."
 
R

Roland

Friends No of sentences can be counted based on no.of (.) full stop...
But if there is a word like M.B.B.S , Mr.Ram..etc.., How to count the no
..of sentences..Plz help me to solve this in java...
Eagerly waiting for ur reply..

Thanking you
Your Friend guru


You could use BreakIterator. The program below counts 5 sentence in your
post (though I replaced "sentences..Plz" by "sentences? Plz").


import java.text.BreakIterator;
import java.util.Locale;

public class SentenceCount {
public static void main(String[] args) {
String text = "Friends No of sentences can be counted "
+ "based on no.of (.) full stop... But if there is a "
+ "word like M.B.B.S , Mr.Ram..etc.., How to count "
+ "the no. of sentences? Plz help me to solve this "
+ "in java... Eagerly waiting for ur reply. Thanking "
+ "you Your Friend guru";

BreakIterator boundary = BreakIterator.getSentenceInstance(
Locale.getDefault());

printAll(boundary, text);
System.out.println("No. of sentences: "
+ count(boundary, text));
}
public static int count(BreakIterator boundary, String source) {
boundary.setText(source);
int count = 0;
int start = boundary.first();
for (int end = boundary.next(); end != BreakIterator.DONE;
start = end, end = boundary.next(), count++) {
}
return count;
}
public static void printAll(BreakIterator boundary, String source) {
boundary.setText(source);
int count = 0;
int start = boundary.first();
for (int end = boundary.next(); end != BreakIterator.DONE;
start = end, end = boundary.next(), count++) {
System.out.print(count + 1);
System.out.print('\t');
System.out.println(source.substring(start, end));
}
}
}

--
Regards,

Roland de Ruiter
___ ___
/__/ w_/ /__/
/ \ /_/ / \
 
H

HK

Guru said:
Friends No of sentences can be counted based on no.of (.) full
stop...

This is a reasonable heuristic:

a) a sentence ends with "[.?!][\r \n\t]+" if it is followed by [^a-z].
b) This is not true if you find an abbreviation just before the
dot, e.g. "Mr|Dr|vs|e.g|i.e|c.f|Fig|fig|No|no"

Like almost every phenomenon in natural language analysis, this
is subject to Zipf's law which says that a few rules cover
most of the cases, while allmost all other cases each need
their own rule.

http://en.wikipedia.org/wiki/Zipfs_law

Harald.
 
M

Michael

Guru Nathan via JavaKB.com said:
Friends No of sentences can be counted based on no.of (.) full stop...
But if there is a word like M.B.B.S , Mr.Ram..etc.., How to count the no
.of sentences..Plz help me to solve this in java...
Eagerly waiting for ur reply..

Thanking you
Your Friend guru


How about in addition to the looking for the special characters that
end a sentence - (followed by the 1 or 2 spaces) - to then make sure
that the first letter of the next word is a capital. every sentence
begins with a capital letter.

Michael
 
E

Eric Sosman

Michael said:
How about in addition to the looking for the special characters that
end a sentence - (followed by the 1 or 2 spaces) - to then make sure
that the first letter of the next word is a capital. every sentence
begins with a capital letter.

Mr. Michael is a devious person.

Various people have suggested lexical rules of this
kind, and others (sometimes the suggesters themselves)
have offered examples where the rules fail. The whole
issue rests on what is meant by "sentence:" if Guru Nathan
defines "sentence" the way a grammar book does, I don't
think purely lexical rules can possibly succeed. Even a
full-fledged formal grammar would most likely fail, since
natural languages are probably not context-free. This is
the stuff of Ph.D. theses, and possibly of patents.

Guru Nathan needs to think about what the count will
be used for, and from this decide how accurate he requires
it to be. The accuracy requirement will tell him how
fancy he needs to make his lexical and/or syntactic rules.
In the meantime, we'll all have fun dreaming up cases that
make trouble for this or that plausible rule.

"The capital letters are A, B, ... Z," said Tom,
elliptically.
 
H

HK

Michael said:
How about in addition to the looking for the special characters that
end a sentence - (followed by the 1 or 2 spaces) - to then make sure
that the first letter of the next word is a capital. every sentence
begins with a capital letter.

Like everything in natural language processing, this is only
part of the truth. Some sentences start with other things, e.g.
"(i)". Also very nice are gene names some authors put
at the beginning of a sentence and don't upcase them,
because this would not be the gene name anymore.

Harald.
 
O

opalpa

Hi there, I've got some experience in this!

Been a couple of years, but I feel acquatined, listen, the key thought
IMHO is that if you give people sentences and ask how many they'll
disagree, so it's okay to have programs disagree on this too.

IMHO a good program to do this would give results similar to humans.
There are multiple processes to have that happen. Check out some
machine learning methodologies and have them tell you which things are
sentences. And then give a count.

That's the way I've done it and it's head some suprisingly great
results.

Laters, Pawel Opalinski
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,008
Latest member
HaroldDark

Latest Threads

Top