Nlp, Python and period

Fred Mangusta · Aug 4, 2008

Hi,

are you aware of any nlp packages or algorithms in Python to spot
whether a '.' represents an end of sentence or rather something else (eg
Mr., (e-mail address removed), etc)?

Thanks

F.

Paul Boddie · Aug 4, 2008

Hi,

are you aware of any nlp packages or algorithms in Python to spot
whether a '.' represents an end of sentence or rather something else (eg
Mr., (e-mail address removed), etc)?

I wouldn't mind finding out about such packages, either. I see that
NLTK offers a few options, with the following tokeniser being
interesting if you don't mind training the software:

http://nltk.org/doc/guides/tokenize.html#punkt-tokenizer

There was also discussion of this topic on Ned Batchelder's blog a
while back:

http://nedbatchelder.com/blog/200804/separating_sentences.html

My comment on there (that I'm using a regular expression with some
postprocessing) still stands.

Paul

John Machin · Aug 4, 2008

Hi,

are you aware of any nlp packages or algorithms in Python to spot
whether a '.' represents an end of sentence or rather something else (eg
Mr., (e-mail address removed), etc)?

google("python nltk") ... it may do what you want.

Fred Mangusta · Aug 4, 2008

Hi Paul,

thanks for replying. I'm interested in knowing more about your regex
approach, but as you point out in your comment, seems like access to the
sourceforge mail archive is restricted. Is there any way I can read
about it? Would you be so kind to cut and paste it here for instance?

Thanks!
F.

Paul Boddie · Aug 4, 2008

thanks for replying. I'm interested in knowing more about your regex
approach, but as you point out in your comment, seems like access to the
sourceforge mail archive is restricted. Is there any way I can read
about it? Would you be so kind to cut and paste it here for instance?

I can't log into SourceForge, possibly because I've forgotten my
password, but I can give you a fairly similar regular expression which
does some of the work:

sentence_pattern = re.compile(
r'(' +
r'[\(\"\[]*' + # Quoting or bracketing (optional)
r'[A-Z,a-z,0-9]' + # Match sentence with specific start
character
r'.+?' + # Match sentence content - "?" means non-
greedy
r'[\.\!\?]' + # End of sentence
r'[\)\"\]]*' + # End quoting or bracketing
r')' +
r'(\s+)' + # Spaces
r'[\(\"\[]*' + # Quoting or bracketing (optional)
r'[A-Z,0-9]' # Match sentence with specific start
character
)

This is mostly the same as that posted to SourceForge, but with some
enhancements; I've indented the part which actually produces the
matched sentence text in a group. Unfortunately, some postprocessing
is required to deal with abbreviations, and I maintain a list of these
against which I test the supposed ends of sentences that the regular
expression provides. In addition, I also try and detect initials (eg.
G. van Rossum) which the regular expression may regard as the end of a
sentence.

As I noted, I'd be interested to hear of any better solutions which
don't involve training.

Paul

running python as a dameon	4	Sep 5, 2008
A Discussion on Python and Data Visualization	0	Dec 3, 2012
Python battle game help	2	Feb 23, 2023
Hello and Help please :-)	1	Jul 16, 2022
Question about WEKA, Python and Python-WEKA-Wrapper3	0	Mar 31, 2022
SOLVE THIS IF YOU CAN PYTHON MASTER	7	Jan 30, 2023
New to python looking for help	4	Sep 26, 2023
Identifying if the program I have is python and then decompiling	0	May 29, 2022

Nlp, Python and period

Fred Mangusta

Paul Boddie

John Machin

Fred Mangusta

Paul Boddie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads