Nlp, Python and period

F

Fred Mangusta

Hi,

are you aware of any nlp packages or algorithms in Python to spot
whether a '.' represents an end of sentence or rather something else (eg
Mr., (e-mail address removed), etc)?

Thanks

F.
 
P

Paul Boddie

Hi,

are you aware of any nlp packages or algorithms in Python to spot
whether a '.' represents an end of sentence or rather something else (eg
Mr., (e-mail address removed), etc)?

I wouldn't mind finding out about such packages, either. I see that
NLTK offers a few options, with the following tokeniser being
interesting if you don't mind training the software:

http://nltk.org/doc/guides/tokenize.html#punkt-tokenizer

There was also discussion of this topic on Ned Batchelder's blog a
while back:

http://nedbatchelder.com/blog/200804/separating_sentences.html

My comment on there (that I'm using a regular expression with some
postprocessing) still stands.

Paul
 
J

John Machin

Hi,

are you aware of any nlp packages or algorithms in Python to spot
whether a '.' represents an end of sentence or rather something else (eg
Mr., (e-mail address removed), etc)?

google("python nltk") ... it may do what you want.
 
F

Fred Mangusta

Hi Paul,

thanks for replying. I'm interested in knowing more about your regex
approach, but as you point out in your comment, seems like access to the
sourceforge mail archive is restricted. Is there any way I can read
about it? Would you be so kind to cut and paste it here for instance?

Thanks!
F.
 
P

Paul Boddie

thanks for replying. I'm interested in knowing more about your regex
approach, but as you point out in your comment, seems like access to the
sourceforge mail archive is restricted. Is there any way I can read
about it? Would you be so kind to cut and paste it here for instance?

I can't log into SourceForge, possibly because I've forgotten my
password, but I can give you a fairly similar regular expression which
does some of the work:

sentence_pattern = re.compile(
r'(' +
r'[\(\"\[]*' + # Quoting or bracketing (optional)
r'[A-Z,a-z,0-9]' + # Match sentence with specific start
character
r'.+?' + # Match sentence content - "?" means non-
greedy
r'[\.\!\?]' + # End of sentence
r'[\)\"\]]*' + # End quoting or bracketing
r')' +
r'(\s+)' + # Spaces
r'[\(\"\[]*' + # Quoting or bracketing (optional)
r'[A-Z,0-9]' # Match sentence with specific start
character
)

This is mostly the same as that posted to SourceForge, but with some
enhancements; I've indented the part which actually produces the
matched sentence text in a group. Unfortunately, some postprocessing
is required to deal with abbreviations, and I maintain a list of these
against which I test the supposed ends of sentences that the regular
expression provides. In addition, I also try and detect initials (eg.
G. van Rossum) which the regular expression may regard as the end of a
sentence.

As I noted, I'd be interested to hear of any better solutions which
don't involve training.

Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top