Depending on the text you might be able to search for a period (or other
punctuation) followed by two spaces. It's not robust, but if you know that
convention will be followed by the authors, then it can work.
_Kevin
-----Original Message-----
From: Matthew Smillie [mailto:
[email protected]]
Sent: Tuesday, November 29, 2005 09:06 PM
To: ruby-talk ML
Subject: Re: Splitting a text file into sentences
Looking for ideas on how to split a text file into sentences. I see
the problem of basing the split on [.!?] -- they're also used in ways
other than to end a sentence. If I have to do manual pre-processing of
the text file, what editing might I do? Has anyone had to deal with
this problem and how did you make life easier for you?
Thanks for the help.
basi
Doing really, really good sentence boundary detection is an on-going problem
in natural language processing. I'm not aware of any Ruby- based NLP
packages, but if you want better accuracy than just using [.!?:] there are
several free NLP packages around (NLTK in Python,
and Stanford's Java NLP package spring to mind) that might help you.
A googling of "sentence tokenization" may also yield some help.
If that sounds like overkill, then you can get accuracy "good enough for
government work" by making a list of regular expressions to catch exceptions
to the punctuation rule. These will necessarily vary a little depending on
your source text, but a typical examples are catching titles like "Mr.",
"Mrs." "Dr.", and all-caps abbreviations like "U.S.A." or "M.D." (something
like this: /([A-Z]\.([A-Z]\.)+/)
good luck,
matthew smillie.