Splitting a text file into sentences

B

basi

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi
 
M

Matthew Smillie

Looking for ideas on how to split a text file into sentences. I see
the
problem of basing the split on [.!?] -- they're also used in ways
other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi


Doing really, really good sentence boundary detection is an on-going
problem in natural language processing. I'm not aware of any Ruby-
based NLP packages, but if you want better accuracy than just using
[.!?:] there are several free NLP packages around (NLTK in Python,
and Stanford's Java NLP package spring to mind) that might help you.
A googling of "sentence tokenization" may also yield some help.

If that sounds like overkill, then you can get accuracy "good enough
for government work" by making a list of regular expressions to catch
exceptions to the punctuation rule. These will necessarily vary a
little depending on your source text, but a typical examples are
catching titles like "Mr.", "Mrs." "Dr.", and all-caps abbreviations
like "U.S.A." or "M.D." (something like this: /([A-Z]\.([A-Z]\.)+/)

good luck,
matthew smillie.
 
K

Kevin Olbrich

Depending on the text you might be able to search for a period (or other
punctuation) followed by two spaces. It's not robust, but if you know that
convention will be followed by the authors, then it can work.

_Kevin

-----Original Message-----
From: Matthew Smillie [mailto:[email protected]]
Sent: Tuesday, November 29, 2005 09:06 PM
To: ruby-talk ML
Subject: Re: Splitting a text file into sentences


Looking for ideas on how to split a text file into sentences. I see
the problem of basing the split on [.!?] -- they're also used in ways
other than to end a sentence. If I have to do manual pre-processing of
the text file, what editing might I do? Has anyone had to deal with
this problem and how did you make life easier for you?
Thanks for the help.
basi


Doing really, really good sentence boundary detection is an on-going problem
in natural language processing. I'm not aware of any Ruby- based NLP
packages, but if you want better accuracy than just using [.!?:] there are
several free NLP packages around (NLTK in Python,
and Stanford's Java NLP package spring to mind) that might help you.
A googling of "sentence tokenization" may also yield some help.

If that sounds like overkill, then you can get accuracy "good enough for
government work" by making a list of regular expressions to catch exceptions
to the punctuation rule. These will necessarily vary a little depending on
your source text, but a typical examples are catching titles like "Mr.",
"Mrs." "Dr.", and all-caps abbreviations like "U.S.A." or "M.D." (something
like this: /([A-Z]\.([A-Z]\.)+/)

good luck,
matthew smillie.
 
N

Nicholas Van Weerdenburg

------=_Part_14208_13794951.1133318680017
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi
I dimly recall something on this list about 9 months ago or so.

Nick
 
N

Nicholas Van Weerdenburg

------=_Part_14288_14917043.1133318967822
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi
I dimly recall something on this list about 9 months ago or so.

Nick


http://www.pressure.to/ruby/ is the reference I found in an old email threa=
d
on this list.

Nick
 
J

Jeffrey Schwab

basi said:
Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?

It's a common convention to separate sentences by double spaces. I
started following this convention because Emacs expected it, and now I
use it always.
 
B

basi

Hi,
I have looked at NLTK in Python (and had been hoping a Rubyist would
rewrite it in Ruby). I will go back to NLTK and see if it has a
split-sentence algorithm of sort. And thanks for the tip on Stanfords
Java NLP package. Yes, those abbreviations are pesky, and I may have to
resort to an exceptions list containing the most common ones.
Thanks much,
basi
 
B

basi

Yes, I learned this convention when I took a keyboarding (i.e., typing)
lesson in high school. Sometime ago, a style manual for word processing
appeared, and one of the advice is to use only one space to separate
sentences. The reason given is that in a justified format, those two
spaces can become four spaces, or even more. Anyway, a lot of text now
has one or two spaces between sentences, and this wouldn't be a
reliable indicator of sentence boundary.
Cheers!
basi
 
R

Ryan Leavengood

Yes, I learned this convention when I took a keyboarding (i.e., typing)
lesson in high school. Sometime ago, a style manual for word processing
appeared, and one of the advice is to use only one space to separate
sentences. The reason given is that in a justified format, those two
spaces can become four spaces, or even more. Anyway, a lot of text now
has one or two spaces between sentences, and this wouldn't be a
reliable indicator of sentence boundary.

I too learned the two space after a period convention years ago and
also recently learned that with modern fonts and word processors it is
not necessary. It was tricky to retrain myself, but I did, and have
been using just one space ever since.

So like you say, that isn't a reliable way to discern sentences.

I would recommend following the advice of first filtering out false
positives (possibly even replacing them with temporary markers, Mr.
becomes $MISTER$ or similar), then splitting on punctuation. If you
then test on various sample texts you should be able to find more
false positives that you might have missed.

Ryan
 
B

basi

Hi,
This just might be easier than what I have in mind. I will try this
first.
Thanks!
basi
 
D

Damphyr

Ryan said:
I too learned the two space after a period convention years ago and
also recently learned that with modern fonts and word processors it
is not necessary. It was tricky to retrain myself, but I did, and
have been using just one space ever since.

So like you say, that isn't a reliable way to discern sentences.

I would recommend following the advice of first filtering out false
positives (possibly even replacing them with temporary markers, Mr.
becomes $MISTER$ or similar), then splitting on punctuation. If you
then test on various sample texts you should be able to find more
false positives that you might have missed.
Which will not help you at all with foreign languages. And don't forget
putting i.e., e.g. or etc. in the list.
This is an ongoing problem (think about the auto-correction 'feature' of
capitalizing the first letter of every sentence in Openoffice or Word -
something I always turn off because it is so insistent when it's wrong)
Cheers,
V.-
--
http://www.braveworld.net/riva

____________________________________________________________________
http://www.freemail.gr - äùñåÜí õðçñåóßá çëåêôñïíéêïý ôá÷õäñïìåßïõ.
http://www.freemail.gr - free email service for the Greek-speaking.
 
E

Edwin van Leeuwen

basi_lio said:
Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi

If you make a regexp: [.!?]\s+[A-Z] you will already capture most. Most
Abbreviations normally aren't followed by a space/capital letter.

One change to this rule that I can think of is Mr. Name, Mrs. Name. But
as you can see these have a <uppercase> followed by only one or two
downcase letters. Most sentences would have at least five non uppercase
in front of the <.> ->
[A-Z]\w\w?\w?\w?\.
 
A

Austin Ziegler

Depending on the text you might be able to search for a period (or other
punctuation) followed by two spaces. It's not robust, but if you know th= at
convention will be followed by the authors, then it can work.

That, in fact, is a very *bad* metric to follow, as the proper spacing
after sentence punctuation is a single space. The only reason that two
spaces was used in the past is the space used between sentence endings
in typeset work is a little wider than that used between words (an
em-space vs. an en-space).

-austin
 
A

Austin Ziegler

basi said:
Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
It's a common convention to separate sentences by double spaces. I
started following this convention because Emacs expected it, and now I
use it always.

As I noted above, this is an improper convention outside of the
typewriter realm. If you are using anything other than a fixed-pitch
font for display or print, you should *never* use two spaces.

-austin
 
A

Austin Ziegler

I have looked at NLTK in Python (and had been hoping a Rubyist would
rewrite it in Ruby). I will go back to NLTK and see if it has a
split-sentence algorithm of sort. And thanks for the tip on Stanfords
Java NLP package. Yes, those abbreviations are pesky, and I may have to
resort to an exceptions list containing the most common ones.

Look at Text::Format for some indication on how abbreviations could be hand=
led.

-austin
 
J

Jeffrey Schwab

Austin said:
That, in fact, is a very *bad* metric to follow, as the proper spacing
after sentence punctuation is a single space. The only reason that two
spaces was used in the past is the space used between sentence endings
in typeset work is a little wider than that used between words (an
em-space vs. an en-space).

Not true at all. I was always taught to use double spaces after
sentences in grade-school homework assignments done on plain word
processors or typewriters.
 
J

James Edward Gray II

Not true at all. I was always taught to use double spaces after
sentences in grade-school homework assignments done on plain word
processors or typewriters.

Many of us were and I'll admit that I can't shake the habit. I still
know it's wrong though. ;)

James Edward Gray II
 
A

Austin Ziegler

Not true at all. I was always taught to use double spaces after
sentences in grade-school homework assignments done on plain word
processors or typewriters.

Then, quite honestly, you were taught wrong. I was taught to use
double spaces with a typewriter or when using fixed-pitch fonts
(although that was later, since most computers and printers didn't
have reliable kerning routines until I was out of university).
Ultimately, the use of double spaces after a period is wrong *even
with fixed-pitch fonts*, but it was done to be clearer since the width
of the em-space and an en-space on a typewriter with a Courier-like
font is exactly the same. The two spaces *simulates* an em-space in a
typeset piece of work. (And that is *fact*, not opinion.)

-austin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top