Splitting a text file into sentences

basi · Nov 29, 2005

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi

Matthew Smillie · Nov 29, 2005

Looking for ideas on how to split a text file into sentences. I see
the
problem of basing the split on [.!?] -- they're also used in ways
other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi

Doing really, really good sentence boundary detection is an on-going
problem in natural language processing. I'm not aware of any Ruby-
based NLP packages, but if you want better accuracy than just using
[.!?:] there are several free NLP packages around (NLTK in Python,
and Stanford's Java NLP package spring to mind) that might help you.
A googling of "sentence tokenization" may also yield some help.

If that sounds like overkill, then you can get accuracy "good enough
for government work" by making a list of regular expressions to catch
exceptions to the punctuation rule. These will necessarily vary a
little depending on your source text, but a typical examples are
catching titles like "Mr.", "Mrs." "Dr.", and all-caps abbreviations
like "U.S.A." or "M.D." (something like this: /([A-Z]\.([A-Z]\.)+/)

good luck,
matthew smillie.

Kevin Olbrich · Nov 29, 2005

Depending on the text you might be able to search for a period (or other
punctuation) followed by two spaces. It's not robust, but if you know that
convention will be followed by the authors, then it can work.

_Kevin

-----Original Message-----
From: Matthew Smillie [mailto:[email protected]]
Sent: Tuesday, November 29, 2005 09:06 PM
To: ruby-talk ML
Subject: Re: Splitting a text file into sentences

Looking for ideas on how to split a text file into sentences. I see
the problem of basing the split on [.!?] -- they're also used in ways
other than to end a sentence. If I have to do manual pre-processing of
the text file, what editing might I do? Has anyone had to deal with
this problem and how did you make life easier for you?
Thanks for the help.
basi

Doing really, really good sentence boundary detection is an on-going problem
in natural language processing. I'm not aware of any Ruby- based NLP
packages, but if you want better accuracy than just using [.!?:] there are
several free NLP packages around (NLTK in Python,
and Stanford's Java NLP package spring to mind) that might help you.
A googling of "sentence tokenization" may also yield some help.

If that sounds like overkill, then you can get accuracy "good enough for
government work" by making a list of regular expressions to catch exceptions
to the punctuation rule. These will necessarily vary a little depending on
your source text, but a typical examples are catching titles like "Mr.",
"Mrs." "Dr.", and all-caps abbreviations like "U.S.A." or "M.D." (something
like this: /([A-Z]\.([A-Z]\.)+/)

good luck,
matthew smillie.

Nicholas Van Weerdenburg · Nov 29, 2005

------=_Part_14208_13794951.1133318680017
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi

I dimly recall something on this list about 9 months ago or so.

Nick

Nicholas Van Weerdenburg · Nov 29, 2005

------=_Part_14288_14917043.1133318967822
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi

Click to expand...

I dimly recall something on this list about 9 months ago or so.

Nick

http://www.pressure.to/ruby/ is the reference I found in an old email threa=
d
on this list.

Nick

Jeffrey Schwab · Nov 29, 2005

basi said:
Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?

It's a common convention to separate sentences by double spaces. I
started following this convention because Emacs expected it, and now I
use it always.

basi · Nov 29, 2005

Hi,
I have looked at NLTK in Python (and had been hoping a Rubyist would
rewrite it in Ruby). I will go back to NLTK and see if it has a
split-sentence algorithm of sort. And thanks for the tip on Stanfords
Java NLP package. Yes, those abbreviations are pesky, and I may have to
resort to an exceptions list containing the most common ones.
Thanks much,
basi

basi · Nov 29, 2005

Hi,
I will google. Thanks!
basi

basi · Nov 29, 2005

Yes, I learned this convention when I took a keyboarding (i.e., typing)
lesson in high school. Sometime ago, a style manual for word processing
appeared, and one of the advice is to use only one space to separate
sentences. The reason given is that in a justified format, those two
spaces can become four spaces, or even more. Anyway, a lot of text now
has one or two spaces between sentences, and this wouldn't be a
reliable indicator of sentence boundary.
Cheers!
basi

basi · Nov 29, 2005

Hi,
This looks promising. I'm downloading as I write.
Thanks!
basi

Ryan Leavengood · Nov 29, 2005

Yes, I learned this convention when I took a keyboarding (i.e., typing)
lesson in high school. Sometime ago, a style manual for word processing
appeared, and one of the advice is to use only one space to separate
sentences. The reason given is that in a justified format, those two
spaces can become four spaces, or even more. Anyway, a lot of text now
has one or two spaces between sentences, and this wouldn't be a
reliable indicator of sentence boundary.

I too learned the two space after a period convention years ago and
also recently learned that with modern fonts and word processors it is
not necessary. It was tricky to retrain myself, but I did, and have
been using just one space ever since.

So like you say, that isn't a reliable way to discern sentences.

I would recommend following the advice of first filtering out false
positives (possibly even replacing them with temporary markers, Mr.
becomes $MISTER$ or similar), then splitting on punctuation. If you
then test on various sample texts you should be able to find more
false positives that you might have missed.

Ryan

basi · Nov 30, 2005

Hi,
This just might be easier than what I have in mind. I will try this
first.
Thanks!
basi

Damphyr · Nov 30, 2005

Ryan said:
I too learned the two space after a period convention years ago and
also recently learned that with modern fonts and word processors it
is not necessary. It was tricky to retrain myself, but I did, and
have been using just one space ever since.

So like you say, that isn't a reliable way to discern sentences.

I would recommend following the advice of first filtering out false
positives (possibly even replacing them with temporary markers, Mr.
becomes $MISTER$ or similar), then splitting on punctuation. If you
then test on various sample texts you should be able to find more
false positives that you might have missed.

Which will not help you at all with foreign languages. And don't forget
putting i.e., e.g. or etc. in the list.
This is an ongoing problem (think about the auto-correction 'feature' of
capitalizing the first letter of every sentence in Openoffice or Word -
something I always turn off because it is so insistent when it's wrong)
Cheers,
V.-
--
http://www.braveworld.net/riva

____________________________________________________________________
http://www.freemail.gr - äùñåÜí õðçñåóßá çëåêôñïíéêïý ôá÷õäñïìåßïõ.
http://www.freemail.gr - free email service for the Greek-speaking.

Edwin van Leeuwen · Nov 30, 2005

basi_lio said:
Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?
Thanks for the help.
basi

If you make a regexp: [.!?]\s+[A-Z] you will already capture most. Most
Abbreviations normally aren't followed by a space/capital letter.

One change to this rule that I can think of is Mr. Name, Mrs. Name. But
as you can see these have a <uppercase> followed by only one or two
downcase letters. Most sentences would have at least five non uppercase
in front of the <.> ->
[A-Z]\w\w?\w?\w?\.

Austin Ziegler · Nov 30, 2005

Depending on the text you might be able to search for a period (or other
punctuation) followed by two spaces. It's not robust, but if you know th= at
convention will be followed by the authors, then it can work.

That, in fact, is a very *bad* metric to follow, as the proper spacing
after sentence punctuation is a single space. The only reason that two
spaces was used in the past is the space used between sentence endings
in typeset work is a little wider than that used between words (an
em-space vs. an en-space).

-austin

Austin Ziegler · Nov 30, 2005

basi said:
basi said:

Looking for ideas on how to split a text file into sentences. I see the
problem of basing the split on [.!?] -- they're also used in ways other
than to end a sentence. If I have to do manual pre-processing of the
text file, what editing might I do? Has anyone had to deal with this
problem and how did you make life easier for you?

Click to expand...

It's a common convention to separate sentences by double spaces. I
started following this convention because Emacs expected it, and now I
use it always.

As I noted above, this is an improper convention outside of the
typewriter realm. If you are using anything other than a fixed-pitch
font for display or print, you should *never* use two spaces.

-austin

Austin Ziegler · Nov 30, 2005

I have looked at NLTK in Python (and had been hoping a Rubyist would
rewrite it in Ruby). I will go back to NLTK and see if it has a
split-sentence algorithm of sort. And thanks for the tip on Stanfords
Java NLP package. Yes, those abbreviations are pesky, and I may have to
resort to an exceptions list containing the most common ones.

Look at Text::Format for some indication on how abbreviations could be hand=
led.

-austin

Jeffrey Schwab · Nov 30, 2005

Austin said:
That, in fact, is a very *bad* metric to follow, as the proper spacing
after sentence punctuation is a single space. The only reason that two
spaces was used in the past is the space used between sentence endings
in typeset work is a little wider than that used between words (an
em-space vs. an en-space).

Not true at all. I was always taught to use double spaces after
sentences in grade-school homework assignments done on plain word
processors or typewriters.

James Edward Gray II · Nov 30, 2005

Not true at all. I was always taught to use double spaces after
sentences in grade-school homework assignments done on plain word
processors or typewriters.

Many of us were and I'll admit that I can't shake the habit. I still
know it's wrong though.

James Edward Gray II

Austin Ziegler · Nov 30, 2005

Not true at all. I was always taught to use double spaces after
sentences in grade-school homework assignments done on plain word
processors or typewriters.

Then, quite honestly, you were taught wrong. I was taught to use
double spaces with a typewriter or when using fixed-pitch fonts
(although that was later, since most computers and printers didn't
have reliable kerning routines until I was out of university).
Ultimately, the use of double spaces after a period is wrong *even
with fixed-pitch fonts*, but it was done to be clearer since the width
of the em-space and an en-space on a typewriter with a Courier-like
font is exactly the same. The two spaces *simulates* an em-space in a
typeset piece of work. (And that is *fact*, not opinion.)

-austin

Problem Splitting Text String	2	Dec 28, 2022
How to transform a .pst file into a .eml file?	4	Jan 15, 2025
How can I extract PST data into a CSV file?	1	Mar 20, 2026
How do I turn my NSF files into a PST file?	3	Dec 30, 2024
Why should I split large MBOX files?	0	Mar 31, 2026
Insert replace text based on a name in other file python script	4	Mar 5, 2025
Ow do I easily convert my PST file into a PDF?	10	Dec 28, 2024
Splitting up and Reassembling A File	5	Mar 14, 2011

Splitting a text file into sentences

basi

Matthew Smillie

Kevin Olbrich

Nicholas Van Weerdenburg

Nicholas Van Weerdenburg

Jeffrey Schwab

basi

basi

basi

basi

Ryan Leavengood

basi

Damphyr

Edwin van Leeuwen

Austin Ziegler

Austin Ziegler

Austin Ziegler

Jeffrey Schwab

James Edward Gray II

Austin Ziegler

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads