Splitting a text file into sentences

K

Kevin Olbrich

Whatever the original reason for the double spaces at the end of a line
started, the practice still continues.
In fact, MS word has an option in its grammar checker to enforce one or two
spaces at the end of a sentence. For a lot of people (like me), it is
nothing more than an old habit that is hard to break.

The utility of this method for determining the end of a sentence depends
entirely on the purpose of the program. If I were to write a routine to
parse text that I wrote, it would probably work pretty well, and it would
save me several hours of work trying to implement a fancier, more robust
routine.

The same routine would probably fail horribly for other users or a more
generic corpus of text.

As a general rule, I like to use algorithms that are as simple as possible
for the job. That, of course, depends a lot on what the job is.

Funny, I never thought something like spacing between sentences would be so
controversial. I can almost envision _why making an esoteric remark about
the beauty of 'negative space' in text files.

_Kevin


-----Original Message-----
From: Austin Ziegler [mailto:[email protected]]
Sent: Wednesday, November 30, 2005 12:40 PM
To: ruby-talk ML
Subject: Re: Splitting a text file into sentences

Not true at all. I was always taught to use double spaces after
sentences in grade-school homework assignments done on plain word
processors or typewriters.

Then, quite honestly, you were taught wrong. I was taught to use double
spaces with a typewriter or when using fixed-pitch fonts (although that was
later, since most computers and printers didn't have reliable kerning
routines until I was out of university).
Ultimately, the use of double spaces after a period is wrong *even with
fixed-pitch fonts*, but it was done to be clearer since the width of the
em-space and an en-space on a typewriter with a Courier-like font is exactly
the same. The two spaces *simulates* an em-space in a typeset piece of work.
(And that is *fact*, not opinion.)

-austin
 
J

Jeffrey Schwab

Austin said:
Then, quite honestly, you were taught wrong. I was taught to use
double spaces with a typewriter or when using fixed-pitch fonts
(although that was later, since most computers and printers didn't
have reliable kerning routines until I was out of university).
Ultimately, the use of double spaces after a period is wrong *even
with fixed-pitch fonts*, but it was done to be clearer since the width
of the em-space and an en-space on a typewriter with a Courier-like
font is exactly the same. The two spaces *simulates* an em-space in a
typeset piece of work. (And that is *fact*, not opinion.)

The Bedford Handbook, which has been my bible for writing conventions
through the past ten years, lists two sets of guidelines: Those
recommended by the Modern Language Association (MLA), and those
recommended by the American Psychological Association (APA). It says
that the MLA style is typically taught in English classes, but that the
APA style is common in the social sciences. Here is the explanation of
the MLA guidelines, from page 633 of the Bedford Handbook for Writers,
(c) 1994:


MLA Guidelines [for essays]:

In typing the text of the essay, leave one space after words, commas,
colons, and semicolons and between the dots in ellipsis marks. Leave
two spaces after periods, question marks, and exclamation points.
To form a dash, type two hyphens with no space between them. Do not
put a space on either side of a dash.


The Handbook goes on to say (p. 635):


Although the APA guidelines call for one space after all punctuation,
most college professors prefer two spaces at the end of a sentence. Use
one space after all other punctuation.
Although two spaces are used after a period that ends a sentence, use
only one space after a period that follows a person's initial (B.F.
Skinner).
To form a dash, type two hyphens with no space between them. Do not
put a space on either side of a dash.


The Handbook itself uses only single spaces at the ends of sentences.
Still, I hardly think there is one conclusively "right" or "wrong"
convention. Until I am convinced otherwise, I will continue to use two
spaces to separate sentences. This makes sentences easier to lex with
regular expressions, and makes them stand out to text editors and human
readers.
 
J

Jallan

Jeffrey said:
The Handbook itself uses only single spaces at the ends of sentences.
Still, I hardly think there is one conclusively "right" or "wrong"
convention. Until I am convinced otherwise, I will continue to use two
spaces to separate sentences. This makes sentences easier to lex with
regular expressions, and makes them stand out to text editors and human
readers.

"Right" or "wrong" in this kind of styling has to do with whether
something is right or wrong according to a particular convention.

The normal convention for professional typography is to use one space
between sentences, whether you are convinced or not, whether using hard
type, a professinoal typesetting program, a desktop publishing program,
or a word processing program.

The older typewiter conventions are still often requested for
manuscripts for academic essays and mansucripts for submission to
publishing houses. These conventions also require underlining rather
than italics, use of double-hyphen for a dash rather than the specific
dash character, and so forth. But should this same manuscript be
professionally printed, even if the text is actually to be set by a
word processor, it would almost certainly be edited first to convert it
to typographical standard: changing all double-spaces to single spaces,
all occurrences of double-hyphen to em-dash or en-dash, using fancy
quotation marks instead of possible straight typewriter quotation
marks, italics instead of underlining, and so forth.

Note that HTML has from the beginning automatically changed any
multiple runs of spaces into a single space when displaying text.

Yes, a convention of always using two spaces would make sentences
easier to lex with regular expressions. Similarly, enforcing one single
spelling of English throughout the world would make searches and
matches easier. However, it is philosphically unsound to ask that the
world change to fit particular data-processing routines, rather than
that data-processing routines be built to properly to deal with
real-world situations.

If your lexing routine fails because many people don't end
non-paragraph-final setences with double-spaces, or do so only in
particular plain text files, it is the fault of your lexing routine for
failing to handling common formatting, unless your lexing is intended
to be a limited tool that works only with manuscript formatted text.

The best general sentence lexing algorithm I've seen is the one set
forth by the Unicode Consoritium at
http://www.unicode.org/reports/tr29/tr29-4.html#Sentence_Boundaries .
This is designed to work reasonably well in any language and writing
system supported by Unicode, not just in English.

Jallan
 
D

Dave Howell

I think "right" or "wrong" are a tad strong for most of the cases
sited. But as a professional book designer and typographer, there's
unquestionably "better" and "worse."

For improved legibility, inter-sentence space should generally be a bit
greater than inter-word space.

Typewriters only had one distance they could travel. Either 1/10th of
an inch ("Pica") or 1/12th ("Elite"). So the only way to add extra
space after a sentence was to double it. That's way too much extra
space, but it was generally better than the alternative. The real
problem was that the words were too far apart, not that the sentences
were too close, but again, the fixed spacing was already an abominable
situation.

Proportional type, dating all the way back to Gutenberg, would
generally use 1/3rd or 1/4th of the height of type type as the
inter-word spacing. This would usually work out to about the width of a
lower case "t" or "l".

When setting modern (by which you may also read "all type before
typewriters" as well) proportional type in fully justified form (left
and right margins both even), the spaces must be stretched out on a
line-by-line basis to fit. Really good typesetting programs (and really
good typesetters sticking little bits of lead between their words (and
I've done that, too)) will add more of the space between sentences than
between words, so as the line stretches, the inter-word space to
inter-sentence space ratio actually changes. (Take a look at a narrow
newspaper column sometime.)

More sophisticated approaches to space will ignore a user's attempt to
sprinkle extraneous space in. Less sophisticated ones might allow it,
and even treat them as individual spaces, stretching both of them
during expansion. {shudder}

The fact that both the MLA Guidelines and the Bedford Handbook
encourage poor typography is regrettable. ("If you cannot type
appropriate punctuation, e.g. an em-dash or en-dash, please use
appropriate substitutions. For both dashes, substitute a pair of
hyphens, which, like true dashes, are typed without adjacent spaces."
There's still software out there that will happily wrap a line between
the two hyphens. Ick!) Nevertheless, if you're submitting a paper to an
institution that expects or requires that, then to not follow them is
wrong, even if the legibility of the submission is better.

What it all boils down to is "Putting two spaces after a period at the
end of a sentence is an artifact left over from the days when the
typewriter was the prevalent text-making tool. Unless you have a
specific reason or requirement to do otherwise, it's preferable to put
only one space between sentences."

*****

For breaking text into sentences, sometimes I find it easier to work
backwards. Also, only very colloquial writing will have a one-word
sentence, so you can solve all "Mr./Dr./Ph.D." cases by the fact that
if a word starts with a cap and ends with a period, it's not a
sentence. For a more sophisticated approach that's still not too
complex to program, check the final word of a sentence against a
dictionary. If it's found there without a final dot, then you're almost
certainly looking at the end of a sentence. If it isn't, then is it
found anywhere else in the document without a dot? If not, then you're
probably looking at an abbreviation. (My mail program uses a monospaced
font. If I thought most readers would read it with a proportional font,
I'd have typed "Ph. D." above, since it should have a thin space before
the D.)
 
G

Gavin Sinclair

Austin said:
Then, quite honestly, you were taught wrong. I was taught to use
double spaces with a typewriter or when using fixed-pitch fonts
(although that was later, since most computers and printers didn't
have reliable kerning routines until I was out of university).
Ultimately, the use of double spaces after a period is wrong *even
with fixed-pitch fonts*, but it was done to be clearer since the width
of the em-space and an en-space on a typewriter with a Courier-like
font is exactly the same. The two spaces *simulates* an em-space in a
typeset piece of work. (And that is *fact*, not opinion.)

What rot. How can anything like that be a fact? You're regurgitating
the opinion of a style manual.

Gavin
 
M

Matthew Smillie

you can solve all "Mr./Dr./Ph.D." cases by the fact that if a word
starts with a cap and ends with a period, it's not a sentence.

I'm not sure that's a very good rule, Dave. There are two sentences
here.

The above rule may catch titular abbreviations, but over-generalises
to produce a false negative in the above example. So in solving one
problem, you introduce another one. It's relatively easy to make
another rule to catch the problem in this case, but it would probably
have been simpler to just make a specific rule to eliminate titular
abbreviations, since there really aren't that many of them.

matthew smillie.
 
J

Jeffrey Schwab

Dave said:
I think "right" or "wrong" are a tad strong for most of the cases sited.
But as a professional book designer and typographer, there's
unquestionably "better" and "worse."

For improved legibility, inter-sentence space should generally be a bit
greater than inter-word space.

Typewriters only had one distance they could travel. Either 1/10th of an
inch ("Pica") or 1/12th ("Elite"). So the only way to add extra space
after a sentence was to double it. That's way too much extra space, but
it was generally better than the alternative. The real problem was that
the words were too far apart, not that the sentences were too close, but
again, the fixed spacing was already an abominable situation.

Proportional type, dating all the way back to Gutenberg, would generally
use 1/3rd or 1/4th of the height of type type as the inter-word spacing.
This would usually work out to about the width of a lower case "t" or "l".

When setting modern (by which you may also read "all type before
typewriters" as well) proportional type in fully justified form (left
and right margins both even), the spaces must be stretched out on a
line-by-line basis to fit. Really good typesetting programs (and really
good typesetters sticking little bits of lead between their words (and
I've done that, too)) will add more of the space between sentences than
between words, so as the line stretches, the inter-word space to
inter-sentence space ratio actually changes. (Take a look at a narrow
newspaper column sometime.)

More sophisticated approaches to space will ignore a user's attempt to
sprinkle extraneous space in. Less sophisticated ones might allow it,
and even treat them as individual spaces, stretching both of them during
expansion. {shudder}

The fact that both the MLA Guidelines and the Bedford Handbook encourage
poor typography is regrettable. ("If you cannot type appropriate
punctuation, e.g. an em-dash or en-dash, please use appropriate
substitutions. For both dashes, substitute a pair of hyphens, which,
like true dashes, are typed without adjacent spaces." There's still
software out there that will happily wrap a line between the two
hyphens. Ick!) Nevertheless, if you're submitting a paper to an
institution that expects or requires that, then to not follow them is
wrong, even if the legibility of the submission is better.

What it all boils down to is "Putting two spaces after a period at the
end of a sentence is an artifact left over from the days when the
typewriter was the prevalent text-making tool. Unless you have a
specific reason or requirement to do otherwise, it's preferable to put
only one space between sentences."

*****

For breaking text into sentences, sometimes I find it easier to work
backwards. Also, only very colloquial writing will have a one-word
sentence, so you can solve all "Mr./Dr./Ph.D." cases by the fact that if
a word starts with a cap and ends with a period, it's not a sentence.
For a more sophisticated approach that's still not too complex to
program, check the final word of a sentence against a dictionary. If
it's found there without a final dot, then you're almost certainly
looking at the end of a sentence. If it isn't, then is it found anywhere
else in the document without a dot? If not, then you're probably looking
at an abbreviation. (My mail program uses a monospaced font. If I
thought most readers would read it with a proportional font, I'd have
typed "Ph. D." above, since it should have a thin space before the D.)

This is what I love about Usenet. :)
 
A

Austin Ziegler

What rot. How can anything like that be a fact? You're regurgitating
the opinion of a style manual.

Um. No, I'm stating fact. This isn't mere opinion: two spaces were done
to simulate em-spaces in fixed pitch environments. That's a fact. The
reason for that may often be forgotten, but it *remains* a fact. Please
remember that I've done quite a bit of typesetting-style work in the
last year with PDF::Writer and I have to know a bit more about this than
most folks, and it's something of a hobby of mine in any case to know
about printing mechanisms.

The only *opinion* I stated was that the first poster in the chain above
(I think Jeffrey) was taught wrongly. I maintain that as true
regardless, because if he was taught two spaces without the reason why,
then there's a practice being repeated for no good reason.

The practice is nonsense these days in most contexts.

-austin
 
S

Shot - Piotr Szotkowski

--j+MD90OnwjQyWNYt
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Hello.

Dave Howell:
For improved legibility, inter-sentence space should
generally be a bit greater than inter-word space.

It's worth noting that actually turning this theory into reality seems
to apply to 'Western' (American, British, others?) typography (mostly?
only?).

I've yet to see a typical modern Polish book typeset with greater
inter-sentence spaces. Also (and, I guess, as a result of this),
I doubt I ever saw any Polish email or Usenet post with two
inter-sentence spaces, and I remember how happy I was to find
out about the 'joinspaces' vim option that finally let me reflow
paraghaprs properly, without doing a s/ / /g on them afterwards. :eek:)

Cheers,
-- Shot
--=20
He has never been known to use a word that might send a reader
to the dictionary. -- William Faulkner on Ernest Hemingway
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D http://s=
hot.pl/hovercraft/ =3D=3D=3D http://shot.pl/1/125/ =3D=3D=3D

--j+MD90OnwjQyWNYt
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFDjj7Vi/mCfdEo8UoRAoCGAJ9yz42YMLrZ6sXhfHOTb2LamSQ4/wCfXsKu
hd/bsYyB3YCuUFMj488GgzU=
=/rUN
-----END PGP SIGNATURE-----

--j+MD90OnwjQyWNYt--
 
D

Dave Howell

I'm not sure that's a very good rule, Dave. There are two sentences
here.

The above rule may catch titular abbreviations, but over-generalises
to produce a false negative in the above example.

I hadn't intended to provide a single magical rule that was perfect in
isolation, after all. {chuckle}


"Ph. D." is not a sentence. But where do you break
My name is Dave, Ph. D. Pleased to meet you.
vs.
You need my Ph. D. friend Dave to help you.

I don't think having a list of abbreviations and titles will improve
that situation much, although it's a lot more work and almost certain
to be incomplete. Any/every rule will have failures; avoiding them is
what takes you into that whole natural language high-octane engine
situation.

However, if you also use the *other* "rule" I mentioned, then you don't
have a problem. "Dave Howell" appears just a couple lines earlier,
establishing "Dave" as a word that doesn't require a period. Therefore,
it's more likely to be at the end of a sentence. The following word
("There") can be found in a dictionary, and in a non-capitalized form,
which means that its capitalization here following a dot strongly
indicates that it's beginning a sentence.

The capital "P" of "Ph." is not preceded by a period either time, so
it's not starting a sentence. After it, "friend" isn't capitalized, so
it's not ending a sentence. But "Pleased" is, and dictionary says "not
normally capitalized" so that's probably a sentence break.
 
G

Gavin Sinclair

Austin said:
[...] The two spaces *simulates* an
em-space in a typeset piece of work. (And that is *fact*, not
opinion.)
What rot. How can anything like that be a fact? You're regurgitating
the opinion of a style manual.

Um. No, I'm stating fact. This isn't mere opinion: two spaces were done
to simulate em-spaces in fixed pitch environments. That's a fact. [...]

Fair enough.

Gavin
 
M

Matthew Smillie

I hadn't intended to provide a single magical rule that was perfect
in isolation, after all. {chuckle}

Didn't assume you were! It was just a good example to use for a
"this can be harder than it looks" couple of lines of warning, since
it's been my experience that people don't anticipate false negatives
as well as they do false positives.


matthew smillie.
 
B

basi

All was well with this strategy, until i hit a sentence similar to:

The abbreviation for Mister is Mr.
The head office is in New York, N.Y.

In other words, abbreviations that end a sentence. These sentences
don't end with a double dot, so if we replace Mr. with $MISTER$, the
sentence has no end marker.

Hmmm.
basi
 
A

Adam i Agnieszka Gasiorowski FNORD

I hadn't intended to provide a single magical rule that was perfect in
isolation, after all. {chuckle}
Want some magick? You are stuck in wrong coordinate
system, like Newton. Stop thinking in terms of words and syntax
rules governing how to put them in correct order. Think
links (alinka). Think relations and revelation.
Words (symbols) have no meaning. None. They *are* empty.
If you want to infiltrate enemy ogranization the most
effective method is not drilling into individual agents,
but monitoring their communications (that is, relations).
If you aquire enough of those relations (and recursively,
but set some boundary unless you are Goddess and can
do anything you fancy) you don't even need to decrypt the
messages, unless you are bored. To destroy enemy
organization, mess with the relations. Agents (symbols,
words, punctuation marks...) are of no importance
whatsowherever. That is why a person, if immersed enough
in a alien language needs no dictinary day-to-day - if one does
need to check, it's not the meaning you are after -
it's definition, that is MORE SYMBOLS, so you can
augment MORE RELATIONS from unfamiliar context (SYMBOL
CLOUD, think quantum mechanics and particles) until
you actually GET the pointer to "meaning" and can call on it
(how to relate that
symbol to some other symbol mesh, you can still have
no idea what the hell fermion "means", but you can use it and
fail to be misunderstood unless you want to).

I have no idea how many "syntax errors" there are
in above paragraph - for the reason sublime, my total
lack of knowledge aboot rules of grammar for the
language used to convey meaning heretofore. HTH.

P.S. It makes me wonder, what 't bony "heretofore" word
"means" right now to you, Reader. Compose witty remarks if
it's a-kind funny miss-take, I enjoy my Self when people
smirk. Yes, I did stick-in a word possessing none of it's
meaning in my poor head. I must be mad? Or contrary-wise.
I'm not sure, to be frank with you a-like Frank
Herbert iff there was such word in usage "then". She
will compensate for that - any dictionary dug
up shall (she can't help it) explain in detail or else - she always does
that when I go at a genuine miracle in open source. It's
the game we play. I need some time, we make
a beatiful team... Prop me up with another
pill! A-musing...

-- 
I am the One. I am A vampire A-calling for your love! A.A!
I am the fire that burns within your blood. I am the One!!
No bars or chains can keep me from your bed! I am the One!
Nothing on earth can get me from your head! I am the One!!
 
A

Adam i Agnieszka Gasiorowski FNORD

He has never been known to use a word that might send a reader
to the dictionary. -- William Faulkner on Ernest Hemingway

Now, that is a wise one - it actually helps
to comprehend my jabber in the other post O
spontaneously generated today...

-- 
I am the One. I am A vampire A-calling for your love! A.A!
I am the fire that burns within your blood. I am the One!!
No bars or chains can keep me from your bed! I am the One!
Nothing on earth can get me from your head! I am the One!!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top