[QUIZ] Proper Case (#89)

R

Ruby Quiz

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion. Please reply to the original quiz message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by elliot temple

sometimes i type in all or mostly lowercase. a friend of mine says it's hard to
read essays with no capital letters. so the problem is to write a method which
takes a string (which could include many paragraphs), and capitalizes words that
should be capitalized. at minimum it should do the starts of sentences.

solutions could range from simple (a few regexes) to complex (lots of special
cases are possible, like abbreviations that use a period). an addition would be
using a dictionary to find proper nouns and capitalize those. it could also ask
the user about cases the program can't figure out. or log them.

i can provide an example solution (regex based) and a list of reasons it doesn't
work very well, if you want.

sample input:

- this email itself works nicely

- this one is hard. sometimes i might want to write about gsub vs. gsub! without
the "." or "!" causing any capitalization (or the punctuation in quotes).

one problem is maybe dealing with sentences that contain periods is too hard. i
don't know.
 
M

Mike Harris

Ruby said:
*snip*

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by elliot temple

sometimes i type in all or mostly lowercase. a friend of mine says it's hard to
read essays with no capital letters. so the problem is to write a method which
takes a string (which could include many paragraphs), and capitalizes words that
should be capitalized. at minimum it should do the starts of sentences.

solutions could range from simple (a few regexes) to complex (lots of special
cases are possible, like abbreviations that use a period). an addition would be
using a dictionary to find proper nouns and capitalize those. it could also ask
the user about cases the program can't figure out. or log them.

i can provide an example solution (regex based) and a list of reasons it doesn't
work very well, if you want.

sample input:

- this email itself works nicely

- this one is hard. sometimes i might want to write about gsub vs. gsub! without
the "." or "!" causing any capitalization (or the punctuation in quotes).

one problem is maybe dealing with sentences that contain periods is too hard. i
don't know.
It would be nice if you could assume two spaces after a end of sentence
with puncuation. Generally I think that's correct grammar, although my
grammar stinks so I could easily be wrong. If you have to get into
parsing incorrect grammar it becomes much more difficult.
 
J

James Edward Gray II

It would be nice if you could assume two spaces after a end of
sentence with puncuation. Generally I think that's correct
grammar, although my grammar stinks so I could easily be wrong. If
you have to get into parsing incorrect grammar it becomes much more
difficult.

Actually, that's an old typographical convention that we can't seem
to shake. Here's an report that talks a little about the issue:

http://webword.com/reports/period.html

Here's an explanation from that report:

The only reason that two spaces were used after a period during the
'typewriter' age was because original typewriters had monospaced
fonts -- the extra space was needed for the eye to pick up on the
beginning
of a new sentence. That need is negated w/proportional space type,
hence
[it is] the typographic standard.

James Edward Gray II
 
T

transfire

i say your friend is just being hard headed. we don't need no stinking
caps! ;-)

t.
 
M

Matthew Smillie

It would be nice if you could assume two spaces after a end of
sentence with puncuation. Generally I think that's correct
grammar, although my grammar stinks so I could easily be wrong. If
you have to get into parsing incorrect grammar it becomes much more
difficult.

It's not correct grammar, just a typographical convention; one which
is sort of semi-obsolete and regularly gives rise to great debate in
typographical circles over its perceived rightness, wrongness, and
pragmatic value.

That isn't to say you shouldn't use it, since it'll be very accurate
in the general case, but redefining the problem to say "anything that
doesn't use two spaces is wrong" is a bit of a dodge.
 
J

jason r tibbetts

My day job is developing natural language processing apps, and we've had
to implement a similar case-correcting tool. What we found is that a
simple regex-based approach is correct about 90% of the time. When we
used machine learning to do the same thing, the results went up to about
95%. Compare this to human performance (i.e. have two or more people
manually correct a text, then compare how often their corrections were
in agreement), which was, IIRC, about 97%.
It would be nice if you could assume two spaces after a end of sentence
with puncuation. Generally I think that's correct grammar, although my
grammar stinks so I could easily be wrong. If you have to get into
parsing incorrect grammar it becomes much more difficult.

The two-spaces-after-period rule is not a grammatical one; it's a
typographic convention that grew out of typewriter (i.e. monospaced) fonts.
 
P

Paul Battley

It would be nice if you could assume two spaces after a end of sentence
with puncuation. Generally I think that's correct grammar, although my
grammar stinks so I could easily be wrong. If you have to get into
parsing incorrect grammar it becomes much more difficult.

There's an old typewriter convention to use two spaces, but I'd be
surprised if you can find a single printed English book that uses two
spaces after a sentence.

Paul.
 
P

Paul Battley

sometimes i type in all or mostly lowercase. a friend of mine says it's hard to
read essays with no capital letters. so the problem is to write a method which
takes a string (which could include many paragraphs), and capitalizes words that
should be capitalized. at minimum it should do the starts of sentences.

perhaps u could also correct rly annoying abbreviations used by ppl
for whom typing a few extra letters is 2 hard! thx!111

(Ugh - did I just type that?!)

Paul.
 
M

Mike Harris

James said:
It would be nice if you could assume two spaces after a end of
sentence with puncuation. Generally I think that's correct grammar,
although my grammar stinks so I could easily be wrong. If you have
to get into parsing incorrect grammar it becomes much more difficult.


Actually, that's an old typographical convention that we can't seem
to shake. Here's an report that talks a little about the issue:

http://webword.com/reports/period.html

Here's an explanation from that report:

The only reason that two spaces were used after a period during the
'typewriter' age was because original typewriters had monospaced
fonts -- the extra space was needed for the eye to pick up on the
beginning
of a new sentence. That need is negated w/proportional space type,
hence
[it is] the typographic standard.

James Edward Gray II
I stand corrected.
 
J

jason r tibbetts

Paul said:
perhaps u could also correct rly annoying abbreviations used by ppl
for whom typing a few extra letters is 2 hard! thx!111

Joking aside, this kind of tool would have been most welcome when I
taught freshman-level programming a few years back. We're showing our
age here.
 
H

Hans Fugal

James said:
It would be nice if you could assume two spaces after a end of
sentence with puncuation. Generally I think that's correct grammar,
although my grammar stinks so I could easily be wrong. If you have to
get into parsing incorrect grammar it becomes much more difficult.

Actually, that's an old typographical convention that we can't seem to
shake. Here's an report that talks a little about the issue:

http://webword.com/reports/period.html

Here's an explanation from that report:

The only reason that two spaces were used after a period during the
'typewriter' age was because original typewriters had monospaced
fonts -- the extra space was needed for the eye to pick up on the
beginning
of a new sentence. That need is negated w/proportional space type, hence
[it is] the typographic standard.

Very interesting. It's also very interesting to me that I spend most of
my time reading and writing in monospaced fonts and I think two spaces
looks worse in monospace, so I only ever use one. When typing in
proportional fonts I sometimes still do a double-space, but mostly I've
given up caring what others think and just do what I want (one space),
similar to the situation with punctuation inside or outside of quotation
marks. I blame latex for my nonchalant attitude, however no matter how
much I use latex I will never fall for the horrendously wrong `` ''
convention.
 
J

James Edward Gray II

James said:
It would be nice if you could assume two spaces after a end of
sentence with puncuation. Generally I think that's correct
grammar, although my grammar stinks so I could easily be wrong.
If you have to get into parsing incorrect grammar it becomes much
more difficult.
Actually, that's an old typographical convention that we can't
seem to shake. Here's an report that talks a little about the issue:
http://webword.com/reports/period.html
Here's an explanation from that report:
The only reason that two spaces were used after a period during the
'typewriter' age was because original typewriters had monospaced
fonts -- the extra space was needed for the eye to pick up on
the beginning
of a new sentence. That need is negated w/proportional space
type, hence
[it is] the typographic standard.

Very interesting. It's also very interesting to me that I spend
most of my time reading and writing in monospaced fonts and I think
two spaces looks worse in monospace, so I only ever use one.

The report mentions this as well:

In short, the "rivers" of whitespace, caused by using two spaces,
invariably annoy graphic designers and typographers.

James Edward Gray II
 
H

Hans Fugal

James said:
The report mentions this as well:

In short, the "rivers" of whitespace, caused by using two spaces,
invariably annoy graphic designers and typographers.

That sounds like a noble cause. Maybe I'll reconsider...
 
R

Rick DeNatale

That sounds like a noble cause. Maybe I'll reconsider...

The noble cause being to annoy graphic designers and typographers?

Or maybe you meant something else. <G>

Sorry for the two empty replies. Gmail went crazy on me.
 
G

Gautam Dey

The noble cause being to annoy graphic designers and typographers?

Or maybe you meant something else. <G>

Sorry for the two empty replies. Gmail went crazy on me.

I thought you were trying to start on the noble cause, by adding to
the cause.
 
W

William James

James said:
Actually, that's an old typographical convention that we can't seem
to shake.

What sort of perversion would make anyone want to shake
an old convention that is useful?
Here's an report that talks a little about the issue.

http://webword.com/reports/period.html

Here's an explanation from that report:

The only reason that two spaces were used after a period during the
'typewriter' age was because original typewriters had monospaced
fonts -- the extra space was needed for the eye to pick up on the
beginning
of a new sentence. That need is negated w/proportional space type,
hence
[it is] the typographic standard.

Most people view the posts here in a monospaced font.
If they didn't, source code would look too chaotic.

TeX and LaTeX, for example, quite properly put extra space
after the end of a sentence. Since what we type here will
usually be displayed monospaced, a sensible person who is
trying to make his message as readable as possible will put
two spaces between sentences.
 
W

William James

William said:
TeX and LaTeX, for example, quite properly put extra space
after the end of a sentence. Since what we type here will
usually be displayed monospaced, a sensible person who is
trying to make his message as readable as possible will put
two spaces between sentences.

Two spaces are needed even when the posts are seen in
a proportional font; without them, there is no extra space
between sentences.
 
M

Matthew Smillie

What sort of perversion would make anyone want to shake
an old convention that is useful?

I would consider it a vast personal favour if we didn't have to re-
hash this never-ending argument in the quiz thread. A quick poke
around Google should familiarise anyone who's interested with the
basic propositions for and against using two spaces at the end of a
sentence, wikipedia makes a decent start.

matthew smillie.
 
E

Elliot Temple

here's what i have. it does a few abbreviations, proper nouns, and
some regexs.


require "yaml"
Abbreviations = {"ppl" => "people", "btwn" => "between", "ur" =>
"your", "u" => "you", "diff" => "different", "ofc" => "of course",
"liek" => "like", "rly" => "really", "i" => "I", "i'm" => "I'm"}

def fix_abbreviations text
Abbreviations.each_key do |abbrev|
text = text.gsub %r[(^|(\s))#{abbrev}((\s)|[.,?!]|$)]i do |m|
m.gsub(/\w+/, "#{Abbreviations[abbrev]}")
end
end
text
end

def capitalize_proper_nouns text
if not File.exists?("proper_nouns.yaml")
make_capitalize_proper_nouns_file
end
proper_nouns = YAML.load_file "proper_nouns.yaml"
text = text.gsub /\w+/ do |word|
proper_nouns[word] || word
end
text
end

def make_capitalize_proper_nouns_file
words = File.read("/Users/curi/me/words.txt").split "\n"
lowercase_words = words.select {|w| w =~ /^[a-z]/}.map{|w|
w.downcase}
words = words.map{|w| w.downcase} - lowercase_words
proper_nouns = words.inject({}) { |h, w| h[w] = w.capitalize; h }
File.open("proper_nouns.yaml", "w") {|f| YAML.dump(proper_nouns, f)}
end

def capitalize text
return "" if text.nil?
text = fix_abbreviations text
text = text.gsub /([?!.-]\s+)(\w+)/ do |m|
"#$1#{$2.capitalize}"
end
text = text.gsub /(\n)(\w+)/ do |m|
"#$1#{$2.capitalize}"
end
text = text.gsub /\A(\w+)/ do |m|
"#{$1.capitalize}"
end
text = text.gsub %r[\sHttp://] do |m|
"#{$&.downcase}"
end
text = capitalize_proper_nouns text
text
end
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top