[QUIZ] Proper Case (#89)

M

Mitchell Koch

--9zSXsLTf0vkW971A
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

My solution will at the minimum capitalize the starts of sentences
just judging by periods.

If supplied with an example source using the -s option, it will try to
find words that should always be capitalized (I, Ruby, proper nouns in
general), words that imply that the next word should be capitalized
(Lake, General) and words in which punctuation does not imply an end
of a sentence (abbreviations) although this is only helpful if there
is some capitalization in the text.

--9zSXsLTf0vkW971A
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="propercase.rb"

#!/usr/bin/env ruby

require 'getopts'

class String
def capitalized?
if self == self.capitalize then true else false end
end
end

EOSPunc = ['.', '?', '!']
Thresh = 0.5
should_be_capt = []
can_have_punct = []
next_word_capt = []

getopts 's:'

if $OPT_s # learn from a source
tot_count = Hash.new{0}
sbc_count = Hash.new{0} # should be capitalized
nwc_count = Hash.new{0} # next word capitalized
prev_word = ''
prev_bare = ''
eos = true
File.open($OPT_s).each do |line|
line.split(' ').each do |word|
bare = word.scan(/\w+/).first
next unless bare
tot_count[bare.downcase] += 1
unless eos
if bare.capitalized?
sbc_count[bare.downcase] += 1
nwc_count[prev_bare.downcase] += 1
else
if EOSPunc.index prev_word[-1].chr
chp_count[prev_bare.downcase] += 1
end
end
end
eos = if EOSPunc.index word[-1].chr then true else false end
prev_word, prev_bare = word, bare
end
end
sbc_count.each do |word, count|
if count > Thresh * tot_count[word]
should_be_capt << word
end
end
nwc_count.each do |word, count|
if count > Thresh * tot_count[word]
next_word_capt << word
end
end
end

cnw = true # capitalize next word
ARGF.each do |line|
line.split(' ').each do |word|
bare = word.scan(/\w+/).first.downcase
if cnw or should_be_capt.index(bare)
word.capitalize!
end
print word + ' '
cnw = if EOSPunc.index word[-1].chr then true else false end
cnw = true if next_word_capt.index(bare)
end
puts
end

--9zSXsLTf0vkW971A--
 
E

Elliot Temple

My solution will at the minimum capitalize the starts of sentences
just judging by periods.

If supplied with an example source using the -s option, it will try to
find words that should always be capitalized (I, Ruby, proper nouns in
general), words that imply that the next word should be capitalized
(Lake, General) and words in which punctuation does not imply an end
of a sentence (abbreviations) although this is only helpful if there
is some capitalization in the text.

I'm still reading through the code but just a minor tip:

lines like:

if EOSPunc.index word[-1].chr then true else false end

can be replaced with

EOSPunc.index word[-1].chr

it will either be true or false, and then the if statement is giving
the same thing.

if you really want to have true or false (and not 3 or "hi" or nil,
even though those will work fine if you treat the variable as a
boolean) one way to do it is !!var. using not twice gets you true or
false. there's probably something more readable though.

-- Elliot Temple
http://www.curi.us/blog/
 
J

James Edward Gray II

If supplied with an example source using the -s option, it will try to
find words that should always be capitalized (I, Ruby, proper nouns in
general), words that imply that the next word should be capitalized
(Lake, General) and words in which punctuation does not imply an end
of a sentence (abbreviations) although this is only helpful if there
is some capitalization in the text.

This is quite clever Mitchell. Thanks so much for sharing it with us!

Sadly, I wrote the summary earlier today when I had a few free
moments. Don't take it personally that it doesn't mention this
code. :(

James Edward Gray II
 
M

Mitchell Koch

* Elliot Temple said:
if EOSPunc.index word[-1].chr then true else false end
can be replaced with
EOSPunc.index word[-1].chr
Ah, yeah that's a shorter way to do it. For some reason I had it
stuck in my head that I just wanted to express truth value and tried
to avoid passing on extraneous information (like in this case the
index in the punctuation array of the entry with the punctuation
attached to the last word).

I didn't spend too much time refactoring; at first I was dreaming up
abstracting parts of both the source reading and the proper casing
into a token parser kind of thing, but then it was more like, okay it
works and it's the Wednesday before the quiz summary goes up, so let's
send it off. ;-)

Mitchell Koch
 
M

Mitchell Koch

* James Edward Gray II said:
This is quite clever Mitchell. Thanks so much for sharing it with us!

Sadly, I wrote the summary earlier today when I had a few free
moments. Don't take it personally that it doesn't mention this
code. :(

No worries. I shouldn't have put it off to the last minute anyway. :)

It's interesting to me that so few of us submitted code for this
quiz. It's a problem that has no clear solutions, partly because
capitalization isn't just about grammatical rules, but does
communicate unique things as hinted in Elliot's initial examples.

For example if the word "gray" appears in a lowercase message, it
could mean the color in which it should actually be lowercase, or it
could be a surname, in which it should be capitalized. A computer
reader has no way to know, a human reader should be able to tell, but
really only the original author knows for sure.

It's like image interpolation. If I start out with a small photo,
expand it, and try to infer the extra pixels, a good algorithm will
give you something that looks okay, but it will not be as good as if
you started out by taking it at the larger size in the first place.

Incidentally, that's a good reason to not type in lowercase. ;-)

Mitchell Koch
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
DewittMill
Top