[SUMMARY] Proper Case (#89)

R

Ruby Quiz

I'm sad no one but the quiz creator himself gave this problem a shot. This is a
very real problem with all manner of source texts and fixing it is tricky.
There was even a discussion on the mailing list about how you can't count on
there being two spaces at the end of a sentence.

You really need natural language processing to correctly determine which words
to capitalize. Unfortunately, natural language processing is complex and often
not a perfect solution anyway.

The good news is that we can use some heuristics to get close. A heuristic is a
loosely defined rule or, put another way, the computer science equivalent to a
guess. These are often developed by just trying to get close to a solution and
then tweaking little things here and there to close in on the target. The
result won't be perfect, of course, but it may be good enough. It's a very
agile process and Rubyists love that.

Let's see what heuristics Elliot came up with now, starting with some code used
to correct common Netspeak misspellings:

Abbreviations = { "ppl" => "people",
"btwn" => "between",
"ur" => "your",
"u" => "you",
"diff" => "different",
"ofc" => "of course",
"liek" => "like",
"rly" => "really",
"i" => "I",
"i'm" => "I'm" }

def fix_abbreviations text
Abbreviations.each_key do |abbrev|
text = text.gsub %r[(^|(\s))#{abbrev}((\s)|[.,?!]|$)]i do |m|
m.gsub(/\w+/, "#{Abbreviations[abbrev]}")
end
end
text
end

# ...

This code is fairly trivial, but still quite effective. Using a predefined
Hash, the method just scans the text for the keys, swapping them out for the
provided values when found. Note that the expression used to find the key tries
to ensure it is not in the middle of some larger word by looking for leading and
trailing whitespace or punctuation.

That expression could probably be simplified to %r[\b#{abbrev}\b] which looks
for word boundaries (a \W\w or \w\W transition) and means close to the same
thing. This would allow Elliot do the search and replace in a single call to
gsub(), instead of the current nested call to avoid replacing the surrounding
space or punctuation. (You can do it with a single gsub() call even without
using \b, just FYI: text.gsub(%r[(^|(\s))#{abbrev}((\s)|[.,?!]|$)]i,
"#{Abbreviations[abbrev]}").)

The important aspect of this solution though is that it knows it's not perfect
and gives you the Hash as a means to make it better. If it doesn't handle your
text correctly, you can always add or delete entries from the Hash to improve
the results.

Let's look at some more code, this time for capitalizing proper nouns:

require "yaml"

# ...

def capitalize_proper_nouns text
if not File.exists?("proper_nouns.yaml")
make_capitalize_proper_nouns_file
end
proper_nouns = YAML.load_file "proper_nouns.yaml"
text = text.gsub /\w+/ do |word|
proper_nouns[word] || word
end
text
end

def make_capitalize_proper_nouns_file
words = File.read("/Users/curi/me/words.txt").split "\n"
lowercase_words = words.select {|w| w =~ /^[a-z]/}.map{|w| w.downcase}
words = words.map{|w| w.downcase} - lowercase_words
proper_nouns = words.inject({}) { |h, w| h[w] = w.capitalize; h }
File.open("proper_nouns.yaml", "w") {|f| YAML.dump(proper_nouns, f)}
end

# ...

This is an interesting two-tiered approach. If the program can locate a
proper_nouns.yaml file, a Hash is pulled from it and used to capitalize the
listed nouns. If the file cannot be found, a hand-off is made to
make_capitalize_proper_nouns_file(). The code in that method appears to read a
word list file and build up its own list of proper nouns. This list is then
flushed to the YAML file, so it will be found on future loads.

What I liked about this was how I could customize it, yet again. When testing
Elliot's code against the quiz text, I just built a quick Hash with the needed
keys and values:

$ ruby -r yaml -e 'y Hash[*%w[Elliot Temple].map { |pn| [pn.downcase, pn] }.
flatten]' > proper_nouns.yaml
$ cat proper_nouns.yaml
---
temple: Temple
elliot: Elliot

Getting back to the code, we're again using a trivial regular expression based
swap, which you can see in the second half of capitalize_proper_nouns(). It
matches all words (well, a run of \w characters) and replaces them with the
proper noun capitalization, if there is such a thing, or the word itself,
causing no change.

Now we can put all of that together with a few more heuristics to get a complete
solution:

# ...

def capitalize text
return "" if text.nil?
text = fix_abbreviations text
text = text.gsub /([?!.-]\s+)(\w+)/ do |m|
"#$1#{$2.capitalize}"
end
text = text.gsub /(\n)(\w+)/ do |m|
"#$1#{$2.capitalize}"
end
text = text.gsub /\A(\w+)/ do |m|
"#{$1.capitalize}"
end
text = text.gsub %r[\sHttp://] do |m|
"#{$&.downcase}"
end
text = capitalize_proper_nouns text
text
end

puts capitalize(ARGF.read)

This method triggers the fixes for abbreviations and proper nouns that we have
already examined. In addition, it uses regular expressions to capitalize word
characters following sentence end punctuation as well as words characters at the
beginning of a line or the document. It then corrects the protocol identifier
for inline links it may have damaged in the process.

So, how does this do on the quiz document? Generally quite good. It makes only
two obvious errors:

By Elliot Temple

and:

Sometimes I might want to write about gsub vs. Gsub! Without the...

The first error is that we generally do not capitalize the by in a byline. That
could probably be worked around with another regular expression correction.

The second issue is much harder to get right and here is where we start to miss
a natural language processing facility. When humans read that line we know that
gusb!() and without should not be capitalized because of the context they are
used in. The script is not-so-clever though and the period and exclamation
point throw it off. You could add rules to work around these cases as well, but
you will definitely be fighting an uphill battle at that point.

I still say the end result is quite good though. Count how many characters are
wrong in the quiz and subtract from that the three output issues. It's a big
improvement.

My thanks to Elliot Temple for the problem and being brave enough to put
together a solution.

Tomorrow we'll try our hand at another simple pen and paper game and see who can
solve it in record time...
 
R

Rick DeNatale

I'm sad no one but the quiz creator himself gave this problem a shot. This is a
very real problem with all manner of source texts and fixing it is tricky.
There was even a discussion on the mailing list about how you can't count on
there being two spaces at the end of a sentence.

Oh well, I was working on this, and got a fair ways, but got
distracted by a problem with my server. I guess I'll have to post my
solution on my blog in a few days.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top