[SUMMARY] Proper Case (#89)

Discussion in 'Ruby' started by Ruby Quiz, Aug 10, 2006.

  1. Ruby Quiz

    Ruby Quiz Guest

    I'm sad no one but the quiz creator himself gave this problem a shot. This is a
    very real problem with all manner of source texts and fixing it is tricky.
    There was even a discussion on the mailing list about how you can't count on
    there being two spaces at the end of a sentence.

    You really need natural language processing to correctly determine which words
    to capitalize. Unfortunately, natural language processing is complex and often
    not a perfect solution anyway.

    The good news is that we can use some heuristics to get close. A heuristic is a
    loosely defined rule or, put another way, the computer science equivalent to a
    guess. These are often developed by just trying to get close to a solution and
    then tweaking little things here and there to close in on the target. The
    result won't be perfect, of course, but it may be good enough. It's a very
    agile process and Rubyists love that.

    Let's see what heuristics Elliot came up with now, starting with some code used
    to correct common Netspeak misspellings:

    Abbreviations = { "ppl" => "people",
    "btwn" => "between",
    "ur" => "your",
    "u" => "you",
    "diff" => "different",
    "ofc" => "of course",
    "liek" => "like",
    "rly" => "really",
    "i" => "I",
    "i'm" => "I'm" }

    def fix_abbreviations text
    Abbreviations.each_key do |abbrev|
    text = text.gsub %r[(^|(\s))#{abbrev}((\s)|[.,?!]|$)]i do |m|
    m.gsub(/\w+/, "#{Abbreviations[abbrev]}")
    end
    end
    text
    end

    # ...

    This code is fairly trivial, but still quite effective. Using a predefined
    Hash, the method just scans the text for the keys, swapping them out for the
    provided values when found. Note that the expression used to find the key tries
    to ensure it is not in the middle of some larger word by looking for leading and
    trailing whitespace or punctuation.

    That expression could probably be simplified to %r[\b#{abbrev}\b] which looks
    for word boundaries (a \W\w or \w\W transition) and means close to the same
    thing. This would allow Elliot do the search and replace in a single call to
    gsub(), instead of the current nested call to avoid replacing the surrounding
    space or punctuation. (You can do it with a single gsub() call even without
    using \b, just FYI: text.gsub(%r[(^|(\s))#{abbrev}((\s)|[.,?!]|$)]i,
    "#{Abbreviations[abbrev]}").)

    The important aspect of this solution though is that it knows it's not perfect
    and gives you the Hash as a means to make it better. If it doesn't handle your
    text correctly, you can always add or delete entries from the Hash to improve
    the results.

    Let's look at some more code, this time for capitalizing proper nouns:

    require "yaml"

    # ...

    def capitalize_proper_nouns text
    if not File.exists?("proper_nouns.yaml")
    make_capitalize_proper_nouns_file
    end
    proper_nouns = YAML.load_file "proper_nouns.yaml"
    text = text.gsub /\w+/ do |word|
    proper_nouns[word] || word
    end
    text
    end

    def make_capitalize_proper_nouns_file
    words = File.read("/Users/curi/me/words.txt").split "\n"
    lowercase_words = words.select {|w| w =~ /^[a-z]/}.map{|w| w.downcase}
    words = words.map{|w| w.downcase} - lowercase_words
    proper_nouns = words.inject({}) { |h, w| h[w] = w.capitalize; h }
    File.open("proper_nouns.yaml", "w") {|f| YAML.dump(proper_nouns, f)}
    end

    # ...

    This is an interesting two-tiered approach. If the program can locate a
    proper_nouns.yaml file, a Hash is pulled from it and used to capitalize the
    listed nouns. If the file cannot be found, a hand-off is made to
    make_capitalize_proper_nouns_file(). The code in that method appears to read a
    word list file and build up its own list of proper nouns. This list is then
    flushed to the YAML file, so it will be found on future loads.

    What I liked about this was how I could customize it, yet again. When testing
    Elliot's code against the quiz text, I just built a quick Hash with the needed
    keys and values:

    $ ruby -r yaml -e 'y Hash[*%w[Elliot Temple].map { |pn| [pn.downcase, pn] }.
    flatten]' > proper_nouns.yaml
    $ cat proper_nouns.yaml
    ---
    temple: Temple
    elliot: Elliot

    Getting back to the code, we're again using a trivial regular expression based
    swap, which you can see in the second half of capitalize_proper_nouns(). It
    matches all words (well, a run of \w characters) and replaces them with the
    proper noun capitalization, if there is such a thing, or the word itself,
    causing no change.

    Now we can put all of that together with a few more heuristics to get a complete
    solution:

    # ...

    def capitalize text
    return "" if text.nil?
    text = fix_abbreviations text
    text = text.gsub /([?!.-]\s+)(\w+)/ do |m|
    "#$1#{$2.capitalize}"
    end
    text = text.gsub /(\n)(\w+)/ do |m|
    "#$1#{$2.capitalize}"
    end
    text = text.gsub /\A(\w+)/ do |m|
    "#{$1.capitalize}"
    end
    text = text.gsub %r[\sHttp://] do |m|
    "#{$&.downcase}"
    end
    text = capitalize_proper_nouns text
    text
    end

    puts capitalize(ARGF.read)

    This method triggers the fixes for abbreviations and proper nouns that we have
    already examined. In addition, it uses regular expressions to capitalize word
    characters following sentence end punctuation as well as words characters at the
    beginning of a line or the document. It then corrects the protocol identifier
    for inline links it may have damaged in the process.

    So, how does this do on the quiz document? Generally quite good. It makes only
    two obvious errors:

    By Elliot Temple

    and:

    Sometimes I might want to write about gsub vs. Gsub! Without the...

    The first error is that we generally do not capitalize the by in a byline. That
    could probably be worked around with another regular expression correction.

    The second issue is much harder to get right and here is where we start to miss
    a natural language processing facility. When humans read that line we know that
    gusb!() and without should not be capitalized because of the context they are
    used in. The script is not-so-clever though and the period and exclamation
    point throw it off. You could add rules to work around these cases as well, but
    you will definitely be fighting an uphill battle at that point.

    I still say the end result is quite good though. Count how many characters are
    wrong in the quiz and subtract from that the three output issues. It's a big
    improvement.

    My thanks to Elliot Temple for the problem and being brave enough to put
    together a solution.

    Tomorrow we'll try our hand at another simple pen and paper game and see who can
    solve it in record time...
     
    Ruby Quiz, Aug 10, 2006
    #1
    1. Advertising

  2. On 8/10/06, Ruby Quiz <> wrote:
    > I'm sad no one but the quiz creator himself gave this problem a shot. This is a
    > very real problem with all manner of source texts and fixing it is tricky.
    > There was even a discussion on the mailing list about how you can't count on
    > there being two spaces at the end of a sentence.


    Oh well, I was working on this, and got a fair ways, but got
    distracted by a problem with my server. I guess I'll have to post my
    solution on my blog in a few days.

    --
    Rick DeNatale
     
    Rick DeNatale, Aug 10, 2006
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. THY
    Replies:
    4
    Views:
    642
    Oliver
    Oct 25, 2003
  2. =?Utf-8?B?Sm9l?=

    RE: Proper Case

    =?Utf-8?B?Sm9l?=, Feb 6, 2006, in forum: ASP .Net
    Replies:
    0
    Views:
    463
    =?Utf-8?B?Sm9l?=
    Feb 6, 2006
  3. =?Utf-8?B?Sm9l?=

    Proper Case

    =?Utf-8?B?Sm9l?=, Feb 6, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    1,936
    =?Utf-8?B?VEg=?=
    Feb 9, 2006
  4. Guest
    Replies:
    1
    Views:
    726
    Ken Cox [Microsoft MVP]
    Dec 25, 2004
  5. CD
    Replies:
    2
    Views:
    164
Loading...

Share This Page