[SUMMARY] TumbleDRYer (#53)

Discussion in 'Ruby' started by Ruby Quiz, Nov 3, 2005.

  1. Ruby Quiz

    Ruby Quiz Guest

    Great quiz Hugh! This is one of those unique ideas that was a lot of fun to
    play with. There's always some neat things to be accomplished with code that
    writes code.

    I want to take a look at Bob Showalter's solution below. It's really just one
    linear procedure, so I'll show all the code and then break it down:

    require 'strscan'
    require 'abbrev'

    class TumbleDRYer

    # minimum length of a phrase to consider
    MIN_PHRASE = 10

    # minimum times a phrase must occur to consider
    MIN_OCCUR = 3

    # minimum length for abbreviation
    MIN_ABBR = 2

    def initialize(string)
    @input = string
    end

    def dry

    # this will accumulate a list of repeated phrases to condense
    phrases = Array.new

    # this will receive the abbreviation for each phrase
    abbr = Hash.new

    lines = @input.to_a
    loop do

    # process the input data by lines. we find "phrases" by
    # first finding the start and end of each "word" in the line,
    # and then combining those words into longer phrases. for
    # each phrase, we count the number of times it occurs in the
    # total input.
    phr = Hash.new
    lines.each do |line|
    s = StringScanner.new(line)
    words = Array.new
    loop do
    s.scan_until(/(?=\S)/) or break
    beg = s.pos
    s.scan(/\S+/)
    words << [ beg, s.pos ]
    end

    # combine words to make 'phrases'
    combos(words)

    # accumulate phrases, counting their occurences.
    # skip phrases that are too short.
    words.each do |from, to|
    p = line[from, to - from]
    next unless p.length >= MIN_PHRASE
    phr[p] ||= 0
    phr[p] += 1
    end
    end

    # get the longest phrase that occurs the most times
    longest = phr.sort_by { |k,v| -(k.length * 1000 + v)
    }.find { |k,v| v >= MIN_OCCUR } or break
    phrase, occurs = longest

    # save the phrase, and then blank it out of the input data
    # so we can search for more phrases
    phrases << phrase
    lines.each { |line| line.gsub!(phrase, ' ' * phrase.length) }

    end

    # now we have all the phrases we want to replace.
    # find unique abbreviations for each phrase.
    temp = Hash.new
    phrases.each do |phrase|
    key = phrase.scan(/\w+/).flatten.to_s.downcase
    key = '_' + key unless key =~ /^[_a-zA-Z]/
    key += '_' while temp.has_key? key
    temp[key] = phrase
    end
    temp.keys.abbrev.sort.each do |s, key|
    phrase = temp[key]
    abbr[phrase] = s if abbr[phrase].nil? ||
    abbr[phrase].length < MIN_ABBR
    end

    # generate the output class
    puts "class WashingMachine"
    puts " def initialize"
    phrases.each do |phrase|
    puts ' @' + abbr[phrase] + " = '" +
    phrase.gsub("'", "\\\\'") + "'"
    @input.gsub!(phrase, '#{@' + abbr[phrase] + '}')
    end
    puts " end\n"
    puts " def output\nprint <<EOF"
    puts @input
    puts "EOF\n end\n"
    puts "end"

    end

    private

    def combos(arr, max = arr.size - 1, i = 0)
    (i+1..max).each do |j|
    arr << [ arr[0], arr[j][1] ]
    end
    combos(arr, max, i + 1) if i < max - 1
    end

    end

    TumbleDRYer.new(ARGF.read).dry

    That's the code, save one unused line I removed.

    The first thing I do when reading code is to try and understand the process. To
    get the lay of the land, we need to know what's called where. The path of
    execution is trivial to find here. The last line is the only code, besides a
    couple of require statements, outside of a class definition. We know that's
    what Ruby will run.

    That line tells us that we need to be examining two methods
    TumbleDRYer#initialize and TumbleDRYer#dry. Don't be shy with that scrollbar
    now. Zip around until you find them. TumbleDRYer#initialize is easy enough,
    all it does is record the passed input. TumbleDRYer#dry is a monster, so let's
    come back to it.

    Anything else in this file? Well, we can see that two standard libraries are
    used, strscan and abbrev. We can also see three constants at the top of the
    class that start to give us ideas about how the code will break down the
    sections it is meant to simplify.

    Beyond that, there's only one other method, which we can assume is a helper to
    TumbleDRYer#dry. It's smaller, so let's see if we can figure it out first. Uh
    oh, recursion! My brain just melted and leaked out of my ear. Okay, let's see
    if we can cheat and just call it. Copy and paste and irb to the rescue:

    >> def combos(arr, max = arr.size - 1, i = 0)
    >> (i+1..max).each do |j|

    ?> arr << [ arr[0], arr[j][1] ]
    >> end
    >> combos(arr, max, i + 1) if i < max - 1
    >> end

    => nil
    >> combos( (1..10).to_a )

    => nil

    Or not. I was hoping for a more informative return value obviously.

    Okay, it looks like I actually need to read a little. I see that `arr` is a
    parameter of the method. That's how I decided on a sample data set. A little
    farther in I can see `arr << ...`. Bingo. It modifies the array. Now I know
    how to check it:

    >> test_data = (1..10).to_a

    => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    >> combos test_data

    => nil
    >> test_data

    => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, [1, 1], [1, 1], [1, 0], [1, 0], [1, 1],
    [1, 1], [1, 0], [1, 0], [1, 1], [0, 1], [0, 0], [0, 0], [0, 1], [0, 1],
    [0, 0], [0, 0], [0, 1], [1, 0], [1, 0], [1, 1], [1, 1], [1, 0], [1, 0],
    [1, 1], [0, 0], [0, 1], [0, 1], [0, 0], [0, 0], [0, 1], [1, 1], [1, 1],
    [1, 0], [1, 0], [1, 1], [0, 1], [0, 0], [0, 0], [0, 1], [1, 0], [1, 0],
    [1, 1], [0, 0], [0, 1], [1, 1]]

    Yuck. In the dictionary under "unhelpful", you will find the above listing.
    More reading is required.

    Well, if I finish the line I stopped at last time, I find this little nugget:
    `[ arr[0], arr[j][1] ]`. That makes me think `arr` is expected to be an
    Array of Arrays. It looks like the subarrays only need two items, since the
    indexes are 0 and 1. Third try is a charm:

    >> test_data = [1, 3, 5, 7, 9].map { |n| [n, n + 1] }

    => [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
    >> combos test_data

    => nil
    >> test_data

    => [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [1, 4], [1, 6], [1, 8], [1, 10],
    [3, 6], [3, 8], [3, 10], [5, 8], [5, 10], [7, 10]]

    That looks promising. Now what are we seeing? Obviously it added some
    groupings to the end of the Array. 1 was already seen with 2, but it also
    paired it with 4, 6, 8, and 10. That's all of the other second elements. Then
    3 is paired with 6, 8, and 10 and it was already paired with 4, which leaves
    only 2 missing. Now I see the pattern. Each first element is paired with all
    of the second elements that come after it, in the original Array.

    Even knowing how that works, I'm not sure I understand what that is actually for
    yet. I'll file it away in the back of my mind and see if I'm ready to tackle
    TumbleDRYer#dry.

    Here's the first chunk of that method, to save you some scrolling:

    def dry

    # this will accumulate a list of repeated phrases to condense
    phrases = Array.new

    # this will receive the abbreviation for each phrase
    abbr = Hash.new

    lines = @input.to_a
    loop do

    # process the input data by lines. we find "phrases" by
    # first finding the start and end of each "word" in the line,
    # and then combining those words into longer phrases. for
    # each phrase, we count the number of times it occurs in the
    # total input.
    phr = Hash.new
    lines.each do |line|
    s = StringScanner.new(line)
    words = Array.new
    loop do
    s.scan_until(/(?=\S)/) or break
    beg = s.pos
    s.scan(/\S+/)
    words << [ beg, s.pos ]
    end

    # combine words to make 'phrases'
    combos(words)

    # accumulate phrases, counting their occurrences.
    # skip phrases that are too short.
    words.each do |from, to|
    p = line[from, to - from]
    next unless p.length >= MIN_PHRASE
    phr[p] ||= 0
    phr[p] += 1
    end
    end

    # get the longest phrase that occurs the most times
    longest = phr.sort_by { |k,v| -(k.length * 1000 + v)
    }.find { |k,v| v >= MIN_OCCUR } or break
    phrase, occurs = longest

    # save the phrase, and then blank it out of the input data
    # so we can search for more phrases
    phrases << phrase
    lines.each { |line| line.gsub!(phrase, ' ' * phrase.length) }

    end

    # ...

    Luckily, that's well commented code. The comments easily explain what the
    variables are for, and then we get an introduction to the process in `loop do
    ... end`. As soon as I read that, TumbleDRYer#combos clicks into place.

    The Array of Arrays passed to TumbleDRYer#combos holds a start and end position
    for each word. The start positions are combined with all of the end positions
    after that word to give the possible phrases.

    Before the call to TumbleDRYer#combos, we can see the Array of Arrays being
    created, with a little StringScanner work. Words are found by scanning
    non-whitespace characters, because replacing at any other boundaries pretty much
    kills the human readable goal of the output.

    Just beneath that call to TumbleDRYer#combos the phrases are assembled, checked
    for the minimum length, and tallied in the phrase count. The chunk of code
    right after that selects the longest phrase that occurs the most times for
    substitution. Note the clever use of negation in #sort_by here, to reverse the
    order. The last couple of lines save the phrase and replace it with whitespace,
    so it won't match in future iterations.

    Remember, this is all inside of a big `loop do ... end`, so by the time we break
    out of here, all the replacement phrases will have been selected.

    # ...

    # now we have all the phrases we want to replace.
    # find unique abbreviations for each phrase.
    temp = Hash.new
    phrases.each do |phrase|
    key = phrase.scan(/\w+/).flatten.to_s.downcase
    key = '_' + key unless key =~ /^[_a-zA-Z]/
    key += '_' while temp.has_key? key
    temp[key] = phrase
    end
    temp.keys.abbrev.sort.each do |s, key|
    phrase = temp[key]
    abbr[phrase] = s if abbr[phrase].nil? ||
    abbr[phrase].length < MIN_ABBR
    end

    # ...

    This section of the method just finds names for each of the replaced phrases.
    It generally choses the first few letters or numbers of the phrase, unless it
    needs to insert a _ character or two to maintain Ruby syntax or break
    ambiguities. A lot of the work here is done by the abbrev library. If you're
    not familiar with how that works, ask irb:

    >> require "abbrev"

    => true
    >> names = %w{James Edward Gray II}

    => ["James", "Edward", "Gray", "II"]
    >> names.respond_to? :abbrev

    => true
    >> names.abbrev

    => {"II"=>"II", "Gr"=>"Gray", "Gra"=>"Gray", "Jam"=>"James", "Ja"=>"James",
    "Edw"=>"Edward", "Edwa"=>"Edward", "Edwar"=>"Edward", "Gray"=>"Gray",
    "James"=>"James", "Edward"=>"Edward", "E"=>"Edward", "G"=>"Gray",
    "Jame"=>"James", "I"=>"II", "Ed"=>"Edward", "J"=>"James"}

    It turns an Array into a Hash of all the unique abbreviations that can be used
    to form the words in the original Array.

    Last little bit of code:

    # ...

    # generate the output class
    puts "class WashingMachine"
    puts " def initialize"
    phrases.each do |phrase|
    puts ' @' + abbr[phrase] + " = '" +
    phrase.gsub("'", "\\\\'") + "'"
    @input.gsub!(phrase, '#{@' + abbr[phrase] + '}')
    end
    puts " end\n"
    puts " def output\nprint <<EOF"
    puts @input
    puts "EOF\n end\n"
    puts "end"

    end

    That just outputs the DRYed code to STDOUT. This output uses a Ruby heredoc
    with String interpolation. Unfortunately, I think that means that Ruby code,
    with its own String interpolation, will confuse it. Have a look at Dominik
    Bathon's code for custom interpolation or Daniel Sheppard's code for ERb
    templates that even make use of arguments in replacement.

    As always my thanks to all the creative souls that gave this quiz a try. All of
    the solutions are educational.

    Ruby Quiz will take a week off now so I can run up to Denver and wish my sister,
    Nicole Blake Thanner, a happy sweet sixteenth birthday!
    Ruby Quiz, Nov 3, 2005
    #1
    1. Advertising

  2. Late Ruby Quiz Notice (was Re: [SUMMARY] TumbleDRYer (#53))

    On Nov 3, 2005, at 7:44 AM, Ruby Quiz wrote:

    > Ruby Quiz will take a week off now...


    Just FYI, there will be a Ruby Quiz on the 11th, but it will be
    later. My return flight doesn't land until that evening and I can't
    launch it until I'm back home. It will make the right day, but not
    at the usual morning time. Thanks for your patients.

    James Edward Gray II
    James Edward Gray II, Nov 5, 2005
    #2
    1. Advertising

  3. Re: Late Ruby Quiz Notice (was Re: [SUMMARY] TumbleDRYer (#53))

    On 11/4/05, James Edward Gray II <> wrote:

    > Thanks for your patients.


    But I'm not a doctor ;)
    Thanks James for all of your hard work making the Quz happen.

    Anthony Moralez
    Anthony Moralez, Nov 6, 2005
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Adam Barr
    Replies:
    1
    Views:
    496
  2. Libs
    Replies:
    0
    Views:
    1,485
  3. zPaul

    Validation Summary

    zPaul, Jun 26, 2003, in forum: ASP .Net
    Replies:
    0
    Views:
    457
    zPaul
    Jun 26, 2003
  4. Ruby Quiz

    [QUIZ] TumbleDRYer (#53)

    Ruby Quiz, Oct 28, 2005, in forum: Ruby
    Replies:
    9
    Views:
    127
    Hugh Sasse
    Nov 2, 2005
  5. Dale Martenson

    TumbleDRYer (#53) - my solution

    Dale Martenson, Oct 31, 2005, in forum: Ruby
    Replies:
    6
    Views:
    113
    Hugh Sasse
    Nov 1, 2005
Loading...

Share This Page