[SUMMARY] Mailing List Files (#115)


R

Ruby Quiz

I've been playing a little with TMail lately, which is what really inspired this
quiz. I thought that a simple solution to this problem would be to pull the
pages down with open-uri and then dump them into TMail and just pull the
attachments from that. It turns out to be a bit harder to do that than I
expected, but one solution did follow that path.

What I love about this plan is the fact that you are just stitching the real
tools together. I like leaning on libraries to get tons of functionality with
just a few lines of code. Apparently, so does Louis J Scoras! Check out this
list of dependencies that kick-starts his solution (I've removed the excellent
comments in the code to save space):

#!/usr/bin/env ruby

require 'action_mailer'
require 'cgi'
require 'delegate'
require 'elif'
require 'fileutils'
require 'hpricot'
require 'open-uri'
require 'tempfile'

# ...

Wow.

Let's start with the standard libraries. Louis pulls in cgi to handle HTML
escapes, delegate to wrap existing classes, fileutils for easy directory
creation, open-uri to fetch web pages with, and tempfile for creating temporary
files, of course. That's an impressive set of tools all of which ship with
Ruby.

The other three dependancies are external. You can get them all as gems.
action_mailer is a component of the Rails framework used to handle email. Louis
doesn't actually use the action_mailer part, just the bundled TMail dependency.
This is a trick for getting TMail as a gem.

elif is a little library I wrote as a solution to an earlier quiz (#64). It
reads files line by line, but in reverse order. In other words, you get the
last line first, then the next to last line, all the way up to the first line.

hpricot is a fun little HTML parser from Why the Lucky Stiff. It has a very
unique interface that makes it popular for web scraping usage.

Now that Louis has imported all the tools he could find, he's ready to do some
fetching. Here's the start of that code:

module Quiz115
class QuizMail < DelegateClass(TMail::Mail)
class << self
attr_reader :archive_base_url

def archive_base_url
@archive_base_url ||
"http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/"
end

def solutions(quiz_number)
doc = Hpricot(
open("http://www.rubyquiz.com/quiz#{quiz_number}.html")
)
(doc/'#links'/'li/a').collect do |link|
[CGI.unescapeHTML(link.inner_text), link['href']]
end
end
end

# ...

This object we are examining now is a TMail enhancement, via delegation. This
section has some class methods added for easy usability. I believe the
attr_reader line is actually intended to be attr_writer though, giving you a way
to override the base URL. The reader is defined manually and just defaults to
the Ruby Talk mailing list.

The solutions() method is a neat added feature of the code which will allows you
to pass in a Ruby Quiz number in order to fetch all the solution emails for that
quiz. Here you can see some Hpricot parsing. Its XPath-in-Ruby style syntax is
used to pull the solution links off of the quiz page at rubyquiz.com.

Let's get to the real meat of this class now:

# ...

def initialize(mail)
temp_path = to_temp_file(mail)
boundary = MIME::BoundaryFinder.new(temp_path).find_boundary

@tmail = TMail::Mail.load(temp_path)
@tmail.set_content_type 'multipart', 'mixed',
'boundary' => boundary if boundary

super(@tmail)
end

private

def to_temp_file(mail)
temp = Tempfile.new('qmail')

temp.write(if (Integer(mail) rescue nil)
url = self.class.archive_base_url + mail
open(url) { |f| x = cleanse_html f.read }
else
web = URI.parse(mail).scheme == 'http'
open(mail) { |m| web ? cleanse_html(m.read) : m.read }
end)

temp.close
temp.path
end

def cleanse_html(str)
CGI.unescapeHTML(
str.gsub(/\A.*?<div id="header">/mi,'').gsub(/<[^>]*>/m, '')
)
end
end

# ...

In initialize() the passed mail reference is fetched into a temporary file and a
special boundary search is performed, which we will examine in detail in just a
moment. The temp file is then handed off to TMail. After that a content_type
header is synthesized, as long as we found a boundary.

The actual fetch is made in to_temp_file(). The code that fills the Tempfile is
a little tricky there, but all is really does is recognize when we are loading
via the web so it can cleanse_html(). That method just strips the tags around
the message and unescapes entities.

Now we need to dig into that boundary problem I sidestepped earlier. The
messages on the web archives are missing their Content-type header and we need
to restore it in order to get TMail to accept the message. With messages that
contain attachments, that header should be multipart/mixed. However, the header
also points to a special boundary string that divides the parts of the message.
We have to find that string so we can set it in the header.

The next class handles that operation:

# ...

module MIME
class BoundaryFinder
def initialize(file)
@elif = ::Elif.new(file)
@in_attachment_headers = false
end

def find_boundary
while line = @elif.gets
if @in_attachment_headers
if boundary = look_for_mime_boundary(line)
return boundary
end
else
look_for_attachment(line)
end
end
nil
end

private

def look_for_attachment line
if line =~ /^content-disposition\s*:\s*attachment/i
puts "Found an attachment" if $DEBUG
@in_attachment_headers = true
end
end

def look_for_mime_boundary line
unless line =~ /^\S+\s*:\s*/ || # Not a mail header
line =~ /^\s+/ # Continuation line?
puts "I think I found it...#{line}" if $DEBUG
line.strip.gsub(/^--/, '')
else
nil
end
end
end
end
end

# ...

This class is a trivial parser that hunts for the missing boundary. It uses
Elif to read the file backwards, watching for an attachment to come up. When it
detects that it is inside an attachment, it switches modes. In the new mode if
skips over headers and continuation lines until it reaches the first line that
doesn't seem to be part of the headers. That's the boundary.

The rest of the code just put's these tools to work:

# ...

include Quiz115
include FileUtils

def process_mail(mailh, outdir)
begin
t = QuizMail.new(mailh)
if t.has_attachments?
t.attachments.each do |attachment|
outpath = File.join(outdir, attachment.original_filename)
puts "\tWriting: #{outpath}"
File.open(outpath, 'w') do |out|
out.puts attachment.read
end
end
else
outfile = File.join(outdir, 'solution.txt')
File.open(outfile, 'w') {|f| f.write t.body}
end
rescue => e
puts "Couldn't parse mail correctly. Sorry! (E: #{e})"
end
end

def to_dirname(solver)
solver.downcase.delete('!#$&*?(){}').gsub(/\s+/, '_')
end

# ...

process_mail() builds a QuizMail object out of the passed reference number, then
copies the attachments from TMail to files in the indicated directory. If the
message has no attachments, you just get the full message instead.

to_dirname() is a directory name sanitize for when the code in downloading the
solutions from a quiz, as mentioned earlier.

Here's the application code:

# ...

query = ARGV[0]
outdir = ARGV[1] || '.'

unless query
$stderr.puts "You must specify either a ruby-talk message id, or a
quiz number (prefixed by 'q')"
exit 1
end

if query =~ /\Aq/i
quiz_number = query.sub(/\Aq/i, '')
puts "Fetching all solutions for quiz \##{quiz_number}"

QuizMail.solutions(quiz_number).each do |solver, url|
puts "Fetching solution from #{solver}."

dirname = to_dirname(solver)
solver_dir = File.join(outdir, dirname)

mkdir_p solver_dir
process_mail(url, solver_dir)
end
else
process_mail(query, outdir)
end

exit 0

This code just pulls in the arguments, and runs them through one of two
processes. If the number is prefixed with a q, the code scrapes rubyquiz.com
for that quiz number and pulls all the solutions. It creates a directory for
each solution, then processes each of those messages. Otherwise, it handles
just the individual message.

My thanks to those who helped me solve this problem for all quiz fans. We now
have an excellent resource to share with people who ask about retrieving the
garbled solutions.

Tomorrow, it's back to fun and games for the quiz, but this time we're on a
search for pure strategy...
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top