[QUIZ] Code Heuristics (#172)

Matthew Moss · Aug 1, 2008

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

The three rules of Ruby Quiz 2:

1. Please do not post any solutions or spoiler discussion for this
quiz until 48 hours have passed from the time on this message.

2. Support Ruby Quiz 2 by submitting ideas as often as you can! (A
permanent, new website is in the works for Ruby Quiz 2. Until then,
please visit the temporary website at

<http://splatbang.com/rubyquiz/>.

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem
helps everyone on Ruby Talk follow the discussion. Please reply to
the original quiz message, if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

## Code Heuristics (#172)

This week, your task is to make my job simpler.

Each week, coders send in their submissions to Ruby Quiz problems,
usually as a mix of quiz discussion and actual code. Your task, then,
is to take a submission as input, and generate output that is the
extracted code. Every line of the input that isn't code should be
prefixed with the comment marker, `#`.

Take the following submission to quiz #158:

I may have missed it, but I didn't see this solution:

require 'enumerator'
"Hello world!\n".enum_for

each_byte).inject($stdout) { |res, b|
res << b.chr
}

Which in 1.9 looks much better:

"Hello world!\n".each_char.inject($stdout) { |res, c|
res << c
}

Stefano

Your code should take that as input and generate this output:

# I may have missed it, but I didn't see this solution:

require 'enumerator'
"Hello world!\n".enum_for

each_byte).inject($stdout) { |res, b|
res << b.chr
}

# Which in 1.9 looks much better:

"Hello world!\n".each_char.inject($stdout) { |res, c|
res << c
}

# Stefano

Assuming you knew for certain which lines should be commented and
which should not, your submission would be little more than a
glorified `cat`. This quiz is more about identifying Ruby-looking
text: what heuristics or patterns will you use to determine what lines
or groups of lines are Ruby code or not.

Fortunately, there is a large sample set from which you can
experiment: all existing Ruby Quiz submissions. Unfortunately -- for
this quiz, not in general -- Ruby is very dynamic and flexible, which
can make code identification a difficult problem. So I am not
expecting perfection here, but I want to see what tricks, or general
process, you can come up with for this problem.

Extra credit: Provide an optional command-line argument to "split" the
output into multiple files. Each separate file would be a different
chunk of the code, where code is separated by discussion. Commented
discussion should be kept with the code that follows it.

Joshua Ballanco · Aug 3, 2008

So, based on the sample test case, do we have to worry about Ruby 1.9 as
well? and the mixed case? That would make this test quite a bit tougher!

Frederick Cheung · Aug 3, 2008

-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D= -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-

## Code Heuristics (#172)

This week, your task is to make my job simpler.

Each week, coders send in their submissions to Ruby Quiz problems,
usually as a mix of quiz discussion and actual code. Your task, then,
is to take a submission as input, and generate output that is the
extracted code. Every line of the input that isn't code should be
prefixed with the comment marker, `#`.

I got lazy and thought I'd let ruby do the hard work. Given some text, =20=

I feed it through

eval("BEGIN {return true}\n#{code}", nil)

and see if an exception is raised or not (this does have limitations). =20=

It's not enough to do this line by line, for example

if foo
puts "bar"
end

the first and last lines are not, on their own, valid ruby, but as a =20
whole it is of course valid.

For a given chunk of text we first try and find the maximal prefix =20
(prefix isn't quite the right word, since we only split at lines) that =20=

is valid ruby.
To do this we take an empty string and just add a line of the input at =20=

a time, running eval each time to see if what we have is valid.

if there is no such prefix, then the first line must be comment text =20
and so we tag that line as a comment. We remove the line and repeat =20
processing on the remaining lines.
if there is such a prefix then that prefix is tagged as code, we =20
remove it and process the remaining lines.

The output formatter sort of does the splitting into separate files - =20=

it prints a mark to the screen where it would split (I was too lazy to =20=

start messing around with files).

What this code doesn't deal well with is lines with not much text, for =20=

example:

I think this does it:
if foo
bar
end
Fred

The line Fred is marked as code, because that is perfectly legal ruby, =20=

it's just the value of the constant Fred. Of course that would =20
probably blowup if you actually evaluated that line but my valid code =20=

detector can't handle that (I can't really think how you could handle =20=

this with true certainty without actually executing the code).

Sequences like

hope this helps

also look like legal code (but will produce warnings). To get around =20
this we require that our evaling produces no warnings (and thus we =20
trust that ruby quiz submitters squash warnings from their code

).
Another limitation was that if you were trying to say 'this works in =20
ruby 1.8 but only this works in 1.9' then this solution would fail if =20=

the 1.9 code used some ruby 1.9 specific bit of syntax (obviously if =20
the example just uses differences in the standard library that is =20
irrelevant) and you were running this script on ruby 1.8

Fred

Usage: (example stolen from Mikael Hoilund's submission to 171)

CodeExtractor.extract(<<TEXT
Oh hi, I just thought I'd golf a solution. I'm sure other people can =20
do a much better job than I making a full hexdumping suite, so I just =20=

had some fun. Can't seem to get it lower than 78 characters, =20
unfortunately.

i=3D0;$<.read.scan(/.{0,16}/m){puts"%08x "%i+$&.unpack('H4'*8).join(' =20=

');i+=3D16}

Expanded and parenthesified, clarified:

i =3D 0
ARGF.read.scan(/.{0,16}/m) {
puts(("%08x " % i) + $&.unpack('H4'*8).join(' '))
i +=3D 16
}

ARGF (aliased as $<) is the file handle of all file names given in the =20=

arguments concatenated, STDIN if none =97 exactly what we need. The =20
regex to scan matches between 0 and 16 characters (including newline) =20=

greedily. Change it to 1,16 if you don't want the empty line at the end.

Instead of letting the block to scan take an argument, I used a trick =20=

I picked up from the last Ruby Quiz I participated in (Obfuscated =20
Email), and use $& inside the block, which is the last regex match. =20
Saves two characters \o/
TEXT
)

produces as output:

#Oh hi, I just thought I'd golf a solution. I'm sure other people can =20=

do a much better job than I making a full hexdumping suite, so I just =20=

had some fun. Can't seem to get it lower than 78 characters, =20
unfortunately.

i=3D0;$<.read.scan(/.{0,16}/m){puts"%08x "%i+$&.unpack('H4'*8).join(' =20=

');i+=3D16}

-------
#Expanded and parenthesified, clarified:

i =3D 0
ARGF.read.scan(/.{0,16}/m) {
puts(("%08x " % i) + $&.unpack('H4'*8).join(' '))
i +=3D 16
}

-------
#ARGF (aliased as $<) is the file handle of all file names given in =20
the arguments concatenated, STDIN if exactly what we need. The regex =20=

to scan matches between 0 and 16 characters (including newline) =20
greedily. Change it to 1,16 if you don't want the empty line at the =20
end.none

#Instead of letting the block to scan take an argument, I used a trick =20=

I picked up from the last Ruby Quiz I participated in (Obfuscated =20
Email), and use $& inside the block, which is the last regex match. =20
Saves two characters o/

The code:

require 'stringio'
Struct.new 'Line', :data, :code
class CodeExtractor
attr_reader :lines,

utput

def initialize(text)
@output =3D []
@lines =3D text.split(/[\r\n]/)
end

def extract
while lines.any?
process lines
end
end

def valid_syntax?(code)
io =3D StringIO.new
original_err, $stderr=3D $stderr, io
eval("BEGIN {return true}\n#{code}")
raise 'here'
rescue Exception
false
ensure
$stderr =3D original_err
return false if io.length > 0
end

#returns the maximum number of lines (contiguous from the start) =20
that are valid ruby
def valid_code_prefix_length lines
max_valid_lines =3D 0
code =3D ""
lines.each_with_index do |line, index|
code << line
code << "\n"
if valid_syntax? code
max_valid_lines =3D index + 1
end
end
return max_valid_lines
end

def append_output(line, code)
@output << Struct::Line.new(line, code)
end

def process lines
if (prefix_length =3D valid_code_prefix_length lines) > 0
prefix_length.times { append_output lines.shift, true }
else
append_output lines.shift, false
end
end

def format_output
last_line =3D nil
@output.each do |line|
if line.data =3D~ /^\s*$/
puts ""
next
end
if last_line && last_line.code && !line.code #transition from =20
code to comment
puts "-------"
end
puts "#{line.code ? '':'#'}#{line.data}"
last_line =3D line
end
end

def self.extract(text)
c=3D CodeExtractor.new text
c.extract
c.format_output
nil
end
end=

Jesse Brown · Aug 3, 2008

Here is my attempt.
I made the huge assumption that most code adheres to some indentation
style. Based on that all I really had to worry about was what lines
would usually not be indented (those that start a new scope).
I also changed the prefixing of the output since lines with # might
actually be part of the code as comments.
After dealing with this in several submissions I decided to allow for
text from a copied email to be removed (usually prefixed with "> ") so
that only the submitters comments were included.
With the provided example...

172 $ ruby rubyish.rb < test.eml
[TEXT] I may have missed it, but I didn't see this solution:

Code:

 require 'enumerator'
[CODE] "Hello world!\n".enum_for(:each_byte).inject($stdout) { |res, b|
[CODE]   res << b.chr
[CODE] }

[TEXT] Which in 1.9 looks much better:

[CODE] "Hello world!\n".each_char.inject($stdout) { |res, c|
[CODE]   res << c
[CODE] }

[TEXT] Stefano


rubyish.rb:

#!/usr/bin/env ruby

KEYWORDS = %{
begin do end while for in if else break redo next loop retry ensure
rescue case when exit unless and or not class module new def raise
puts print p
}

THRESHOLD = 0.60

def load_keywords(file)
return KEYWORDS unless file
keywords = []
begin
File.open(file).each do |line|
keywords << line.chomp.split(/\s+/)
end
rescue
keywords = KEYWORDS
end
keywords.flatten
end

# Determine if the line looks as though it begins/ends a block of code
def block_statement?(text)

# Assignment is assumed to be code
# This can lead to several false positives...keep?
return true if text =~ /[^=]+=[^=]/

case text
when /^(def|class|module)/
true
when /^(begin|do|while|ARG.[\[\.]|for|case|if|loop|puts|print|p[( ])/
true
when /(\{|do) (\|.*?\|)?\s*$/
true
when /^(end|\})/
true
else
false
end
end

# Assume that symbols are language items. And words that match
# a set of keywords are language items. Everything else is text
# Whitespace is not considered
# Returns [# language tokens, # total tokens]
def separate(text,keywords)
words = text.scan(/\w+/)
s = words.join
symbols = (text.split(//) - s.split(//)).delete_if { |x| x =~ /\s+/ }

special = symbols.size
total = words.size + special

words.each { |w| special += 1 if keywords.include?(w) }

[special,total]
end

def usage
$stderr.puts "USAGE: #{File.basename($0)} [options] [-f <file>]"
$stderr.puts "-"*40
$stderr.puts " -c <prefix>"
$stderr.puts "      When text is copied from a previous email and
should be"
$stderr.puts "      removed, prefix is the start of the copied lines
(ex. -c '> ')"
$stderr.puts " -s [name]"
$stderr.puts "      Split comment/code pairs into separate files"
$stderr.puts "      If provided, files are named name.N (Default:
submission)"
$stderr.puts " -t <value>"
$stderr.puts "      Threshold for assuming a line is code vs.
comment."
$stderr.puts "      0.0 - 1.0 : percent of symbols that need to be
code-like (Default: 0.60)"
$stderr.puts " -k <keyword file>"
$stderr.puts "      File containing the whitespace-separated list of
keywords"
$stderr.puts "      to use in code-like matching"
$stderr.puts " -p [on|off]"
$stderr.puts "      If on, prefix code output with [CODE] and comment"
$stderr.puts "      output with [TEXT]. (Default: on)"
exit 42
end

def parse_options
opts = {:prefix => true}
ARGV.join.split('-').delete_if { |x| x == "" }.each do |arg|
case arg
when /^c(.+)/
opts[:copied] = $1.strip
when /^s(.+)/
opts[:outfile] = $1.strip
when /^t(.)/
t = $1.strip.to_f
raise "Invalid Threshold #{$1}" if t < 0.0 || t > 1.0
opts[:threshold] = t
when /^k(.+)/
opts[:keyfile] = $1.strip
when /^f(.+)/
opts[:input] = $1.strip
when /^p(.+)/
opts[:prefix] = $1 == "on"
else
usage
end
end
opts
end


# Begin execution

opts = parse_options
input = opts[:input] ? File.open(opts[:input]) : $stdin
keywords = load_keywords(opts[:keywords])
threshold = opts[:threshold] || THRESHOLD

# Initial classification
classified = []
input.each do |line|
# ignore all lines copied from previous emails
next if opts[:copied] && line =~ /^#{opts[:copied]}/
case line
when /^\s*$/
classified << [:blank, line]
when /^\s*#/, /^\s+/, /^require/
classified << [:code, line]
else
classified << [:text, line]
end
end

# Make educated guesses based on content of remaining lines
estimated = []
classified.each do |type,line|
case type
when :code, :blank
estimated << [type, line]
when :text
if block_statement?(line)
estimated << [:code, line]
else
# Compare words to 'guessed' language characters
special,total = separate(line, keywords)
if special.to_f/total.to_f >= threshold
estimated << [:code,line]
else
estimated << [:text,line]
end
end
end
end

# Assume that one line of code surrounded by two non-code lines
# is just an example and not part of the actual submission
size = estimated.size
(0...size).each do |i|
next if i < 2
next if i > size - 3
a,_,b,_,type,line,y,_,z_ = estimated.slice((i-2)..(i+2))
next unless type == :code
next if [a,b,c,d].include? :code
estimated[i] = [:text, line]
end

# Output modified submission
n, last, out = 0, nil, $stdout
if opts[:outfile]
file = "%s.%d" % [opts[:outfile], n]
out = File.open(file, 'w')
end

estimated.each do |type,line|
case type
when :blank
out.puts
next
when :text
if last == :code && out != $stdout
out.close
file = "%s.%d" % [opts[:outfile], n += 1]
out = File.open(file, 'w')
end
prefix = opts[:prefix] ? " [TEXT] " : ""
out.puts prefix + line
when :code
prefix = opts[:prefix] ? " [CODE] " : ""
out.puts prefix + line
end
last = type
end
out.close

Matthew Moss · Aug 7, 2008

Summary for the Code Heuristics quiz delayed; should be up tomorrow.

Matthew Moss · Aug 9, 2008

Summary for the Code Heuristics quiz delayed; should be up tomorrow.

Apologies, things have come up that have prevented me from finishing
the summary. I'll get it done this weekend. Needless to say, there
won't be a new quiz this week.

[QUIZ] The Matrix: Code Screen (#230)	0	Mar 20, 2010
[QUIZ] Morse Code (#121)	25	Apr 20, 2007
[QUIZ] Syntax Highlighting (#218)	1	Aug 24, 2009
[QUIZ] Quine (#207)	42	May 29, 2009
[QUIZ] Ruby BASIC (#228)	1	Jan 30, 2010
[QUIZ] Gathering Ruby Quiz 2 Data (#189)	10	Jan 23, 2009
[QUIZ] Hello, world? (#158)	40	Mar 1, 2008
[QUIZ] Distinct Sets (#225)	19	Nov 21, 2009

[QUIZ] Code Heuristics (#172)

Matthew Moss

Joshua Ballanco

Frederick Cheung

Jesse Brown

Matthew Moss

Matthew Moss

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads