[SOLUTION] Quoted Printable (#23)

P

Patrick Hurley

I am a ruby newbie, so be kind. I wrote the code myself, but blatantly
stole Dave Burt's test cases - thank you. I also found one test case
that breaks my code (and Dave's) that I am not sure what the correct
answer is, but I know mine is wrong:

Consider:
"===
\n"
which will cause a new space to be found at the end of a string - is
it the case that all space at the end of the line is encoded
(increasing size rather needlessly), but simplifying this case? Either
way, I am too tired and have other important stuff to do so I will let
it go.

Please feel free to let me know where I did not do things the "Ruby
way" as I am primarily a C++ and Perl guy, but very interested in
getting better at Ruby.

Thanks
pth


#
# == Synopsis
#
# Ruby Quiz #23
#
# The quoted printable encoding is used in primarily in email, thought it has
# recently seen some use in XML areas as well. The encoding is simple to
# translate to and from.
#
# This week's quiz is to build a filter that handles quoted printable
# translation.
#
# Your script should be a standard Unix filter, reading from files listed on
# the command-line or STDIN and writing to STDOUT. In normal operation, the
# script should encode all text read in the quoted printable format. However,
# your script should also support a -d command-line option and when present,
# text should be decoded from quoted printable instead. Finally, your script
# should understand a -x command-line option and when given, it should encode
# <, > and & for use with XML.
#
# == Usage
#
# ruby quiz23.rb [-d | --decode ] [ -x | --xml ]
#
# == Author
# Patrick Hurley, Cornell-Mayo Assoc
#
# == Copyright
# Copytright (c) 2005 Cornell-Mayo Assoc
# Licensed under the same terms as Ruby.
#

require 'optparse'
require 'rdoc/usage'

module QuotedPrintable
MAX_LINE_PRINTABLE_ENCODE_LENGTH = 76

def from_qp
result = self.gsub(/=\r\n/, "")
result.gsub!(/\r\n/m, $/)
result.gsub!(/=([\dA-F]{2})/) { $1.hex.chr }
result
end

def to_qp(handle_xml = false)
char_mask = if (handle_xml)
/[^!-%,-;=?-~\s]/
else
/[^!-<>-~\s]/
end

# encode the non-space characters
result = self.gsub(char_mask) { |ch| "=%02X" % ch[0] }
# encode the last space character at end of line
result.gsub!(/(\s)(?=#{$/})/o) { |ch| "=%02X" % ch[0] }

lines = result.scan(/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/);
lines.join("=\n").gsub(/#{$/}/m, "\r\n")
end

def QuotedPrintable.encode
STDOUT.binmode
while (line = gets) do
print line.to_qp
end
end

def QuotedPrintable.decode
STDIN.binmode
while (line = gets) do
# I am a ruby newbie, and I could
# not get gets to get the \r\n pairs
# no matter how I set $/ - any pointers?
line = line.chomp + "\r\n"
print line.from_qp
end
end

end

class String
include QuotedPrintable
end

if __FILE__ == $0

opts = OptionParser.new
opts.on("-h", "--help") { RDoc::usage; }
opts.on("-d", "--decode") { $decode = true }
opts.on("-x", "--xml") { $handle_xml = true }

opts.parse!(ARGV) rescue RDoc::usage('usage')

if ($decode)
QuotedPrintable.decode()
else
QuotedPrintable.encode()
end
end
 
D

Dave Burt

Patrick Hurley said:
I am a ruby newbie, so be kind. I wrote the code myself, but blatantly
stole Dave Burt's test cases - thank you. I also found one test case

Quiz tests are for sharing - I think that's established. In any case, you're
welcome to them.
that breaks my code (and Dave's) that I am not sure what the correct
answer is, but I know mine is wrong:

Consider:
"===
\n"
which will cause a new space to be found at the end of a string - is
it the case that all space at the end of the line is encoded
(increasing size rather needlessly), but simplifying this case? Either
way, I am too tired and have other important stuff to do so I will let
it go.

I see no problem. I've added that test case, and both our solutions
pass.

http://www.dave.burt.id.au/ruby/test-quoted-printable.rb
Please feel free to let me know where I did not do things the "Ruby
way" as I am primarily a C++ and Perl guy, but very interested in
getting better at Ruby.
...
/[^!-<>-~\s]/

Bug: "\f" doesn't get escaped (it's part of /\s/). Probably "\r" as well;
that's harder to test on windows.

I see no other problems. Your optparse is better (i.e. shorter) than mine
:). Your
(/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/)
makes you look like a Perl 5 junkie, though. Also, you use global
variables - we rubyists shun these: use locals.

Cheers,
Dave
 
J

James Edward Gray II

(from Patrick's solution--for those who missed it)

while (line = gets) do
# I am a ruby newbie, and I could
# not get gets to get the \r\n pairs
# no matter how I set $/ - any pointers?
...

James Edward Gray II
 
P

Patrick Hurley

Thanks for the kind response.

When I said the test case failed, I meant the actually output our
resulting output encodeing the line has trailing space at the end of a
line. We both escape trailing spaces before we break lines - if the
line breaking moves some code is that not an issue? (the continuation
= might mean that it is not).

Yup there was an issue with masks I fixed that and removed the globals
(my perl just throwing in a $ when in doubt :) There was also a bug
in the command line driver, which I have fixed. The patched code
follows
(/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/)
makes you look like a Perl 5 junkie,

I did this to allow the use of a gsub, which is much faster than the
looping solution. The look aheads and general uglyness handle the
special cases. I probably should use /x and space it out and comment,
but when I am in the regexp zone, I know what I am typing <grin>.

#
# == Synopsis
#
# Ruby Quiz #23
#
# The quoted printable encoding is used in primarily in email, thought it has
# recently seen some use in XML areas as well. The encoding is simple to
# translate to and from.
#
# This week's quiz is to build a filter that handles quoted printable
# translation.
#
# Your script should be a standard Unix filter, reading from files listed on
# the command-line or STDIN and writing to STDOUT. In normal operation, the
# script should encode all text read in the quoted printable format. However,
# your script should also support a -d command-line option and when present,
# text should be decoded from quoted printable instead. Finally, your script
# should understand a -x command-line option and when given, it should encode
# <, > and & for use with XML.
#
# == Usage
#
# ruby quiz23.rb [-d | --decode ] [ -x | --xml ]
#
# == Author
# Patrick Hurley, Cornell-Mayo Assoc
#
# == Copyright
# Copytright (c) 2005 Cornell-Mayo Assoc
# Licensed under the same terms as Ruby.
#

require 'optparse'
require 'rdoc/usage'

module QuotedPrintable
MAX_LINE_PRINTABLE_ENCODE_LENGTH = 76

def from_qp
result = self.gsub(/=\r\n/, "")
result.gsub!(/\r\n/m, $/)
result.gsub!(/=([\dA-F]{2})/) { $1.hex.chr }
result
end

def to_qp(handle_xml = false)
char_mask = if (handle_xml)
/[\x00-\x08\x0b-\x1f\x7f-\xff=<>&]/
else
/[\x00-\x08\x0b-\x1f\x7f-\xff=]/
end

# encode the non-space characters
result = self.gsub(char_mask) { |ch| "=%02X" % ch[0] }
# encode the last space character at end of line
result.gsub!(/(\s)(?=#{$/})/o) { |ch| "=%02X" % ch[0] }

lines = result.scan(/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/);
lines.join("=\n").gsub(/#{$/}/m, "\r\n")
end

def QuotedPrintable.encode(handle_xml=false)
STDOUT.binmode
while (line = gets) do
print line.to_qp(handle_xml)
end
end

def QuotedPrintable.decode
STDIN.binmode
while (line = gets) do
# I am a ruby newbie, and I could
# not get gets to get the \r\n pairs
# no matter how I set $/ - any pointers?
line = line.chomp + "\r\n"
print line.from_qp
end
end

end

class String
include QuotedPrintable
end

if __FILE__ == $0

decode = false
handle_xml = true
opts = OptionParser.new
opts.on("-h", "--help") { RDoc::usage; }
opts.on("-d", "--decode") { decode = true }
opts.on("-x", "--xml") { handle_xml = true }

opts.parse!(ARGV) rescue RDoc::usage('usage')

if (decode)
QuotedPrintable.decode()
else
QuotedPrintable.encode(handle_xml)
end
end
 
D

Dave Burt

Patrick Hurley said:
Thanks for the kind response.

When I said the test case failed, I meant the actually output our
resulting output encodeing the line has trailing space at the end of a
line. We both escape trailing spaces before we break lines - if the
line breaking moves some code is that not an issue? (the continuation
= might mean that it is not).

From the RFC (2045, section 6.7):
Any TAB (HT) or SPACE characters
on an encoded line MUST thus be followed on that line
by a printable character. In particular, an "=" at the
end of an encoded line, indicating a soft line break
(see rule #5) may follow one or more TAB (HT) or SPACE
characters.

So it's all good - unescaped tabs and spaces are fine as long as it's got a
printable non-whitespace character after it, and "=" is fine for that.

... Therefore, when decoding a Quoted-Printable
body, any trailing white space on a line must be
deleted, as it will necessarily have been added by
intermediate transport agents.

There's something I think we've all forgotten to do -- strip trailing unescaped
whitespace. I've added the following test:

def test_decode_strip_trailing_space
assert_equal(
"The following whitespace must be ignored: \r\n".from_quoted_printable,
"The following whitespace must be ignored:\n")
end

And the following line to decode_string:
result.gsub!(/[\t ]+(?=\r\n|$)/, '')
Yup there was an issue with masks I fixed that and removed the globals
(my perl just throwing in a $ when in doubt :) There was also a bug
in the command line driver, which I have fixed. The patched code
follows
(/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/)
makes you look like a Perl 5 junkie,

I did this to allow the use of a gsub, which is much faster than the
looping solution. The look aheads and general uglyness handle the
special cases. I probably should use /x and space it out and comment,
but when I am in the regexp zone, I know what I am typing <grin>.

Write-only? No, I'm not in a fantastic position to comment, mine is not that
much shorter.
...
def QuotedPrintable.decode
STDIN.binmode
while (line = gets) do
# I am a ruby newbie, and I could
# not get gets to get the \r\n pairs
# no matter how I set $/ - any pointers?

| C:\WINDOWS>ruby
| STDIN.binmode
| gets.each_byte do |b| puts b end
| ^Z
|
| 13
| 10
|
Seems to work for me - that output says I wouldn't need the following line
line = line.chomp + "\r\n"

Cheers,
Dave
 
P

Patrick Hurley

Thanks for the update on the RFC, guess I should have just read that myself.

Well I don't want to "litter" the news group, but I hate to have
incorrect code out there with my name on it so. If you want follow the
link (http://hurleyhome.com/~patrick/quiz23.rb) to see the fixed code.
Also of note is the now commented (just for Dave) regexp for parsing
long lines, for the curious:

lines = result.scan(/
# Match one of the three following cases
(?:
# This will match the special case of an escape that would generally have
# split across line boundries
(?: [^\n]{74}(?==[\dA-F]{2}) ) |
# This will match the case of a line of text that does not need to split
(?: [^\n]{0,76}(?=\n) ) |
# This will match the case of a line of text that needs to be
split without special adjustment
(?:[^\n]{1,75}(?!\n{2}))
)
# Match zero or more newlines
(?-x:#{$/.}*)/x);

pth


Patrick Hurley said:
Thanks for the kind response.

When I said the test case failed, I meant the actually output our
resulting output encodeing the line has trailing space at the end of a
line. We both escape trailing spaces before we break lines - if the
line breaking moves some code is that not an issue? (the continuation
= might mean that it is not).

From the RFC (2045, section 6.7):
Any TAB (HT) or SPACE characters
on an encoded line MUST thus be followed on that line
by a printable character. In particular, an "=" at the
end of an encoded line, indicating a soft line break
(see rule #5) may follow one or more TAB (HT) or SPACE
characters.

So it's all good - unescaped tabs and spaces are fine as long as it's got a
printable non-whitespace character after it, and "=" is fine for that.

... Therefore, when decoding a Quoted-Printable
body, any trailing white space on a line must be
deleted, as it will necessarily have been added by
intermediate transport agents.

There's something I think we've all forgotten to do -- strip trailing unescaped
whitespace. I've added the following test:

def test_decode_strip_trailing_space
assert_equal(
"The following whitespace must be ignored: \r\n".from_quoted_printable,
"The following whitespace must be ignored:\n")
end

And the following line to decode_string:
result.gsub!(/[\t ]+(?=\r\n|$)/, '')
Yup there was an issue with masks I fixed that and removed the globals
(my perl just throwing in a $ when in doubt :) There was also a bug
in the command line driver, which I have fixed. The patched code
follows
(/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/)
makes you look like a Perl 5 junkie,

I did this to allow the use of a gsub, which is much faster than the
looping solution. The look aheads and general uglyness handle the
special cases. I probably should use /x and space it out and comment,
but when I am in the regexp zone, I know what I am typing <grin>.

Write-only? No, I'm not in a fantastic position to comment, mine is not that
much shorter.
...
def QuotedPrintable.decode
STDIN.binmode
while (line = gets) do
# I am a ruby newbie, and I could
# not get gets to get the \r\n pairs
# no matter how I set $/ - any pointers?

| C:\WINDOWS>ruby
| STDIN.binmode
| gets.each_byte do |b| puts b end
| ^Z
|
| 13
| 10
|
Seems to work for me - that output says I wouldn't need the following line
line = line.chomp + "\r\n"

Cheers,
Dave
 
D

Dave Burt

Florian Gross said:
And here's mine as well. Sorry for being late -- I coded this up on
Friday and forgot about it until today.

It ought to handle everything correctly (including proper wrapping of
lines that end in encoded characters) and it does most of the work with
a few simple regular expressions.

Hi Florian,

As always, I'm amazed by your concise code. But your solution seems to be
failing a bunch of my tests (and not just by chopping lines early, which is
allowed):

encoding:
- escapes mid-line whitespace
- escapes '~'
- allows too-long lines (my tests saw up to 104 characters on a line)
- allows unescaped whitespace on the end of a line (as long as it's preceded
by escaped whitespace)
decoding:
- doesn't ignore trailing literal whitespace

Cheers,
Dave
 
F

Florian Gross

Dave said:
Hi Florian,

Moin Dave.
As always, I'm amazed by your concise code. But your solution seems to be
failing a bunch of my tests (and not just by chopping lines early, which is
allowed):

Thanks, I'll have a look.
encoding:
- escapes mid-line whitespace

I'm not sure I get this. Am I incorrectly escaping mid-line whitespace
or am I incorrectly not escaping it? And what is mid-line whitespace?
- escapes '~'

Heh, classic off-by-one. Easily fixed by changing the Regexp. See source
below.
- allows too-long lines (my tests saw up to 104 characters on a line)

Any hints on when this is happening? I can't see why and when this would
happen.
- allows unescaped whitespace on the end of a line (as long as it's preceded
by escaped whitespace)

Fixed. See code below.
decoding:
- doesn't ignore trailing literal whitespace

Well, I don't think that's much of an issue as I'm not sure when
trailing whitespace would be prepended to lines, but I've fixed it anyway.

Here's the new code:
def encode(text, also_encode = "")
text.gsub(/[\t ](?:[\v\t ]|$)|[=\x00-\x08\x0B-\x1F\x7F-\xFF#{also_encode}]/) do |char|
char[0 ... -1] + "=%02X" % char[-1]
end.gsub(/^(.{75})(.{2,})$/) do |match|
base, continuation = $1, $2
continuation = base.slice!(/=(.{0,2})\Z/).to_s + continuation
base + "=\n" + continuation
end.gsub("\n", "\r\n")
end

def decode(text, allow_lowercase = false)
encoded_re = Regexp.new("=([0-9A-F]{2})", allow_lowercase ? "i" : "")
text.gsub("\r\n", "\n").gsub("=\n", "").gsub(encoded_re) do
$1.to_i(16).chr
end
end

I'll repost the full source when I've sorted out that other problem as well.
 
D

Dave Burt

Florian Gross said:
Moin Dave.


Thanks, I'll have a look.


I'm not sure I get this. Am I incorrectly escaping mid-line whitespace or
am I incorrectly not escaping it? And what is mid-line whitespace?

Tabs and spaces that are followed by something printable on the same line
should not be escaped; see the following:

5) Failure:
test_encode_12(TC_QuotedPrintable) [(eval):2]:
<"=3D=3D=3D
=\r\n =20\r\n"> expected but was
<"=3D=3D=3D=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
=\r\n=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
=20 =20 =20 =20 =20 =20 =20 =20\r\n">.
Heh, classic off-by-one. Easily fixed by changing the Regexp. See source
below.

Too easy :)
Any hints on when this is happening? I can't see why and when this would
happen.

test_encode_12 also demonstrates this. I fixed it by changing
/[\t ](?:[\v\t ]|$)../ to /[\t ]$../.
This (obviously) fixes the mid-line whitespace as well.
Fixed. See code below.


Well, I don't think that's much of an issue as I'm not sure when trailing
whitespace would be prepended to lines, but I've fixed it anyway.

It's not mentioned in the quiz question, although you can infer that it is
illegal from the quiz question. The idea is that if there is trailing
whitespace, it has been added in transit and should be removed (it's not
actually part of the data that was encoded).

Also, this, on line 10: "char[0 ... -1] + ...", seems redundant - with char
as a one-character match, it's an empty string.
Here's the new code:


I'll repost the full source when I've sorted out that other problem as
well.

Cheers,
Dave
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,262
Messages
2,571,044
Members
48,769
Latest member
Clifft

Latest Threads

Top