[QUIZ] Huffman Encoder (#123)

R

Ruby Quiz

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion. Please reply to the original quiz message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Harlan

Huffman Coding is a common form of data compression where none of the original
data gets lost. It begins by analyzing a string of data to determine which
pieces occur with the highest frequencies. These frequencies and pieces are
used to construct a binary tree. It is the “path†from root node to the
leaf with this data that forms its encoding. The following example should
explain things:

Data: ABRRKBAARAA (11 bytes)

Frequency counts:
A 5
R 3
B 2
K 1

In Huffman Tree form, with frequency weights in parentheses:
ARBK (11)
/ \
0 1
/ \
A (5) RBK (6)
/ \
0 1
/ \
R (3) BK (3)
/ \
0 1
/ \
B (2) K (1)

The encoding for each character is simply the path to that character:
A 0
R 10
B 110
K 111

Here is the original data encoded:
01101010 11111000 1000 (fits in 3 bytes)

We have compressed the original information by 80%!

A key point to note is that every character encoding has a unique prefix,
corresponding to the unique path to that character within the tree. If this
were not so, then decoding would be impossible due to ambiguity.

The quiz this time is to write a program to implement a compression program
using Huffman encoding.

Extra Credit:
Perform the actual encoding using your tree. You may encounter one issue
during the decompression/decoding phase. Your encoded string may not be a
multiple of 8. This means that when you compress your encoding into a
binary number, padding 0’s get added. Then, upon decompression, you may
see extra characters. To counter this, one solution is to add your own
padding of 1 extra character every time. And then simply strip it off
once you have decoded.

You may also wish to provide a way to serialize the Huffman Tree so it
can be shared among copies of your program.

./huffman_encode.rb I want this message to get encoded!

Encoded:
11111111 11111110 11111111 11101111 10111111
01100110 11111111 11110111 11111111 11011100
11111111 11010111 01110111 11011110 10011011
11111100 11110101 10010111 11101111 11111011
11111101 11111101 01111111 01111111 11111110
Encoded Bytes:
25

Original:
I want this message to get encoded!
Original Bytes:
35

Compressed: 28%
 
D

Daniel Martin

Ruby Quiz said:
The quiz this time is to write a program to implement a compression
program using Huffman encoding.

I'd like to comment on and regularize a few of the "extra credit"
portions.

First off, without knowledge of the tree, one can't decode the Huffman
encoding given in the quiz since the message:
I want this message to get encoded!
encodes to the same thing as:
V jnag guvf zrffntr gb trg rapbqrq!

So, with a Huffman encoding one always needs to know the tree
structure when doing the decoding. This means that usually what one
does with Huffman encoding is use some big training database to build
up your tree, and then one re-uses the same tree again and again.

Therefore, I want to add these switches:

-d Decoding mode. (Default is encoding)

-T file Generate tree and store it to file. Do not produce other
output.

-t file Use file to retrieve your tree. (for either encoding or
decoding)

-i file Use file as input instead of reading command line arguments

-o file Use file as output instead of writing to standard output
(Note: compression statistics should still be written to
standard output. file should contain only the message)

Also, I'll note that the quiz description can give a false impression
of Huffman encoding and how to build up the tree. For example, after
reading that I'd forgive anyone who thought that Huffman encoding
always assigned to each letter a code of some number of 1s followed
by a zero, except the least frequent letter which is assigned a code
of all 1s. In fact, it only works out to that for certain frequency
distributions.

Although I don't want to spoil things I'll point at the section
labeled "Basic technique" in the wikipedia article on Huffman coding
(http://en.wikipedia.org/wiki/Huffman_coding) and mention that I won't
quite be using the algorithm they mention there, but will be using
something vaguely similar. (That is, I will be building the tree from
the bottom up rather than the top down)
 
M

Matthew Moss

First off, without knowledge of the tree, one can't decode the Huffman
encoding given in the quiz since the message:
I want this message to get encoded!
encodes to the same thing as:
V jnag guvf zrffntr gb trg rapbqrq!

So, with a Huffman encoding one always needs to know the tree
structure when doing the decoding. This means that usually what one
does with Huffman encoding is use some big training database to build
up your tree, and then one re-uses the same tree again and again.


Another alternative is to do an adaptive Huffman encoding, which does
not require a dictionary at the front of the file, nor a training
database. The dictionary (i.e. tree) is built up during the process of
decoding.

The tradeoff is less compression for small files, or early on in larger files.
 
D

Daniel Martin

Ruby Quiz said:
Encoded Bytes:
25

Original:
I want this message to get encoded!
Original Bytes:
35

Compressed: 28%

I'll note that this phrase includes 16 distinct characters (" ", "!",
"I", "a", "c", "d", "e", "g", "h", "i", "m", "n", "o", "s", "t", "w").
Even if you add a special "end-of-data" character, that's only 17
distinct characters, so you would expect any Huffman code to perform
at least as well 5 bits per character; since your code is in theory
being developed to encode this exact phrase, I'd expect better
performance than that.

At 5 bits per character, if you add a special "end of data" character
to the end of the phrase so that you're encooding 36 values, that's
only 180 bits of output. 180 bits fits into 23 bytes.

I'm pointing out that the initial reference code probably has a bug in
its algorithm, since it does even worse than a naive
5-bits-per-character encoding.

For what it's worth, my current Huffman encoder when tuned to that
phrase alone compresses it into 18 bytes; when tuned to
http://norvig.com/big.txt (a large chunk of English text) it still
manages to compress that phrase to 22 bytes.
 
E

Erik Veenstra

I used a bit of freedom:

* It reads from STDIN instead of ARGV.

* Output is binary instead of 0-and-1-as-text.

* Both encoding and decoding, for the extra points...

* In order to avoid juggling with padding characters, the least
frequent character is 1110 instead of 111 (in the example in
the quiz). Every "character" in the encoded bit stream ends
with a 0. Yes, it costs one bit, but since it's not used
frequently, it doesn't really hurt.

* 0 and 1 become 1 and 0, respectively, because we get the
padding zeroes for free when we do "001".pack("B*"). So,
rewriting the previous section: In order to avoid juggling
with padding characters, the least frequent character is 0001
instead of 000. Every "character" in the encoded bit stream
ends with a 1.

Here's the code: http://tinyurl.com/3x5nsy

Example:

$ echo -n Scooby-Dooby-Doo | ruby huffman.rb | ruby huffman.rb -d
Compression: 14/16 (87%)
Decompression: 14/16 (87%)
Scooby-Dooby-Doo

How does it work?

First of all, we count each unique byte in the original byte
stream (e.g. "Scooby-Dooby-Doo"):

{"-"=>2, "b"=>2, "y"=>2, "c"=>1, "D"=>2, "o"=>6, "S"=>1}

... which is sorted:

[["o", 6], ["-", 2], ["D", 2], ["b", 2], ["y", 2], ["S", 1], ["c",
1]]

... and cleaned:

["o", "-", "D", "b", "y", "S", "c"]

The leftmost character is used most often and the rightmost
character is used least often. We call this ordered list the
"alphabet".

Secondly, we build the translation table, where each character
in the "alphabet" is mapped to a bit sequence:

{"b"=>"0001", "-"=>"01", "c"=>"0000001", "y"=>"00001", "D"=>"001",
"o"=>"1", "S"=>"000001"}

We can translated each byte of the original byte stream:

["S", "c", "o", "o", "b", "y", "-", "D", "o", "o", "b", "y", "-",
"D", "o", "o"]

... to the corresponding bit stream:

["000001", "0000001", "1", "1", "0001", "00001", "01", "001", "1",
"1", "0001", "00001", "01", "001", "1", "1"]

... which is joined into a string:

"00000100000011100010000101001110001000010100111"

... which is packed (with pack("B*")) into another string:

"\004\016!N!N"

... which we'll call the "encoded bit stream".

In order to be able to decode the message, we have to serialize
the "alphabet" and concatenate it with the "encoded bit stream":

alphabet + bit stream -> message
"o-DbySc" + "\004\016!N!N" -> "o-DbySc\004\016!N!N"

But where's the boundary between "alphabet" and "encoded bit
stream"? Well, let's add the length of the "alphabet"
([7]..pack("C")):

len + alphabet + bit stream -> message
"\a" + "o-DbySc" + "\004\016!N!N" -> "\ao-DbySc\004\016!N!N"

Conclusion? "Scooby-Dooby-Doo" becomes "\ao-DbySc\004\016!N!N".

Well, we've saved 2 bytes, but it doesn't sound that well...

gegroet,
Erik V. - http://www.erikveen.dds.nl/
 
D

Daniel Martin

Ruby Quiz said:
The quiz this time is to write a program to implement a compression program
using Huffman encoding.

And here's mine. Instead of adding something to each character
encoded, I used an additional symbol to mean "end of data". Although
the program will behave as in the quiz when given no options,
(encoding the phrase given in the command line arguments to 1s and 0s
displayed as that) the really interesting stuff happens when you give
one or more options.

Here's the usage:

Generate tree: q123.rb -g -t treefile [opts|phrase]
encode: q123.rb -t treefile [opts|phrase]
decode: q123.rb -d -t treefile [opts|phrase]
[opts]: -i input -o output
-b flip binary mode (default is 0s and 1s
if -o is not given, and bytes if -o is)
-x generate extended tree to also cover bytes
not present in the training input
-q be quiet - no statistics

ObTestOutput:
C:\Temp>ruby q123.rb I want this message to get encoded!
Encoded:
11010000 10100110 00101111 00011110 00110101
01000001 01110010 10001001 10001100 01000111
10010000 11000111 10000010 10110000 10010111
00101111 01101101 10000000
Original Bytes: 35
Encoded Bytes: 18
Compressed: 48.6%

(big.txt is http://norvig.com/big.txt)
C:\Temp>ruby q123.rb -g -t english.tree big.txt
Generated tree from input size 7

C:\Temp>ruby q123.rb -g -t english.tree -i big.txt
Generated tree from input size 6500961

C:\Temp>ruby q123.rb -t english.tree -i big.txt -o nul:
Original Bytes: 6500961
Encoded Bytes: 3707406
Compressed: 43.0%

And now the script itself:

#!ruby
require 'strscan'
require 'stringio'

# A useful method for my priority queue implementation
class Array
def swap(a,b)
self[a], self = self, self[a]
end
end

# Inspired by, but totally different from, the heap in
# ruby quiz 40
class PriorityQueue
# Actually, this is more inspired by the summary on
# quiz number 98 than the implementation in quiz 40.
def initialize
@items=[]
@priorities=[]
end

def add(priority, item)
@priorities.push priority
@items.push item
bubble_up
self
end

def empty?
@items.empty?
end

def shift
return nil if empty?
retval = @items.shift
@priorities.shift
if ! @items.empty? then
@priorities.unshift @priorities.pop
@items.unshift @items.pop
bubble_down
end
retval
end

# Inspired by mapn in quiz 122
def eachn(&b)
arglen = b.arity
while !empty?
args = Array.new(arglen){shift}
b.call(*args)
end
end

private

def bubble_up
wi = @priorities.size - 1
pr = @priorities[wi]
until wi == 0
pi = (wi-1)/2
return if @priorities[pi] <= pr
@items.swap(pi,wi)
@priorities.swap(pi,wi)
wi = pi
end
end

def bubble_down
wi = 0
pr = @priorities[wi]
until false # i.e. until I return
ci = 2*wi + 1
return if ci >= @priorities.size
ci += 1 if ci + 1 < @priorities.size and
@priorities[ci+1] < @priorities[ci]
return if @priorities[ci] >= pr
@items.swap(ci,wi)
@priorities.swap(ci,wi)
wi = ci
end
end
end

# Okay, that's all for utilities. Now on with stuff for this quiz;
# basically, I have two classes: HuffmanNode and HuffmanCoder.
# HuffmanNode is the tree structure. HuffmanCoder handles the
# dirty business of encoding and decoding given the tree structure.

# A HuffmanNode either has data or it has both left and right children,
# but not both.

HuffmanNode = Struct.new("HuffmanNode", :frequency, :data, :left, :right)

class HuffmanNode
def inspect
"HuffmanNode.from_s(#{to_s.inspect})"
end

def to_s
if self.data.nil? then
"(" + self.left.to_s + ' ' + self.right.to_s + ")"
else
self.data.to_s
end
end

def to_h(forencode, prefix='')
if self.data.nil? then
l = self.left.to_h(forencode,prefix + '0')
r = self.right.to_h(forencode,prefix + '1')
l.update(r)
else
if forencode then
{self.data => prefix}
else
{prefix => self.data}
end
end
end

def HuffmanNode.from_s(s)
begin
return from_s_internal(StringScanner.new(s))
rescue
raise "Malformed string: '#{s}'"
end
end

def HuffmanNode.from_s_internal(scanner)
data = scanner.scan(/\s*-?\d+/)
if data.nil? then
scanner.scan(/\s*\(/) or raise 'err'
rei = from_s_internal(scanner)
scanner.scan(/\s+/) or raise 'err'
ichi = from_s_internal(scanner)
scanner.scan(/\s*\)/) or raise 'err'
return new(0,nil,rei,ichi)
else
return new(0,data.to_i,nil,nil)
end
end

def HuffmanNode.make_tree(freqhash, add_everything=false)
pqueue = PriorityQueue.new
# node with data -1 is used to mean "end of data"
universe = {-1=>1}
if add_everything then
# Assume anything we haven't seen at all is an order of
# magnitude less likely than those things we've seen once
256.times{|i|universe=0.1;}
end
universe.update(freqhash)
universe.each {|charcode,freq|
pqueue.add(freq, new(freq,charcode,nil,nil))
}
pqueue.eachn {|node1, node2|
return node1 if node2.nil?

n = new(node1.frequency + node2.frequency, \
nil, node2, node1)
pqueue.add(n.frequency, n)
}
end
end

class HuffmanCoder
attr :enchash
attr :dechash
attr :decre
def initialize(nodetree)
@enchash = nodetree.to_h(true)
@dechash = nodetree.to_h(false)
@decre = Regexp.new('^(' + @dechash.keys.join('|') + ')(.*)')
end

def encode(io_in,io_out,crunch_binary=true,io_stats=$stderr)
buff = ''
outbytes = 0
inbytes = 0
io_out.puts "Encoded:" unless crunch_binary
encode_bits(io_in) { |bits|
inbytes += 1
if bits then
buff += bits
else
buff += '0' * ((-buff.length) % 8 )
end
while buff.length >= 8 do
binary = buff.slice!(0..7)
if crunch_binary then
binary = [binary].pack("b*")
else
binary = binary + ' '
binary += "\n" if 4 == outbytes % 5
end
io_out.print binary
outbytes += 1
end
}
if 0 != outbytes % 5 and not crunch_binary then
io_out.print "\n"
end
inbytes -= 2
if io_stats then
io_stats.puts "Original Bytes: #{inbytes}"
io_stats.puts "Encoded Bytes: #{outbytes}"
io_stats.puts "Compressed: %2.1f%%"% [100.0 - (100.0*outbytes)/inbytes]
end
end

def decode(io_in,io_out,crunched_binary=true,io_stats=$stderr)
buff = ''
outbytes = 0
inbytes = decode_bits(io_in,crunched_binary) { |bits|
buff += bits
m = @decre.match buff
while m do
ch = @dechash[m[1]]
if ch == -1
if m[2] !~ /^0*$/ then
raise "Garbage after EOD marker"
end
break
end
io_out.putc ch
outbytes += 1
buff = m[2]
m = @decre.match buff
end
}
if io_stats then
io_stats.puts "Encoded Bytes: #{inbytes}"
io_stats.puts "Original Bytes: #{outbytes}"
io_stats.puts "Compressed: %2.1f%%"% [100.0 - (100.0*inbytes)/outbytes]
end
end

def HuffmanCoder.from_file(treefile)
tree = nil
File.open(treefile, "r") { |f|
treetxt = ''
f.each{ |treeline| treetxt += treeline }
tree = HuffmanNode.from_s(treetxt)
}
new(tree)
end

def HuffmanCoder.generate(io_in, treefile, generate_extended, io_stats=$stderr)
bytecount = 0
d = Hash.new(0);
io_in.each_byte {|b| d += 1; bytecount += 1}
tree = HuffmanNode.make_tree(d,generate_extended)
if ! treefile.nil?
File.open(treefile, "w") {|f|
f.puts tree.to_s
}
end
if io_stats then
io_stats.puts "Generated tree from input size #{bytecount}"
end
new(tree)
end

private

def encode_bits(io_in)
c = io_in.getc
until c.nil?
bits = @enchash[c]
raise "no code for character #{c}" unless bits
yield bits
c = io_in.getc
end
yield @enchash[-1]
yield nil
end

def decode_bits(io_in,crunched_binary)
inbytes = 0
if crunched_binary then
until io_in.eof?
b = io_in.read(4096)
return if b.nil?
inbytes += b.length
yield b.unpack('b*')[0]
end
else
until io_in.eof?
b = io_in.read(4096)
return if b.nil?
b.tr!('^01','')
inbytes += b.length
yield b if b.length > 0
end
inbytes = (inbytes+7)/8
end
inbytes
end
end

# That's all the interesting stuff. Everything from here down is
# argument handling. Basically uninteresting.

if __FILE__ == $0 then
mode = :encode
input = nil
output = nil
treefile = nil
crunched_binary = true
generate_extended = false
statschannel = $stderr
while ARGV[0] and ARGV[0] =~ /^-/
opt = ARGV.shift
case opt
when '--'
break
when '-t'
treefile = ARGV.shift
when '-i'
input = ARGV.shift
when '-o'
output = ARGV.shift
when '-d'
mode = :decode
when '-b'
crunched_binary = false
when '-q'
statschannel = nil
when '-g'
mode = :gentree
when '-x'
mode = :gentree
generate_extended = true
when '-h'
puts "Usage:"
puts "Generate tree: #{$0} -g -t treefile [opts|phrase]"
puts " encode: #{$0} -t treefile [opts|phrase]"
puts " decode: #{$0} -d -t treefile [opts|phrase]"
puts "[opts]: -i input -o output"
puts " -b flip binary mode (default is 0s and 1s"
puts " if -o is not given, and bytes if -o is)"
puts " -x generate extended tree to also cover bytes"
puts " not present in the training input"
puts " -q be quiet - no statistics"
exit 0
else
$stderr.puts "Unrecognized option #{opt} -- use -h for help"
exit 1
end
end
if treefile.nil? then
# allow no treefile only when encoding command line
# stuff as a demo.
if ! (input.nil? and mode == :encode)
$stderr.puts "Error: no -t option given"
exit 1
end
end
in_io = nil
if input.nil?
in_io = StringIO.new(ARGV.join(' '))
crunched_binary = ! crunched_binary if mode == :decode
elsif input == '-'
in_io = $stdin
else
in_io = File.open(input, "rb")
end
out_io = nil
if output.nil?
out_io = $stdout
crunched_binary = ! crunched_binary if mode == :encode
elsif output == '-'
out_io = $stdout
else
out_io = File.open(output, "wb")
end
if mode == :gentree then
HuffmanCoder.generate(in_io, treefile, generate_extended,
statschannel)
else
coder = nil
if (treefile.nil?)
coder = HuffmanCoder.generate(in_io, nil, false, nil)
in_io.rewind
else
coder = HuffmanCoder.from_file(treefile)
end
coder.send(mode, in_io,out_io, crunched_binary, statschannel)
end
end

__END__
 
D

Drew Olson

Here's my solution. I'm getting different values, but I'm fairly sure
I'm building the tree correctly. I suppose it all depends on how you
choose to encode the values (0 = left, 1 = right in my case). I may have
some glaring stupid error, but I can't find it if I do :)

Output:

--------------------------------------------------------
ruby huffman.rb ABRRKBAARAA

Encoded
10100000 01101011 0011
3 bytes

Original
ABRRKBAARAA
11 bytes

72.7272727272727% compression

--------------------------------------------------------

ruby huffman.rb I want this message to get encoded!

Encoded
01010001 10101100 10000110 00011010 01010111
10001101 10011111 11110010 10001000 01110110
00101000 10110000 01100001 00011011 10010011
00101000 0
17 bytes

Original
I want this message to get encoded!
35 bytes

51.4285714285714% compression

--------------------------------------------------------

- Drew

# file: huffman.rb
# author: Drew Olson

require 'enumerator'

# class to hold nodes in the huffman tree
class Node
attr_accessor :val,:weight,:left,:right

def initialize(val="",weight=0)
@val,@weight = val,weight
end
end

# priority queue for nodes
class NodeQueue
def initialize
@queue = []
end

def enqueue node
@queue << node
@queue = @queue.sort_by{|x|[-x.weight,x.val.size]}
end

def dequeue
@queue.pop
end

def size
@queue.size
end
end

# HuffmanTree represents the tree with which we perform
# the encoding
class HuffmanTree

# initialize the tree based on data
def initialize data
@freqs = build_frequencies(data)
@root = build_tree
end

#encode the given data
def encode data
data.downcase.split(//).inject("") do |code,char|
code + encode_char(char)
end
end

private

# this method encodes a given character based on our
# tree representation
def encode_char char
node = @root
coding = ""

# we do a binary search, building the representation
# of the character based on which branch we follow
while node.val != char
if node.right.val.include? char
node = node.right
coding += "1"
else
node = node.left
coding += "0"
end
end
coding
end

# get word frequencies in a given phrase
def build_frequencies phrase
phrase.downcase.split(//).inject(Hash.new(0)) do |hash,item|
hash[item] += 1
hash
end
end

# build huffmantree using the priority queue method
def build_tree
queue = NodeQueue.new

# build a node for each character and place in pqueue
@freqs.keys.each do |char|
queue.enqueue(Node.new(char,@freqs[char]))
end

while !queue.size.zero?

# if only one node exists, it is the root. return it
return queue.dequeue if queue.size == 1

# dequeue two lightest nodes, create parent,
# add children and enqueue newly created node
node = Node.new
node.right = queue.dequeue
node.left = queue.dequeue
node.val = node.left.val+node.right.val
node.weight = node.left.weight+node.right.weight
queue.enqueue node
end
end
end

# get command lines args, build tree and encode data
if __FILE__ == $0
data = ARGV.join(" ")
tree = HuffmanTree.new data
encoded = tree.encode(data).scan(/\d{1,8}/)
puts
puts "Encoded"
encoded.each_slice(5) do |slice|
puts slice.join(" ")
end
puts "#{encoded.size} bytes"
puts
puts "Original"
puts data
puts "#{data.size} bytes"
puts
compression = (100.0 - (encoded.size.to_f/data.size.to_f)*100)
puts "#{compression}\% compression"
end
 
E

Erik Veenstra

The better solution...

Here's the code: http://tinyurl.com/32tjfo

Example:

$ echo -n Scooby-Dooby-Doo | ruby huffman.rb | ruby huffman.rb -d
Compression: 27/17 (158%)
Decompression: 27/17 (158%)
Scooby-Dooby-Doo

How does it work?

Count each unique byte in the original byte stream (e.g.
"Scooby-Dooby-Doo"):

[["S", 1], ["c", 1], ["o", 6], ["b", 2], ["y", 2], ["-", 2], ["D",
2]]

This list is then converted to a tree, using this code:

while tree.length > 1
a, b, *tree = tree.sort_by{|_, count| count} # Not deterministic.
tree << [[a[0], b[0]], a[1]+b[1]]
end

In English: The two least frequently used leaves/nodes are
replaced by one node (a combination of both leaves),
recursively, until the list is only one leaf/node long. Step by
step, this results in the following:

[["S", 1], ["c", 1], ["b", 2], ["D", 2], ["y", 2], ["-", 2], ["o",
6]]
[["b", 2], ["D", 2], ["y", 2], [["S", "c"], 2], ["-", 2], ["o", 6]]
[["y", 2], [["S", "c"], 2], ["-", 2], [["b", "D"], 4], ["o", 6]]
[["-", 2], [["b", "D"], 4], [["y", ["S", "c"]], 4], ["o", 6]]
[[["y", ["S", "c"]], 4], ["o", 6], [["-", ["b", "D"]], 6]]
[[["-", ["b", "D"]], 6], [[["y", ["S", "c"]], "o"], 10]]
[[[["-", ["b", "D"]], [["y", ["S", "c"]], "o"]], 16]]

The tree is actually still in the list (the only element in the
list), so we get it and strip the weight:

[["-", ["b", "D"]], [["y", ["S", "c"]], "o"]]

Secondly, we build the translation table, where each leaf in
the tree is mapped to a bit sequence, by using the path to each
leaf:

{"-"=>"00", "b"=>"010", "y"=>"100", "c"=>"1011", "D"=>"011",
"o"=>"11", "S"=>"1010"}

We can translated each byte of the original byte stream:

["S", "c", "o", "o", "b", "y", "-", "D", "o", "o", "b", "y", "-",
"D", "o", "o"]

... to the corresponding bit stream:

["1010", "1011", "11", "11", "010", "100", "00", "011", "11", "11",
"010", "100", "00", "011", "11", "11"]

... which is joined into a string:

"101010111111010100000111111010100000111111"

The rest, serializing the translation table and adding and
registering padding bits is not that complicated and won't be
explained here.

The decoding (decompressing?) is, well, just the reversed order
of the steps...

I want to emphasize one line of the decode code. After having
deserialized the translation table, we can decode the whole
message with one line:

bitstring.scan(/#{table.keys.join("|")}/).collect{|bits|
table[bits]}.join("")

gegroet,
Erik V. - http://www.erikveen.dds.nl/
 
E

Eric I.

My solution to the quiz can be found here:

http://pastie.caboo.se/61613

This solution has a number of features which might make it different
to other solutions submitted. It encodes to and decodes from actual
binary data, not strings of "0" and "1" characters. It is able to
serialize and deserialize the Huffman encoder (which is a hash
table, not a tree). It uses a linear algorithm to create the
Huffman code from the character frequency data. And beyond that, it
was written with an attempt to be efficient time-wise and
memory-wise. It uses an end-of-message token :)eom) to mark the end
of the encoded message. It therefore is able to handle messages
that use 256 characters plus the 1 token. It recognizes that with
large messages some characters could be encoded with more than 8
bits, and it handles those situations. And finally it makes an
attempt to recognize corrupt data, both in terms of the Huffman
encoded message and the serialized encoder.

Eric
 
R

Raf Coremans

2007/5/15 said:
I also tried writing a binary encoder which packs
and unpacks the bit strings generated by my string encoder, but it fails
to decompress the entire text for long files, e.g. the program source
code itself. I'm unsure why, perhaps someone more experienced in this
arena can suggest why.

Perhaps because the file you want to encode already contains your
end-marker. The decoder stops when it encounters the first end-marker
(line 161). A binary file would not be unlikely to contain a NULL
character somewhere.

Regards,
Raf
 
R

Raf Coremans

2007/5/15 said:
That was my first thought, but I threw in a check for the existence of
NULL characters in the frequency hash before merging the EOS=>0 hash,
and it did not complain. Besides, the file I'm trying to decompress is
the ruby source of my quiz solution, which is just plain ol' text.

- donald

Strange... I get this:


Encoding-decoding paste-61871.rb:
$ ruby -w paste-61871.rb -b -f paste-61871.rb | ruby -w paste-61871.rb
-b -d > ref
$ cmp paste-61871.rb ref

=>Identical


Encoding-decoding a binary file:
$ ruby -w paste-61871.rb -b -f dilbert2073317070515.gif | ruby -w
paste-61871.rb -b -d > ref
$ cmp dilbert2073317070515.gif ref
cmp: EOF on ref
$ hexdump -C ref
00000000 47 49 46 38 39 61 58 02 d5 |GIF89aX..|
00000009
$ hexdump -C dilbert2073317070515.gif | head -n 1
00000000 47 49 46 38 39 61 58 02 d5 00 c4 00 00 58 58 58 |GIF89aX......XXX|

=> Encoded-decoded terminates just before first NULL character in original file


Regards,
Raf
 
D

Drew Olson

A little late, but here's my 2nd solution. It does decoding as well, but
I didn't get a chance to add serialization. Nothing too fancy in here:

Output:

-------------------------------------------------------------
ruby huffman.rb ABRRKBAARAA
Original
ABRRKBAARAA
11 bytes

Encoded
10100000 01101011 0011
3 bytes

72 percent compression

Decoded
abrrkbaaraa

-------------------------------------------------------------
ruby huffman.rb I want this message to get encoded!

Original
I want this message to get encoded!
35 bytes

Encoded
01010001 10101100 10000110 00011010 01010111
10001101 10011111 11110010 10001000 01110110
00101000 10110000 01100001 00011011 10010011
00101000 0
17 bytes

51 percent compression

Decoded
i want this message to get encoded!

-------------------------------------------------------------

Code:

# file: huffman.rb
# author: Drew Olson

require 'enumerator'

# class to hold nodes in the huffman tree
class Node
attr_accessor :val,:weight,:left,:right

def initialize(val="",weight=0)
@val,@weight = val,weight
end

def children?
return @left || @right
end
end

# priority queue for nodes
class NodeQueue
def initialize
@queue = []
end

def enqueue node
@queue << node
@queue = @queue.sort_by{|x|[-x.weight,x.val.size]}
end

def dequeue
@queue.pop
end

def size
@queue.size
end
end

# HuffmanTree represents the tree with which we perform
# the encoding
class HuffmanTree

# initialize the tree based on data
def initialize data
@freqs = build_frequencies(data)
@root = build_tree
end

#encode the given data
def encode data
data.downcase.split(//).inject("") do |code,char|
code + encode_char(char)
end
end

def decode data
node = @root

if [email protected]?
return @root.val
end

data.split(//).inject("") do |phrase,digit|
if digit == "0"
node = node.left
else
node = node.right
end
if !node.children?
phrase += node.val
node = @root
end
phrase
end
end

private

# this method encodes a given character based on our
# tree representation
def encode_char char
node = @root
coding = ""

# encode to 0 if only one character
if [email protected]?
return "0"
end

# we do a binary search, building the representation
# of the character based on which branch we follow
while node.val != char
if node.right.val.include? char
node = node.right
coding += "1"
else
node = node.left
coding += "0"
end
end
coding
end

# get word frequencies in a given phrase
def build_frequencies phrase
phrase.downcase.split(//).inject(Hash.new(0)) do |hash,item|
hash[item] += 1
hash
end
end

# build huffmantree using the priority queue method
def build_tree
queue = NodeQueue.new

# build a node for each character and place in pqueue
@freqs.keys.each do |char|
queue.enqueue(Node.new(char,@freqs[char]))
end

while !queue.size.zero?

# if only one node exists, it is the root. return it
return queue.dequeue if queue.size == 1

# dequeue two lightest nodes, create parent,
# add children and enqueue newly created node
node = Node.new
node.right = queue.dequeue
node.left = queue.dequeue
node.val = node.left.val+node.right.val
node.weight = node.left.weight+node.right.weight
queue.enqueue node
end
end
end

# get command lines args, build tree and encode data
if __FILE__ == $0
data = ARGV.join(" ")
tree = HuffmanTree.new data

# get encoded data and split into bits
code = tree.encode(data)
encoded_bits = code.scan(/\d{1,8}/)

# output
puts
puts "Original"
puts data
puts "#{data.size} bytes"
puts
puts "Encoded"
encoded_bits.each_slice(5) do |slice|
puts slice.join(" ")
end
puts "#{encoded_bits.size} bytes"
puts
puts "%d percent compression" % (100.0 -
(encoded_bits.size.to_f/data.size.to_f)*100.0)
puts
puts "Decoded"
puts tree.decode(code)
end
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top