[QUIZ] Huffman Encoder (#123)

Discussion in 'Ruby' started by Ruby Quiz, May 11, 2007.

  1. Ruby Quiz

    Ruby Quiz Guest

    The three rules of Ruby Quiz:

    1. Please do not post any solutions or spoiler discussion for this quiz until
    48 hours have passed from the time on this message.

    2. Support Ruby Quiz by submitting ideas as often as you can:

    http://www.rubyquiz.com/

    3. Enjoy!

    Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
    on Ruby Talk follow the discussion. Please reply to the original quiz message,
    if you can.

    -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

    by Harlan

    Huffman Coding is a common form of data compression where none of the original
    data gets lost. It begins by analyzing a string of data to determine which
    pieces occur with the highest frequencies. These frequencies and pieces are
    used to construct a binary tree. It is the “path†from root node to the
    leaf with this data that forms its encoding. The following example should
    explain things:

    Data: ABRRKBAARAA (11 bytes)

    Frequency counts:
    A 5
    R 3
    B 2
    K 1

    In Huffman Tree form, with frequency weights in parentheses:
    ARBK (11)
    / \
    0 1
    / \
    A (5) RBK (6)
    / \
    0 1
    / \
    R (3) BK (3)
    / \
    0 1
    / \
    B (2) K (1)

    The encoding for each character is simply the path to that character:
    A 0
    R 10
    B 110
    K 111

    Here is the original data encoded:
    01101010 11111000 1000 (fits in 3 bytes)

    We have compressed the original information by 80%!

    A key point to note is that every character encoding has a unique prefix,
    corresponding to the unique path to that character within the tree. If this
    were not so, then decoding would be impossible due to ambiguity.

    The quiz this time is to write a program to implement a compression program
    using Huffman encoding.

    Extra Credit:
    Perform the actual encoding using your tree. You may encounter one issue
    during the decompression/decoding phase. Your encoded string may not be a
    multiple of 8. This means that when you compress your encoding into a
    binary number, padding 0’s get added. Then, upon decompression, you may
    see extra characters. To counter this, one solution is to add your own
    padding of 1 extra character every time. And then simply strip it off
    once you have decoded.

    You may also wish to provide a way to serialize the Huffman Tree so it
    can be shared among copies of your program.

    ./huffman_encode.rb I want this message to get encoded!

    Encoded:
    11111111 11111110 11111111 11101111 10111111
    01100110 11111111 11110111 11111111 11011100
    11111111 11010111 01110111 11011110 10011011
    11111100 11110101 10010111 11101111 11111011
    11111101 11111101 01111111 01111111 11111110
    Encoded Bytes:
    25

    Original:
    I want this message to get encoded!
    Original Bytes:
    35

    Compressed: 28%
     
    Ruby Quiz, May 11, 2007
    #1
    1. Advertising

  2. Ruby Quiz <> writes:

    > The quiz this time is to write a program to implement a compression
    > program using Huffman encoding.


    I'd like to comment on and regularize a few of the "extra credit"
    portions.

    First off, without knowledge of the tree, one can't decode the Huffman
    encoding given in the quiz since the message:
    I want this message to get encoded!
    encodes to the same thing as:
    V jnag guvf zrffntr gb trg rapbqrq!

    So, with a Huffman encoding one always needs to know the tree
    structure when doing the decoding. This means that usually what one
    does with Huffman encoding is use some big training database to build
    up your tree, and then one re-uses the same tree again and again.

    Therefore, I want to add these switches:

    -d Decoding mode. (Default is encoding)

    -T file Generate tree and store it to file. Do not produce other
    output.

    -t file Use file to retrieve your tree. (for either encoding or
    decoding)

    -i file Use file as input instead of reading command line arguments

    -o file Use file as output instead of writing to standard output
    (Note: compression statistics should still be written to
    standard output. file should contain only the message)

    Also, I'll note that the quiz description can give a false impression
    of Huffman encoding and how to build up the tree. For example, after
    reading that I'd forgive anyone who thought that Huffman encoding
    always assigned to each letter a code of some number of 1s followed
    by a zero, except the least frequent letter which is assigned a code
    of all 1s. In fact, it only works out to that for certain frequency
    distributions.

    Although I don't want to spoil things I'll point at the section
    labeled "Basic technique" in the wikipedia article on Huffman coding
    (http://en.wikipedia.org/wiki/Huffman_coding) and mention that I won't
    quite be using the algorithm they mention there, but will be using
    something vaguely similar. (That is, I will be building the tree from
    the bottom up rather than the top down)

    --
    s=%q( Daniel Martin --
    puts "s=%q(#{s})",s.to_a.last )
    puts "s=%q(#{s})",s.to_a.last
     
    Daniel Martin, May 11, 2007
    #2
    1. Advertising

  3. Ruby Quiz

    Matthew Moss Guest

    > First off, without knowledge of the tree, one can't decode the Huffman
    > encoding given in the quiz since the message:
    > I want this message to get encoded!
    > encodes to the same thing as:
    > V jnag guvf zrffntr gb trg rapbqrq!
    >
    > So, with a Huffman encoding one always needs to know the tree
    > structure when doing the decoding. This means that usually what one
    > does with Huffman encoding is use some big training database to build
    > up your tree, and then one re-uses the same tree again and again.



    Another alternative is to do an adaptive Huffman encoding, which does
    not require a dictionary at the front of the file, nor a training
    database. The dictionary (i.e. tree) is built up during the process of
    decoding.

    The tradeoff is less compression for small files, or early on in larger files.
     
    Matthew Moss, May 11, 2007
    #3
  4. Ruby Quiz <> writes:

    > Encoded Bytes:
    > 25
    >
    > Original:
    > I want this message to get encoded!
    > Original Bytes:
    > 35
    >
    > Compressed: 28%


    I'll note that this phrase includes 16 distinct characters (" ", "!",
    "I", "a", "c", "d", "e", "g", "h", "i", "m", "n", "o", "s", "t", "w").
    Even if you add a special "end-of-data" character, that's only 17
    distinct characters, so you would expect any Huffman code to perform
    at least as well 5 bits per character; since your code is in theory
    being developed to encode this exact phrase, I'd expect better
    performance than that.

    At 5 bits per character, if you add a special "end of data" character
    to the end of the phrase so that you're encooding 36 values, that's
    only 180 bits of output. 180 bits fits into 23 bytes.

    I'm pointing out that the initial reference code probably has a bug in
    its algorithm, since it does even worse than a naive
    5-bits-per-character encoding.

    For what it's worth, my current Huffman encoder when tuned to that
    phrase alone compresses it into 18 bytes; when tuned to
    http://norvig.com/big.txt (a large chunk of English text) it still
    manages to compress that phrase to 22 bytes.

    --
    s=%q( Daniel Martin --
    puts "s=%q(#{s})",s.to_a.last )
    puts "s=%q(#{s})",s.to_a.last
     
    Daniel Martin, May 12, 2007
    #4
  5. Re: Huffman Encoder (#123)

    I used a bit of freedom:

    * It reads from STDIN instead of ARGV.

    * Output is binary instead of 0-and-1-as-text.

    * Both encoding and decoding, for the extra points...

    * In order to avoid juggling with padding characters, the least
    frequent character is 1110 instead of 111 (in the example in
    the quiz). Every "character" in the encoded bit stream ends
    with a 0. Yes, it costs one bit, but since it's not used
    frequently, it doesn't really hurt.

    * 0 and 1 become 1 and 0, respectively, because we get the
    padding zeroes for free when we do "001".pack("B*"). So,
    rewriting the previous section: In order to avoid juggling
    with padding characters, the least frequent character is 0001
    instead of 000. Every "character" in the encoded bit stream
    ends with a 1.

    Here's the code: http://tinyurl.com/3x5nsy

    Example:

    $ echo -n Scooby-Dooby-Doo | ruby huffman.rb | ruby huffman.rb -d
    Compression: 14/16 (87%)
    Decompression: 14/16 (87%)
    Scooby-Dooby-Doo

    How does it work?

    First of all, we count each unique byte in the original byte
    stream (e.g. "Scooby-Dooby-Doo"):

    {"-"=>2, "b"=>2, "y"=>2, "c"=>1, "D"=>2, "o"=>6, "S"=>1}

    ... which is sorted:

    [["o", 6], ["-", 2], ["D", 2], ["b", 2], ["y", 2], ["S", 1], ["c",
    1]]

    ... and cleaned:

    ["o", "-", "D", "b", "y", "S", "c"]

    The leftmost character is used most often and the rightmost
    character is used least often. We call this ordered list the
    "alphabet".

    Secondly, we build the translation table, where each character
    in the "alphabet" is mapped to a bit sequence:

    {"b"=>"0001", "-"=>"01", "c"=>"0000001", "y"=>"00001", "D"=>"001",
    "o"=>"1", "S"=>"000001"}

    We can translated each byte of the original byte stream:

    ["S", "c", "o", "o", "b", "y", "-", "D", "o", "o", "b", "y", "-",
    "D", "o", "o"]

    ... to the corresponding bit stream:

    ["000001", "0000001", "1", "1", "0001", "00001", "01", "001", "1",
    "1", "0001", "00001", "01", "001", "1", "1"]

    ... which is joined into a string:

    "00000100000011100010000101001110001000010100111"

    ... which is packed (with pack("B*")) into another string:

    "\004\016!N!N"

    ... which we'll call the "encoded bit stream".

    In order to be able to decode the message, we have to serialize
    the "alphabet" and concatenate it with the "encoded bit stream":

    alphabet + bit stream -> message
    "o-DbySc" + "\004\016!N!N" -> "o-DbySc\004\016!N!N"

    But where's the boundary between "alphabet" and "encoded bit
    stream"? Well, let's add the length of the "alphabet"
    ([7]..pack("C")):

    len + alphabet + bit stream -> message
    "\a" + "o-DbySc" + "\004\016!N!N" -> "\ao-DbySc\004\016!N!N"

    Conclusion? "Scooby-Dooby-Doo" becomes "\ao-DbySc\004\016!N!N".

    Well, we've saved 2 bytes, but it doesn't sound that well...

    gegroet,
    Erik V. - http://www.erikveen.dds.nl/
     
    Erik Veenstra, May 13, 2007
    #5
  6. Ruby Quiz <> writes:

    > The quiz this time is to write a program to implement a compression program
    > using Huffman encoding.


    And here's mine. Instead of adding something to each character
    encoded, I used an additional symbol to mean "end of data". Although
    the program will behave as in the quiz when given no options,
    (encoding the phrase given in the command line arguments to 1s and 0s
    displayed as that) the really interesting stuff happens when you give
    one or more options.

    Here's the usage:

    Generate tree: q123.rb -g -t treefile [opts|phrase]
    encode: q123.rb -t treefile [opts|phrase]
    decode: q123.rb -d -t treefile [opts|phrase]
    [opts]: -i input -o output
    -b flip binary mode (default is 0s and 1s
    if -o is not given, and bytes if -o is)
    -x generate extended tree to also cover bytes
    not present in the training input
    -q be quiet - no statistics

    ObTestOutput:
    C:\Temp>ruby q123.rb I want this message to get encoded!
    Encoded:
    11010000 10100110 00101111 00011110 00110101
    01000001 01110010 10001001 10001100 01000111
    10010000 11000111 10000010 10110000 10010111
    00101111 01101101 10000000
    Original Bytes: 35
    Encoded Bytes: 18
    Compressed: 48.6%

    (big.txt is http://norvig.com/big.txt)
    C:\Temp>ruby q123.rb -g -t english.tree big.txt
    Generated tree from input size 7

    C:\Temp>ruby q123.rb -g -t english.tree -i big.txt
    Generated tree from input size 6500961

    C:\Temp>ruby q123.rb -t english.tree -i big.txt -o nul:
    Original Bytes: 6500961
    Encoded Bytes: 3707406
    Compressed: 43.0%

    And now the script itself:

    #!ruby
    require 'strscan'
    require 'stringio'

    # A useful method for my priority queue implementation
    class Array
    def swap(a,b)
    self[a], self = self, self[a]
    end
    end

    # Inspired by, but totally different from, the heap in
    # ruby quiz 40
    class PriorityQueue
    # Actually, this is more inspired by the summary on
    # quiz number 98 than the implementation in quiz 40.
    def initialize
    @items=[]
    @priorities=[]
    end

    def add(priority, item)
    @priorities.push priority
    @items.push item
    bubble_up
    self
    end

    def empty?
    @items.empty?
    end

    def shift
    return nil if empty?
    retval = @items.shift
    @priorities.shift
    if ! @items.empty? then
    @priorities.unshift @priorities.pop
    @items.unshift @items.pop
    bubble_down
    end
    retval
    end

    # Inspired by mapn in quiz 122
    def eachn(&b)
    arglen = b.arity
    while !empty?
    args = Array.new(arglen){shift}
    b.call(*args)
    end
    end

    private

    def bubble_up
    wi = @priorities.size - 1
    pr = @priorities[wi]
    until wi == 0
    pi = (wi-1)/2
    return if @priorities[pi] <= pr
    @items.swap(pi,wi)
    @priorities.swap(pi,wi)
    wi = pi
    end
    end

    def bubble_down
    wi = 0
    pr = @priorities[wi]
    until false # i.e. until I return
    ci = 2*wi + 1
    return if ci >= @priorities.size
    ci += 1 if ci + 1 < @priorities.size and
    @priorities[ci+1] < @priorities[ci]
    return if @priorities[ci] >= pr
    @items.swap(ci,wi)
    @priorities.swap(ci,wi)
    wi = ci
    end
    end
    end

    # Okay, that's all for utilities. Now on with stuff for this quiz;
    # basically, I have two classes: HuffmanNode and HuffmanCoder.
    # HuffmanNode is the tree structure. HuffmanCoder handles the
    # dirty business of encoding and decoding given the tree structure.

    # A HuffmanNode either has data or it has both left and right children,
    # but not both.

    HuffmanNode = Struct.new("HuffmanNode", :frequency, :data, :left, :right)

    class HuffmanNode
    def inspect
    "HuffmanNode.from_s(#{to_s.inspect})"
    end

    def to_s
    if self.data.nil? then
    "(" + self.left.to_s + ' ' + self.right.to_s + ")"
    else
    self.data.to_s
    end
    end

    def to_h(forencode, prefix='')
    if self.data.nil? then
    l = self.left.to_h(forencode,prefix + '0')
    r = self.right.to_h(forencode,prefix + '1')
    l.update(r)
    else
    if forencode then
    {self.data => prefix}
    else
    {prefix => self.data}
    end
    end
    end

    def HuffmanNode.from_s(s)
    begin
    return from_s_internal(StringScanner.new(s))
    rescue
    raise "Malformed string: '#{s}'"
    end
    end

    def HuffmanNode.from_s_internal(scanner)
    data = scanner.scan(/\s*-?\d+/)
    if data.nil? then
    scanner.scan(/\s*\(/) or raise 'err'
    rei = from_s_internal(scanner)
    scanner.scan(/\s+/) or raise 'err'
    ichi = from_s_internal(scanner)
    scanner.scan(/\s*\)/) or raise 'err'
    return new(0,nil,rei,ichi)
    else
    return new(0,data.to_i,nil,nil)
    end
    end

    def HuffmanNode.make_tree(freqhash, add_everything=false)
    pqueue = PriorityQueue.new
    # node with data -1 is used to mean "end of data"
    universe = {-1=>1}
    if add_everything then
    # Assume anything we haven't seen at all is an order of
    # magnitude less likely than those things we've seen once
    256.times{|i|universe=0.1;}
    end
    universe.update(freqhash)
    universe.each {|charcode,freq|
    pqueue.add(freq, new(freq,charcode,nil,nil))
    }
    pqueue.eachn {|node1, node2|
    return node1 if node2.nil?

    n = new(node1.frequency + node2.frequency, \
    nil, node2, node1)
    pqueue.add(n.frequency, n)
    }
    end
    end

    class HuffmanCoder
    attr :enchash
    attr :dechash
    attr :decre
    def initialize(nodetree)
    @enchash = nodetree.to_h(true)
    @dechash = nodetree.to_h(false)
    @decre = Regexp.new('^(' + @dechash.keys.join('|') + ')(.*)')
    end

    def encode(io_in,io_out,crunch_binary=true,io_stats=$stderr)
    buff = ''
    outbytes = 0
    inbytes = 0
    io_out.puts "Encoded:" unless crunch_binary
    encode_bits(io_in) { |bits|
    inbytes += 1
    if bits then
    buff += bits
    else
    buff += '0' * ((-buff.length) % 8 )
    end
    while buff.length >= 8 do
    binary = buff.slice!(0..7)
    if crunch_binary then
    binary = [binary].pack("b*")
    else
    binary = binary + ' '
    binary += "\n" if 4 == outbytes % 5
    end
    io_out.print binary
    outbytes += 1
    end
    }
    if 0 != outbytes % 5 and not crunch_binary then
    io_out.print "\n"
    end
    inbytes -= 2
    if io_stats then
    io_stats.puts "Original Bytes: #{inbytes}"
    io_stats.puts "Encoded Bytes: #{outbytes}"
    io_stats.puts "Compressed: %2.1f%%"% [100.0 - (100.0*outbytes)/inbytes]
    end
    end

    def decode(io_in,io_out,crunched_binary=true,io_stats=$stderr)
    buff = ''
    outbytes = 0
    inbytes = decode_bits(io_in,crunched_binary) { |bits|
    buff += bits
    m = @decre.match buff
    while m do
    ch = @dechash[m[1]]
    if ch == -1
    if m[2] !~ /^0*$/ then
    raise "Garbage after EOD marker"
    end
    break
    end
    io_out.putc ch
    outbytes += 1
    buff = m[2]
    m = @decre.match buff
    end
    }
    if io_stats then
    io_stats.puts "Encoded Bytes: #{inbytes}"
    io_stats.puts "Original Bytes: #{outbytes}"
    io_stats.puts "Compressed: %2.1f%%"% [100.0 - (100.0*inbytes)/outbytes]
    end
    end

    def HuffmanCoder.from_file(treefile)
    tree = nil
    File.open(treefile, "r") { |f|
    treetxt = ''
    f.each{ |treeline| treetxt += treeline }
    tree = HuffmanNode.from_s(treetxt)
    }
    new(tree)
    end

    def HuffmanCoder.generate(io_in, treefile, generate_extended, io_stats=$stderr)
    bytecount = 0
    d = Hash.new(0);
    io_in.each_byte {|b| d += 1; bytecount += 1}
    tree = HuffmanNode.make_tree(d,generate_extended)
    if ! treefile.nil?
    File.open(treefile, "w") {|f|
    f.puts tree.to_s
    }
    end
    if io_stats then
    io_stats.puts "Generated tree from input size #{bytecount}"
    end
    new(tree)
    end

    private

    def encode_bits(io_in)
    c = io_in.getc
    until c.nil?
    bits = @enchash[c]
    raise "no code for character #{c}" unless bits
    yield bits
    c = io_in.getc
    end
    yield @enchash[-1]
    yield nil
    end

    def decode_bits(io_in,crunched_binary)
    inbytes = 0
    if crunched_binary then
    until io_in.eof?
    b = io_in.read(4096)
    return if b.nil?
    inbytes += b.length
    yield b.unpack('b*')[0]
    end
    else
    until io_in.eof?
    b = io_in.read(4096)
    return if b.nil?
    b.tr!('^01','')
    inbytes += b.length
    yield b if b.length > 0
    end
    inbytes = (inbytes+7)/8
    end
    inbytes
    end
    end

    # That's all the interesting stuff. Everything from here down is
    # argument handling. Basically uninteresting.

    if __FILE__ == $0 then
    mode = :encode
    input = nil
    output = nil
    treefile = nil
    crunched_binary = true
    generate_extended = false
    statschannel = $stderr
    while ARGV[0] and ARGV[0] =~ /^-/
    opt = ARGV.shift
    case opt
    when '--'
    break
    when '-t'
    treefile = ARGV.shift
    when '-i'
    input = ARGV.shift
    when '-o'
    output = ARGV.shift
    when '-d'
    mode = :decode
    when '-b'
    crunched_binary = false
    when '-q'
    statschannel = nil
    when '-g'
    mode = :gentree
    when '-x'
    mode = :gentree
    generate_extended = true
    when '-h'
    puts "Usage:"
    puts "Generate tree: #{$0} -g -t treefile [opts|phrase]"
    puts " encode: #{$0} -t treefile [opts|phrase]"
    puts " decode: #{$0} -d -t treefile [opts|phrase]"
    puts "[opts]: -i input -o output"
    puts " -b flip binary mode (default is 0s and 1s"
    puts " if -o is not given, and bytes if -o is)"
    puts " -x generate extended tree to also cover bytes"
    puts " not present in the training input"
    puts " -q be quiet - no statistics"
    exit 0
    else
    $stderr.puts "Unrecognized option #{opt} -- use -h for help"
    exit 1
    end
    end
    if treefile.nil? then
    # allow no treefile only when encoding command line
    # stuff as a demo.
    if ! (input.nil? and mode == :encode)
    $stderr.puts "Error: no -t option given"
    exit 1
    end
    end
    in_io = nil
    if input.nil?
    in_io = StringIO.new(ARGV.join(' '))
    crunched_binary = ! crunched_binary if mode == :decode
    elsif input == '-'
    in_io = $stdin
    else
    in_io = File.open(input, "rb")
    end
    out_io = nil
    if output.nil?
    out_io = $stdout
    crunched_binary = ! crunched_binary if mode == :encode
    elsif output == '-'
    out_io = $stdout
    else
    out_io = File.open(output, "wb")
    end
    if mode == :gentree then
    HuffmanCoder.generate(in_io, treefile, generate_extended,
    statschannel)
    else
    coder = nil
    if (treefile.nil?)
    coder = HuffmanCoder.generate(in_io, nil, false, nil)
    in_io.rewind
    else
    coder = HuffmanCoder.from_file(treefile)
    end
    coder.send(mode, in_io,out_io, crunched_binary, statschannel)
    end
    end

    __END__



    --
    s=%q( Daniel Martin --
    puts "s=%q(#{s})",s.to_a.last )
    puts "s=%q(#{s})",s.to_a.last
     
    Daniel Martin, May 14, 2007
    #6
  7. Ruby Quiz

    Drew Olson Guest

    Re: Huffman Encoder (#123)

    Here's my solution. I'm getting different values, but I'm fairly sure
    I'm building the tree correctly. I suppose it all depends on how you
    choose to encode the values (0 = left, 1 = right in my case). I may have
    some glaring stupid error, but I can't find it if I do :)

    Output:

    --------------------------------------------------------

    >ruby huffman.rb ABRRKBAARAA


    Encoded
    10100000 01101011 0011
    3 bytes

    Original
    ABRRKBAARAA
    11 bytes

    72.7272727272727% compression

    --------------------------------------------------------


    >ruby huffman.rb I want this message to get encoded!


    Encoded
    01010001 10101100 10000110 00011010 01010111
    10001101 10011111 11110010 10001000 01110110
    00101000 10110000 01100001 00011011 10010011
    00101000 0
    17 bytes

    Original
    I want this message to get encoded!
    35 bytes

    51.4285714285714% compression

    --------------------------------------------------------

    - Drew

    # file: huffman.rb
    # author: Drew Olson

    require 'enumerator'

    # class to hold nodes in the huffman tree
    class Node
    attr_accessor :val,:weight,:left,:right

    def initialize(val="",weight=0)
    @val,@weight = val,weight
    end
    end

    # priority queue for nodes
    class NodeQueue
    def initialize
    @queue = []
    end

    def enqueue node
    @queue << node
    @queue = @queue.sort_by{|x|[-x.weight,x.val.size]}
    end

    def dequeue
    @queue.pop
    end

    def size
    @queue.size
    end
    end

    # HuffmanTree represents the tree with which we perform
    # the encoding
    class HuffmanTree

    # initialize the tree based on data
    def initialize data
    @freqs = build_frequencies(data)
    @root = build_tree
    end

    #encode the given data
    def encode data
    data.downcase.split(//).inject("") do |code,char|
    code + encode_char(char)
    end
    end

    private

    # this method encodes a given character based on our
    # tree representation
    def encode_char char
    node = @root
    coding = ""

    # we do a binary search, building the representation
    # of the character based on which branch we follow
    while node.val != char
    if node.right.val.include? char
    node = node.right
    coding += "1"
    else
    node = node.left
    coding += "0"
    end
    end
    coding
    end

    # get word frequencies in a given phrase
    def build_frequencies phrase
    phrase.downcase.split(//).inject(Hash.new(0)) do |hash,item|
    hash[item] += 1
    hash
    end
    end

    # build huffmantree using the priority queue method
    def build_tree
    queue = NodeQueue.new

    # build a node for each character and place in pqueue
    @freqs.keys.each do |char|
    queue.enqueue(Node.new(char,@freqs[char]))
    end

    while !queue.size.zero?

    # if only one node exists, it is the root. return it
    return queue.dequeue if queue.size == 1

    # dequeue two lightest nodes, create parent,
    # add children and enqueue newly created node
    node = Node.new
    node.right = queue.dequeue
    node.left = queue.dequeue
    node.val = node.left.val+node.right.val
    node.weight = node.left.weight+node.right.weight
    queue.enqueue node
    end
    end
    end

    # get command lines args, build tree and encode data
    if __FILE__ == $0
    data = ARGV.join(" ")
    tree = HuffmanTree.new data
    encoded = tree.encode(data).scan(/\d{1,8}/)
    puts
    puts "Encoded"
    encoded.each_slice(5) do |slice|
    puts slice.join(" ")
    end
    puts "#{encoded.size} bytes"
    puts
    puts "Original"
    puts data
    puts "#{data.size} bytes"
    puts
    compression = (100.0 - (encoded.size.to_f/data.size.to_f)*100)
    puts "#{compression}\% compression"
    end

    --
    Posted via http://www.ruby-forum.com/.
     
    Drew Olson, May 14, 2007
    #7
  8. Re: Huffman Encoder (#123)

    The better solution...

    Here's the code: http://tinyurl.com/32tjfo

    Example:

    $ echo -n Scooby-Dooby-Doo | ruby huffman.rb | ruby huffman.rb -d
    Compression: 27/17 (158%)
    Decompression: 27/17 (158%)
    Scooby-Dooby-Doo

    How does it work?

    Count each unique byte in the original byte stream (e.g.
    "Scooby-Dooby-Doo"):

    [["S", 1], ["c", 1], ["o", 6], ["b", 2], ["y", 2], ["-", 2], ["D",
    2]]

    This list is then converted to a tree, using this code:

    while tree.length > 1
    a, b, *tree = tree.sort_by{|_, count| count} # Not deterministic.
    tree << [[a[0], b[0]], a[1]+b[1]]
    end

    In English: The two least frequently used leaves/nodes are
    replaced by one node (a combination of both leaves),
    recursively, until the list is only one leaf/node long. Step by
    step, this results in the following:

    [["S", 1], ["c", 1], ["b", 2], ["D", 2], ["y", 2], ["-", 2], ["o",
    6]]
    [["b", 2], ["D", 2], ["y", 2], [["S", "c"], 2], ["-", 2], ["o", 6]]
    [["y", 2], [["S", "c"], 2], ["-", 2], [["b", "D"], 4], ["o", 6]]
    [["-", 2], [["b", "D"], 4], [["y", ["S", "c"]], 4], ["o", 6]]
    [[["y", ["S", "c"]], 4], ["o", 6], [["-", ["b", "D"]], 6]]
    [[["-", ["b", "D"]], 6], [[["y", ["S", "c"]], "o"], 10]]
    [[[["-", ["b", "D"]], [["y", ["S", "c"]], "o"]], 16]]

    The tree is actually still in the list (the only element in the
    list), so we get it and strip the weight:

    [["-", ["b", "D"]], [["y", ["S", "c"]], "o"]]

    Secondly, we build the translation table, where each leaf in
    the tree is mapped to a bit sequence, by using the path to each
    leaf:

    {"-"=>"00", "b"=>"010", "y"=>"100", "c"=>"1011", "D"=>"011",
    "o"=>"11", "S"=>"1010"}

    We can translated each byte of the original byte stream:

    ["S", "c", "o", "o", "b", "y", "-", "D", "o", "o", "b", "y", "-",
    "D", "o", "o"]

    ... to the corresponding bit stream:

    ["1010", "1011", "11", "11", "010", "100", "00", "011", "11", "11",
    "010", "100", "00", "011", "11", "11"]

    ... which is joined into a string:

    "101010111111010100000111111010100000111111"

    The rest, serializing the translation table and adding and
    registering padding bits is not that complicated and won't be
    explained here.

    The decoding (decompressing?) is, well, just the reversed order
    of the steps...

    I want to emphasize one line of the decode code. After having
    deserialized the translation table, we can decode the whole
    message with one line:

    bitstring.scan(/#{table.keys.join("|")}/).collect{|bits|
    table[bits]}.join("")

    gegroet,
    Erik V. - http://www.erikveen.dds.nl/
     
    Erik Veenstra, May 14, 2007
    #8
  9. Ruby Quiz

    Eric I. Guest

    Re: Huffman Encoder (#123)

    My solution to the quiz can be found here:

    http://pastie.caboo.se/61613

    This solution has a number of features which might make it different
    to other solutions submitted. It encodes to and decodes from actual
    binary data, not strings of "0" and "1" characters. It is able to
    serialize and deserialize the Huffman encoder (which is a hash
    table, not a tree). It uses a linear algorithm to create the
    Huffman code from the character frequency data. And beyond that, it
    was written with an attempt to be efficient time-wise and
    memory-wise. It uses an end-of-message token :)eom) to mark the end
    of the encoded message. It therefore is able to handle messages
    that use 256 characters plus the 1 token. It recognizes that with
    large messages some characters could be encoded with more than 8
    bits, and it handles those situations. And finally it makes an
    attempt to recognize corrupt data, both in terms of the Huffman
    encoded message and the serialized encoder.

    Eric
     
    Eric I., May 15, 2007
    #9
  10. Ruby Quiz

    Raf Coremans Guest

    2007/5/15, Ball, Donald A Jr (Library) <>:
    <snip>
    > I also tried writing a binary encoder which packs
    > and unpacks the bit strings generated by my string encoder, but it fails
    > to decompress the entire text for long files, e.g. the program source
    > code itself. I'm unsure why, perhaps someone more experienced in this
    > arena can suggest why.


    Perhaps because the file you want to encode already contains your
    end-marker. The decoder stops when it encounters the first end-marker
    (line 161). A binary file would not be unlikely to contain a NULL
    character somewhere.

    Regards,
    Raf
     
    Raf Coremans, May 15, 2007
    #10
  11. Ruby Quiz

    Raf Coremans Guest

    2007/5/15, Ball, Donald A Jr (Library) <>:
    > > 2007/5/15, Ball, Donald A Jr (Library) <>:
    > > <snip>
    > > > I also tried writing a binary encoder which packs and

    > > unpacks the bit
    > > > strings generated by my string encoder, but it fails to

    > > decompress the
    > > > entire text for long files, e.g. the program source code

    > > itself. I'm
    > > > unsure why, perhaps someone more experienced in this arena

    > > can suggest
    > > > why.

    > >
    > > Perhaps because the file you want to encode already contains
    > > your end-marker. The decoder stops when it encounters the
    > > first end-marker (line 161). A binary file would not be
    > > unlikely to contain a NULL character somewhere.

    >
    > That was my first thought, but I threw in a check for the existence of
    > NULL characters in the frequency hash before merging the EOS=>0 hash,
    > and it did not complain. Besides, the file I'm trying to decompress is
    > the ruby source of my quiz solution, which is just plain ol' text.
    >
    > - donald
    >


    Strange... I get this:


    Encoding-decoding paste-61871.rb:
    $ ruby -w paste-61871.rb -b -f paste-61871.rb | ruby -w paste-61871.rb
    -b -d > ref
    $ cmp paste-61871.rb ref

    =>Identical


    Encoding-decoding a binary file:
    $ ruby -w paste-61871.rb -b -f dilbert2073317070515.gif | ruby -w
    paste-61871.rb -b -d > ref
    $ cmp dilbert2073317070515.gif ref
    cmp: EOF on ref
    $ hexdump -C ref
    00000000 47 49 46 38 39 61 58 02 d5 |GIF89aX..|
    00000009
    $ hexdump -C dilbert2073317070515.gif | head -n 1
    00000000 47 49 46 38 39 61 58 02 d5 00 c4 00 00 58 58 58 |GIF89aX......XXX|

    => Encoded-decoded terminates just before first NULL character in original file


    Regards,
    Raf
     
    Raf Coremans, May 15, 2007
    #11
  12. Ruby Quiz

    Drew Olson Guest

    Re: Huffman Encoder (#123)

    A little late, but here's my 2nd solution. It does decoding as well, but
    I didn't get a chance to add serialization. Nothing too fancy in here:

    Output:

    -------------------------------------------------------------

    >ruby huffman.rb ABRRKBAARAA

    Original
    ABRRKBAARAA
    11 bytes

    Encoded
    10100000 01101011 0011
    3 bytes

    72 percent compression

    Decoded
    abrrkbaaraa

    -------------------------------------------------------------

    >ruby huffman.rb I want this message to get encoded!


    Original
    I want this message to get encoded!
    35 bytes

    Encoded
    01010001 10101100 10000110 00011010 01010111
    10001101 10011111 11110010 10001000 01110110
    00101000 10110000 01100001 00011011 10010011
    00101000 0
    17 bytes

    51 percent compression

    Decoded
    i want this message to get encoded!

    -------------------------------------------------------------

    Code:

    # file: huffman.rb
    # author: Drew Olson

    require 'enumerator'

    # class to hold nodes in the huffman tree
    class Node
    attr_accessor :val,:weight,:left,:right

    def initialize(val="",weight=0)
    @val,@weight = val,weight
    end

    def children?
    return @left || @right
    end
    end

    # priority queue for nodes
    class NodeQueue
    def initialize
    @queue = []
    end

    def enqueue node
    @queue << node
    @queue = @queue.sort_by{|x|[-x.weight,x.val.size]}
    end

    def dequeue
    @queue.pop
    end

    def size
    @queue.size
    end
    end

    # HuffmanTree represents the tree with which we perform
    # the encoding
    class HuffmanTree

    # initialize the tree based on data
    def initialize data
    @freqs = build_frequencies(data)
    @root = build_tree
    end

    #encode the given data
    def encode data
    data.downcase.split(//).inject("") do |code,char|
    code + encode_char(char)
    end
    end

    def decode data
    node = @root

    if !@root.children?
    return @root.val
    end

    data.split(//).inject("") do |phrase,digit|
    if digit == "0"
    node = node.left
    else
    node = node.right
    end
    if !node.children?
    phrase += node.val
    node = @root
    end
    phrase
    end
    end

    private

    # this method encodes a given character based on our
    # tree representation
    def encode_char char
    node = @root
    coding = ""

    # encode to 0 if only one character
    if !@root.children?
    return "0"
    end

    # we do a binary search, building the representation
    # of the character based on which branch we follow
    while node.val != char
    if node.right.val.include? char
    node = node.right
    coding += "1"
    else
    node = node.left
    coding += "0"
    end
    end
    coding
    end

    # get word frequencies in a given phrase
    def build_frequencies phrase
    phrase.downcase.split(//).inject(Hash.new(0)) do |hash,item|
    hash[item] += 1
    hash
    end
    end

    # build huffmantree using the priority queue method
    def build_tree
    queue = NodeQueue.new

    # build a node for each character and place in pqueue
    @freqs.keys.each do |char|
    queue.enqueue(Node.new(char,@freqs[char]))
    end

    while !queue.size.zero?

    # if only one node exists, it is the root. return it
    return queue.dequeue if queue.size == 1

    # dequeue two lightest nodes, create parent,
    # add children and enqueue newly created node
    node = Node.new
    node.right = queue.dequeue
    node.left = queue.dequeue
    node.val = node.left.val+node.right.val
    node.weight = node.left.weight+node.right.weight
    queue.enqueue node
    end
    end
    end

    # get command lines args, build tree and encode data
    if __FILE__ == $0
    data = ARGV.join(" ")
    tree = HuffmanTree.new data

    # get encoded data and split into bits
    code = tree.encode(data)
    encoded_bits = code.scan(/\d{1,8}/)

    # output
    puts
    puts "Original"
    puts data
    puts "#{data.size} bytes"
    puts
    puts "Encoded"
    encoded_bits.each_slice(5) do |slice|
    puts slice.join(" ")
    end
    puts "#{encoded_bits.size} bytes"
    puts
    puts "%d percent compression" % (100.0 -
    (encoded_bits.size.to_f/data.size.to_f)*100.0)
    puts
    puts "Decoded"
    puts tree.decode(code)
    end

    --
    Posted via http://www.ruby-forum.com/.
     
    Drew Olson, May 16, 2007
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Eddy Soeparmin

    Phone Format (770) 123-1234

    Eddy Soeparmin, Jul 28, 2003, in forum: ASP .Net
    Replies:
    3
    Views:
    3,197
    Swanand Mokashi
    Jul 28, 2003
  2. huffman encoder

    , Dec 21, 2005, in forum: C Programming
    Replies:
    16
    Views:
    818
    Chuck F.
    Dec 23, 2005
  3. Jesse Merriman

    [SOLUTION][QUIZ] Huffman Encoder (#123)

    Jesse Merriman, May 13, 2007, in forum: Ruby
    Replies:
    1
    Views:
    173
    Charles Lowe
    May 13, 2007
  4. Kyle Schmitt

    Late huffman (quiz 123)

    Kyle Schmitt, May 16, 2007, in forum: Ruby
    Replies:
    0
    Views:
    91
    Kyle Schmitt
    May 16, 2007
  5. Ruby Quiz

    [SUMMARY] Huffman Encoder (#123)

    Ruby Quiz, May 17, 2007, in forum: Ruby
    Replies:
    0
    Views:
    144
    Ruby Quiz
    May 17, 2007
Loading...

Share This Page