Python Huffman encoding

Discussion in 'Python' started by dot, Nov 25, 2004.

  1. dot

    dot Guest

    dot, Nov 25, 2004
    #1
    1. Advertising

  2. dot

    David Fraser Guest

    "@ dot wrote:
    > Hi all, I have written a Python huffman Encoding Module, for my own
    > amusement. I thought it might be educational/entertaining for other
    > people, so I've put it on my website and wrote about it a little.
    >
    > http://gumuz.looze.net/wordpress/index.php/archives/2004/11/25/huffman-encoding/
    >
    >
    > Your comments are highly appreciated!
    >


    Looks cool. Now if only I had seen this before I implemented my own
    huffman decoding for JPEG ... I wonder if your module could be used
    instead ...
     
    David Fraser, Nov 25, 2004
    #2
    1. Advertising

  3. dot

    Paul McGuire Guest

    "dot" <""gumuz(\"@)looze(dot)net"> wrote in message
    news:41a525c6$0$76502$...
    > Hi all, I have written a Python huffman Encoding Module, for my own
    > amusement. I thought it might be educational/entertaining for other
    > people, so I've put it on my website and wrote about it a little.
    >
    >

    http://gumuz.looze.net/wordpress/index.php/archives/2004/11/25/huffman-encoding/
    >
    > Your comments are highly appreciated!
    >
    > cheers,
    >
    > guyon


    Guyon -

    Great first step at some self-learning with a non-trivial test program.
    Here are some performance speedups that will help you out.

    1. Measure, measure, measure!!! Don't just guess where the performance
    spots may lie, use time, timeit, or hotspot profiler to help you find your
    performance problems. I used a very crude form of profiling, in the same
    vein as using printf for debugging. I created a method called elapsed(),
    and then just dropped in elapsed("A"), elapsed("B"), etc. commands to help
    me identify where the time is going. (Calling elapsed() with no label
    resets the timer.)

    elapsedLast = 0
    def elapsed(label=''):
    global elapsedLast
    cur = time.time()
    if label:
    print "%s %.2f sec" % (label,cur-elapsedLast)
    elapsedLast = cur


    2. Your main encoding hotspot is in the bin2dec() routine. In your
    newbie-ness, you have reinvented the int() built-in. Adding the code (to
    replace your implementation of bin2dec):

    def bin2dec(bin_number):
    return int(bin_number,2)

    cuts out 75% of your encoding time.

    3. Your decoding method has 2 hot spots. One, predictably, is in the
    conversion from decimal to binary. Unfortunately, there *is* no builtin for
    this (did I read somewhere that this would be added in 2.4?), so instead, I
    resorted to the time-honored lookup table approach. You only ever call
    dec2bin with numbers from 0 to 255. So I followed your slow implementation
    of dec2bin with the following code:

    # use slow dec2bin to build lookup dict
    binmap = {}
    for i in range(256):
    binmap = dec2bin(i)
    # now redefine dec2bin using lookup dict
    def dec2bin(dec_number):
    return binmap[dec_number]

    This creates a fast lookup table using 256 calls to your slow routine, then
    defines a new dec2bin method that just returns the value from the lookup
    table. (In Python version 2.4, you could use a memoize decorator to
    accomplish this with a single @memoize decorator
    statement/declaration/well-what-is-this-thing-anyway? just before your
    original dec2bin method - this is a very cool technique worth checking out -
    see http://www.python.org/moin/PythonDecoratorLibrary.) This change crushes
    out almost *all* the decimal-to-binary time (which in my tests was about 58%
    of the decoding time)

    Your second hot spot in decoding is the while loop in which you iterate over
    the encoded string looking for successively longer keys. With a few minor
    changes, I cut out about 48% of the processing time in your loop (or about
    19% of the total decoding time).
    Orig:
    while not end_idx > len(bit_string):
    if table.has_key(bit_string[start_idx:end_idx]):
    result.append(table[bit_string[start_idx:end_idx]])
    start_idx = end_idx
    end_idx += 1

    if bit_string[start_idx:end_idx] == '': break

    Modified:
    bit_string_len = len(bit_string)
    while not end_idx > bit_string_len:
    curkey = bit_string[start_idx:end_idx]
    if curkey in table:
    result.append(table[curkey])
    start_idx = end_idx
    end_idx += 1

    bit_string does not change length during this loop, so there is no reason to
    evaluate len(bit_string) each time. This alone cuts another 5% from the
    original decoding time. Then, I removed the curious 'if bitstring[...'
    test, which looks like some kind of vestigial debugging code - I'm not sure
    I can see any conditions when it will evaluate to true. Removing this drops
    another 8%. Finally, I saved the temporary value of the string slice
    bit_string[start_idx:end_idx], in case it was a key found in the table
    variable - no need to double-compute the slice if the key is in the table -
    this carves off another 7%.

    Just some general comments on performance:
    1. Avoid function calls. These are major performance killers. The worst
    part of your original dec2bin was that it called itself recursively.
    2. try: except: overhead. Setting up the try/except frames can be
    expensive. One of my early attempts at optimizing your while loop was to
    use:
    try:
    result.append(table[bit_string[start_idx:end_idx]])
    except KeyError:
    pass
    else:
    start_idx = end_idx
    But this made things much slower!
    3. Loop invariants and dotted refs. If you have a while-loop, remember that
    the while condition is reevaluated every iteration. If you are working your
    way through a long string, don't keep calling len(reallyLongString)! Also,
    once you start to work more with classes and objects, you'll find that
    repeated refs to object attributes within a while loop can be improved by
    copying to a local variable. That is,
    while countingSomethings():
    if numberOfSomethings > self.maxSomethings:
    reactAccordingly()
    can be performance-improved with:
    maxSomethingLimit = self.maxSomethings
    while countingSomethings():
    if numberOfSomethings > maxSomethingLimit:
    reactAccordingly()

    Lastly, performance enhancements can be in direct conflict with
    maintainability and generally accepted practices - so use them sparingly!!!
    and comment liberally!!!

    -- Paul
     
    Paul McGuire, Nov 28, 2004
    #3
  4. dot

    Guyon Morée Guest

    Wow Paul!

    thanks a lot for your comments! I learned a lot already only by reading
    them, I will implement them and see what the speed gains are.

    Keep an eye on my blog, I will post an update on it soon!


    thanks,
    guyon

    ps. love this group


    "Paul McGuire" <._bogus_.com> wrote in message

    >

    http://gumuz.looze.net/wordpress/index.php/archives/2004/11/25/huffman-encoding/

    > Great first step at some self-learning with a non-trivial test program.
    > Here are some performance speedups that will help you out.
     
    Guyon Morée, Nov 30, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. NOBODY
    Replies:
    2
    Views:
    830
    Thomas Weidenfeller
    Oct 17, 2003
  2. niko
    Replies:
    3
    Views:
    3,859
  3. Weng Tianxiang

    Where are Huffman encoding applications?

    Weng Tianxiang, Aug 1, 2006, in forum: VHDL
    Replies:
    16
    Views:
    3,169
    Weng Tianxiang
    Aug 5, 2006
  4. Using Huffman Compression

    , Oct 12, 2005, in forum: C Programming
    Replies:
    8
    Views:
    550
    Mabden
    Oct 28, 2005
  5. huffman encoder

    , Dec 21, 2005, in forum: C Programming
    Replies:
    16
    Views:
    842
    Chuck F.
    Dec 23, 2005
Loading...

Share This Page