creating very small types

andrea · Apr 27, 2005

I was thinking to code the huffman algorithm and trying to compress
something with it, but I've got a problem.
How can I represent for example a char with only 3 bits??
I had a look to the compression modules but I can't understand them much...

Thank you very much
Any good link would be appreciated of course

Jeremy Bowers · Apr 27, 2005

I was thinking to code the huffman algorithm and trying to compress
something with it, but I've got a problem.
How can I represent for example a char with only 3 bits??
I had a look to the compression modules but I can't understand them much...

Thank you very much
Any good link would be appreciated of course

I think the answer to this question very much depends on why you want to
do this. Easy and fast are pretty opposed in this particular domain in
Python (IMHO anyways, it's an easy bit-bashing language but it's slow
for that), and you're in one of the rare domains where it matters.

The answers strongly vary if you're trying to create something performant,
or just for your learning purposes. Or homework purposes... ;-)

andrea · Apr 27, 2005

Jeremy said:
I think the answer to this question very much depends on why you want to
do this. Easy and fast are pretty opposed in this particular domain in
Python (IMHO anyways, it's an easy bit-bashing language but it's slow
for that), and you're in one of the rare domains where it matters.

The answers strongly vary if you're trying to create something performant,
or just for your learning purposes. Or homework purposes... ;-)

No it's not for homework but for learning purposes...
I understand I can't do it easily in python, but maybe I could define a
new type in C and use it to do those dirty works, what do you think?

Michael Spencer · Apr 27, 2005

andrea said:
....
I understand I can't do it easily in python, but maybe I could define a
new type in C and use it to do those dirty works, what do you think?

Why do you need to create 'very small types'?

You only need actual bit-twiddling when you do the encoding/de-coding right?
If you create an encoding map 'codes' as a dict of strings of '1' and '0',
encoding might look like (untested):

def encode(stream):
outchar = count = 0
for char in stream:
for bit in codes[char]:
(outchar << 1) | (bit == "1")
count +=1
if count ==8:
yield chr(outchar)
outchar = count = 0
if count:
yield chr(outchar)

HTH
Michael

Dan Bishop · Apr 28, 2005

Michael said:
them much...
...

I understand I can't do it easily in python, but maybe I could define a
new type in C and use it to do those dirty works, what do you

Click to expand...

think?

Why do you need to create 'very small types'?

You only need actual bit-twiddling when you do the encoding/de-coding right?
If you create an encoding map 'codes' as a dict of strings of '1' and '0',
encoding might look like (untested):

def encode(stream):
outchar = count = 0
for char in stream:
for bit in codes[char]:
(outchar << 1) | (bit == "1")
count +=1
if count ==8:
yield chr(outchar)
outchar = count = 0
if count:
yield chr(outchar)

I wrote some Huffman compression code a few years ago, with

class BitWriter(object):
# writes individual bits to an output stream
def __init__(self, outputStream):
self.__out = outputStream
self.__bitCount = 0 # number of unwritten bits
self.__currentByte = 0 # buffer for unwritten bits
def write(self, bit):
self.__currentByte = self.__currentByte << 1 | bit
self.__bitCount += 1
if self.__bitCount == BYTE_SIZE:
self.__out.write(chr(self.__currentByte))
self.__bitCount = 0
self.__currentByte = 0
def flush(self):
while self.__bitCount > 0:
self.write(0)

class BitReader(object):
# reads individual bits from an input stream
def __init__(self, inputStream):
self.__in = inputStream
self.__bits = [] # buffer to hold incoming bits
def readBit(self):
if len(self.__bits) == 0:
# read the next byte
b = ord(self.__in.read(1))
# unpack the bits
self.__bits = [(b & (1<<i)) != 0 for i in range(BYTE_SIZE-1,
-1, -1)]
return self.__bits.pop(0)

Bengt Richter · Apr 28, 2005

andrea said:
andrea said:

...
I understand I can't do it easily in python, but maybe I could define a
new type in C and use it to do those dirty works, what do you think?

Click to expand...

Why do you need to create 'very small types'?

You only need actual bit-twiddling when you do the encoding/de-coding right?
If you create an encoding map 'codes' as a dict of strings of '1' and '0',
encoding might look like (untested):

def encode(stream):
outchar = count = 0
for char in stream:
for bit in codes[char]:
(outchar << 1) | (bit == "1")
count +=1
if count ==8:
yield chr(outchar)
outchar = count = 0
if count:

outchar <<= 8-count # make last bit field big-endian like others?

yield chr(outchar)

I think I prefer little-endian bit streams though, e.g. (just hacked, not tested beyond
what you see and not recommended as efficient either way, esp the decode, but it might
be illustrative of something for Andrea ;-)

----< trivcodec.py >------------------------------------------------
#trivcodec.py
def encode(stream):
outchar = count = 0
for char in stream:
#print '%r code: %r'%(char, codes[char])
bits, width = codes[char]
outchar |= (bits<<count)
count += width
while count >= 8:
yield chr(outchar%0xff)
outchar >>= 8
count -= 8
if count:
yield chr(outchar)

def decode(stream):
codebuf = count = 0
width = 1
for codebyte in stream:
codebyte = ord(codebyte)
codebuf |= (codebyte<<count)
count += 8
if width>count:
continue
while width <= count:
code = (codebuf&((1<<width)-1), width)
if code not in chars:
width += 1
continue
yield chars

Code:

            codebuf >>= width
            count -= width
            width = 1
        width = 1 

#trivial encoding dict: a=0 b=01 C=0011 D=0111 E=1011 F=1111
codes = {'a':(0x00,1), 'b':(0x01,2), 
         'C':(0x3+0x4*0,4), 'D':(0x3+0x4*1,4),
         'E':(0x3+0x4*2,4), 'F':(0x3+0x4*3,4)}
chars = dict([(v,k) for k,v in codes.items()]) # reverse

def test(*tests):
    if not tests: tests = ['abaFECbaabbDF']
    for charstream in tests:
        print
        codestream = ''.join(list(encode(charstream)))
        print '%r [%s] -(encode)-> %r [%s]' % (
            charstream, len(charstream), codestream, len(codestream))
        recovered = ''.join(list(decode(codestream)))
        print '%r [%s] -(decode)-> %r [%s]' % (
            codestream, len(codestream), charstream, len(charstream))
     
if __name__ == '__main__':
    import sys
    test(*sys.argv[1:])
--------------------------------------------------------------------

Result:

[22:02] C:\pywk\clp>py24 trivcodec.py

'abaFECbaabbDF' [13] -(encode)-> '\xf2;Q\xf7' [4]
'\xf2;Q\xf7' [4] -(decode)-> 'abaFECbaabbDF' [13]

[22:02] C:\pywk\clp>py24 trivcodec.py  a b C D E F

'a' [1] -(encode)-> '\x00' [1]
'\x00' [1] -(decode)-> 'a' [1]

'b' [1] -(encode)-> '\x01' [1]
'\x01' [1] -(decode)-> 'b' [1]

'C' [1] -(encode)-> '\x03' [1]
'\x03' [1] -(decode)-> 'C' [1]

'D' [1] -(encode)-> '\x07' [1]
'\x07' [1] -(decode)-> 'D' [1]

'E' [1] -(encode)-> '\x0b' [1]
'\x0b' [1] -(decode)-> 'E' [1]

'F' [1] -(encode)-> '\x0f' [1]
'\x0f' [1] -(decode)-> 'F' [1]

[22:02] C:\pywk\clp>py24 trivcodec.py

'abaFECbaabbDF' [13] -(encode)-> '\xf2;Q\xf7' [4]
'\xf2;Q\xf7' [4] -(decode)-> 'abaFECbaabbDF' [13]

[22:02] C:\pywk\clp>py24 trivcodec.py  aaaaaaaa bbbb CC DD EE FF

'aaaaaaaa' [8] -(encode)-> '\x00' [1]
'\x00' [1] -(decode)-> 'aaaaaaaa' [8]

'bbbb' [4] -(encode)-> 'U' [1]
'U' [1] -(decode)-> 'bbbb' [4]

'CC' [2] -(encode)-> '3' [1]
'3' [1] -(decode)-> 'CC' [2]

'DD' [2] -(encode)-> 'w' [1]
'w' [1] -(decode)-> 'DD' [2]

'EE' [2] -(encode)-> '\xbb' [1]
'\xbb' [1] -(decode)-> 'EE' [2]

'FF' [2] -(encode)-> '\x00' [1]
'\x00' [1] -(decode)-> 'FF' [2]

[22:03] C:\pywk\clp>py24 trivcodec.py  aaaaaaaabbbbCCDDEEFF

'aaaaaaaabbbbCCDDEEFF' [20] -(encode)-> '\x00U3w\xbb\x00' [6]
'\x00U3w\xbb\x00' [6] -(decode)-> 'aaaaaaaabbbbCCDDEEFF' [20]


Regards,
Bengt Richter

Michael Spencer · Apr 28, 2005

andrea said:
No it's not for homework but for learning purposes...

I think I prefer little-endian bit streams though,

good point: should lead to easier decoding via right-shifts

e.g. (just hacked, not tested beyond

what you see

Yep, yours looks better. Pretty soon there isn't going to be much learning left
in this for Andrea ;-)

and not recommended as efficient either way, esp the decode,

I expect that decoding by walking the coding tree is cleaner (not sure about
faster).

but it might

be illustrative of something for Andrea ;-)

Michael

Bengt Richter · Apr 28, 2005

On Thu, 28 Apr 2005 05:07:34 GMT, (e-mail address removed) (Bengt Richter) wrote:

.... some not quite correct code ;-/
(I copy/pasted and created an illusion. My code dict has no EOS, so
I decode pad zero bits as code that a single zero stands for ('a' in this case)
so that was an oversight. I should have extended the coding some more or reserved
perhaps 'F' as EOString, and tested for EOS in the decode stream.
This is revised, sorry.
You can make a 'way more efficient decoder as an exercise ;-)
Hint: if you make a dict of all the 2**width integers as keys where
width is your widest code (4 here for 2**4), then you can do a single
mask of 2**width-1 and look up the translation to (char, codewidth) directly,
according to the codewidth least significant bits, if you make all the states
of the other bits into keys that map to the same (char,codewidth).
So for example all the even values of out 2**16 would have to map to ('a', 1)
and all the values with 2 LSBs of 01 (of which there are 4) would map to ('b',2)
and so forth. Thus the incrementing of width and trying one thing after
another is not necessary. I think that will work ...

Well, now Andrea has something to do ;-)

----< trivcodec.py >------------------------------------------------

#trivcodec.py
EOS = ''
import itertools
def encode(stream):
outchar = count = 0
for char in itertools.chain(stream, [EOS]):
bits, width = codes[char]
outchar |= (bits<<count)
count += width
while count >= 8:
yield chr(outchar&0xff)
outchar >>= 8
count -= 8
if count:
yield chr(outchar)

def decode(stream):
codebuf = count = 0
width = 1; char = None
for codebyte in stream:
codebyte = ord(codebyte)
codebuf |= (codebyte<<count)
count += 8
if width>count:
continue
while width <= count:
code = (codebuf&((1<<width)-1), width)
if code not in chars:
width += 1
continue
char = chars

Code:

            if char == EOS: break
            yield char
            codebuf >>= width
            count -= width
            width = 1
        if char == EOS: break
        width = 1 

#trivial encoding dict: a=0 b=01 C=0011 D=0111 E=1011 EOS=1111
codes = {'a':(0x00,1), 'b':(0x01, 2), 
         'C':(0x3+0x4*0,4), 'D':(0x3+0x4*1,4),
         'E':(0x3+0x4*2,4), EOS:(0x3+0x4*3,4)}
chars = dict([(v,k) for k,v in codes.items()]) # reverse

def test(*tests):
    if not tests: tests = ['abaDECbaabbD']
    for charstream in tests:
        print
        codestream = ''.join(list(encode(charstream)))
        print '%r [%s] -(encode)-> %r [%s]' % (
            charstream, len(charstream), codestream, len(codestream))
        recovered = ''.join(list(decode(codestream)))
        print '%r [%s] -(decode)-> %r [%s]' % (
            codestream, len(codestream), recovered, len(recovered))
     
if __name__ == '__main__':
    import sys
    test(*sys.argv[1:])Not really. Tack enough a's on the recovered chars to account for zero fill
of the last encoded byte ;-/[QUOTE]
[22:02] C:\pywk\clp>py24 trivcodec.py

'abaFECbaabbDF' [13] -(encode)-> '\xf2;Q\xf7' [4]
'\xf2;Q\xf7' [4] -(decode)-> 'abaFECbaabbDF' [13][/QUOTE]

Fine, because the [4] bytes were totally filled.[QUOTE]
[22:02] C:\pywk\clp>py24 trivcodec.py  a b C D E F

'a' [1] -(encode)-> '\x00' [1]
'\x00' [1] -(decode)-> 'a' [1][/QUOTE]
Not so fine. There is no character serving as EOS mark.[QUOTE]
'b' [1] -(encode)-> '\x01' [1]
'\x01' [1] -(decode)-> 'b' [1]

'C' [1] -(encode)-> '\x03' [1]
'\x03' [1] -(decode)-> 'C' [1]

'D' [1] -(encode)-> '\x07' [1]
'\x07' [1] -(decode)-> 'D' [1]

'E' [1] -(encode)-> '\x0b' [1]
'\x0b' [1] -(decode)-> 'E' [1]

'F' [1] -(encode)-> '\x0f' [1]
'\x0f' [1] -(decode)-> 'F' [1][/QUOTE]

That really produced:

[ 8:03] C:\pywk\clp>py24  trivcodec.py a b C D E F

'a' [1] -(encode)-> '\x00' [1]
'\x00' [1] -(decode)-> 'aaaaaaaa' [8]

'b' [1] -(encode)-> '\x01' [1]
'\x01' [1] -(decode)-> 'baaaaaa' [7]

'C' [1] -(encode)-> '\x03' [1]
'\x03' [1] -(decode)-> 'Caaaa' [5]

'D' [1] -(encode)-> '\x07' [1]
'\x07' [1] -(decode)-> 'Daaaa' [5]

'E' [1] -(encode)-> '\x0b' [1]
'\x0b' [1] -(decode)-> 'Eaaaa' [5]

'F' [1] -(encode)-> '\x0f' [1]
'\x0f' [1] -(decode)-> 'Faaaa' [5]

So we need to assign a code as the EOS.
And probably translate that to '' as the return char,
and end on that.

So now it does (having given up 1111 for EOS instead of F):

[10:22] C:\pywk\clp>py24  trivcodec.py

'abaDECbaabbD' [12] -(encode)-> 'r;Q\xf7' [4]
'r;Q\xf7' [4] -(decode)-> 'abaDECbaabbD' [12]

[10:22] C:\pywk\clp>py24  trivcodec.py a b C D E

'a' [1] -(encode)-> '\x1e' [1]
'\x1e' [1] -(decode)-> 'a' [1]

'b' [1] -(encode)-> '=' [1]
'=' [1] -(decode)-> 'b' [1]

'C' [1] -(encode)-> '\xf3' [1]
'\xf3' [1] -(decode)-> 'C' [1]

'D' [1] -(encode)-> '\xf7' [1]
'\xf7' [1] -(decode)-> 'D' [1]

'E' [1] -(encode)-> '\xfb' [1]
'\xfb' [1] -(decode)-> 'E' [1]

Goes to show you, eh? Slice and dice similar lines carefully ;-)

Regards,
Bengt Richter

Data saving in condition of changing reality	0	Apr 29, 2022
Linux: using "clone3" and "waitid"	0	Oct 17, 2023
'Needless flexibilities' and structured records [very long]	10	Mar 15, 2013
I'm tempted to quit out of frustration	1	Aug 13, 2023
wxPython - very small frame	1	Sep 21, 2006
Help with Huffman Compression	0	Jul 18, 2010
Trying to creade method .between()	3	Sep 24, 2023
Very simple finite automaton (?)	6	Sep 23, 2009

creating very small types

andrea

Jeremy Bowers

andrea

Michael Spencer

Dan Bishop

Bengt Richter

Michael Spencer

Bengt Richter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads