Re: Working with bytes.

Discussion in 'Python' started by Adam T. Gautier, Apr 3, 2004.

  1. I came up with a solution using the binascii module's hexlify method.
    Thanks

    Adam T. Gautier wrote:

    > I have been unable to solve a problem. I am working with MD5
    > signatures trying to put these in a database. The MD5 signatures are
    > not generated using the python md5 module but an external application
    > that is producing the valid 16 byte signature formats. Anyway, these
    > 16 byte signatures are not nescarrally valid strings. How do I
    > manipulate the bytes? I need to concatenate the bytes with a SQL
    > statement which is a string. This works fine for most of the md5
    > signatures but some blow up with a TypeError. Because there is a NULL
    > byte or something else. So I guess my ultimate question is how do I
    > get a prepared SQL statement to accept a series of bytes? How do I
    > convert the bytes to a valid string like:
    >
    > 'x%L9d\340\316\262\363\037\311\345<\262\357\215'
    >
    > that can be concatenated?
    >
    > Thanks
    >
    Adam T. Gautier, Apr 3, 2004
    #1
    1. Advertising

  2. "Adam T. Gautier" <> wrote:

    >I came up with a solution using the binascii module's hexlify method.


    That is the most obvious method, I think. However, the code below
    stores 7 bits per byte and still remains ascii-compliant (the
    binascii.hexlify method stores 4 bits per byte).

    Anton

    from itertools import islice

    def _bits(i):
    return [('01'[i>>j & 1]) for j in range(8)][::-1]

    _table = dict([(chr(i),_bits(i)) for i in range(256)])

    def _bitstream(bytes):
    for byte in bytes:
    for bit in _table[byte]:
    yield bit

    def _dropfirst(gen):
    while 1:
    gen.next()
    for x in islice(gen,7):
    yield x

    def sevens(bytes):
    """ stream normal bytes to bytes where bit 8 is always 1"""
    gen = _bitstream(bytes)
    while 1:
    R = list(islice(gen,7))
    if not R: break
    s = '1'+ "".join(R) + '0' * (7-len(R))
    yield chr(int(s,2))

    def eights(bytes,n):
    """ the reverse of the sevens function :) """
    gen = _bitstream(bytes)
    df = _dropfirst(gen)
    for i in xrange(n):
    s = ''.join(islice(df,8))
    yield chr(int(s,2))

    def test():
    from random import randint
    size = 40
    R = [chr(randint(0,255)) for i in xrange(size)]
    bytes = ''.join(R)
    sv = ''.join(sevens(bytes))
    check = ''.join(eights(sv,size))
    assert check == bytes
    print sv

    if __name__ == '__main__':
    test()

    sample output:

    Ÿæ‰®ÑëÍ¡¾÷ÁóÆú½Þ·ú˂זðÚ¿‹ˆ²ªÅíž›¾Ÿ£•Í£¬ô²ŸÕØ
    Anton Vredegoor, Apr 3, 2004
    #2
    1. Advertising

  3. >>>>> (Anton Vredegoor) (AV) wrote:

    AV> "Adam T. Gautier" <> wrote:
    >> I came up with a solution using the binascii module's hexlify method.


    AV> That is the most obvious method, I think. However, the code below
    AV> stores 7 bits per byte and still remains ascii-compliant (the
    AV> binascii.hexlify method stores 4 bits per byte).
    .....
    AV> sample output:

    AV> Ÿæ‰®Ñëá¾÷ÃóÆú½Þ·ú˂זðÚ¿‹ˆ²ªÅíž›¾Ÿ£•Ã£¬ô²ŸÂÕØ

    Which includes quite a few NON-ASCII characters.
    So what is ASCII-compliant about it?
    You can't store 7 bits per byte and still be ASCII-compliant. At least if
    you don't want to include control characters.
    --
    Piet van Oostrum <>
    URL: http://www.cs.uu.nl/~piet [PGP]
    Private email:
    Piet van Oostrum, Apr 4, 2004
    #3
  4. Piet van Oostrum <> wrote:

    >AV>


    [snip]

    >Which includes quite a few NON-ASCII characters.
    >So what is ASCII-compliant about it?
    >You can't store 7 bits per byte and still be ASCII-compliant. At least if
    >you don't want to include control characters.


    Thanks, and yes you are right. I thought that getting rid of control
    codes just meant switching to the high bit codes, but of course
    control codes are part of the lower bit population and can't be
    removed that way. Worse than that: high bit codes are not
    ASCII-compliant at all!

    However the code below has the 8'th and 7'th bit always set to 0 and 1
    respectively, so it should produce ASCII-compliant output using 6 bits
    per byte.

    I wonder whether it would be possible to use more than six bits per
    byte but less than seven? There seem to be some character codes left
    and these could be used too?

    Anton

    from itertools import islice

    def _bits(i):
    return [('01'[i>>j & 1]) for j in range(8)][::-1]

    _table = dict([(chr(i),_bits(i)) for i in range(256)])

    def _bitstream(bytes):
    for byte in bytes:
    for bit in _table[byte]:
    yield bit

    def _drop_first_two(gen):
    while 1:
    gen.next()
    gen.next()
    for x in islice(gen,6):
    yield x

    def sixes(bytes):
    """ stream normal bytes to bytes where bits 8,7 are 0,1 """
    gen = _bitstream(bytes)
    while 1:
    R = list(islice(gen,6))
    if not R: break
    s = '01'+ "".join(R) + '0' * (6-len(R))
    yield chr(int(s,2))

    def eights(bytes,n):
    """ the reverse of the sixes function :-| """
    gen = _bitstream(bytes)
    df = _drop_first_two(gen)
    for i in xrange(n):
    s = ''.join(islice(df,8))
    yield chr(int(s,2))

    def test():
    from random import randint
    size = 20
    R = [chr(randint(0,255)) for i in xrange(size)]
    bytes = ''.join(R)
    sx = ''.join(sixes(bytes))
    check = ''.join(eights(sx,size))
    assert check == bytes
    print sx

    if __name__ == '__main__':
    test()

    output:

    VMtdh[LII~Qexdyg}xFRhXRIVx
    Anton Vredegoor, Apr 5, 2004
    #4
  5. Adam T. Gautier

    Jason Harper Guest

    Anton Vredegoor wrote:
    > I wonder whether it would be possible to use more than six bits per
    > byte but less than seven? There seem to be some character codes left
    > and these could be used too?


    Look up Base85 coding (a standard part of PostScript) for an example of
    how this can be done - 4 bytes encoded per 5 characters of printable ASCII.
    Jason Harper
    Jason Harper, Apr 5, 2004
    #5
  6. >>>>> (Anton Vredegoor) (AV) wrote:

    AV> Piet van Oostrum <> wrote:

    >> Which includes quite a few NON-ASCII characters.
    >> So what is ASCII-compliant about it?
    >> You can't store 7 bits per byte and still be ASCII-compliant. At least if
    >> you don't want to include control characters.


    AV> Thanks, and yes you are right. I thought that getting rid of control
    AV> codes just meant switching to the high bit codes, but of course
    AV> control codes are part of the lower bit population and can't be
    AV> removed that way. Worse than that: high bit codes are not
    AV> ASCII-compliant at all!

    AV> However the code below has the 8'th and 7'th bit always set to 0 and 1
    AV> respectively, so it should produce ASCII-compliant output using 6 bits
    AV> per byte.

    Except that the highest code you get is 0177 which is DEL, and is also a
    control code. If you store 6 bits per byte that is also what BASE64 does,
    so why reinvent the wheel?

    AV> I wonder whether it would be possible to use more than six bits per
    AV> byte but less than seven? There seem to be some character codes left
    AV> and these could be used too?

    Yes, you could in principle use 94 characters. There is a scheme called
    btoa that encodes 4 bytes into 5 ASCII characters by using BASE85, but I
    have never seen a Python implementation of it. It shouldn't be difficult,
    however.
    --
    Piet van Oostrum <>
    URL: http://www.cs.uu.nl/~piet [PGP]
    Private email:
    Piet van Oostrum, Apr 5, 2004
    #6
  7. Piet van Oostrum <> writes:

    > Yes, you could in principle use 94 characters. There is a scheme called
    > btoa that encodes 4 bytes into 5 ASCII characters by using BASE85, but I
    > have never seen a Python implementation of it. It shouldn't be difficult,
    > however.


    Is that the same as PDF/PostScript Ascii85? If so, there's an
    implementation somewhere in reportlab, IIRC.

    Bernhard

    --
    Intevation GmbH http://intevation.de/
    Skencil http://sketch.sourceforge.net/
    Thuban http://thuban.intevation.org/
    Bernhard Herzog, Apr 5, 2004
    #7
  8. >>>>> Bernhard Herzog <> (BH) wrote:

    BH> Piet van Oostrum <> writes:
    >> Yes, you could in principle use 94 characters. There is a scheme called
    >> btoa that encodes 4 bytes into 5 ASCII characters by using BASE85, but I
    >> have never seen a Python implementation of it. It shouldn't be difficult,
    >> however.


    BH> Is that the same as PDF/PostScript Ascii85? If so, there's an
    BH> implementation somewhere in reportlab, IIRC.

    They are slightly different AFAIK. Postscript uses '~' and btoa uses 'x'
    as terminating character. For the OP's use it doesn't matter of course.
    --
    Piet van Oostrum <>
    URL: http://www.cs.uu.nl/~piet [PGP]
    Private email:
    Piet van Oostrum, Apr 5, 2004
    #8
  9. Jason Harper <> wrote:

    >Anton Vredegoor wrote:
    >> I wonder whether it would be possible to use more than six bits per
    >> byte but less than seven? There seem to be some character codes left
    >> and these could be used too?

    >
    >Look up Base85 coding (a standard part of PostScript) for an example of
    >how this can be done - 4 bytes encoded per 5 characters of printable ASCII.


    Thanks to you and Piet for mentioning this. I found some other
    interesting application of Base85 encoding. It's used for a scheme to
    encode ipv6 addresses (which use 128 bits). Since a md5 digest is 16
    bytes (== 128 bits) there's a possibility to use this scheme. See

    http://www.faqs.org/rfcs/rfc1924.html

    for the details.

    Anton

    from string import digits, ascii_letters

    _rfc1924_chars = digits+ascii_letters+'!#$%&()*+-;<=>?@^_`{|}~'
    _rfc1924_table = dict([(c,i) for i,c in enumerate(_rfc1924_chars)])
    _rfc1924_bases = [85L**i for i in range(20)]

    def bytes_to_rfc1924(sixteen):
    res = []
    i = 0L
    for byte in sixteen:
    i <<= 8
    i |= ord(byte)
    for j in range(20):
    i,k = divmod(i,85)
    res.append(_rfc1924_chars[k])
    return "".join(res)

    def rfc1924_to_bytes(twenty):
    res = []
    i = 0L
    for b,byte in zip(_rfc1924_bases,twenty):
    i += b*_rfc1924_table[byte]
    for j in range(16):
    k = i & 255
    res.append(chr(k))
    i >>= 8
    res.reverse()
    return "".join(res)

    def test():
    import md5

    #md5.digest returns 16 bytes == 128 bits, an ipv6 address
    #also uses 128 bits (I don't know which format so I'm using md5
    #as a dummy placeholder to get 16 bytes of 'random' data)

    bytes = md5.new('9034572345asdf').digest()
    r = bytes_to_rfc1924(bytes)
    print r
    check = rfc1924_to_bytes(r)
    assert bytes == check

    if __name__=='__main__':
    test()

    output:

    k#llNFNo4sYFxKn*J<lB
    Anton Vredegoor, Apr 8, 2004
    #9
  10. (Anton Vredegoor) wrote:

    > http://www.faqs.org/rfcs/rfc1924.html


    Replying to my own post. I was using lowercase letters before
    uppercase and did some other non compliant things. Because I was using
    random data I didn't notice. The code below should reproduce the
    rfc1924 example.

    Anton

    from binascii import hexlify
    from string import digits, ascii_lowercase, ascii_uppercase

    _rfc1924_letters = ascii_uppercase + ascii_lowercase
    _rfc1924_chars = digits+_rfc1924_letters+'!#$%&()*+-;<=>?@^_`{|}~'
    _rfc1924_table = dict([(c,i) for i,c in enumerate(_rfc1924_chars)])
    _rfc1924_bases = [85L**i for i in range(19,-1,-1)]

    def bytes_to_rfc1924(sixteen):
    res = []
    i = 0L
    for byte in sixteen:
    i <<= 8
    i |= ord(byte)
    for j in range(20):
    i,k = divmod(i,85)
    res.append(_rfc1924_chars[k])
    res.reverse()
    return "".join(res)

    def rfc1924_to_bytes(twenty):
    res = []
    i = 0L
    for b,byte in zip(_rfc1924_bases,twenty):
    i += b * _rfc1924_table[byte]
    for j in range(16):
    i,k = divmod(i,256)
    res.append(chr(k))
    res.reverse()
    return "".join(res)

    def bytes_as_ipv6(bytes):
    addr = [bytes[i:i+2] for i in range(0,16,2)]
    return ":".join(map(hexlify,addr))

    def test():
    s = "4)+k&C#VzJ4br>0wv%Yp"
    bytes = rfc1924_to_bytes(s)
    check = bytes_to_rfc1924(bytes)
    assert s == check
    addr = bytes_as_ipv6(bytes)
    print addr

    if __name__=='__main__':
    test()

    output:

    1080:0000:0000:0000:0008:0800:200c:417a
    Anton Vredegoor, Apr 8, 2004
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jason Collins
    Replies:
    3
    Views:
    6,009
    Jason Collins
    Feb 18, 2004
  2. mrby

    4-bytes or 8-bytes alignment?

    mrby, Nov 2, 2004, in forum: C Programming
    Replies:
    8
    Views:
    417
    Mark McIntyre
    Nov 2, 2004
  3. Replies:
    5
    Views:
    528
    Flash Gordon
    Apr 9, 2006
  4. Yandos
    Replies:
    12
    Views:
    5,113
    Pete Becker
    Sep 15, 2005
  5. George2
    Replies:
    1
    Views:
    876
    Victor Bazarov
    Jan 9, 2008
Loading...

Share This Page