[newbie] String to binary conversion

Discussion in 'Python' started by Mok-Kong Shen, Aug 6, 2012.

  1. If I have a string "abcd" then, with 8-bit encoding of each character,
    there is a corresponding 32-bit binary integer. How could I best
    obtain that integer and from that integer backwards again obtain the
    original string? Thanks in advance.

    M. K. Shen
    Mok-Kong Shen, Aug 6, 2012
    #1
    1. Advertising

  2. Mok-Kong Shen

    Tobiah Guest

    The binascii module looks like it might have
    something for you. I've never used it.

    Tobiah

    http://docs.python.org/library/binascii.html

    On 08/06/2012 01:46 PM, Mok-Kong Shen wrote:
    >
    > If I have a string "abcd" then, with 8-bit encoding of each character,
    > there is a corresponding 32-bit binary integer. How could I best
    > obtain that integer and from that integer backwards again obtain the
    > original string? Thanks in advance.
    >
    > M. K. Shen
    Tobiah, Aug 6, 2012
    #2
    1. Advertising

  3. Mok-Kong Shen

    Tobiah Guest

    On 08/06/2012 01:59 PM, Tobiah wrote:
    > The binascii module looks like it might have
    > something for you. I've never used it.


    Having actually read some of that doc, I see
    it's not what you want at all. Sorry.
    Tobiah, Aug 6, 2012
    #3
  4. Am 06.08.2012 22:59, schrieb Tobiah:
    > The binascii module looks like it might have
    > something for you. I've never used it.


    Thanks for the hint, but if I don't err, the module binascii doesn't
    seem to work. I typed:

    import binascii

    and a line that's given as example in the document:

    crc = binascii.crc32("hello")

    but got the following error message:

    TypeError: 'str' does not support the buffer interface.

    The same error message appeared when I tried the other functions.

    M. K. Shen
    Mok-Kong Shen, Aug 6, 2012
    #4
  5. Mok-Kong Shen

    MRAB Guest

    On 06/08/2012 21:46, Mok-Kong Shen wrote:
    >
    > If I have a string "abcd" then, with 8-bit encoding of each character,
    > there is a corresponding 32-bit binary integer. How could I best
    > obtain that integer and from that integer backwards again obtain the
    > original string? Thanks in advance.
    >

    Try this (Python 3, in which strings are Unicode):
    >>> import struct
    >>> # For a little-endian integer
    >>> struct.unpack("<I", "abcd".encode("latin-1"))[0]

    1684234849
    >>> hex(_)

    '0x64636261'

    or this (Python 2, in which strings are bytestrings):
    >>> import struct
    >>> # For a little-endian integer
    >>> struct.unpack("<I", "abcd")[0]

    1684234849
    >>> hex(_)

    '0x64636261'
    MRAB, Aug 6, 2012
    #5
  6. On 8/6/2012 1:46 PM Mok-Kong Shen said...
    >
    > If I have a string "abcd" then, with 8-bit encoding of each character,
    > there is a corresponding 32-bit binary integer. How could I best
    > obtain that integer and from that integer backwards again obtain the
    > original string? Thanks in advance.


    It's easy to write one:

    def str2val(str,_val=0):
    if len(str)>1: return str2val(str[1:],256*_val+ord(str[0]))
    return 256*_val+ord(str[0])


    def val2str(val,_str=""):
    if val>256: return val2str(int(val/256),_str)+chr(val%256)
    return _str+chr(val)


    print str2val("abcd")
    print val2str(str2val("abcd"))
    print val2str(str2val("good"))
    print val2str(str2val("longer"))
    print val2str(str2val("verymuchlonger"))

    Flavor to taste.

    Emile
    Emile van Sebille, Aug 6, 2012
    #6
  7. On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote:

    > If I have a string "abcd" then, with 8-bit encoding of each character,
    > there is a corresponding 32-bit binary integer. How could I best obtain
    > that integer and from that integer backwards again obtain the original
    > string? Thanks in advance.


    First you have to know the encoding, as that will define the integers you
    get. There are many 8-bit encodings, but of course they can't all encode
    arbitrary 4-character strings. Since there are tens of thousands of
    different characters, and an 8-bit encoding can only code for 256 of
    them, there are many strings that an encoding cannot handle.

    For those, you need multi-byte encodings like UTF-8, UTF-16, etc.

    Sticking to one-byte encodings: since most of them are compatible with
    ASCII, examples with "abcd" aren't very interesting:

    py> 'abcd'.encode('latin1')
    b'abcd'

    Even though the bytes object b'abcd' is printed as if it were a string,
    it is actually treated as an array of one-byte ints:

    py> b'abcd'[0]
    97

    Here's a more interesting example, using Python 3: it uses at least one
    character (the Greek letter π) which cannot be encoded in Latin1, and two
    which cannot be encoded in ASCII:

    py> "aπ©d".encode('iso-8859-7')
    b'a\xf0\xa9d'

    Most encodings will round-trip successfully:

    py> text = 'aπ©Z!'
    py> data = text.encode('iso-8859-7')
    py> data.decode('iso-8859-7') == text
    True


    (although the ability to round-trip is a property of the encoding itself,
    not of the encoding system).

    Naturally if you encode with one encoding, and then decode with another,
    you are likely to get different strings:

    py> text = 'aπ©Z!'
    py> data = text.encode('iso-8859-7')
    py> data.decode('latin1')
    'að©Z!'
    py> data.decode('iso-8859-14')
    'aŵ©Z!'


    Both the encode and decode methods take an optional argument, errors,
    which specify the error handling scheme. The default is errors='strict',
    which raises an exception. Others include 'ignore' and 'replace'.

    py> 'aŵðπ©Z!'.encode('ascii', 'ignore')
    b'aZ!'
    py> 'aŵðπ©Z!'.encode('ascii', 'replace')
    b'a????Z!'



    --
    Steven
    Steven D'Aprano, Aug 7, 2012
    #7
  8. Steven D'Apranoæ–¼ 2012å¹´8月7日星期二UTC+8上åˆ10時01分05秒寫é“:
    > On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote:
    >
    >
    >
    > > If I have a string "abcd" then, with 8-bit encoding of each character,

    >
    > > there is a corresponding 32-bit binary integer. How could I best obtain

    >
    > > that integer and from that integer backwards again obtain the original

    >
    > > string? Thanks in advance.

    >
    >
    >
    > First you have to know the encoding, as that will define the integers you
    >
    > get. There are many 8-bit encodings, but of course they can't all encode
    >
    > arbitrary 4-character strings. Since there are tens of thousands of
    >
    > different characters, and an 8-bit encoding can only code for 256 of
    >
    > them, there are many strings that an encoding cannot handle.
    >
    >
    >
    > For those, you need multi-byte encodings like UTF-8, UTF-16, etc.
    >
    >
    >
    > Sticking to one-byte encodings: since most of them are compatible with
    >
    > ASCII, examples with "abcd" aren't very interesting:
    >
    >
    >
    > py> 'abcd'.encode('latin1')
    >
    > b'abcd'
    >
    >
    >
    > Even though the bytes object b'abcd' is printed as if it were a string,
    >
    > it is actually treated as an array of one-byte ints:
    >
    >
    >
    > py> b'abcd'[0]
    >
    > 97
    >
    >
    >
    > Here's a more interesting example, using Python 3: it uses at least one
    >
    > character (the Greek letter π) which cannot be encoded in Latin1, and two
    >
    > which cannot be encoded in ASCII:
    >
    >
    >
    > py> "aπ©d".encode('iso-8859-7')
    >
    > b'a\xf0\xa9d'
    >
    >
    >
    > Most encodings will round-trip successfully:
    >
    >
    >
    > py> text = 'aπ©Z!'
    >
    > py> data = text.encode('iso-8859-7')
    >
    > py> data.decode('iso-8859-7') == text
    >
    > True
    >
    >
    >
    >
    >
    > (although the ability to round-trip is a property of the encoding itself,
    >
    > not of the encoding system).
    >
    >
    >
    > Naturally if you encode with one encoding, and then decode with another,
    >
    > you are likely to get different strings:
    >
    >
    >
    > py> text = 'aπ©Z!'
    >
    > py> data = text.encode('iso-8859-7')
    >
    > py> data.decode('latin1')
    >
    > 'að©Z!'
    >
    > py> data.decode('iso-8859-14')
    >
    > 'aŵ©Z!'
    >
    >
    >
    >
    >
    > Both the encode and decode methods take an optional argument, errors,
    >
    > which specify the error handling scheme. The default is errors='strict',
    >
    > which raises an exception. Others include 'ignore' and 'replace'.
    >
    >
    >
    > py> 'aŵðπ©Z!'.encode('ascii', 'ignore')
    >
    > b'aZ!'
    >
    > py> 'aŵðπ©Z!'.encode('ascii', 'replace')
    >
    > b'a????Z!'
    >
    >
    >
    >
    >
    >
    >
    > --
    >
    > Steven




    Steven D'Apranoæ–¼ 2012å¹´8月7日星期二UTC+8上åˆ10時01分05秒寫é“:
    > On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote:
    >
    >
    >
    > > If I have a string "abcd" then, with 8-bit encoding of each character,

    >
    > > there is a corresponding 32-bit binary integer. How could I best obtain

    >
    > > that integer and from that integer backwards again obtain the original

    >
    > > string? Thanks in advance.

    >
    >
    >
    > First you have to know the encoding, as that will define the integers you
    >
    > get. There are many 8-bit encodings, but of course they can't all encode
    >
    > arbitrary 4-character strings. Since there are tens of thousands of
    >
    > different characters, and an 8-bit encoding can only code for 256 of
    >
    > them, there are many strings that an encoding cannot handle.
    >
    >
    >
    > For those, you need multi-byte encodings like UTF-8, UTF-16, etc.
    >
    >
    >
    > Sticking to one-byte encodings: since most of them are compatible with
    >
    > ASCII, examples with "abcd" aren't very interesting:
    >
    >
    >
    > py> 'abcd'.encode('latin1')
    >
    > b'abcd'
    >
    >
    >
    > Even though the bytes object b'abcd' is printed as if it were a string,
    >
    > it is actually treated as an array of one-byte ints:
    >
    >
    >
    > py> b'abcd'[0]
    >
    > 97
    >
    >
    >
    > Here's a more interesting example, using Python 3: it uses at least one
    >
    > character (the Greek letter π) which cannot be encoded in Latin1, and two
    >
    > which cannot be encoded in ASCII:
    >
    >
    >
    > py> "aπ©d".encode('iso-8859-7')
    >
    > b'a\xf0\xa9d'
    >
    >
    >
    > Most encodings will round-trip successfully:
    >
    >
    >
    > py> text = 'aπ©Z!'
    >
    > py> data = text.encode('iso-8859-7')
    >
    > py> data.decode('iso-8859-7') == text
    >
    > True
    >
    >
    >
    >
    >
    > (although the ability to round-trip is a property of the encoding itself,
    >
    > not of the encoding system).
    >
    >
    >
    > Naturally if you encode with one encoding, and then decode with another,
    >
    > you are likely to get different strings:
    >
    >
    >
    > py> text = 'aπ©Z!'
    >
    > py> data = text.encode('iso-8859-7')
    >
    > py> data.decode('latin1')
    >
    > 'að©Z!'
    >
    > py> data.decode('iso-8859-14')
    >
    > 'aŵ©Z!'
    >
    >
    >
    >
    >
    > Both the encode and decode methods take an optional argument, errors,
    >
    > which specify the error handling scheme. The default is errors='strict',
    >
    > which raises an exception. Others include 'ignore' and 'replace'.
    >
    >
    >
    > py> 'aŵðπ©Z!'.encode('ascii', 'ignore')
    >
    > b'aZ!'
    >
    > py> 'aŵðπ©Z!'.encode('ascii', 'replace')
    >
    > b'a????Z!'
    >
    >
    >
    >
    >
    >
    >
    > --
    >
    > Steven


    I think UTF-8 CODEC or UTF-16 is necessary, just recall those MS encoding codecs
    of Win98, and NT that collected taxes all over the world.


    Actually for each kind of some character encoding,
    please develop a codec to UTF-8 or UTF-16.

    It means one can make conversions between any two of the qualified
    character sets.
    88888 Dihedral, Aug 7, 2012
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Delali Dzirasa
    Replies:
    4
    Views:
    1,022
    Delali Dzirasa
    Sep 15, 2003
  2. Marco Traverso
    Replies:
    5
    Views:
    15,096
    Marco Traverso
    Dec 7, 2003
  3. Alexander Eisenhuth
    Replies:
    5
    Views:
    530
    Bob Gailer
    Jul 25, 2003
  4. bob
    Replies:
    6
    Views:
    732
    Peter Shaggy Haywood
    Mar 21, 2006
  5. Jim
    Replies:
    6
    Views:
    719
Loading...

Share This Page