[newbie] String to binary conversion

Mok-Kong Shen · Aug 6, 2012

If I have a string "abcd" then, with 8-bit encoding of each character,
there is a corresponding 32-bit binary integer. How could I best
obtain that integer and from that integer backwards again obtain the
original string? Thanks in advance.

M. K. Shen

Tobiah · Aug 6, 2012

The binascii module looks like it might have
something for you. I've never used it.

Tobiah

http://docs.python.org/library/binascii.html

Tobiah · Aug 6, 2012

The binascii module looks like it might have
something for you. I've never used it.

Having actually read some of that doc, I see
it's not what you want at all. Sorry.

Mok-Kong Shen · Aug 6, 2012

Am 06.08.2012 22:59, schrieb Tobiah:

The binascii module looks like it might have
something for you. I've never used it.

Thanks for the hint, but if I don't err, the module binascii doesn't
seem to work. I typed:

import binascii

and a line that's given as example in the document:

crc = binascii.crc32("hello")

but got the following error message:

TypeError: 'str' does not support the buffer interface.

The same error message appeared when I tried the other functions.

M. K. Shen

MRAB · Aug 6, 2012

If I have a string "abcd" then, with 8-bit encoding of each character,
there is a corresponding 32-bit binary integer. How could I best
obtain that integer and from that integer backwards again obtain the
original string? Thanks in advance.

Try this (Python 3, in which strings are Unicode):

import struct
# For a little-endian integer
struct.unpack("<I", "abcd".encode("latin-1"))[0] 1684234849
hex(_)

Click to expand...

Click to expand...

'0x64636261'

or this (Python 2, in which strings are bytestrings):

import struct
# For a little-endian integer
struct.unpack("<I", "abcd")[0] 1684234849
hex(_)

Click to expand...

Click to expand...

'0x64636261'

Emile van Sebille · Aug 6, 2012

On 8/6/2012 1:46 PM Mok-Kong Shen said...

If I have a string "abcd" then, with 8-bit encoding of each character,
there is a corresponding 32-bit binary integer. How could I best
obtain that integer and from that integer backwards again obtain the
original string? Thanks in advance.

It's easy to write one:

def str2val(str,_val=0):
if len(str)>1: return str2val(str[1:],256*_val+ord(str[0]))
return 256*_val+ord(str[0])

def val2str(val,_str=""):
if val>256: return val2str(int(val/256),_str)+chr(val%256)
return _str+chr(val)

print str2val("abcd")
print val2str(str2val("abcd"))
print val2str(str2val("good"))
print val2str(str2val("longer"))
print val2str(str2val("verymuchlonger"))

Flavor to taste.

Emile

Steven D'Aprano · Aug 7, 2012

If I have a string "abcd" then, with 8-bit encoding of each character,
there is a corresponding 32-bit binary integer. How could I best obtain
that integer and from that integer backwards again obtain the original
string? Thanks in advance.

First you have to know the encoding, as that will define the integers you
get. There are many 8-bit encodings, but of course they can't all encode
arbitrary 4-character strings. Since there are tens of thousands of
different characters, and an 8-bit encoding can only code for 256 of
them, there are many strings that an encoding cannot handle.

For those, you need multi-byte encodings like UTF-8, UTF-16, etc.

Sticking to one-byte encodings: since most of them are compatible with
ASCII, examples with "abcd" aren't very interesting:

py> 'abcd'.encode('latin1')
b'abcd'

Even though the bytes object b'abcd' is printed as if it were a string,
it is actually treated as an array of one-byte ints:

py> b'abcd'[0]
97

Here's a more interesting example, using Python 3: it uses at least one
character (the Greek letter Ï€) which cannot be encoded in Latin1, and two
which cannot be encoded in ASCII:

py> "aÏ€Â©d".encode('iso-8859-7')
b'a\xf0\xa9d'

Most encodings will round-trip successfully:

py> text = 'aÏ€Â©Z!'
py> data = text.encode('iso-8859-7')
py> data.decode('iso-8859-7') == text
True

(although the ability to round-trip is a property of the encoding itself,
not of the encoding system).

Naturally if you encode with one encoding, and then decode with another,
you are likely to get different strings:

py> text = 'aÏ€Â©Z!'
py> data = text.encode('iso-8859-7')
py> data.decode('latin1')
'aÃ°Â©Z!'
py> data.decode('iso-8859-14')
'aÅµÂ©Z!'

Both the encode and decode methods take an optional argument, errors,
which specify the error handling scheme. The default is errors='strict',
which raises an exception. Others include 'ignore' and 'replace'.

py> 'aÅµÃ°Ï€Â©Z!'.encode('ascii', 'ignore')
b'aZ!'
py> 'aÅµÃ°Ï€Â©Z!'.encode('ascii', 'replace')
b'a????Z!'

88888 Dihedral · Aug 7, 2012

Steven D'Apranoæ–¼ 2012å¹´8æœˆ7æ—¥æ˜ŸæœŸäºŒUTC+8ä¸Šåˆ10æ™‚01åˆ†05ç§’å¯«é“ï¼š

If I have a string "abcd" then, with 8-bit encoding of each character,

Click to expand...

there is a corresponding 32-bit binary integer. How could I best obtain

Click to expand...

that integer and from that integer backwards again obtain the original

Click to expand...

string? Thanks in advance.

Click to expand...

First you have to know the encoding, as that will define the integers you

get. There are many 8-bit encodings, but of course they can't all encode

arbitrary 4-character strings. Since there are tens of thousands of

different characters, and an 8-bit encoding can only code for 256 of

them, there are many strings that an encoding cannot handle.

For those, you need multi-byte encodings like UTF-8, UTF-16, etc.

Sticking to one-byte encodings: since most of them are compatible with

ASCII, examples with "abcd" aren't very interesting:

py> 'abcd'.encode('latin1')

b'abcd'

Even though the bytes object b'abcd' is printed as if it were a string,

it is actually treated as an array of one-byte ints:

py> b'abcd'[0]

97

Here's a more interesting example, using Python 3: it uses at least one

character (the Greek letter Ï€) which cannot be encoded in Latin1, and two

which cannot be encoded in ASCII:

py> "aÏ€Â©d".encode('iso-8859-7')

b'a\xf0\xa9d'

Most encodings will round-trip successfully:

py> text = 'aÏ€Â©Z!'

py> data = text.encode('iso-8859-7')

py> data.decode('iso-8859-7') == text

True

(although the ability to round-trip is a property of the encoding itself,

not of the encoding system).

Naturally if you encode with one encoding, and then decode with another,

you are likely to get different strings:

py> text = 'aÏ€Â©Z!'

py> data = text.encode('iso-8859-7')

py> data.decode('latin1')

'aÃ°Â©Z!'

py> data.decode('iso-8859-14')

'aÅµÂ©Z!'

Both the encode and decode methods take an optional argument, errors,

which specify the error handling scheme. The default is errors='strict',

which raises an exception. Others include 'ignore' and 'replace'.

py> 'aÅµÃ°Ï€Â©Z!'.encode('ascii', 'ignore')

b'aZ!'

py> 'aÅµÃ°Ï€Â©Z!'.encode('ascii', 'replace')

b'a????Z!'

Steven D'Apranoæ–¼ 2012å¹´8æœˆ7æ—¥æ˜ŸæœŸäºŒUTC+8ä¸Šåˆ10æ™‚01åˆ†05ç§’å¯«é“ï¼š

If I have a string "abcd" then, with 8-bit encoding of each character,

Click to expand...

there is a corresponding 32-bit binary integer. How could I best obtain

Click to expand...

that integer and from that integer backwards again obtain the original

Click to expand...

string? Thanks in advance.

Click to expand...

First you have to know the encoding, as that will define the integers you

get. There are many 8-bit encodings, but of course they can't all encode

arbitrary 4-character strings. Since there are tens of thousands of

different characters, and an 8-bit encoding can only code for 256 of

them, there are many strings that an encoding cannot handle.

For those, you need multi-byte encodings like UTF-8, UTF-16, etc.

Sticking to one-byte encodings: since most of them are compatible with

ASCII, examples with "abcd" aren't very interesting:

py> 'abcd'.encode('latin1')

b'abcd'

Even though the bytes object b'abcd' is printed as if it were a string,

it is actually treated as an array of one-byte ints:

py> b'abcd'[0]

97

Here's a more interesting example, using Python 3: it uses at least one

character (the Greek letter Ï€) which cannot be encoded in Latin1, and two

which cannot be encoded in ASCII:

py> "aÏ€Â©d".encode('iso-8859-7')

b'a\xf0\xa9d'

Most encodings will round-trip successfully:

py> text = 'aÏ€Â©Z!'

py> data = text.encode('iso-8859-7')

py> data.decode('iso-8859-7') == text

True

(although the ability to round-trip is a property of the encoding itself,

not of the encoding system).

Naturally if you encode with one encoding, and then decode with another,

you are likely to get different strings:

py> text = 'aÏ€Â©Z!'

py> data = text.encode('iso-8859-7')

py> data.decode('latin1')

'aÃ°Â©Z!'

py> data.decode('iso-8859-14')

'aÅµÂ©Z!'

Both the encode and decode methods take an optional argument, errors,

which specify the error handling scheme. The default is errors='strict',

which raises an exception. Others include 'ignore' and 'replace'.

py> 'aÅµÃ°Ï€Â©Z!'.encode('ascii', 'ignore')

b'aZ!'

py> 'aÅµÃ°Ï€Â©Z!'.encode('ascii', 'replace')

b'a????Z!'

I think UTF-8 CODEC or UTF-16 is necessary, just recall those MS encoding codecs
of Win98, and NT that collected taxes all over the world.

Actually for each kind of some character encoding,
please develop a codec to UTF-8 or UTF-16.

It means one can make conversions between any two of the qualified
character sets.

Binary to BCD code understanding	0	Dec 27, 2021
Automatic Type Conversion to String	6	Feb 13, 2012
Checking for binary data in a string	1	Jun 19, 2009
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
translating ascii to binary	5	Sep 17, 2008
Clickable link conversion regex?	0	Nov 30, 2012
Why don't string support binary?	3	Jun 10, 2009
Convert binary string to float	2	Jan 22, 2007

[newbie] String to binary conversion

Mok-Kong Shen

Tobiah

Tobiah

Mok-Kong Shen

MRAB

Emile van Sebille

Steven D'Aprano

88888 Dihedral

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads