If I have a string "abcd" then, with 8-bit encoding of each character,
there is a corresponding 32-bit binary integer. How could I best obtain
that integer and from that integer backwards again obtain the original
string? Thanks in advance.
First you have to know the encoding, as that will define the integers you
get. There are many 8-bit encodings, but of course they can't all encode
arbitrary 4-character strings. Since there are tens of thousands of
different characters, and an 8-bit encoding can only code for 256 of
them, there are many strings that an encoding cannot handle.
For those, you need multi-byte encodings like UTF-8, UTF-16, etc.
Sticking to one-byte encodings: since most of them are compatible with
ASCII, examples with "abcd" aren't very interesting:
py> 'abcd'.encode('latin1')
b'abcd'
Even though the bytes object b'abcd' is printed as if it were a string,
it is actually treated as an array of one-byte ints:
py> b'abcd'[0]
97
Here's a more interesting example, using Python 3: it uses at least one
character (the Greek letter π) which cannot be encoded in Latin1, and two
which cannot be encoded in ASCII:
py> "aπ©d".encode('iso-8859-7')
b'a\xf0\xa9d'
Most encodings will round-trip successfully:
py> text = 'aπ©Z!'
py> data = text.encode('iso-8859-7')
py> data.decode('iso-8859-7') == text
True
(although the ability to round-trip is a property of the encoding itself,
not of the encoding system).
Naturally if you encode with one encoding, and then decode with another,
you are likely to get different strings:
py> text = 'aπ©Z!'
py> data = text.encode('iso-8859-7')
py> data.decode('latin1')
'að©Z!'
py> data.decode('iso-8859-14')
'aŵ©Z!'
Both the encode and decode methods take an optional argument, errors,
which specify the error handling scheme. The default is errors='strict',
which raises an exception. Others include 'ignore' and 'replace'.
py> 'aŵðπ©Z!'.encode('ascii', 'ignore')
b'aZ!'
py> 'aŵðπ©Z!'.encode('ascii', 'replace')
b'a????Z!'