[newbie] String to binary conversion

M

Mok-Kong Shen

If I have a string "abcd" then, with 8-bit encoding of each character,
there is a corresponding 32-bit binary integer. How could I best
obtain that integer and from that integer backwards again obtain the
original string? Thanks in advance.

M. K. Shen
 
T

Tobiah

The binascii module looks like it might have
something for you. I've never used it.

Having actually read some of that doc, I see
it's not what you want at all. Sorry.
 
M

Mok-Kong Shen

Am 06.08.2012 22:59, schrieb Tobiah:
The binascii module looks like it might have
something for you. I've never used it.

Thanks for the hint, but if I don't err, the module binascii doesn't
seem to work. I typed:

import binascii

and a line that's given as example in the document:

crc = binascii.crc32("hello")

but got the following error message:

TypeError: 'str' does not support the buffer interface.

The same error message appeared when I tried the other functions.

M. K. Shen
 
M

MRAB

If I have a string "abcd" then, with 8-bit encoding of each character,
there is a corresponding 32-bit binary integer. How could I best
obtain that integer and from that integer backwards again obtain the
original string? Thanks in advance.
Try this (Python 3, in which strings are Unicode):
import struct
# For a little-endian integer
struct.unpack("<I", "abcd".encode("latin-1"))[0] 1684234849
hex(_)
'0x64636261'

or this (Python 2, in which strings are bytestrings):
import struct
# For a little-endian integer
struct.unpack("<I", "abcd")[0] 1684234849
hex(_)
'0x64636261'
 
E

Emile van Sebille

On 8/6/2012 1:46 PM Mok-Kong Shen said...
If I have a string "abcd" then, with 8-bit encoding of each character,
there is a corresponding 32-bit binary integer. How could I best
obtain that integer and from that integer backwards again obtain the
original string? Thanks in advance.

It's easy to write one:

def str2val(str,_val=0):
if len(str)>1: return str2val(str[1:],256*_val+ord(str[0]))
return 256*_val+ord(str[0])


def val2str(val,_str=""):
if val>256: return val2str(int(val/256),_str)+chr(val%256)
return _str+chr(val)


print str2val("abcd")
print val2str(str2val("abcd"))
print val2str(str2val("good"))
print val2str(str2val("longer"))
print val2str(str2val("verymuchlonger"))

Flavor to taste.

Emile
 
S

Steven D'Aprano

If I have a string "abcd" then, with 8-bit encoding of each character,
there is a corresponding 32-bit binary integer. How could I best obtain
that integer and from that integer backwards again obtain the original
string? Thanks in advance.

First you have to know the encoding, as that will define the integers you
get. There are many 8-bit encodings, but of course they can't all encode
arbitrary 4-character strings. Since there are tens of thousands of
different characters, and an 8-bit encoding can only code for 256 of
them, there are many strings that an encoding cannot handle.

For those, you need multi-byte encodings like UTF-8, UTF-16, etc.

Sticking to one-byte encodings: since most of them are compatible with
ASCII, examples with "abcd" aren't very interesting:

py> 'abcd'.encode('latin1')
b'abcd'

Even though the bytes object b'abcd' is printed as if it were a string,
it is actually treated as an array of one-byte ints:

py> b'abcd'[0]
97

Here's a more interesting example, using Python 3: it uses at least one
character (the Greek letter π) which cannot be encoded in Latin1, and two
which cannot be encoded in ASCII:

py> "aπ©d".encode('iso-8859-7')
b'a\xf0\xa9d'

Most encodings will round-trip successfully:

py> text = 'aπ©Z!'
py> data = text.encode('iso-8859-7')
py> data.decode('iso-8859-7') == text
True


(although the ability to round-trip is a property of the encoding itself,
not of the encoding system).

Naturally if you encode with one encoding, and then decode with another,
you are likely to get different strings:

py> text = 'aπ©Z!'
py> data = text.encode('iso-8859-7')
py> data.decode('latin1')
'að©Z!'
py> data.decode('iso-8859-14')
'aŵ©Z!'


Both the encode and decode methods take an optional argument, errors,
which specify the error handling scheme. The default is errors='strict',
which raises an exception. Others include 'ignore' and 'replace'.

py> 'aŵðπ©Z!'.encode('ascii', 'ignore')
b'aZ!'
py> 'aŵðπ©Z!'.encode('ascii', 'replace')
b'a????Z!'
 
8

88888 Dihedral

Steven D'Apranoæ–¼ 2012å¹´8月7日星期二UTC+8上åˆ10時01分05秒寫é“:
If I have a string "abcd" then, with 8-bit encoding of each character,
there is a corresponding 32-bit binary integer. How could I best obtain
that integer and from that integer backwards again obtain the original
string? Thanks in advance.



First you have to know the encoding, as that will define the integers you

get. There are many 8-bit encodings, but of course they can't all encode

arbitrary 4-character strings. Since there are tens of thousands of

different characters, and an 8-bit encoding can only code for 256 of

them, there are many strings that an encoding cannot handle.



For those, you need multi-byte encodings like UTF-8, UTF-16, etc.



Sticking to one-byte encodings: since most of them are compatible with

ASCII, examples with "abcd" aren't very interesting:



py> 'abcd'.encode('latin1')

b'abcd'



Even though the bytes object b'abcd' is printed as if it were a string,

it is actually treated as an array of one-byte ints:



py> b'abcd'[0]

97



Here's a more interesting example, using Python 3: it uses at least one

character (the Greek letter π) which cannot be encoded in Latin1, and two

which cannot be encoded in ASCII:



py> "aπ©d".encode('iso-8859-7')

b'a\xf0\xa9d'



Most encodings will round-trip successfully:



py> text = 'aπ©Z!'

py> data = text.encode('iso-8859-7')

py> data.decode('iso-8859-7') == text

True





(although the ability to round-trip is a property of the encoding itself,

not of the encoding system).



Naturally if you encode with one encoding, and then decode with another,

you are likely to get different strings:



py> text = 'aπ©Z!'

py> data = text.encode('iso-8859-7')

py> data.decode('latin1')

'að©Z!'

py> data.decode('iso-8859-14')

'aŵ©Z!'





Both the encode and decode methods take an optional argument, errors,

which specify the error handling scheme. The default is errors='strict',

which raises an exception. Others include 'ignore' and 'replace'.



py> 'aŵðπ©Z!'.encode('ascii', 'ignore')

b'aZ!'

py> 'aŵðπ©Z!'.encode('ascii', 'replace')

b'a????Z!'



Steven D'Apranoæ–¼ 2012å¹´8月7日星期二UTC+8上åˆ10時01分05秒寫é“:
If I have a string "abcd" then, with 8-bit encoding of each character,
there is a corresponding 32-bit binary integer. How could I best obtain
that integer and from that integer backwards again obtain the original
string? Thanks in advance.



First you have to know the encoding, as that will define the integers you

get. There are many 8-bit encodings, but of course they can't all encode

arbitrary 4-character strings. Since there are tens of thousands of

different characters, and an 8-bit encoding can only code for 256 of

them, there are many strings that an encoding cannot handle.



For those, you need multi-byte encodings like UTF-8, UTF-16, etc.



Sticking to one-byte encodings: since most of them are compatible with

ASCII, examples with "abcd" aren't very interesting:



py> 'abcd'.encode('latin1')

b'abcd'



Even though the bytes object b'abcd' is printed as if it were a string,

it is actually treated as an array of one-byte ints:



py> b'abcd'[0]

97



Here's a more interesting example, using Python 3: it uses at least one

character (the Greek letter π) which cannot be encoded in Latin1, and two

which cannot be encoded in ASCII:



py> "aπ©d".encode('iso-8859-7')

b'a\xf0\xa9d'



Most encodings will round-trip successfully:



py> text = 'aπ©Z!'

py> data = text.encode('iso-8859-7')

py> data.decode('iso-8859-7') == text

True





(although the ability to round-trip is a property of the encoding itself,

not of the encoding system).



Naturally if you encode with one encoding, and then decode with another,

you are likely to get different strings:



py> text = 'aπ©Z!'

py> data = text.encode('iso-8859-7')

py> data.decode('latin1')

'að©Z!'

py> data.decode('iso-8859-14')

'aŵ©Z!'





Both the encode and decode methods take an optional argument, errors,

which specify the error handling scheme. The default is errors='strict',

which raises an exception. Others include 'ignore' and 'replace'.



py> 'aŵðπ©Z!'.encode('ascii', 'ignore')

b'aZ!'

py> 'aŵðπ©Z!'.encode('ascii', 'replace')

b'a????Z!'

I think UTF-8 CODEC or UTF-16 is necessary, just recall those MS encoding codecs
of Win98, and NT that collected taxes all over the world.


Actually for each kind of some character encoding,
please develop a codec to UTF-8 or UTF-16.

It means one can make conversions between any two of the qualified
character sets.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,534
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top