Can I get the 8bit-string representation of any unicode string

W

wanghz

Hello, everyone.

I have a problem when I'm processing unicode strings. Is it possible
to get the 8bit-string representation of any unicode string?

Suppose I get a unicode string:
a = u'\xc8\xce\xcf\xcd\xc6\xeb';
then, by
a.encode('latin-1');
I can get the 8bit-string representation of it, that is, the physical
storage format of this string.

But for another kind of unicode string, say:
b = u'\u4efb\u8d24\u9f50';
I have to:
b.encode('utf-8')
to get the 8bit-string format of it.

Since these unicode strings are given by an external library function,
I don't know which kind a unicode string belongs to before I get it at
runtime. So, I wonder if there is a unified way to get the 8bit-string
representation, say, byte-by-byte, of any unicode string?

Thank you very much.
 
K

Kent Johnson

Hello, everyone.

I have a problem when I'm processing unicode strings. Is it possible
to get the 8bit-string representation of any unicode string?

Yes, if you can be more precise about what you mean by '8bit-string
representation'. Likely candidates are
b.encode('utf-8')
b.encode('utf_16_be')
b.encode('utf_16_le')

Kent
 
F

Fredrik Lundh

I have a problem when I'm processing unicode strings. Is it possible
to get the 8bit-string representation of any unicode string?

Suppose I get a unicode string:
a = u'\xc8\xce\xcf\xcd\xc6\xeb';
then, by
a.encode('latin-1');
I can get the 8bit-string representation of it, that is, the physical
storage format of this string.

But for another kind of unicode string, say:
b = u'\u4efb\u8d24\u9f50';
I have to:
b.encode('utf-8')
to get the 8bit-string format of it.

latin-1 and utf-8 are two different 8-bit representations (encodings) of
Unicode.
Since these unicode strings are given by an external library function,
I don't know which kind a unicode string belongs to before I get it at
runtime. So, I wonder if there is a unified way to get the 8bit-string
representation, say, byte-by-byte, of any unicode string?

since the Unicode character set contains 1.1 million code points, and a
single byte can contain 256 different values, it should be fairly obvious
that there's no "8 bit byte by byte" representation of a Unicode string.
you need to decide what 8-bit encoding to use, and stick to that.

</F>
 
W

wanghz

Thank you all for your replies :)

I may misunderstood it. I will think about it carefully.

By the way, does python has a interface, just like iconv in libc for
C/C++? Or, how can I convert a string from a encoding into another
one?


Thank you so much.
 
F

Fredrik Lundh

(e-mail address removed) wrote
I may misunderstood it. I will think about it carefully.

By the way, does python has a interface, just like iconv in libc for
C/C++? Or, how can I convert a string from a encoding into another
one?

if b is an 8-bit string containing an encoded unicode string,

u = b.decode(encoding)

or

u = unicode(b, encoding)

gives you a unicode string. to encode the unicode string back to another
byte string, use the encode method.

b = u.encode(encoding)

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,188
Latest member
Crypto TaxSoftware

Latest Threads

Top