Can I get the 8bit-string representation of any unicode string

Discussion in 'Python' started by wanghz@gmail.com, Feb 12, 2006.

  1. Guest

    Hello, everyone.

    I have a problem when I'm processing unicode strings. Is it possible
    to get the 8bit-string representation of any unicode string?

    Suppose I get a unicode string:
    a = u'\xc8\xce\xcf\xcd\xc6\xeb';
    then, by
    a.encode('latin-1');
    I can get the 8bit-string representation of it, that is, the physical
    storage format of this string.

    But for another kind of unicode string, say:
    b = u'\u4efb\u8d24\u9f50';
    I have to:
    b.encode('utf-8')
    to get the 8bit-string format of it.

    Since these unicode strings are given by an external library function,
    I don't know which kind a unicode string belongs to before I get it at
    runtime. So, I wonder if there is a unified way to get the 8bit-string
    representation, say, byte-by-byte, of any unicode string?

    Thank you very much.
     
    , Feb 12, 2006
    #1
    1. Advertising

  2. Kent Johnson Guest

    wrote:
    > Hello, everyone.
    >
    > I have a problem when I'm processing unicode strings. Is it possible
    > to get the 8bit-string representation of any unicode string?


    Yes, if you can be more precise about what you mean by '8bit-string
    representation'. Likely candidates are
    b.encode('utf-8')
    b.encode('utf_16_be')
    b.encode('utf_16_le')

    Kent

    >
    > Suppose I get a unicode string:
    > a = u'\xc8\xce\xcf\xcd\xc6\xeb';
    > then, by
    > a.encode('latin-1');
    > I can get the 8bit-string representation of it, that is, the physical
    > storage format of this string.
    >
    > But for another kind of unicode string, say:
    > b = u'\u4efb\u8d24\u9f50';
    > I have to:
    > b.encode('utf-8')
    > to get the 8bit-string format of it.
    >
    > Since these unicode strings are given by an external library function,
    > I don't know which kind a unicode string belongs to before I get it at
    > runtime. So, I wonder if there is a unified way to get the 8bit-string
    > representation, say, byte-by-byte, of any unicode string?
    >
    > Thank you very much.
    >
     
    Kent Johnson, Feb 12, 2006
    #2
    1. Advertising

  3. wrote:

    > I have a problem when I'm processing unicode strings. Is it possible
    > to get the 8bit-string representation of any unicode string?
    >
    > Suppose I get a unicode string:
    > a = u'\xc8\xce\xcf\xcd\xc6\xeb';
    > then, by
    > a.encode('latin-1');
    > I can get the 8bit-string representation of it, that is, the physical
    > storage format of this string.
    >
    > But for another kind of unicode string, say:
    > b = u'\u4efb\u8d24\u9f50';
    > I have to:
    > b.encode('utf-8')
    > to get the 8bit-string format of it.


    latin-1 and utf-8 are two different 8-bit representations (encodings) of
    Unicode.

    > Since these unicode strings are given by an external library function,
    > I don't know which kind a unicode string belongs to before I get it at
    > runtime. So, I wonder if there is a unified way to get the 8bit-string
    > representation, say, byte-by-byte, of any unicode string?


    since the Unicode character set contains 1.1 million code points, and a
    single byte can contain 256 different values, it should be fairly obvious
    that there's no "8 bit byte by byte" representation of a Unicode string.
    you need to decide what 8-bit encoding to use, and stick to that.

    </F>
     
    Fredrik Lundh, Feb 12, 2006
    #3
  4. Guest

    Thank you all for your replies :)

    I may misunderstood it. I will think about it carefully.

    By the way, does python has a interface, just like iconv in libc for
    C/C++? Or, how can I convert a string from a encoding into another
    one?


    Thank you so much.
     
    , Feb 12, 2006
    #4
  5. wrote

    > I may misunderstood it. I will think about it carefully.
    >
    > By the way, does python has a interface, just like iconv in libc for
    > C/C++? Or, how can I convert a string from a encoding into another
    > one?


    if b is an 8-bit string containing an encoded unicode string,

    u = b.decode(encoding)

    or

    u = unicode(b, encoding)

    gives you a unicode string. to encode the unicode string back to another
    byte string, use the encode method.

    b = u.encode(encoding)

    </F>
     
    Fredrik Lundh, Feb 12, 2006
    #5
  6. Guest

    Hi,

    I see. Thank you for your help!


    Regards,
    hongzheng

    Fredrik Lundh wrote:
    > wrote
    >
    > > I may misunderstood it. I will think about it carefully.
    > >
    > > By the way, does python has a interface, just like iconv in libc for
    > > C/C++? Or, how can I convert a string from a encoding into another
    > > one?

    >
    > if b is an 8-bit string containing an encoded unicode string,
    >
    > u = b.decode(encoding)
    >
    > or
    >
    > u = unicode(b, encoding)
    >
    > gives you a unicode string. to encode the unicode string back to another
    > byte string, use the encode method.
    >
    > b = u.encode(encoding)
    >
    > </F>
     
    , Feb 12, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    5
    Views:
    4,874
    Tim McCoy
    Jun 12, 2005
  2. marko

    8bit to 7bit numbers

    marko, Aug 23, 2003, in forum: Perl
    Replies:
    0
    Views:
    1,181
    marko
    Aug 23, 2003
  3. humble
    Replies:
    0
    Views:
    913
    humble
    Oct 28, 2006
  4. Andrew
    Replies:
    32
    Views:
    2,109
    Arne Vajhøj
    Aug 23, 2009
  5. Replies:
    94
    Views:
    1,354
    Steven D'Aprano
    Sep 4, 2012
Loading...

Share This Page