unicode wrap unicode object?

Discussion in 'Python' started by ygao, Apr 8, 2006.

  1. ygao

    ygao Guest

    >>> import sys
    >>> sys.setdefaultencoding("utf-8")
    >>> s='\xe9\xab\x98' #this uff-8 string
    >>> ss=U'\xe9\xab\x98'
    >>> s

    '\xe9\xab\x98'
    >>> ss

    u'\xe9\xab\x98'
    >>>

    how do I get ss from s?
    Can there be a way do this?
    thanks!
     
    ygao, Apr 8, 2006
    #1
    1. Advertising

  2. "ygao" <> wrote:

    > >>> import sys
    > >>> sys.setdefaultencoding("utf-8")


    hmm. what kind of bootleg python is that ?

    >>> import sys
    >>> sys.setdefaultencoding("utf-8")

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    AttributeError: 'module' object has no attribute 'setdefaultencoding'

    (you're not supposed to change the default encoding. don't
    do that; it'll only cause problems in the long run).

    > >>> s='\xe9\xab\x98' #this uff-8 string
    > >>> ss=U'\xe9\xab\x98'
    > >>> s

    > '\xe9\xab\x98'
    > >>> ss

    > u'\xe9\xab\x98'
    > >>>

    > how do I get ss from s?
    > Can there be a way do this?


    you have UTF-8 *bytes* in a Unicode text string? sounds like
    someone's made a mistake earlier on...

    anyway, iso-8859-1 is, in practice, a null transform, that simply
    converts unicode characters to bytes:

    >>> s = ss.encode("iso-8859-1")
    >>> s

    '\xe9\xab\x98'
    >>> s.decode("utf-8")

    u'\u9ad8'
    >>> import unicodedata
    >>> unicodedata.name(s.decode("utf-8"))

    'CJK UNIFIED IDEOGRAPH-9AD8'

    but it's probably better to fix the code that puts UTF-8 data in your
    Unicode strings (look for bogus iso-8859-1 conversions)

    </F>
     
    Fredrik Lundh, Apr 8, 2006
    #2
    1. Advertising

  3. ygao

    ygao Guest

    sorry,my poor english.
    I got a solution from others.
    I must use utf-8 for chinese.


    >>> import sys
    >>> reload(sys)
    >>> sys.setdefaultencoding("utf-8")
    >>> s='\xe9\xab\x98' #this uff-8 string
    >>> ss=U'\xe9\xab\x98'
    >>> ss1=ss.encode('unicode_escape').decode('string_escape')
    >>> s1=s.decode('unicode_escape')
    >>> s1==ss

    True
    >>> ss1==s

    True
    >>>
     
    ygao, Apr 8, 2006
    #3
  4. ygao

    ygao Guest

    sorry,my poor english.
    I got a solution from others.
    I must use utf-8 for chinese.
    >>> import sys
    >>> reload(sys)
    >>> sys.setdefaultencoding("utf-8")
    >>> s='\xe9\xab\x98' #this uff-8 string
    >>> ss=U'\xe9\xab\x98'
    >>> ss1=ss.encode('unicode_escape').decode('string_escape')
    >>> s1=s.decode('unicode_escape')
    >>> s1==ss

    True
    >>> ss1==s

    True
     
    ygao, Apr 8, 2006
    #4
  5. "ygao" wrpte_

    > I must use utf-8 for chinese.


    yeah, but you shouldn't store it in a *Unicode* string. Unicode strings
    are designed to hold things that you've already decoded (that is, your
    chinese text), not the raw UTF-8 bytes.

    if you store the UTF-8 in an ordinary 8-bit string instead, you can use
    the unicode constructor to convert things properly:

    b = "... some utf-8 data ..."

    # turn it into a unicode string
    u = unicode(b, "utf-8")

    # ... do something with it ...

    # turn it back into a utf-8 string
    s = u.encode("utf-8")

    # or use some other encoding
    s = u.encode("big5")

    e.g.

    >>> b = '\xe9\xab\x98'
    >>> u = unicode(b, "utf-8")
    >>> u.encode("utf-8")

    '\xe9\xab\x98'
    >>> u.encode("big5")

    '\xb0\xaa'

    </F>
     
    Fredrik Lundh, Apr 8, 2006
    #5
  6. ygao

    ygao Guest

    thanks for your advice.
     
    ygao, Apr 8, 2006
    #6
  7. ygao wrote:
    > I must use utf-8 for chinese.


    Sure. But please don't do that:

    >>>> import sys
    >>>> reload(sys)
    >>>> sys.setdefaultencoding("utf-8")


    As Fredrik says, you should really avoid changing the
    default encoding.

    >>>> s='\xe9\xab\x98' #this uff-8 string
    >>>> ss=U'\xe9\xab\x98'
    >>>> ss1=ss.encode('unicode_escape').decode('string_escape')
    >>>> s1=s.decode('unicode_escape')
    >>>> s1==ss

    > True
    >>>> ss1==s

    > True


    Ok. But how about that:

    py> s='\xe9\xab\x98'
    py> ss=u'\u9ad8'
    py> s1=s.decode('utf-8')
    py> s1==ss
    True

    Here, ss is a single character, which uses 3 bytes in UTF-8.
    In your example, ss has three characters, which are not Chinese,
    but European.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Apr 8, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Torsten Mohr

    wrapping C++, how to wrap an object?

    Torsten Mohr, Mar 7, 2004, in forum: Python
    Replies:
    1
    Views:
    558
    Mike Thompson
    Mar 7, 2004
  2. Aaron Fude

    To wrap or not to wrap?

    Aaron Fude, May 8, 2008, in forum: Java
    Replies:
    12
    Views:
    741
    Chronic Philharmonic
    May 10, 2008
  3. Art Werschulz

    Text::Wrap::wrap difference

    Art Werschulz, Sep 22, 2003, in forum: Perl Misc
    Replies:
    0
    Views:
    274
    Art Werschulz
    Sep 22, 2003
  4. Art Werschulz

    Text::Wrap::wrap difference

    Art Werschulz, Sep 24, 2003, in forum: Perl Misc
    Replies:
    1
    Views:
    279
    Anno Siegel
    Sep 25, 2003
  5. Text::Wrap and unicode

    , Jan 4, 2006, in forum: Perl Misc
    Replies:
    4
    Views:
    216
    Alan J. Flavell
    Jan 4, 2006
Loading...

Share This Page