Unicode Question

Discussion in 'Python' started by David Pratt, Jan 10, 2006.

  1. David Pratt

    David Pratt Guest

    Hi. I am working through some tutorials on unicode and am hoping that
    someone can help explain this for me. I am on mac platform using python
    2.4.1 at the moment. I am experimenting with unicode with the 3/4 symbol.

    I want to prepare strings for db storage that come from normal Windows
    machine (cp1252) so my understanding is to unicode and encode to utf-8
    and to store properly. Since data will be used on the web I would not
    have to change my encoding when extracting from the database. This first
    example I believe simulates this with the 3/4 symbol. Here I want to
    store '\xc2\xbe' in my database.

    >>> tq = u'\xbe'
    >>> tq_utf = tq.encode('utf8')
    >>> tq, tq_utf

    (u'\xbe', '\xc2\xbe')

    To unicode withat a valiable, my understanding is that I can unicode and
    encode at the same time

    >>> tq = '\xbe'
    >>> tq_utf = unicode(tq, 'utf-8')

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xbe in position 0:
    unexpected code byte

    This is not working for me. Can someone explain why. Many thanks.

    Regards,
    David
    David Pratt, Jan 10, 2006
    #1
    1. Advertising

  2. David Pratt wrote:

    > This is not working for me. Can someone explain why. Many thanks.


    Because '\xbe' isn't UTF-8 for the character you want, '\xc2\xbe' is, as
    you just showed yourself in the code snippet.

    --
    Erik Max Francis && && http://www.alcyone.com/max/
    San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
    Where are they?
    -- Enrico Fermi, 1901-1954
    Erik Max Francis, Jan 10, 2006
    #2
    1. Advertising

  3. David Pratt wrote:
    > I want to prepare strings for db storage that come from normal Windows
    > machine (cp1252) so my understanding is to unicode and encode to utf-8
    > and to store properly.


    That also depends on the database. The database must accept
    UTF-8-encoded strings, and must not modify them in any form or way.
    Some databases fail here, and work better if you pass Unicode objects
    to them directly.

    > Since data will be used on the web I would not
    > have to change my encoding when extracting from the database. This first
    > example I believe simulates this with the 3/4 symbol. Here I want tox
    > store '\xc2\xbe' in my database.
    >
    >>>> tq = u'\xbe'


    You can verify that this is really 3/4:

    py> import unicodedata
    py> unicodedata.name(u"\xbe")
    'VULGAR FRACTION THREE QUARTERS'

    >>>> tq_utf = tq.encode('utf8')
    >>>> tq, tq_utf

    > (u'\xbe', '\xc2\xbe')


    So it should be clear now that '\xc2\xbe' is the UTF-8 encoding
    of that character.

    > To unicode withat a valiable, my understanding is that I can unicode and
    > encode at the same time


    Not sure what you mean by "same time" (I'm not even sure what
    "I can unicode" means - unicode is not a verb, it's a noun).

    >>>> tq = '\xbe'
    >>>> tq_utf = unicode(tq, 'utf-8')

    > Traceback (most recent call last):
    > File "<stdin>", line 1, in ?
    > UnicodeDecodeError: 'utf8' codec can't decode byte 0xbe in position 0:
    > unexpected code byte
    >
    > This is not working for me. Can someone explain why. Many thanks.


    Of course not. The UTF-8 encoding of the character, as we have seen
    earlier, is '\xc2\xbe'. So you should write

    py> unicode('\xc2\xbe', 'utf-8')
    u'\xbe'

    You mentioned windows-1252 at some point. If you are given windows-1252
    bytes, you can do

    py> unicode('\xbe', 'windows-1252')
    u'\xbe'

    If you are looking for "at the same time", perhaps this is also
    interesting:

    py> unicode('\xbe', 'windows-1252').encode('utf-8')
    '\xc2\xbe'

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Jan 10, 2006
    #3
  4. David Pratt

    David Pratt Guest

    Hi Martin. Many thanks for your reply. What I am reall after, the
    following accomplishes.
    >
    > If you are looking for "at the same time", perhaps this is also
    > interesting:
    >
    > py> unicode('\xbe', 'windows-1252').encode('utf-8')
    > '\xc2\xbe'
    >


    Your answer really helped quite a bit to clarify this for me. I am using
    sqlite3 so it is very happy to have utf-8 encoded unicode.

    The examples you provided were the additional help I needed. Thank you.

    Regards,
    David
    David Pratt, Jan 10, 2006
    #4
  5. David Pratt

    David Pratt Guest

    Hi Erik. Thank you for your reply. The advice I has helped clarify this
    for me.

    Regards,
    David

    Erik Max Francis wrote:
    > David Pratt wrote:
    >
    >
    >>This is not working for me. Can someone explain why. Many thanks.

    >
    >
    > Because '\xbe' isn't UTF-8 for the character you want, '\xc2\xbe' is, as
    > you just showed yourself in the code snippet.
    >
    David Pratt, Jan 10, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,921
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    548
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. Gabriele *darkbard* Farina

    Unicode digit to unicode string

    Gabriele *darkbard* Farina, May 16, 2006, in forum: Python
    Replies:
    2
    Views:
    520
    Gabriele *darkbard* Farina
    May 16, 2006
  4. gabor
    Replies:
    13
    Views:
    553
    Leo Kislov
    Nov 18, 2006
  5. Jean-Paul Calderone
    Replies:
    23
    Views:
    676
    Leo Kislov
    Nov 21, 2006
Loading...

Share This Page