inserting Unicode character in dictionary - Python

Discussion in 'Python' started by gita ziabari, Oct 17, 2008.

  1. gita ziabari

    gita ziabari Guest

    Hello All,

    The following code does not work for unicode characters:

    keyword = dict()
    kw = 'ÇÅÎÓËÉÈ'
    keyword.setdefault(key, []).append (kw)

    It works fine for inserting ASCII character. Any suggestion?

    Thanks,

    Gita
     
    gita ziabari, Oct 17, 2008
    #1
    1. Advertising

  2. On Fri, 17 Oct 2008 13:07:38 -0400, gita ziabari wrote:

    > The following code does not work for unicode characters:
    >
    > keyword = dict()
    > kw = 'генÑких'
    > keyword.setdefault(key, []).append (kw)
    >
    > It works fine for inserting ASCII character. Any suggestion?


    What do you mean by "does not work"? And you are aware that the above
    snipped doesn't involve any unicode characters!? You have a byte string
    there -- type `str` not `unicode`.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Oct 17, 2008
    #2
    1. Advertising

  3. gita ziabari

    Joe Strout Guest

    On Oct 17, 2008, at 11:24 AM, Marc 'BlackJack' Rintsch wrote:

    >> kw = 'генÑких'
    >>

    > What do you mean by "does not work"? And you are aware that the above
    > snipped doesn't involve any unicode characters!? You have a byte
    > string
    > there -- type `str` not `unicode`.


    Just checking my understanding here -- are the following all true:

    1. If you had prefixed that literal with a "u", then you'd have Unicode.

    2. Exactly what Unicode you get would be dependent on Python properly
    interpreting the bytes in the source file -- which you can make it do
    by adding something like "-*- coding: utf-8 -*-" in a comment at the
    top of the file.

    3. Without the "u" prefix, you'll have some 8-bit string, whose
    interpretation is... er... here's where I get a bit fuzzy. What if
    your source file is set to utf-8? Do you then have a proper UTF-8
    string, but the problem is that none of the standard Python library
    methods know how to properly interpret UTF-8?

    4. In Python 3.0, this silliness goes away, because all strings are
    Unicode by default.

    Thanks for any answers/corrections,
    - Joe
     
    Joe Strout, Oct 17, 2008
    #3
  4. On Fri, 17 Oct 2008 11:32:36 -0600, Joe Strout wrote:

    > On Oct 17, 2008, at 11:24 AM, Marc 'BlackJack' Rintsch wrote:
    >
    >>> kw = 'генÑких'
    >>>

    >> What do you mean by "does not work"? And you are aware that the above
    >> snipped doesn't involve any unicode characters!? You have a byte
    >> string there -- type `str` not `unicode`.

    >
    > Just checking my understanding here -- are the following all true:
    >
    > 1. If you had prefixed that literal with a "u", then you'd have Unicode.


    Yes.

    > 2. Exactly what Unicode you get would be dependent on Python properly
    > interpreting the bytes in the source file -- which you can make it do by
    > adding something like "-*- coding: utf-8 -*-" in a comment at the top of
    > the file.


    Yes, assuming the encoding on that comment matches the actual encoding of
    the file.

    > 3. Without the "u" prefix, you'll have some 8-bit string, whose
    > interpretation is... er... here's where I get a bit fuzzy.


    No interpretation at all, just the bunch of bytes that happen to be in
    the source file.

    > What if your source file is set to utf-8? Do you then have a proper
    > UTF-8 string, but the problem is that none of the standard Python
    > library methods know how to properly interpret UTF-8?


    Well, the decode method knows how to decode that bytes into a `unicode`
    object if you call it with 'utf-8' as argument.

    > 4. In Python 3.0, this silliness goes away, because all strings are
    > Unicode by default.


    Yes and no. The problem just shifts because at some point you get into
    similar troubles, just in the other direction. Data enters the program
    as bytes and must leave it as bytes again, so you have to deal with
    encodings at those points.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Oct 17, 2008
    #4
  5. gita ziabari

    Joe Strout Guest

    Thanks for the answers. That clears things up quite a bit.

    >> What if your source file is set to utf-8? Do you then have a proper
    >> UTF-8 string, but the problem is that none of the standard Python
    >> library methods know how to properly interpret UTF-8?

    >
    > Well, the decode method knows how to decode that bytes into a
    > `unicode`
    > object if you call it with 'utf-8' as argument.


    OK, good to know.

    >> 4. In Python 3.0, this silliness goes away, because all strings are
    >> Unicode by default.

    >
    > Yes and no. The problem just shifts because at some point you get
    > into
    > similar troubles, just in the other direction. Data enters the
    > program
    > as bytes and must leave it as bytes again, so you have to deal with
    > encodings at those points.


    Yes, but that's still much better than having to litter your code with
    'u' prefixes and .decode calls and so on. If I'm using a UTF-8-savvy
    text editor (as we all should be doing in the 21st century!), and type
    "foo = '2Ï€'", I should get a string containing a '2' and a pi
    character, and all the text operations (like counting characters,
    etc.) should Just Work.

    When I read and write files or sockets or whatever, of course I'll
    have to think about what encoding the text should be... but internal
    to my own source code, I shouldn't have to.

    I understand the need for a transition strategy, which is what we have
    in 2.x, and that's working well enough. But I'll be glad when it's
    over. :)

    Cheers,
    - Joe
     
    Joe Strout, Oct 17, 2008
    #5
  6. > 2. Exactly what Unicode you get would be dependent on Python properly
    > interpreting the bytes in the source file -- which you can make it do by
    > adding something like "-*- coding: utf-8 -*-" in a comment at the top of
    > the file.


    That depends on the Python version. Up to (and including) 2.4, the bytes
    on the disk where interpreted as Latin-1 in absence of an encoding
    declaration. In 2.5, not having an encoding declaration is an error. In
    3.x, in absence of an encoding declaration, the bytes are interpreted as
    UTF-8 (giving an error when ill-formed UTF-8 sequences are encountered).

    > 3. Without the "u" prefix, you'll have some 8-bit string, whose
    > interpretation is... er... here's where I get a bit fuzzy. What if your
    > source file is set to utf-8?


    You need to distinguish between the declared encoding, and the intended
    (editor) encoding also. Some editors (like Emacs or IDLE) interpret the
    declaration, others may not. What you see on the display is the editor's
    interpretation; what Python uses is the declared encoding.

    However, Python uses the declared encoding just for Unicode strings.

    > Do you then have a proper UTF-8 string,
    > but the problem is that none of the standard Python library methods know
    > how to properly interpret UTF-8?


    There is (probably) no such thing as a "proper UTF-8 string" (in the
    sense in which you probably mean it). Python doesn't have a data type
    for "UTF-8 string". It only has a data type "byte string". It's up to
    the application whether it gets interpreted in a consistent manner.
    Libraries are (typically) encoding-agnostic, i.e. they work for UTF-8
    encoded strings the same way as for, say, Big-5 encoded strings.

    > 4. In Python 3.0, this silliness goes away, because all strings are
    > Unicode by default.


    You still need to make sure that the editor's encoding and the declared
    encoding match.

    Regards,
    Martin
     
    Martin v. Löwis, Oct 18, 2008
    #6
  7. gita ziabari

    Guest

    On Oct 17, 2:38 pm, Joe Strout <> wrote:
    > Thanks for the answers. That clears things up quite a bit.
    >
    > >> What if your source file is set to utf-8? Do you then have a proper
    > >> UTF-8 string, but the problem is that none of the standard Python
    > >> library methods know how to properly interpret UTF-8?

    >
    > > Well, the decode method knows how to decode that bytes into a
    > > `unicode`
    > > object if you call it with 'utf-8' as argument.

    >
    > OK, good to know.
    >
    > >> 4. In Python 3.0, this silliness goes away, because all strings are
    > >> Unicode by default.

    >
    > > Yes and no. The problem just shifts because at some point you get
    > > into
    > > similar troubles, just in the other direction. Data enters the
    > > program
    > > as bytes and must leave it as bytes again, so you have to deal with
    > > encodings at those points.

    >
    > Yes, but that's still much better than having to litter your code with
    > 'u' prefixes and .decode calls and so on. If I'm using a UTF-8-savvy
    > text editor (as we all should be doing in the 21st century!), and type
    > "foo = '2ð'", I should get a string containing a '2' and a pi
    > character, and all the text operations (like counting characters,
    > etc.) should Just Work.
    >
    > When I read and write files or sockets or whatever, of course I'll
    > have to think about what encoding the text should be... but internal
    > to my own source code, I shouldn't have to.
    >
    > I understand the need for a transition strategy, which is what we have
    > in 2.x, and that's working well enough. But I'll be glad when it's
    > over. :)
    >
    > Cheers,
    > - Joe


    Thanks for the answers. The following factors should be considerd when
    dealing with unicode characters in python:
    1. Declaring # -*- coding: utf-8 -*- at the top of script
    2. Opening files with appropriate encoding:
    txt = codecs.open (filename, 'w+', encoding='utf-8')

    My program works fine now. There is no specific way of adding unicode
    characters in list or dictionaies. The character itself has to be in
    unicode.

    Cheers,

    Gita
     
    , Oct 19, 2008
    #7
  8. gita ziabari

    Joe Strout Guest

    On Oct 18, 2008, at 1:20 AM, Martin v. Löwis wrote:

    >> Do you then have a proper UTF-8 string,
    >> but the problem is that none of the standard Python library methods
    >> know
    >> how to properly interpret UTF-8?

    >
    > There is (probably) no such thing as a "proper UTF-8 string" (in the
    > sense in which you probably mean it).


    To be clear, I mean a string that is valid UTF-8 (not all strings of
    bytes are, of course).

    > Python doesn't have a data type
    > for "UTF-8 string". It only has a data type "byte string". It's up to
    > the application whether it gets interpreted in a consistent manner.
    > Libraries are (typically) encoding-agnostic, i.e. they work for UTF-8
    > encoded strings the same way as for, say, Big-5 encoded strings.


    Oi -- so if I ask for length, I get the number of bytes, not the
    number of characters. If I slice and dice, I could end up splitting
    characters in half. It is, as you say, just a string of bytes, not a
    string of characters.

    >> 4. In Python 3.0, this silliness goes away, because all strings are
    >> Unicode by default.

    >
    > You still need to make sure that the editor's encoding and the
    > declared
    > encoding match.


    Well, the if no encoding is declared, it (quite sensibly) assumes
    UTF-8, so for my purposes this boils down to using a UTF-8 editor --
    which I always do anyway. But do I still have to put a "u" before my
    string literals in order to have it treated as characters rather than
    bytes?

    I'm hoping that the answer is "no" -- most string literals in a source
    file are text (which should be Unicode text, these days); a raw byte
    string would be the exceptional case, and I'd be happy to use the "r"
    prefix for those.

    Best,
    - Joe
     
    Joe Strout, Oct 19, 2008
    #8
  9. > Well, the if no encoding is declared, it (quite sensibly) assumes UTF-8,
    > so for my purposes this boils down to using a UTF-8 editor -- which I
    > always do anyway. But do I still have to put a "u" before my string
    > literals in order to have it treated as characters rather than bytes?


    Yes.

    > I'm hoping that the answer is "no"


    Then you need to switch to Python 3.0, when it comes out. Its string
    literals denote unicode strings.

    Regards,
    Martin
     
    Martin v. Löwis, Oct 19, 2008
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    251
  2. Ilias Lazaridis
    Replies:
    6
    Views:
    444
    Ilias Lazaridis
    Feb 21, 2006
  3. Kenneth McDonald
    Replies:
    1
    Views:
    843
    Carl Banks
    Dec 27, 2006
  4. Keith Hughitt
    Replies:
    5
    Views:
    4,055
    kiranpvsr
    Dec 7, 2012
  5. Tyler
    Replies:
    1
    Views:
    951
    Robert Klemme
    Jul 29, 2011
Loading...

Share This Page