inserting Unicode character in dictionary - Python

gita ziabari · Oct 17, 2008

Hello All,

The following code does not work for unicode characters:

keyword = dict()
kw = 'ÇÅÎÓËÉÈ'
keyword.setdefault(key, []).append (kw)

It works fine for inserting ASCII character. Any suggestion?

Thanks,

Gita

Marc 'BlackJack' Rintsch · Oct 17, 2008

The following code does not work for unicode characters:

keyword = dict()
kw = 'Ð³ÐµÐ½ÑÐºÐ¸Ñ…'
keyword.setdefault(key, []).append (kw)

It works fine for inserting ASCII character. Any suggestion?

What do you mean by "does not work"? And you are aware that the above
snipped doesn't involve any unicode characters!? You have a byte string
there -- type `str` not `unicode`.

Ciao,
Marc 'BlackJack' Rintsch

Joe Strout · Oct 17, 2008

What do you mean by "does not work"? And you are aware that the above
snipped doesn't involve any unicode characters!? You have a byte
string
there -- type `str` not `unicode`.

Just checking my understanding here -- are the following all true:

1. If you had prefixed that literal with a "u", then you'd have Unicode.

2. Exactly what Unicode you get would be dependent on Python properly
interpreting the bytes in the source file -- which you can make it do
by adding something like "-*- coding: utf-8 -*-" in a comment at the
top of the file.

3. Without the "u" prefix, you'll have some 8-bit string, whose
interpretation is... er... here's where I get a bit fuzzy. What if
your source file is set to utf-8? Do you then have a proper UTF-8
string, but the problem is that none of the standard Python library
methods know how to properly interpret UTF-8?

4. In Python 3.0, this silliness goes away, because all strings are
Unicode by default.

Thanks for any answers/corrections,
- Joe

Marc 'BlackJack' Rintsch · Oct 17, 2008

Just checking my understanding here -- are the following all true:

1. If you had prefixed that literal with a "u", then you'd have Unicode.
Yes.

2. Exactly what Unicode you get would be dependent on Python properly
interpreting the bytes in the source file -- which you can make it do by
adding something like "-*- coding: utf-8 -*-" in a comment at the top of
the file.

Yes, assuming the encoding on that comment matches the actual encoding of
the file.

3. Without the "u" prefix, you'll have some 8-bit string, whose
interpretation is... er... here's where I get a bit fuzzy.

No interpretation at all, just the bunch of bytes that happen to be in
the source file.

What if your source file is set to utf-8? Do you then have a proper
UTF-8 string, but the problem is that none of the standard Python
library methods know how to properly interpret UTF-8?

Well, the decode method knows how to decode that bytes into a `unicode`
object if you call it with 'utf-8' as argument.

4. In Python 3.0, this silliness goes away, because all strings are
Unicode by default.

Yes and no. The problem just shifts because at some point you get into
similar troubles, just in the other direction. Data enters the program
as bytes and must leave it as bytes again, so you have to deal with
encodings at those points.

Ciao,
Marc 'BlackJack' Rintsch

Joe Strout · Oct 17, 2008

Thanks for the answers. That clears things up quite a bit.

Well, the decode method knows how to decode that bytes into a
`unicode`
object if you call it with 'utf-8' as argument.

OK, good to know.

Yes and no. The problem just shifts because at some point you get
into
similar troubles, just in the other direction. Data enters the
program
as bytes and must leave it as bytes again, so you have to deal with
encodings at those points.

Yes, but that's still much better than having to litter your code with
'u' prefixes and .decode calls and so on. If I'm using a UTF-8-savvy
text editor (as we all should be doing in the 21st century!), and type
"foo = '2Ï€'", I should get a string containing a '2' and a pi
character, and all the text operations (like counting characters,
etc.) should Just Work.

When I read and write files or sockets or whatever, of course I'll
have to think about what encoding the text should be... but internal
to my own source code, I shouldn't have to.

I understand the need for a transition strategy, which is what we have
in 2.x, and that's working well enough. But I'll be glad when it's
over.

Cheers,
- Joe

Martin v. LÃ¶wis · Oct 18, 2008

2. Exactly what Unicode you get would be dependent on Python properly

interpreting the bytes in the source file -- which you can make it do by
adding something like "-*- coding: utf-8 -*-" in a comment at the top of
the file.

That depends on the Python version. Up to (and including) 2.4, the bytes
on the disk where interpreted as Latin-1 in absence of an encoding
declaration. In 2.5, not having an encoding declaration is an error. In
3.x, in absence of an encoding declaration, the bytes are interpreted as
UTF-8 (giving an error when ill-formed UTF-8 sequences are encountered).

3. Without the "u" prefix, you'll have some 8-bit string, whose
interpretation is... er... here's where I get a bit fuzzy. What if your
source file is set to utf-8?

You need to distinguish between the declared encoding, and the intended
(editor) encoding also. Some editors (like Emacs or IDLE) interpret the
declaration, others may not. What you see on the display is the editor's
interpretation; what Python uses is the declared encoding.

However, Python uses the declared encoding just for Unicode strings.

Do you then have a proper UTF-8 string,
but the problem is that none of the standard Python library methods know
how to properly interpret UTF-8?

There is (probably) no such thing as a "proper UTF-8 string" (in the
sense in which you probably mean it). Python doesn't have a data type
for "UTF-8 string". It only has a data type "byte string". It's up to
the application whether it gets interpreted in a consistent manner.
Libraries are (typically) encoding-agnostic, i.e. they work for UTF-8
encoded strings the same way as for, say, Big-5 encoded strings.

4. In Python 3.0, this silliness goes away, because all strings are
Unicode by default.

You still need to make sure that the editor's encoding and the declared
encoding match.

Regards,
Martin

gitaziabari · Oct 19, 2008

Thanks for the answers. That clears things up quite a bit.

OK, good to know.

Yes, but that's still much better than having to litter your code with
'u' prefixes and .decode calls and so on. If I'm using a UTF-8-savvy
text editor (as we all should be doing in the 21st century!), and type
"foo = '2ð'", I should get a string containing a '2' and a pi
character, and all the text operations (like counting characters,
etc.) should Just Work.

When I read and write files or sockets or whatever, of course I'll
have to think about what encoding the text should be... but internal
to my own source code, I shouldn't have to.

I understand the need for a transition strategy, which is what we have
in 2.x, and that's working well enough. But I'll be glad when it's
over.

Cheers,
- Joe

Thanks for the answers. The following factors should be considerd when
dealing with unicode characters in python:
1. Declaring # -*- coding: utf-8 -*- at the top of script
2. Opening files with appropriate encoding:
txt = codecs.open (filename, 'w+', encoding='utf-8')

My program works fine now. There is no specific way of adding unicode
characters in list or dictionaies. The character itself has to be in
unicode.

Cheers,

Gita

Joe Strout · Oct 19, 2008

There is (probably) no such thing as a "proper UTF-8 string" (in the
sense in which you probably mean it).

To be clear, I mean a string that is valid UTF-8 (not all strings of
bytes are, of course).

Python doesn't have a data type
for "UTF-8 string". It only has a data type "byte string". It's up to
the application whether it gets interpreted in a consistent manner.
Libraries are (typically) encoding-agnostic, i.e. they work for UTF-8
encoded strings the same way as for, say, Big-5 encoded strings.

Oi -- so if I ask for length, I get the number of bytes, not the
number of characters. If I slice and dice, I could end up splitting
characters in half. It is, as you say, just a string of bytes, not a
string of characters.

You still need to make sure that the editor's encoding and the
declared
encoding match.

Well, the if no encoding is declared, it (quite sensibly) assumes
UTF-8, so for my purposes this boils down to using a UTF-8 editor --
which I always do anyway. But do I still have to put a "u" before my
string literals in order to have it treated as characters rather than
bytes?

I'm hoping that the answer is "no" -- most string literals in a source
file are text (which should be Unicode text, these days); a raw byte
string would be the exceptional case, and I'd be happy to use the "r"
prefix for those.

Best,
- Joe

Martin v. Löwis · Oct 19, 2008

Well, the if no encoding is declared, it (quite sensibly) assumes UTF-8,

so for my purposes this boils down to using a UTF-8 editor -- which I
always do anyway. But do I still have to put a "u" before my string
literals in order to have it treated as characters rather than bytes?
Yes.

I'm hoping that the answer is "no"

Then you need to switch to Python 3.0, when it comes out. Its string
literals denote unicode strings.

Regards,
Martin

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Python dict as unicode	1	Nov 24, 2010
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Inserting Unicode chars in Entry widget	4	Dec 29, 2012
Unicode	2	Mar 15, 2013
Unicode Chars in Windows Path	12	Apr 3, 2014
binary key in dictionary	2	Jul 30, 2013
Inserting Unicode text with MySQLdb in Python 2.4-2.5?	5	Nov 18, 2009

inserting Unicode character in dictionary - Python

gita ziabari

Marc 'BlackJack' Rintsch

Joe Strout

Marc 'BlackJack' Rintsch

Joe Strout

Martin v. LÃ¶wis

gitaziabari

Joe Strout

Martin v. Löwis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads