unicode issue

gentlestone · Sep 30, 2009

Why don't work this code on Python 2.6? Or how can I do this job?

_MAP = {
# LATIN
u'Ã€': 'A', u'Ã': 'A', u'Ã‚': 'A', u'Ãƒ': 'A', u'Ã„': 'A', u'Ã…': 'A',
u'Ã†': 'AE', u'Ã‡':'C',
u'Ãˆ': 'E', u'Ã‰': 'E', u'ÃŠ': 'E', u'Ã‹': 'E', u'ÃŒ': 'I', u'Ã': 'I',
u'ÃŽ': 'I',
u'Ã': 'I', u'Ã': 'D', u'Ã‘': 'N', u'Ã’': 'O', u'Ã“': 'O', u'Ã”': 'O',
u'Ã•': 'O', u'Ã–':'O',
u'Å': 'O', u'Ã˜': 'O', u'Ã™': 'U', u'Ãš': 'U', u'Ã›': 'U', u'Ãœ': 'U',
u'Å°': 'U',
u'Ã': 'Y', u'Ãž': 'TH', u'ÃŸ': 'ss', u'Ã ':'a', u'Ã¡':'a', u'Ã¢': 'a',
u'Ã£': 'a', u'Ã¤':'a',
u'Ã¥': 'a', u'Ã¦': 'ae', u'Ã§': 'c', u'Ã¨': 'e', u'Ã©': 'e', u'Ãª': 'e',
u'Ã«': 'e',
u'Ã¬': 'i', u'Ã': 'i', u'Ã®': 'i', u'Ã¯': 'i', u'Ã°': 'd', u'Ã±': 'n',
u'Ã²': 'o', u'Ã³':'o',
u'Ã´': 'o', u'Ãµ': 'o', u'Ã¶': 'o', u'Å‘': 'o', u'Ã¸': 'o', u'Ã¹': 'u',
u'Ãº': 'u',
u'Ã»': 'u', u'Ã¼': 'u', u'Å±': 'u', u'Ã½': 'y', u'Ã¾': 'th', u'Ã¿': 'y',
# LATIN_SYMBOLS
u'Â©':'(c)',
# GREEK
u'Î±':'a', u'Î²':'b', u'Î³':'g', u'Î´':'d', u'Îµ':'e', u'Î¶':'z',
u'Î·':'h', u'Î¸':'8',
u'Î¹':'i', u'Îº':'k', u'Î»':'l', u'Î¼':'m', u'Î½':'n', u'Î¾':'3',
u'Î¿':'o', u'Ï€':'p',
u'Ï':'r', u'Ïƒ':'s', u'Ï„':'t', u'Ï…':'y', u'Ï†':'f', u'Ï‡':'x',
u'Ïˆ':'ps', u'Ï‰':'w',
u'Î¬':'a', u'Î':'e', u'Î¯':'i', u'ÏŒ':'o', u'Ï':'y', u'Î®':'h',
u'ÏŽ':'w', u'Ï‚':'s',
u'ÏŠ':'i', u'Î°':'y', u'Ï‹':'y', u'Î':'i',
u'Î‘':'A', u'Î’':'B', u'Î“':'G', u'Î”':'D', u'Î•':'E', u'Î–':'Z',
u'Î—':'H', u'Î˜':'8',
u'Î™':'I', u'Îš':'K', u'Î›':'L', u'Îœ':'M', u'Î':'N', u'Îž':'3',
u'ÎŸ':'O', u'Î ':'P',
u'Î¡':'R', u'Î£':'S', u'Î¤':'T', u'Î¥':'Y', u'Î¦':'F', u'Î§':'X',
u'Î¨':'PS', u'Î©':'W',
u'Î†':'A', u'Îˆ':'E', u'ÎŠ':'I', u'ÎŒ':'O', u'ÎŽ':'Y', u'Î‰':'H',
u'Î':'W', u'Îª':'I', u'Î«':'Y',
# TURKISH
u'ÅŸ':'s', u'Åž':'S', u'Ä±':'i', u'Ä°':'I', u'Ã§':'c', u'Ã‡':'C',
u'Ã¼':'u', u'Ãœ':'U',
u'Ã¶':'o', u'Ã–':'O', u'ÄŸ':'g', u'Äž':'G',
# RUSSIAN
u'Ð°':'a', u'Ð±':'b', u'Ð²':'v', u'Ð³':'g', u'Ð´':'d', u'Ðµ':'e',
u'Ñ‘':'yo', u'Ð¶':'zh',
u'Ð·':'z', u'Ð¸':'i', u'Ð¹':'j', u'Ðº':'k', u'Ð»':'l', u'Ð¼':'m',
u'Ð½':'n', u'Ð¾':'o',
u'Ð¿':'p', u'Ñ€':'r', u'Ñ':'s', u'Ñ‚':'t', u'Ñƒ':'u', u'Ñ„':'f',
u'Ñ…':'h', u'Ñ†':'c',
u'Ñ‡':'ch', u'Ñˆ':'sh', u'Ñ‰':'sh', u'ÑŠ':'', u'Ñ‹':'y', u'ÑŒ':'',
u'Ñ':'e', u'ÑŽ':'yu', u'Ñ':'ya',
u'Ð':'A', u'Ð‘':'B', u'Ð’':'V', u'Ð“':'G', u'Ð”':'D', u'Ð•':'E',
u'Ð':'Yo', u'Ð–':'Zh',
u'Ð—':'Z', u'Ð˜':'I', u'Ð™':'J', u'Ðš':'K', u'Ð›':'L', u'Ðœ':'M',
u'Ð':'N', u'Ðž':'O',
u'ÐŸ':'P', u'Ð ':'R', u'Ð¡':'S', u'Ð¢':'T', u'Ð£':'U', u'Ð¤':'F',
u'Ð¥':'H', u'Ð¦':'C',
u'Ð§':'Ch', u'Ð¨':'Sh', u'Ð©':'Sh', u'Ðª':'', u'Ð«':'Y', u'Ð¬':'',
u'Ð':'E', u'Ð®':'Yu', u'Ð¯':'Ya',
# UKRAINIAN
u'Ð„':'Ye', u'Ð†':'I', u'Ð‡':'Yi', u'Ò':'G', u'Ñ”':'ye', u'Ñ–':'i',
u'Ñ—':'yi', u'Ò‘':'g',
# CZECH
u'Ä':'c', u'Ä':'d', u'Ä›':'e', u'Åˆ':'n', u'Å™':'r', u'Å¡':'s',
u'Å¥':'t', u'Å¯':'u',
u'Å¾':'z', u'ÄŒ':'C', u'ÄŽ':'D', u'Äš':'E', u'Å‡':'N', u'Å˜':'R',
u'Å ':'S', u'Å¤':'T', u'Å®':'U', u'Å½':'Z',
# POLISH
u'Ä…':'a', u'Ä‡':'c', u'Ä™':'e', u'Å‚':'l', u'Å„':'n', u'Ã³':'o',
u'Å›':'s', u'Åº':'z',
u'Å¼':'z', u'Ä„':'A', u'Ä†':'C', u'Ä˜':'e', u'Å':'L', u'Åƒ':'N',
u'Ã“':'o', u'Åš':'S',
u'Å¹':'Z', u'Å»':'Z',
# LATVIAN
u'Ä':'a', u'Ä':'c', u'Ä“':'e', u'Ä£':'g', u'Ä«':'i', u'Ä·':'k',
u'Ä¼':'l', u'Å†':'n',
u'Å¡':'s', u'Å«':'u', u'Å¾':'z', u'Ä€':'A', u'ÄŒ':'C', u'Ä’':'E',
u'Ä¢':'G', u'Äª':'i',
u'Ä¶':'k', u'Ä»':'L', u'Å…':'N', u'Å ':'S', u'Åª':'u', u'Å½':'Z'
}

def downcode(name):
""" u'Zabovita zmiesana kasa'
"""
for key, value in _MAP.iteritems():
name = name.replace(key, value)
return name

Andre Engels · Sep 30, 2009

Why don't work this code on Python 2.6? Or how can I do this job?

Please be more specific than "it doesn't work":
* What exactly are you doing
* What were you expecting the result of that to be
* What is the actual result?

gentlestone · Sep 30, 2009

Please be more specific than "it doesn't work":
* What exactly are you doing
* What were you expecting the result of that to be
* What is the actual result?

* What exactly are you doing
replace non-ascii characters - see doctest documentation

* What were you expecting the result of that to be
see doctest documentation

* What is the actual result?
the actual result is unchanged name

Andre Engels · Sep 30, 2009

I get the feeling that the problem is with the Python interactive
mode. It does not have full unicode support, so u"Å½abovitÃ¡ zmieÅ¡anÃ¡
kaÅ¡a" is changed to u'\x8eabovit\xe1 zmie\x9aan\xe1 ka\x9aa'. If you
call your code from another program, it might work correctly.

gentlestone · Sep 30, 2009

I get the feeling that the problem is with the Python interactive
mode. It does not have full unicode support, so u"Å½abovitÃ¡ zmieÅ¡anÃ¡
kaÅ¡a" is changed to u'\x8eabovit\xe1 zmie\x9aan\xe1 ka\x9aa'. If you
call your code from another program, it might work correctly.

thx a lot

I spent 2 days of my life beacause of this

so doctests are unuseable for non-engish users in python - seems to be

gentlestone · Sep 30, 2009

thx a lot

I spent 2 days of my life beacause of this

so doctests are unuseable for non-engish users in python - seems to be

yes, you are right, now it works:

def slugify(name):
""" u'zabovita-zmiesana-kasa-sro'
"""
for key, value in _MAP.iteritems():
name = name.replace(key, value)
return defaultfilters.slugify(name)

Dave Angel · Sep 30, 2009

gentlestone said:
Why don't work this code on Python 2.6? Or how can I do this job?

_MAP =
# LATIN
u'Ã€': 'A', u'Ã': 'A', u'Ã‚': 'A', u'Ãƒ': 'A', u'Ã„': 'A', u'Ã…': 'A',
u'Ã†': 'AE', u'Ã‡':'C',
u'Ãˆ': 'E', u'Ã‰': 'E', u'ÃŠ': 'E', u'Ã‹': 'E', u'ÃŒ': 'I', u'Ã': 'I',
u'ÃŽ': 'I',
u'Ã': 'I', u'Ã': 'D', u'Ã‘': 'N', u'Ã’': 'O', u'Ã“': 'O', u'Ã”': 'O',
u'Ã•': 'O', u'Ã–':'O',
u'Å': 'O', u'Ã˜': 'O', u'Ã™': 'U', u'Ãš': 'U', u'Ã›': 'U', u'Ãœ': 'U',
u'Å°': 'U',
u'Ã': 'Y', u'Ãž': 'TH', u'ÃŸ': 'ss', u'Ã ':'a', u'Ã¡':'a', u'Ã¢': 'a',
u'Ã£': 'a', u'Ã¤':'a',
u'Ã¥': 'a', u'Ã¦': 'ae', u'Ã§': 'c', u'Ã¨': 'e', u'Ã©': 'e', u'Ãª': 'e',
u'Ã«': 'e',
u'Ã¬': 'i', u'Ã': 'i', u'Ã®': 'i', u'Ã¯': 'i', u'Ã°': 'd', u'Ã±': 'n',
u'Ã²': 'o', u'Ã³':'o',
u'Ã´': 'o', u'Ãµ': 'o', u'Ã¶': 'o', u'Å‘': 'o', u'Ã¸': 'o', u'Ã¹': 'u',
u'Ãº': 'u',
u'Ã»': 'u', u'Ã¼': 'u', u'Å±': 'u', u'Ã½': 'y', u'Ã¾': 'th', u'Ã¿': 'y',
# LATIN_SYMBOLS
u'Â©':'(c)',
# GREEK
u'Î±':'a', u'Î²':'b', u'Î³':'g', u'Î´':'d', u'Îµ':'e', u'Î¶':'z',
u'Î·':'h', u'Î¸':'8',
u'Î¹':'i', u'Îº':'k', u'Î»':'l', u'Î¼':'m', u'Î½':'n', u'Î¾':'3',
u'Î¿':'o', u'Ï€':'p',
u'Ï':'r', u'Ïƒ':'s', u'Ï„':'t', u'Ï…':'y', u'Ï†':'f', u'Ï‡':'x',
u'Ïˆ':'ps', u'Ï‰':'w',
u'Î¬':'a', u'Î':'e', u'Î¯':'i', u'ÏŒ':'o', u'Ï':'y', u'Î®':'h',
u'ÏŽ':'w', u'Ï‚':'s',
u'ÏŠ':'i', u'Î°':'y', u'Ï‹':'y', u'Î':'i',
u'Î‘':'A', u'Î’':'B', u'Î“':'G', u'Î”':'D', u'Î•':'E', u'Î–':'Z',
u'Î—':'H', u'Î˜':'8',
u'Î™':'I', u'Îš':'K', u'Î›':'L', u'Îœ':'M', u'Î':'N', u'Îž':'3',
u'ÎŸ':'O', u'Î ':'P',
u'Î¡':'R', u'Î£':'S', u'Î¤':'T', u'Î¥':'Y', u'Î¦':'F', u'Î§':'X',
u'Î¨':'PS', u'Î©':'W',
u'Î†':'A', u'Îˆ':'E', u'ÎŠ':'I', u'ÎŒ':'O', u'ÎŽ':'Y', u'Î‰':'H',
u'Î':'W', u'Îª':'I', u'Î«':'Y',
# TURKISH
u'ÅŸ':'s', u'Åž':'S', u'Ä±':'i', u'Ä°':'I', u'Ã§':'c', u'Ã‡':'C',
u'Ã¼':'u', u'Ãœ':'U',
u'Ã¶':'o', u'Ã–':'O', u'ÄŸ':'g', u'Äž':'G',
# RUSSIAN
u'Ð°':'a', u'Ð±':'b', u'Ð²':'v', u'Ð³':'g', u'Ð´':'d', u'Ðµ':'e',
u'Ñ‘':'yo', u'Ð¶':'zh',
u'Ð·':'z', u'Ð¸':'i', u'Ð¹':'j', u'Ðº':'k', u'Ð»':'l', u'Ð¼':'m',
u'Ð½':'n', u'Ð¾':'o',
u'Ð¿':'p', u'Ñ€':'r', u'Ñ':'s', u'Ñ‚':'t', u'Ñƒ':'u', u'Ñ„':'f',
u'Ñ…':'h', u'Ñ†':'c',
u'Ñ‡':'ch', u'Ñˆ':'sh', u'Ñ‰':'sh', u'ÑŠ':'', u'Ñ‹':'y', u'ÑŒ':'',
u'Ñ':'e', u'ÑŽ':'yu', u'Ñ':'ya',
u'Ð':'A', u'Ð‘':'B', u'Ð’':'V', u'Ð“':'G', u'Ð”':'D', u'Ð•':'E',
u'Ð':'Yo', u'Ð–':'Zh',
u'Ð—':'Z', u'Ð˜':'I', u'Ð™':'J', u'Ðš':'K', u'Ð›':'L', u'Ðœ':'M',
u'Ð':'N', u'Ðž':'O',
u'ÐŸ':'P', u'Ð ':'R', u'Ð¡':'S', u'Ð¢':'T', u'Ð£':'U', u'Ð¤':'F',
u'Ð¥':'H', u'Ð¦':'C',
u'Ð§':'Ch', u'Ð¨':'Sh', u'Ð©':'Sh', u'Ðª':'', u'Ð«':'Y', u'Ð¬':'',
u'Ð':'E', u'Ð®':'Yu', u'Ð¯':'Ya',
# UKRAINIAN
u'Ð„':'Ye', u'Ð†':'I', u'Ð‡':'Yi', u'Ò':'G', u'Ñ”':'ye', u'Ñ–':'i',
u'Ñ—':'yi', u'Ò‘':'g',
# CZECH
u'Ä':'c', u'Ä':'d', u'Ä›':'e', u'Åˆ':'n', u'Å™':'r', u'Å¡':'s',
u'Å¥':'t', u'Å¯':'u',
u'Å¾':'z', u'ÄŒ':'C', u'ÄŽ':'D', u'Äš':'E', u'Å‡':'N', u'Å˜':'R',
u'Å ':'S', u'Å¤':'T', u'Å®':'U', u'Å½':'Z',
# POLISH
u'Ä…':'a', u'Ä‡':'c', u'Ä™':'e', u'Å‚':'l', u'Å„':'n', u'Ã³':'o',
u'Å›':'s', u'Åº':'z',
u'Å¼':'z', u'Ä„':'A', u'Ä†':'C', u'Ä˜':'e', u'Å':'L', u'Åƒ':'N',
u'Ã“':'o', u'Åš':'S',
u'Å¹':'Z', u'Å»':'Z',
# LATVIAN
u'Ä':'a', u'Ä':'c', u'Ä“':'e', u'Ä£':'g', u'Ä«':'i', u'Ä·':'k',
u'Ä¼':'l', u'Å†':'n',
u'Å¡':'s', u'Å«':'u', u'Å¾':'z', u'Ä€':'A', u'ÄŒ':'C', u'Ä’':'E',
u'Ä¢':'G', u'Äª':'i',
u'Ä¶':'k', u'Ä»':'L', u'Å…':'N', u'Å ':'S', u'Åª':'u', u'Å½':'Z'
}

def downcode(name):
"""
u'Zabovita zmiesana kasa'
"""
for key, value in _MAP.iteritems():
name =ame.replace(key, value)
return name

Works for me:

rrr = downcode(u"Å½abovitÃ¡ zmieÅ¡anÃ¡ kaÅ¡a")
print repr(rrr)
print rrr

prints out:

u'Zabovita zmiesana kasa'
Zabovita zmiesana kasa

I did have to add an encoding declaration as line 2 of the file:

#-*- coding: latin-1 -*-

and I had to convince my editor (Komodo) to save the file in utf-8.

DaveA

gentlestone · Sep 30, 2009

Works for me:

rrr = downcode(u"Å½abovitÃ¡ zmieÅ¡anÃ¡ kaÅ¡a")
print repr(rrr)
print rrr

prints out:

u'Zabovita zmiesana kasa'
Zabovita zmiesana kasa

I did have to add an encoding declaration as line 2 of the file:

#-*- coding: latin-1 -*-

and I had to convince my editor (Komodo) to save the file in utf-8.

DaveA

great, thanks you all, I changed utf-8 to latin-1 in the header and it
works for me too

how mutch time could I save, just ask in this forum

saeed.gnu · Sep 30, 2009

I recommend to use UTF-8 coding(specially in GNU/Linux) then write
this in the second line:
#-*- coding: latin-1 -*-

Mark Tolonen · Sep 30, 2009

Dave Angel said:
Works for me:

rrr = downcode(u"Å½abovitÃ¡ zmieÅ¡anÃ¡ kaÅ¡a")
print repr(rrr)
print rrr

prints out:

u'Zabovita zmiesana kasa'
Zabovita zmiesana kasa

I did have to add an encoding declaration as line 2 of the file:

#-*- coding: latin-1 -*-

and I had to convince my editor (Komodo) to save the file in utf-8.

Why decare latin-1 and save in utf-8? I'm not sure how you got that to work
because those encodings aren't equivalent. I get:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "testit.py", line 1
SyntaxError: encoding problem: utf-8

In fact, some of the characters in the above code don't map to latin-1.

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0150' in
position
309: ordinal not in range(256)
-Mark

Piet van Oostrum · Sep 30, 2009

Dave Angel said:
DA> Works for me:

DA> rrr = downcode(u"´abovitá zmie¨aná ka¨a")
DA> print repr(rrr)
DA> print rrr

DA> prints out:

DA> u'Zabovita zmiesana kasa'
DA> Zabovita zmiesana kasa

DA> I did have to add an encoding declaration as line 2 of the file:

DA> #-*- coding: latin-1 -*-

DA> and I had to convince my editor (Komodo) to save the file in utf-8.

*Seems to work*.
If you save in utf-8 the coding declaration also has to be utf-8.
Besides, many of these characters won't be representable in latin-1.
The reason it worked is that these characters were translated into two-
or more-bytes sequences and replace did work with these. But it's
dangerous, as they are then no longer the unicode characters they were
intended to be.

Dave Angel · Sep 30, 2009

Piet said:
*Seems to work*.
If you save in utf-8 the coding declaration also has to be utf-8.
Besides, many of these characters won't be representable in latin-1.
The reason it worked is that these characters were translated into two-
or more-bytes sequences and replace did work with these. But it's
dangerous, as they are then no longer the unicode characters they were
intended to be.

Thanks for the correction. What I meant by "works for me" is that the
single example in the docstring translated okay. But I do have a lot to
learn about using Unicode in sources, and I want to learn.

So tell me, how were we supposed to guess what encoding the original
message used? I originally had the mailing list message (in Thunderbird
email). When I copied (copy/paste) to Komodo IDE (text editor), it
wouldn't let me save because the file type was ASCII. So I randomly
chosen latin-1 for file type, and it seemed to like it.

At that point I expected and got errors from Python because I had no
coding declaration. I used latin-1, and still had problems, though I
forget what they were. Only when I changed the file encoding type again,
to utf-8, did the errors go away. I agree that they should agree, but I
don't know how to reconcile the copy/paste boundary, the file type
(without BOM, which is another variable), the coding declaration, and
the stdout implicit ASCII encoding. I understand a bunch of it, but not
enough to be able to safely walk through the choices.

Is this all written up in one place, to where an experienced programmer
can make sense of it? I've nibbled at the edges (even wrote a UTF-8
encoder/decoder a dozen years ago).

DaveA

Piet van Oostrum · Sep 30, 2009

Dave Angel said:
DA> Thanks for the correction. What I meant by "works for me" is that the
DA> single example in the docstring translated okay. But I do have a lot to
DA> learn about using Unicode in sources, and I want to learn.

DA> So tell me, how were we supposed to guess what encoding the original
DA> message used? I originally had the mailing list message (in Thunderbird
DA> email). When I copied (copy/paste) to Komodo IDE (text editor), it wouldn't
DA> let me save because the file type was ASCII. So I randomly chosen latin-1
DA> for file type, and it seemed to like it.

You can see the encoding of the message in its headers. But it is not
important, as the Unicode characters you see is what it is about. You
just copy and paste them in your Python file. The Python file does not
have to use the same encoding as the message from which you pasted. The
editor will do the proper conversion. (If it doesn't throw it away
immediately.) Only for the Python file you must choose an encoding that
can encode all the characters that are in the file. In this case utf-8
is the only reasonable choice, but if there are only latin-1 characters
in the file then of course latin-1 (iso-8859-1) will also be good.

Any decent editor will only allow you to save in an encoding that can
encode all the characters in the file, otherwise you will lose some
characters.

Because Python must also know which encoding you used and this is not in
itself deductible from the file contents, you need the coding
declaration. And it must be the same as the encoding in which the file
is saved, otherwise Python will see something different than you saw in
your editor. Sooner or later this will give you a big headache.

DA> At that point I expected and got errors from Python because I had no coding
DA> declaration. I used latin-1, and still had problems, though I forget what
DA> they were. Only when I changed the file encoding type again, to utf-8, did
DA> the errors go away. I agree that they should agree, but I don't know how to
DA> reconcile the copy/paste boundary, the file type (without BOM, which is
DA> another variable), the coding declaration, and the stdout implicit ASCII
DA> encoding. I understand a bunch of it, but not enough to be able to safely
DA> walk through the choices.

DA> Is this all written up in one place, to where an experienced programmer can
DA> make sense of it? I've nibbled at the edges (even wrote a UTF-8
DA> encoder/decoder a dozen years ago).

I don't know a place. Usually utf-8 is a safe bet but in some cases can
be overkill. And then in you Python input/output (read/write) you may
have to use a different encoding if the programs that you have to
communicate with expect something different.

Dave Angel · Oct 1, 2009

Piet said:
[snip]

DA> Thanks for the correction. What I meant by "works for me" is that the
DA> single example in the docstring translated okay. But I do have a lot to
DA> learn about using Unicode in sources, and I want to learn.

DA> So tell me, how were we supposed to guess what encoding the original
DA> message used? I originally had the mailing list message (in Thunderbird
DA> email). When I copied (copy/paste) to Komodo IDE (text editor), it wouldn't
DA> let me save because the file type was ASCII. So I randomly chosen latin-1
DA> for file type, and it seemed to like it.

Click to expand...

You can see the encoding of the message in its headers. But it is not
important, as the Unicode characters you see is what it is about. You
just copy and paste them in your Python file. The Python file does not
have to use the same encoding as the message from which you pasted. The
editor will do the proper conversion. (If it doesn't throw it away
immediately.) Only for the Python file you must choose an encoding that
can encode all the characters that are in the file. In this case utf-8
is the only reasonable choice, but if there are only latin-1 characters
in the file then of course latin-1 (iso-8859-1) will also be good.

Any decent editor will only allow you to save in an encoding that can
encode all the characters in the file, otherwise you will lose some
characters.

Because Python must also know which encoding you used and this is not in
itself deductible from the file contents, you need the coding
declaration. And it must be the same as the encoding in which the file
is saved, otherwise Python will see something different than you saw in
your editor. Sooner or later this will give you a big headache.

DA> At that point I expected and got errors from Python because I had no coding
DA> declaration. I used latin-1, and still had problems, though I forget what
DA> they were. Only when I changed the file encoding type again, to utf-8, did
DA> the errors go away. I agree that they should agree, but I don't know how to
DA> reconcile the copy/paste boundary, the file type (without BOM, which is
DA> another variable), the coding declaration, and the stdout implicit ASCII
DA> encoding. I understand a bunch of it, but not enough to be able to safely
DA> walk through the choices.

DA> Is this all written up in one place, to where an experienced programmer can
DA> make sense of it? I've nibbled at the edges (even wrote a UTF-8
DA> encoder/decoder a dozen years ago).

Click to expand...

I don't know a place. Usually utf-8 is a safe bet but in some cases can
be overkill. And then in you Python input/output (read/write) you may
have to use a different encoding if the programs that you have to
communicate with expect something different.

I know what I was missing. The copy/paste must be doing it in pure
Unicode. And the in-memory version of the source text is in Unicode.
So the text editor's encoding affects how that Unicode is encoded into 8
bit bytes for the file (and how it will be reloaded next time). OK,
that seems to make sense.

I know that the clipboard has type tags, but I haven't looked at them in
so long that I forget what they look like. For text, is it just ASCII
and Unicode? Or are there other possible encodings that the source and
sink negotiate?

Thanks for the clear explanation.

DaveA

gentlestone · Oct 1, 2009

save in utf-8 the coding declaration also has to be utf-8

ok, I understand, but what's the problem? Unfortunately seems to be
the Python interactive
mode doesn't have unicode support. It recognize the latin-1 encoding
only.

So I have 2 options, how to write doctest:
1. Replace native charaters with their encoded representation like
u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" instead of u"Å½abovitÃ¡
zmieÅ¡anÃ¡ kaÅ¡a"
2. Use latin-1 encoding, where the file is saved in utf-8

The first is bad because doctest is a great documenttion tool and it
is propably the main reason I use python. And something like
u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" is not a best
documentation style. But the tests work.

The second is bad, because the declaration is incorrect and if I use
it in Django model declaration for example I got bad data in the
application.

So what is the solution? Back to Java?

Dave Angel · Oct 1, 2009

gentlestone said:
ok, I understand, but what's the problem? Unfortunately seems to be
the Python interactive
mode doesn't have unicode support. It recognize the latin-1 encoding
only.

So I have 2 options, how to write doctest:
1. Replace native charaters with their encoded representation like
u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" instead of u"Å½abovitÃ¡
zmieÅ¡anÃ¡ kaÅ¡a"
2. Use latin-1 encoding, where the file is saved in utf-8

The first is bad because doctest is a great documenttion tool and it
is propably the main reason I use python. And something like
u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" is not a best
documentation style. But the tests work.

The second is bad, because the declaration is incorrect and if I use
it in Django model declaration for example I got bad data in the
application.

So what is the solution? Back to Java?

Wait -- don't give up yet. Since I'm one of the ones who (partially)
steered you wrong, let me try to help.

Key variable here is how your text editor behaves. Since I've never
taken my (programming) text editor out of ASCII mode before this week,
it took some experimenting (and more importantly a message from Piet on
this thread) to make sense of things. I think I now know how to make my
own editor (Komodo IDE) behave in this environment, and you probably can
do as well or better. In fact, judging from your messages, you probably
are doing much better on the editor front.

When I tried this morning to re-open that test file from yesterday, many
of the characters were all messed up. I was okay as long as the project
was still open, but not today. The editor itself apparently looks to
that encoding declaration when it's deciding how to interpret the bytes
on disk.

So I did the following, using Komodo IDE. I created a new file in the
project. Before saving it, I used
Edit->CurrentFileSettings->Properties->Encoding to set it to UTF-8.
*NOW* I pasted the stuff from your email message. And added the
#-*- coding: utf-8 -*-

as the second line of the file. Notice it's *NOT* latin-1.

At this point I save and run the file, and it seems to work fine.

My guess is that I could set these as default settings in Komodo, if I
were doing UTF-8 very often, and it would become painless. I know I
have certain stuff in my python template, and could add that encoding
line as well.

Anyway, that gets us to the step of running the doctest. The trick here
seems to be that we need to define the docstring as a Unicode docstring
to have it interpreted correctly. Try adding the u in front of the
triple quote as follows:

def downcode(name):
u""" u'Zabovita zmiesana kasa'
"""
for key, value in _MAP.iteritems():
name = name.replace(key, value)
return name

Now, if the doctest passes, we seem to be in good shape.

There's another problem, that hopefully somebody else can help with.
That's if doctest needs to report an error. When I deliberately changed
the "expect" string I get an error like the following.

UnicodeEncodeError: 'ascii' codec can't encode character u'\u017d' in
position 1
50: ordinal not in range(128)

I get a similar error if running the -v option on doctest. (Note that
I do *NOT* get the error when running inside Komodo. And what I've read
implies that the same would be true if running inside IDLE.) The
problem is similar to the one you'd have doing a simple:

print u"\u017d"

I think these are avoided if sys.stdout.encoding (and maybe
sys.stderr.encoding) are set to utf-8. On my system they're set to
None, which says to use "the system default encoding." On my system
that would be ASCII, so I get the error. But perhaps yours is already
something better.

I found links:
http://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/
http://wiki.python.org/moin/PrintFails

http://lists.macromates.com/textmate/2008-June/025735.html
which indicate you may want to try:

set LC_CTYPE=en_GB.utf-8 python

at the command prompt before running python. This could be system specific; it didn't work for me on XP.

The workaround that works for me (so far) is:

if __name__ == "__main__":
import sys, codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)

print u"Å½abovitÃ¡ zmieÅ¡anÃ¡ kaÅ¡a"
import doctest
doctest.testmod()

The codecs line tells python that stdout should use utf-8. That doesn't make the characters look good on my console, but at least it avoids the errors. I'm guessing that on my system I should use latin1 here instead of utf8. But I don't want to confuse things.

HTH

DaveA

Hyuga · Oct 1, 2009

Why don't work this code on Python 2.6? Or how can I do this job?

_MAP = {
Â Â # LATIN
Â Â u'Ã€': 'A', u'Ã': 'A', u'Ã‚': 'A', u'Ãƒ': 'A', u'Ã„': 'A', u'Ã…': 'A',
u'Ã†': 'AE', u'Ã‡':'C',
Â Â u'Ãˆ': 'E', u'Ã‰': 'E', u'ÃŠ': 'E', u'Ã‹': 'E', u'ÃŒ': 'I', u'Ã': 'I',
u'ÃŽ': 'I',
Â Â u'Ã': 'I', u'Ã': 'D', u'Ã‘': 'N', u'Ã’': 'O', u'Ã“': 'O', u'Ã”': 'O',
u'Ã•': 'O', u'Ã–':'O',
Â Â u'Å': 'O', u'Ã˜': 'O', u'Ã™': 'U', u'Ãš': 'U', u'Ã›': 'U', u'Ãœ': 'U',
u'Å°': 'U',
Â Â u'Ã': 'Y', u'Ãž': 'TH', u'ÃŸ': 'ss', u'Ã ':'a', u'Ã¡':'a', u'Ã¢': 'a',
u'Ã£': 'a', u'Ã¤':'a',
Â Â u'Ã¥': 'a', u'Ã¦': 'ae', u'Ã§': 'c', u'Ã¨': 'e', u'Ã©': 'e', u'Ãª': 'e',
u'Ã«': 'e',
Â Â u'Ã¬': 'i', u'Ã': 'i', u'Ã®': 'i', u'Ã¯': 'i', u'Ã°': 'd', u'Ã±': 'n',
u'Ã²': 'o', u'Ã³':'o',
Â Â u'Ã´': 'o', u'Ãµ': 'o', u'Ã¶': 'o', u'Å‘': 'o', u'Ã¸': 'o', u'Ã¹': 'u',
u'Ãº': 'u',
Â Â u'Ã»': 'u', u'Ã¼': 'u', u'Å±': 'u', u'Ã½': 'y', u'Ã¾': 'th', u'Ã¿': 'y',
Â Â # LATIN_SYMBOLS
Â Â u'Â©':'(c)',
Â Â # GREEK
Â Â u'Î±':'a', u'Î²':'b', u'Î³':'g', u'Î´':'d', u'Îµ':'e', u'Î¶':'z',
u'Î·':'h', u'Î¸':'8',
Â Â u'Î¹':'i', u'Îº':'k', u'Î»':'l', u'Î¼':'m', u'Î½':'n', u'Î¾':'3',
u'Î¿':'o', u'Ï€':'p',
Â Â u'Ï':'r', u'Ïƒ':'s', u'Ï„':'t', u'Ï…':'y', u'Ï†':'f', u'Ï‡':'x',
u'Ïˆ':'ps', u'Ï‰':'w',
Â Â u'Î¬':'a', u'Î':'e', u'Î¯':'i', u'ÏŒ':'o', u'Ï':'y', u'Î®':'h',
u'ÏŽ':'w', u'Ï‚':'s',
Â Â u'ÏŠ':'i', u'Î°':'y', u'Ï‹':'y', u'Î':'i',
Â Â u'Î‘':'A', u'Î’':'B', u'Î“':'G', u'Î”':'D', u'Î•':'E', u'Î–':'Z',
u'Î—':'H', u'Î˜':'8',
Â Â u'Î™':'I', u'Îš':'K', u'Î›':'L', u'Îœ':'M', u'Î':'N', u'Îž':'3',
u'ÎŸ':'O', u'Î ':'P',
Â Â u'Î¡':'R', u'Î£':'S', u'Î¤':'T', u'Î¥':'Y', u'Î¦':'F', u'Î§':'X',
u'Î¨':'PS', u'Î©':'W',
Â Â u'Î†':'A', u'Îˆ':'E', u'ÎŠ':'I', u'ÎŒ':'O', u'ÎŽ':'Y', u'Î‰':'H',
u'Î':'W', u'Îª':'I', u'Î«':'Y',
Â Â # TURKISH
Â Â u'ÅŸ':'s', u'Åž':'S', u'Ä±':'i', u'Ä°':'I', u'Ã§':'c', u'Ã‡':'C',
u'Ã¼':'u', u'Ãœ':'U',
Â Â u'Ã¶':'o', u'Ã–':'O', u'ÄŸ':'g', u'Äž':'G',
Â Â # RUSSIAN
Â Â u'Ð°':'a', u'Ð±':'b', u'Ð²':'v', u'Ð³':'g', u'Ð´':'d', u'Ðµ':'e',
u'Ñ‘':'yo', u'Ð¶':'zh',
Â Â u'Ð·':'z', u'Ð¸':'i', u'Ð¹':'j', u'Ðº':'k', u'Ð»':'l', u'Ð¼':'m',
u'Ð½':'n', u'Ð¾':'o',
Â Â u'Ð¿':'p', u'Ñ€':'r', u'Ñ':'s', u'Ñ‚':'t', u'Ñƒ':'u', u'Ñ„':'f',
u'Ñ…':'h', u'Ñ†':'c',
Â Â u'Ñ‡':'ch', u'Ñˆ':'sh', u'Ñ‰':'sh', u'ÑŠ':'', u'Ñ‹':'y', u'ÑŒ':'',
u'Ñ':'e', u'ÑŽ':'yu', u'Ñ':'ya',
Â Â u'Ð':'A', u'Ð‘':'B', u'Ð’':'V', u'Ð“':'G', u'Ð”':'D', u'Ð•':'E',
u'Ð':'Yo', u'Ð–':'Zh',
Â Â u'Ð—':'Z', u'Ð˜':'I', u'Ð™':'J', u'Ðš':'K', u'Ð›':'L', u'Ðœ':'M',
u'Ð':'N', u'Ðž':'O',
Â Â u'ÐŸ':'P', u'Ð ':'R', u'Ð¡':'S', u'Ð¢':'T', u'Ð£':'U', u'Ð¤':'F',
u'Ð¥':'H', u'Ð¦':'C',
Â Â u'Ð§':'Ch', u'Ð¨':'Sh', u'Ð©':'Sh', u'Ðª':'', u'Ð«':'Y', u'Ð¬':'',
u'Ð':'E', u'Ð®':'Yu', u'Ð¯':'Ya',
Â Â # UKRAINIAN
Â Â u'Ð„':'Ye', u'Ð†':'I', u'Ð‡':'Yi', u'Ò':'G', u'Ñ”':'ye', u'Ñ–':'i',
u'Ñ—':'yi', u'Ò‘':'g',
Â Â # CZECH
Â Â u'Ä':'c', u'Ä':'d', u'Ä›':'e', u'Åˆ':'n', u'Å™':'r', u'Å¡':'s',
u'Å¥':'t', u'Å¯':'u',
Â Â u'Å¾':'z', u'ÄŒ':'C', u'ÄŽ':'D', u'Äš':'E', u'Å‡':'N', u'Å˜':'R',
u'Å ':'S', u'Å¤':'T', u'Å®':'U', u'Å½':'Z',
Â Â # POLISH
Â Â u'Ä…':'a', u'Ä‡':'c', u'Ä™':'e', u'Å‚':'l', u'Å„':'n', u'Ã³':'o',
u'Å›':'s', u'Åº':'z',
Â Â u'Å¼':'z', u'Ä„':'A', u'Ä†':'C', u'Ä˜':'e', u'Å':'L', u'Åƒ':'N',
u'Ã“':'o', u'Åš':'S',
Â Â u'Å¹':'Z', u'Å»':'Z',
Â Â # LATVIAN
Â Â u'Ä':'a', u'Ä':'c', u'Ä“':'e', u'Ä£':'g', u'Ä«':'i', u'Ä·':'k',
u'Ä¼':'l', u'Å†':'n',
Â Â u'Å¡':'s', u'Å«':'u', u'Å¾':'z', u'Ä€':'A', u'ÄŒ':'C', u'Ä’':'E',
u'Ä¢':'G', u'Äª':'i',
Â Â u'Ä¶':'k', u'Ä»':'L', u'Å…':'N', u'Å ':'S', u'Åª':'u', u'Å½':'Z'

}

def downcode(name):
Â Â """
Â Â >>> downcode(u"Å½abovitÃ¡ zmieÅ¡anÃ¡ kaÅ¡a")
Â Â u'Zabovita zmiesana kasa'
Â Â """
Â Â for key, value in _MAP.iteritems():
Â Â Â Â name = name.replace(key, value)
Â Â return name

Though C Python is pretty optimized under the hood for this sort of
single-character replacement, this still seems pretty inefficient
since you're calling replace for every character you want to map. I
think that a better approach might be something like:

def downcode(name):
return ''.join(_MAP.get(c, c) for c in name)

Or using string.translate:

import string
def downcode(name):
table = string.maketrans(
'Ã€ÃÃ‚ÃƒÃ„Ã…...',
'AAAAAA...')
return name.translate(table)

Walter DÃ¶rwald · Oct 1, 2009

Though C Python is pretty optimized under the hood for this sort of
single-character replacement, this still seems pretty inefficient
since you're calling replace for every character you want to map. I
think that a better approach might be something like:

def downcode(name):
return ''.join(_MAP.get(c, c) for c in name)

Or using string.translate:

import string
def downcode(name):
table = string.maketrans(
'Ã€ÃÃ‚ÃƒÃ„Ã…...',
'AAAAAA...')
return name.translate(table)

Or even simpler:

import unicodedata

def downcode(name):
return unicodedata.normalize("NFD", name)\
.encode("ascii", "ignore")\
.decode("ascii")

Servus,
Walter

Rami Chowdhury · Oct 1, 2009

Why don't work this code on Python 2.6? Or how can I do this job?

[snip _MAP]

def downcode(name):
"""
downcode(u"Å½abovitÃ¡ zmieÅ¡anÃ¡ kaÅ¡a")
u'Zabovita zmiesana kasa'
"""
for key, value in _MAP.iteritems():
name = name.replace(key, value)
return name

Click to expand...

Though C Python is pretty optimized under the hood for this sort of
single-character replacement, this still seems pretty inefficient
since you're calling replace for every character you want to map. I
think that a better approach might be something like:

def downcode(name):
return ''.join(_MAP.get(c, c) for c in name)

Or using string.translate:

import string
def downcode(name):
table = string.maketrans(
'Ã€ÃÃ‚ÃƒÃ„Ã…...',
'AAAAAA...')
return name.translate(table)

Click to expand...

Or even simpler:

import unicodedata

def downcode(name):
return unicodedata.normalize("NFD", name)\
.encode("ascii", "ignore")\
.decode("ascii")

Servus,
Walter

As I understand it, the "ignore" argument to str.encode *removes* the
undecodable characters, rather than replacing them with an ASCII
approximation. Is that correct? If so, wouldn't that rather defeat the
purpose?

Peter Otten · Oct 1, 2009

Rami said:
Why don't work this code on Python 2.6? Or how can I do this job?

[snip _MAP]

def downcode(name):
"""
downcode(u"Å½abovitÃ¡ zmieÅ¡anÃ¡ kaÅ¡a")
u'Zabovita zmiesana kasa'
"""
for key, value in _MAP.iteritems():
name = name.replace(key, value)
return name

Though C Python is pretty optimized under the hood for this sort of
single-character replacement, this still seems pretty inefficient
since you're calling replace for every character you want to map. I
think that a better approach might be something like:

def downcode(name):
return ''.join(_MAP.get(c, c) for c in name)

Or using string.translate:

import string
def downcode(name):
table = string.maketrans(
'Ã€ÃÃ‚ÃƒÃ„Ã…...',
'AAAAAA...')
return name.translate(table)

Click to expand...

Or even simpler:

import unicodedata

def downcode(name):
return unicodedata.normalize("NFD", name)\
.encode("ascii", "ignore")\
.decode("ascii")

Servus,
Walter

Click to expand...

As I understand it, the "ignore" argument to str.encode *removes* the
undecodable characters, rather than replacing them with an ASCII
approximation. Is that correct? If so, wouldn't that rather defeat the
purpose?

You didn't take the normalization step into your consideration. Example:
'A'

Blue J Ciphertext Program	2	Nov 22, 2023
My Status, Ciphertext	2	Nov 28, 2023
Delete all not allowed characters..	10	Oct 25, 2007
How to play corresponding sound?	2	Jun 10, 2023
ChatGPT will make us Job(Home)less	3	Jan 22, 2023
Python code problem	2	Apr 23, 2023
Dont work, it´s something whit the loops?	1	Jun 30, 2021
Can't solve problems! please Help	0	Sep 26, 2022

unicode issue

gentlestone

Andre Engels

gentlestone

Andre Engels

gentlestone

gentlestone

Dave Angel

gentlestone

saeed.gnu

Mark Tolonen

Piet van Oostrum

Dave Angel

Piet van Oostrum

Dave Angel

gentlestone

Dave Angel

Hyuga

Walter DÃ¶rwald

Rami Chowdhury

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads