unicode issue


W

Walter Dörwald

Why don't work this code on Python 2.6? Or how can I do this job?

[snip _MAP]

def downcode(name):
"""
downcode(u"Žabovitá zmiešaná kaša")
u'Zabovita zmiesana kasa'
"""
for key, value in _MAP.iteritems():
name = name.replace(key, value)
return name

Though C Python is pretty optimized under the hood for this sort of
single-character replacement, this still seems pretty inefficient
since you're calling replace for every character you want to map. I
think that a better approach might be something like:

def downcode(name):
return ''.join(_MAP.get(c, c) for c in name)

Or using string.translate:

import string
def downcode(name):
table = string.maketrans(
'ÀÃÂÃÄÅ...',
'AAAAAA...')
return name.translate(table)

Or even simpler:

import unicodedata

def downcode(name):
return unicodedata.normalize("NFD", name)\
.encode("ascii", "ignore")\
.decode("ascii")

Servus,
Walter

As I understand it, the "ignore" argument to str.encode *removes* the
undecodable characters, rather than replacing them with an ASCII
approximation. Is that correct? If so, wouldn't that rather defeat the
purpose?

Yes, but any accented characters have been split into the base character
and the combining accent via normalize() before, so only the accent gets
removed. Of course non-decomposable characters will be removed
completely, but it would be possible to replace

.encode("ascii", "ignore").decode("ascii")

with something like this:

u"".join(c for c in name if unicodedata.category(c) == "Mn")

Servus,
Walter
 
Ad

Advertisements

R

Rami Chowdhury

Yes, but any accented characters have been split into the base character
and the combining accent via normalize() before, so only the accent gets
removed. Of course non-decomposable characters will be removed
completely, but it would be possible to replace

.encode("ascii", "ignore").decode("ascii")

with something like this:

u"".join(c for c in name if unicodedata.category(c) == "Mn")

Servus,
Walter

Thank you for the clarification!
 
N

Neil Hodgson

Dave Angel:
I know that the clipboard has type tags, but I haven't looked at them in
so long that I forget what they look like. For text, is it just ASCII
and Unicode? Or are there other possible encodings that the source and
sink negotiate?

The normal thing seen is that the clipboard differentiates between
Unicode text and locale-dependent 8 bit text. Depending on platform
Unicode text may be in UTF-8 (Linux) or UTF-16 (Windows). The encoding
of 8-bit text strings is not well defined and is normally assumed to be
compatible with whatever is currently in the document or the current
user interface encoding.

Neil
 
G

gentlestone

Thx for useful advices. They seems to be very clever.

Thx to dajngo users comunity, I've got a nice solution, how to avoid
unicode problems in doctests:

""""Šafářová".decode('utf-8'))
<Osoba: Šafářová Ľudmila>
"""

It is - do not use unicode string at all. Instead of it create a
unicode object by explicitly decoding a bytestring using the proper
codec.
 
Ad

Advertisements

G

Gabriel Genellina

_MAP = {
# LATIN
u'À': 'A', u'Á': 'A', u'Â': 'A', u'Ã': 'A', u'Ä': 'A', u'Å': 'A',
u'Æ': 'AE', u'Ç':'C', [...long table...]
}

def downcode(name):
"""
downcode(u"´abovitá zmie¨aná ka¨a")
u'Zabovita zmiesana kasa'
"""
for key, value in _MAP.iteritems():
name = name.replace(key, value)
return name

import unicodedata

def downcode(name):
return unicodedata.normalize("NFD", name)\
.encode("ascii", "ignore")\
.decode("ascii")

This article [1] shows a mixed technique, decomposing characters when such
info is available in the Unicode tables, and also allowing for a custom
mapping when not.

[1] http://effbot.org/zone/unicode-convert.htm
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top