Looking for UNICODE to ASCII Conversioni Example Code

C

caldwellinva

Hi!

I am looking for an example of a UNICODE to ASCII conversion example that will remove diacritics from characters (and leave the characters, i.e., Klüft to Kluft) as well as handle the conversion of other characters, like große to grosse.

There used to be a program called any2ascii.py (http://www.haypocalc.com/perso/prog/python/any2ascii.py) that worked well, but the link is now broken and I can't seem to locate it.

I have seen the page Unicode strings to ASCII ...nicely, http://www.peterbe..com/plog/unicode-to-ascii, but am looking for a working example.

Thank you!
 
S

Steven D'Aprano

Hi!

I am looking for an example of a UNICODE to ASCII conversion example
that will remove diacritics from characters (and leave the characters,
i.e., Klüft to Kluft) as well as handle the conversion of other
characters, like große to grosse.

Seems like a nasty thing to do, akin to stripping the vowels from English
text just because Hebrew didn't write them. But if you insist, there's
always this:

http://code.activestate.com/recipes/251871

although it is nowhere near complete, and it's pretty ugly code too.

Perhaps a cleaner method might be to use a combination of Unicode
normalisation forms and a custom translation table. Here's a basic
version to get you started, written for Python 3:

import unicodedata

# Do this once. It may take a while.
table = {}
for n in range(128, 0x11000):
# Use unichar in Python2
expanded = unicodedata.normalize('NFKD', chr(n))
keep = [c for c in expanded if ord(c) < 128]
if keep:
table[n] = ''.join(keep)
else:
# None to delete, or use some other replacement string.
table[n] = None

# Add extra transformations.
# In Python2, every string needs to be a Unicode string u'xyz'.
table[ord('ß')] = 'ss'
table[ord('\N{LATIN CAPITAL LETTER SHARP S}')] = 'SS'
table[ord('Æ')] = 'AE'
table[ord('æ')] = 'ae'
table[ord('Å’')] = 'OE'
table[ord('Å“')] = 'oe'
table[ord('ï¬')] = 'fi'
table[ord('fl')] = 'fl'
table[ord('ø')] = 'oe'
table[ord('Ã')] = 'D'
table[ord('Þ')] = 'TH'
# etc.

# Say you don't want control characters in your string, you might
# escape them using caret ^C notation:
for i in range(32):
table = '^%c' % (ord('@') + i)

table[127] = '^?'

# But it's probably best if you leave newlines, tabs etc. alone...
for c in '\n\r\t\f\v':
del table[ord(c)]

# Add any more transformations you like here. Perhaps you want to
# transliterate Russian and Greek characters to English?
table[whatever] = whatever

# In Python2, use unicode.maketrans instead.
table = str.maketrans(table)



That's a fair chunk of work, but it only needs be done once, at the start
of your application. Then you call it like this:

cleaned = 'some Unicode string'.translate(table)

If you really want to be fancy, you can extract the name of each Unicode
code point (if it has one!) and parse the name. Here's an example:

py> unicodedata.name('ħ')
'LATIN SMALL LETTER H WITH STROKE'
py> unicodedata.lookup('LATIN SMALL LETTER H')
'h'

but I'd only do that after the normalization step, if at all.

Too much work for your needs? Well, you can get about 80% of the way in
only a few lines of code:

cleaned = unicodedata.normalize('NFKD', unistr)
for before, after in (
('ß', 'ss'), ('Æ', 'AE'), ('æ', 'ae'), ('Œ', 'OE'), ('œ', 'oe'),
# put any more transformations here...
):
cleaned = cleaned.replace(before, after)

cleaned = cleaned.encode('ascii', 'replace').decode('ascii')


Another method would be this:

http://effbot.org/zone/unicode-convert.htm


which is focused on European languages. But it might suit your purposes.

There used to be a program called any2ascii.py
(http://www.haypocalc.com/perso/prog/python/any2ascii.py) that worked
well, but the link is now broken and I can't seem to locate it.

I have seen the page Unicode strings to ASCII ...nicely,
http://www.peterbe.com/plog/unicode-to-ascii, but am looking for a
working example.

He has a working example. How much hand-holding are you looking for?

Quoting from that page:

I'd much rather that a word like "Klüft" is converted to
"Kluft" which will be more human readable and still correct.


The author is wrong. That's like saying that changing the English word
"car" to "cer" is still correct -- it absolutely is not correct, and even
if it were, what is he implying with the quip about "more human
readable"? That Germans and other Europeans aren't human?

If an Italian said:

I'd much rather that a word like "jump" is converted to
"iump" which will be more human readable and still correct.

we'd all agree that he was talking rubbish.

Make no mistake, this sort of simple-minded stripping of accents and
diacritics is an extremely ham-fisted thing to do. To strip out letters
without changing the meaning of the words is, at best, hard to do right
and requiring good knowledge of the linguistic rules of the language
you're translating. And at worst, it's outright impossible. For instance,
in German I believe it is quite acceptable to translate 'ü' to 'ue',
except in names: Herr Müller will probably be quite annoyed if you call
him Herr Mueller, and Herr Mueller will probably be annoyed too, and both
of them will be peeved to be confused with Herr Muller.
 
C

caldwellinva

Zero/Stephen ... thank you for your replies ... they were both very helpful, both in addressing the immediate issue and for getting a better understanding of the context of the conversion. Greatly appreciate your taking the time for such good solutions.
 
Z

Zero Piraeus

:

Make no mistake, this sort of simple-minded stripping of accents and
diacritics is an extremely ham-fisted thing to do.

I used to live on a street called Calle Colón, so I'm aware of the
dangers of stripping diacritics:

https://es.wikipedia.org/wiki/Colón
https://es.wikipedia.org/wiki/Colon

.... although in that particular case, there's a degree of poetic justice
in confusing Cristóbal Colón / Cristopher Columbus with the back end of
a digestive tract:

http://theoatmeal.com/comics/columbus_day

Joking aside, there is a legitimate use for asciifying text in this way:
creating unambiguous identifiers.

For example, a miscreant may create the username 'míguel' in order to
pose as another user 'miguel', relying on other users inattentiveness.
Asciifying is one way of reducing the risk of that.

-[]z.
 
R

Roy Smith

Zero Piraeus said:
For example, a miscreant may create the username 'míguel' in order to
pose as another user 'miguel', relying on other users inattentiveness.
Asciifying is one way of reducing the risk of that.

Determining if two strings are "almost the same" is not easy. If míguel
and miguel are to be considered the same, then why not also consider
michael to be the same? Or, for that matter, mike, mikey, or mick?
There's no easy answer, and what's the right answer for some
applications will be the wrong answer for others.

A reasonable place to start exploring this topic is
https://en.wikipedia.org/wiki/String_metric.
 
R

rusi

Determining if two strings are "almost the same" is not easy. If míguel
and miguel are to be considered the same, then why not also consider
michael to be the same? Or, for that matter, mike, mikey, or mick?
There's no easy answer, and what's the right answer for some
applications will be the wrong answer for others.

I did not know till quite recently that Jean and Ivan were just good ol John
 
S

Steven D'Aprano

:

Make no mistake, this sort of simple-minded stripping of accents and
diacritics is an extremely ham-fisted thing to do.
[...]
Joking aside, there is a legitimate use for asciifying text in this way:
creating unambiguous identifiers.

For example, a miscreant may create the username 'míguel' in order to
pose as another user 'miguel', relying on other users inattentiveness.
Asciifying is one way of reducing the risk of that.

I'm pretty sure that Oliver and 0liver may not agree. Neither will
Megal33tHaxor and Mega133tHaxor.

It's true that there are *more* opportunities for this sort of
shenanigans with Unicode, so I guess your comment about "reducing" the
risk (rather than eliminating it) is strictly correct. But there are
other (better?) ways to do so, e.g. you could generate an identicon for
the user to act as a visual checksum:

http://en.wikipedia.org/wiki/Identicon


Another reasonable use for accent-stripping is searches. If I'm searching
for music by the Blue Öyster Cult, it would be good to see results for
Blue Oyster Cult as well. And vice versa. (A good search engine should
consider *adding* accents as well as removing them.)

On the other hand, if you name your band ▼□■□■□■, you deserve to wallow
in obscurity :)
 
R

Roy Smith

Another reasonable use for accent-stripping is searches. If I'm searching
for music by the Blue Öyster Cult, it would be good to see results for
Blue Oyster Cult as well.

Tell me about it (I work at Songza; music search is what we do). Accents are easy (Beyoncé, for example). What about NIN (where one of the N's is supposed to be backwards, but I can't figure out how to type that)? AndKe$ha. And "The artist previously known as a glyph which doesn't even exist in Unicode 6.3"
On the other hand, if you name your band ▼□■□■□■, you deserve to wallow
in obscurity :)

Indeed.

So, yesterday, I tracked down an uncaught exception stack in our logs to a user whose username included the unicode character 'SMILING FACE WITH SUNGLASSES' (U+1F60E). It turns out, that's perfectly fine as a user name, except that in one obscure error code path, we try to str() it during some error processing. If you named your band something which included that character, would you expect it to match a search for the same name but with 'WHITESMILING FACE' (U+263A) instead?
 
C

Chris Angelico

So, yesterday, I tracked down an uncaught exception stack in our logs to a user whose username included the unicode character 'SMILING FACE WITH SUNGLASSES' (U+1F60E). It turns out, that's perfectly fine as a user name, except that in one obscure error code path, we try to str() it during some error processing.

How is that a problem? Surely you have to deal with non-ASCII
characters all the time - how is that particular one a problem? I'm
looking at its UTF-8 and UTF-16 representations and not seeing
anything strange, unless it's the \x0e in UTF-16 - but, again, you
must surely have had to deal with
non-ASCII-encoded-whichever-way-you-do-it.

Or are you saying that that particular error code path did NOT handle
non-ASCII characters? If so, that's a strong argument for moving to
Python 3, to get full Unicode support in _all_ branches.

ChrisA
 
R

Roy Smith

Chris Angelico said:
How is that a problem? Surely you have to deal with non-ASCII
characters all the time - how is that particular one a problem? I'm
looking at its UTF-8 and UTF-16 representations and not seeing
anything strange, unless it's the \x0e in UTF-16 - but, again, you
must surely have had to deal with
non-ASCII-encoded-whichever-way-you-do-it.

Or are you saying that that particular error code path did NOT handle
non-ASCII characters?

Exactly. The fundamental error was caught, and then we raised another
UnicodeEncodeError generating the text of the error message to log!
If so, that's a strong argument for moving to
Python 3, to get full Unicode support in _all_ branches.

Well, yeah. The problem is, my pip requirements file lists 76 modules
(and installing all those results in 144 modules, including the cascaded
dependencies). Until most of those are P3 ready, we can't move.

Heck, I can't even really move off 2.6 because we use Amazon's EMR
service, which is stuck on 2.6.
 
C

Chris Angelico

Exactly. The fundamental error was caught, and then we raised another
UnicodeEncodeError generating the text of the error message to log!

Ha... oh, that's awkward.
Well, yeah. The problem is, my pip requirements file lists 76 modules
(and installing all those results in 144 modules, including the cascaded
dependencies). Until most of those are P3 ready, we can't move.

It's still a strong argument, just that unavailability of key modules
may be a stronger one :)
Heck, I can't even really move off 2.6 because we use Amazon's EMR
service, which is stuck on 2.6.

Hrm. 2.6 is now in source-only security-only support, and that's about
to end (there's a 2.6.9 in the pipeline, and that's that). It's about
time Amazon moved to 2.7, at least...

ChrisA
 
R

Roy Smith

Heck, I can't even really move off 2.6 because we use Amazon's EMR
service, which is stuck on 2.6.

Hrm. 2.6 is now in source-only security-only support, and that's about
to end (there's a 2.6.9 in the pipeline, and that's that). It's about
time Amazon moved to 2.7, at least...[/QUOTE]

Tell that to Amazon.
 
C

Chris Angelico

Tell that to customers of Amazon's EMR service, who are going to have
rather more leverage with Amazon than non-customers.

Aye. My involvement with Amazon is pretty minimal - I evaluated EC2 a
while ago, and I think maybe the company paid Amazon something like
$20... maybe as much as $100, we had some of the higher-end instances
running for a while. They won't be listening to me. :)

ChrisA
 
M

Mark Lawrence

Hrm. 2.6 is now in source-only security-only support, and that's about
to end (there's a 2.6.9 in the pipeline, and that's that). It's about
time Amazon moved to 2.7, at least...

Tell that to Amazon.
[/QUOTE]

Dear Amazon,

Please upgrade to Python 3.3 or similar so that users can have better
unicode support amongst other things.

Love and kisses.

Mark.

--
Roses are red,
Violets are blue,
Most poems rhyme,
But this one doesn't.

Mark Lawrence
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,039
Messages
2,570,376
Members
47,029
Latest member
EmiliaSton

Latest Threads

Top