Nice unicode -> ascii translation?

crowell · Aug 6, 2006

I'm using the ID3 tag of an mp3 file to query musicbrainz to get their
sort-name for the artist. A simple example is "The Beatles" ->
MusicBrainz -> "Beatles, The". I then want to rename the mp3 file
using this information. However, I would like the filename to contain
only ascii characters, while musicbrainz gives unicode back. So far,
I've got something like:
'Bla Fleck'

However, I'd like to see the more sensible "Bela Fleck" instead of
dropping '\xe9' entirely. I believe this sort of translation can be
done using:

The trick is finding the right XXXX. Has someone attempted this
before, or am I stuck writing my own solution?

Brian Beck · Aug 6, 2006

The trick is finding the right XXXX. Has someone attempted this
before, or am I stuck writing my own solution?

You want ASCII, Dammit: http://www.crummy.com/cgi-bin/msm/map.cgi/ASCII
+Dammit

John Machin · Aug 7, 2006

I'm using the ID3 tag of an mp3 file to query musicbrainz to get their
sort-name for the artist. A simple example is "The Beatles" ->
MusicBrainz -> "Beatles, The". I then want to rename the mp3 file
using this information. However, I would like the filename to contain
only ascii characters, while musicbrainz gives unicode back. So far,
I've got something like:

'Bla Fleck'

Why do you want only ASCII characters? What platform are you running
on?
If it's just a display problem, and the Unicode doesn't stray outside
the first 256 codepoints, you shouldn't have a problem e.g.

Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
on win32
[snip]
IDLE 1.1.3Béla Fleck

On a *x box, using latin1 should work.

However, I'd like to see the more sensible "Bela Fleck" instead of
dropping '\xe9' entirely. I believe this sort of translation can be
done using:

The trick is finding the right XXXX. Has someone attempted this
before, or am I stuck writing my own solution?

However if you really insist on having only ASCII characters, then
you've pretty much got to make up your own translation table. There was
a thread or two on this topic within the last few months. Merely
stripping off accents umlauts cedillas etc etc off most European
scripts where the basic alphabet is Roman/Latin is easy enough. However
some scripts use characters which are not Latin letters with detachable
decorations, and you will need 2 characters out for 1 in (e.g. German
eszett, Icelandic thorn (the name of the god with the hammer is shown
in ASCII as Thor, not Por!)). Scripts like Greek and Cyrillic would
need even more work

HTH,
John

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Aug 7, 2006

The trick is finding the right XXXX. Has someone attempted this
before, or am I stuck writing my own solution?

In this specific example, there is a different approach, using
the Unicode character database:

def strip_combining(s):
import unicodedata
# Expand pre-combined characters into base+combinator
s1 = unicodedata.normalize("NFD", s)
r = []
for c in s1:
# add all non-combining characters
if not unicodedata.combining(c):
r.append(c)
return u"".join(r)

py> a.strip_combining(u'B\xe9la Fleck')
u'Bela Fleck'

As the accented characters get decomposed into base character
plus combining accent, this strips off all accents in the
string.

Of course, it is still fairly limited. If you have non-latin
scripts (Greek, Cyrillic, Arabic, Kanji, ...), this approach
fails, and you would need a transliteration database for them.
There is non built into Python, and I couldn't find a
transliteration database that transliterates all Unicode characters
into ASCII, either.

Regards,
Martin

skip · Aug 7, 2006

crowell> However, I'd like to see the more sensible "Bela Fleck" instead
crowell> of dropping '\xe9' entirely.

Assuming the data are in latin-1 or can be converted to it, try my latscii
codec:

http://orca.mojam.com/~skip/python/latscii.py

Skip

Ascii to Unicode.	4	Jul 28, 2010
minidom xml & non ascii / unicode & files	4	Aug 5, 2005
Unicode/ascii encoding nightmare	19	Nov 6, 2006
reading id3 tags with python	6	Nov 23, 2006
unicode to ascii converting	12	Aug 6, 2004
NoneType to unicode	1	Dec 12, 2005
Q: The `print' statement over Unicode	9	May 4, 2005
PEP 3131: Supporting Non-ASCII Identifiers	399	May 13, 2007

Nice unicode -> ascii translation?

crowell

Brian Beck

John Machin

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

skip

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads