convert Unicode filenames to good-looking ASCII

C

coldpizza

Hello,

I need to convert accented unicode chars in some audio files to
similarly-looking ascii chars. Looks like the following code seems to
work on windows:

import os
import sys
import glob

EXT = '*.*'

lst_uni = glob.glob(unicode(EXT))

os.system('chcp 437')
lst_asci = glob.glob(EXT)
print sys.stdout.encoding

for i in range(len(lst_asci)):
try:
os.rename(lst_uni, lst_asci)
except Exception as e:
print e

On windows it converts most of the accented chars from the latin1
encoding. This does not work in Linux since it uses 'chcp'.

The questions are (1) *why* does it work on windows, and (2) what is
the proper and portable way to convert unicode characters to similarly
looking plain ascii chars?

That is how to properly do this kind of conversion?
ü > u
é > e
â > a
ä > a
à > a
á > a
ç > c
ê > e
ë > e
è > e

Is there any other way apart from creating my own char replacement
table?
 
I

Iliya

Try smth like this:

import unicodedata

def remove_accents(str):
nkfd_form = unicodedata.normalize('NFKD', unicode(str))
return u''.join([c for c in nkfd_form if not unicodedata.combining(c)])
 
P

Peter Otten

coldpizza said:
Hello,

I need to convert accented unicode chars in some audio files to
similarly-looking ascii chars. Looks like the following code seems to
work on windows:

import os
import sys
import glob

EXT = '*.*'

lst_uni = glob.glob(unicode(EXT))

os.system('chcp 437')
lst_asci = glob.glob(EXT)
print sys.stdout.encoding

for i in range(len(lst_asci)):
try:
os.rename(lst_uni, lst_asci)
except Exception as e:
print e

On windows it converts most of the accented chars from the latin1
encoding. This does not work in Linux since it uses 'chcp'.

The questions are (1) *why* does it work on windows, and (2) what is
the proper and portable way to convert unicode characters to similarly
looking plain ascii chars?

That is how to properly do this kind of conversion?
ü > u
é > e
â > a
ä > a
à > a
á > a
ç > c
ê > e
ë > e
è > e

Is there any other way apart from creating my own char replacement
table?

.... é > e
.... â > a
.... ä > a
.... à > a
.... á > a
.... ç > c
.... ê > e
.... ë > e
.... è > e
.... """u > u
e > e
a > a
a > a
a > a
a > a
c > c
e > e
e > e
e > e
 
C

coldpizza

Cool! Thanks to both Iliya and Peter!

coldpizza said:
I need to convert accented unicode chars in some audio files to
similarly-looking ascii chars. Looks like the following code seems to
work on windows:
import os
import sys
import glob
EXT = '*.*'
lst_uni = glob.glob(unicode(EXT))
os.system('chcp 437')
lst_asci = glob.glob(EXT)
print sys.stdout.encoding
for i in range(len(lst_asci)):
    try:
        os.rename(lst_uni, lst_asci)
    except Exception as e:
        print e

On windows it converts most of the accented chars from the latin1
encoding. This does not work in Linux since it uses 'chcp'.
The questions are (1) *why* does it work on windows, and (2) what is
the proper and portable way to convert unicode characters to similarly
looking plain ascii chars?
That is how to properly do this kind of conversion?
 ü  > u
 é  > e
 â  > a
 ä  > a
 à  > a
 á  > a
 ç  > c
 ê  > e
 ë  > e
 è  > e
Is there any other way apart from creating my own char replacement
table?

...  é  > e
...  â  > a
...  ä  > a
...  à  > a
...  á  > a
...  ç  > c
...  ê  > e
...  ë  > e
...  è  > e
... """>>> from unicodedata import normalize
u  > u
 e  > e
 a  > a
 a  > a
 a  > a
 a  > a
 c  > c
 e  > e
 e  > e
 e  > e
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top