Managing non-ascii filenames in python

P

pdenize

I created the following filename in windows just as a test -
“Dönåld’s™ Néphêws” deg°.txt
The quotes are non -ascii, many non english characters, long hyphen
etc.

Now in DOS I can do a directory and it translates them all to
something close.
"Dönåld'sT Néphêws" deg°.txt

I thought the correct way to do this in python would be to scan the
dir
files=os.listdir(os.path.dirname( os.path.realpath( __file__ ) ))

then print the filenames
for filename in files:
print filename

but as expected teh filename is not correct - so correct it using the
file sysytems encoding

print filename.decode(sys.getfilesystemencoding())

But I get
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2014'
in position 6: character maps to <undefined>

All was working well till these characters came along

I need to be able to write (a representation) to the screen (and I
don't see why I should not get something as good as DOS shows).

Write it to an XML file in UTF-8

and write it to a text file and be able to read it back in.
Again I was supprised that this was also difficult - it appears that
the file also wanted ascii. Should I have to open the file in binary
for write (I expect so) but then what encoding should I write in?

I have been beating myself up with this for weeks as I get it working
then come across some outher character that causes it all to stop
again.

Please help.
 
M

Martin v. Löwis

I thought the correct way to do this in python would be to scan the
dir
files=os.listdir(os.path.dirname( os.path.realpath( __file__ ) ))

then print the filenames
for filename in files:
print filename

but as expected teh filename is not correct - so correct it using the
file sysytems encoding

print filename.decode(sys.getfilesystemencoding())

But I get
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2014'
in position 6: character maps to <undefined>

As a starting point, you shouldn't be using byte-oriented APIs to
access files on Windows; the specific byte-oriented API is os.listdir,
when passed a directory represented as a byte string.

So try:

dirname = os.path.dirname(os.path.realpath(__file__))
dirname = dirname.decode(sys.getfilesystemencoding()
files = os.listdir(dirname)

This should give you the files as Unicode strings.
I need to be able to write (a representation) to the screen (and I
don't see why I should not get something as good as DOS shows).

The command window (it's not really DOS anymore) uses the CP_OEMCP
encoding, which is not available in Python. This does all the
transliteration also, so you would have to write an extension module
if you want to get the same transliteration (or try to get to the
OEMCP encoding through ctypes).

If you can live with a simpler transliteration, try

print filename.encode(sys.stdout.encoding, "replace")
Write it to an XML file in UTF-8

and write it to a text file and be able to read it back in.
Again I was supprised that this was also difficult - it appears that
the file also wanted ascii. Should I have to open the file in binary
for write (I expect so) but then what encoding should I write in?

You need to tell us how precisely you tried to do this. My guess is:
if you now try again, with the filenames being Unicode strings, it
will work fairly well.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top