File names, character sets and Unicode

Michal Ludvig · Dec 12, 2008

Hi all,

is there any way to determine what's the charset of filenames returned
by os.walk()?

The trouble is, if I pass <type 'str'> argument to os.walk() I get the
filenames as byte-strings. Possibly UTF-8 encoded Unicode, who knows.

OTOH If I pass <type 'unicode'> to os.walk() all the filenames I get in
the loop are already unicode()d.

However with some locales settings os.walk() dies with for example:
Traceback (most recent call last):
File "tst.py", line 10, in <module>
for root, dirs, files in filelist:
File "/usr/lib/python2.5/os.py", line 303, in walk
for x in walk(path, topdown, onerror):
File "/usr/lib/python2.5/os.py", line 293, in walk
if isdir(join(top, name)):
File "/usr/lib/python2.5/posixpath.py", line 65, in join
path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1:
ordinal not in range(128)

I can't even skip over these files with 'os.walk(..., onerror=handler)'
the handler() is never called.

That happens for instance when the file names have some non-ascii
characters and locales are set to ascii, but reportedly in some other
cases as well.

What's the right and safe way to walk the filesystem and get some
meaningful filenames?

Related question - if the directory is given name on a command line
what's the right way to preprocess the argument before passing it down
to os.walk()?

For instance with LANG=en_NZ.UTF-8 (i.e. UTF-8 system):
* directory is called 'smileâ˜º'
* sys.argv[1] will be 'smile\xe2\x98\xba' (type str)
* after .decode("utf-8") I get u'smile\u263a' (type unicode)

But how should I decode() it when running on a system where $LANG
doesn't end with "UTF-8"? Apparently some locales have non-ascii default
charsets. For instance zh_TW is BIG5 charset by default, ru_RU is
ISO-8850-5, etc. How do I detect that to get the right charset for decode()?

I tend to have everything internally in Unicode but it's often unclear
how to convert some inputs to Unicode in the first place. What are the
best practices for dealing with these chraset issues in Python?

Thanks!

Michal

Marc 'BlackJack' Rintsch · Dec 12, 2008

is there any way to determine what's the charset of filenames returned
by os.walk()?

No. Especially under *nix file systems file names are just a string of
bytes, not characters. It is possible to have file names in different
encondings in the same directory.

The trouble is, if I pass <type 'str'> argument to os.walk() I get the
filenames as byte-strings. Possibly UTF-8 encoded Unicode, who knows.

Nobody knows.

What's the right and safe way to walk the filesystem and get some
meaningful filenames?

The safe way is to use `str`.

Related question - if the directory is given name on a command line
what's the right way to preprocess the argument before passing it down
to os.walk()?

Pass it as is.

For instance with LANG=en_NZ.UTF-8 (i.e. UTF-8 system): * directory is
called 'smileâ˜º'
* sys.argv[1] will be 'smile\xe2\x98\xba' (type str) * after
.decode("utf-8") I get u'smile\u263a' (type unicode)

But how should I decode() it when running on a system where $LANG
doesn't end with "UTF-8"? Apparently some locales have non-ascii default
charsets. For instance zh_TW is BIG5 charset by default, ru_RU is
ISO-8850-5, etc. How do I detect that to get the right charset for
decode()?

You can't. Even if you know the preferred encoding of the system, e.g.
via $LANG, there is no guarantee that all file names are encoded this way.

I tend to have everything internally in Unicode but it's often unclear
how to convert some inputs to Unicode in the first place. What are the
best practices for dealing with these chraset issues in Python?

I'm usually using UTF-8 as default but offer the user ways, e.g. command
line switches, to change that.

If I have to display file names in a GUI I use a decoded version of the
byte string file name, but keep the byte string for operations on the
file.

Ciao,
Marc 'BlackJack' Rintsch

Unicode	20	Dec 16, 2012
Python 3.3, gettext and Unicode problems	0	Dec 31, 2012
Thinking Unicode	0	Aug 8, 2013
Python and unicode	8	Sep 19, 2010
character sets? unicode?	0	Feb 3, 2005
odd unicode error	2	Apr 12, 2007
How do I encode and decode this data to write to a file?	11	Apr 29, 2013
Trouble with UnicodeEncodeError and email	0	Jan 8, 2014

File names, character sets and Unicode

Michal Ludvig

Marc 'BlackJack' Rintsch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads