File names, character sets and Unicode

Discussion in 'Python' started by Michal Ludvig, Dec 12, 2008.

  1. Hi all,

    is there any way to determine what's the charset of filenames returned
    by os.walk()?

    The trouble is, if I pass <type 'str'> argument to os.walk() I get the
    filenames as byte-strings. Possibly UTF-8 encoded Unicode, who knows.

    OTOH If I pass <type 'unicode'> to os.walk() all the filenames I get in
    the loop are already unicode()d.

    However with some locales settings os.walk() dies with for example:
    Traceback (most recent call last):
    File "tst.py", line 10, in <module>
    for root, dirs, files in filelist:
    File "/usr/lib/python2.5/os.py", line 303, in walk
    for x in walk(path, topdown, onerror):
    File "/usr/lib/python2.5/os.py", line 293, in walk
    if isdir(join(top, name)):
    File "/usr/lib/python2.5/posixpath.py", line 65, in join
    path += '/' + b
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1:
    ordinal not in range(128)

    I can't even skip over these files with 'os.walk(..., onerror=handler)'
    the handler() is never called.

    That happens for instance when the file names have some non-ascii
    characters and locales are set to ascii, but reportedly in some other
    cases as well.

    What's the right and safe way to walk the filesystem and get some
    meaningful filenames?


    Related question - if the directory is given name on a command line
    what's the right way to preprocess the argument before passing it down
    to os.walk()?

    For instance with LANG=en_NZ.UTF-8 (i.e. UTF-8 system):
    * directory is called 'smile☺'
    * sys.argv[1] will be 'smile\xe2\x98\xba' (type str)
    * after .decode("utf-8") I get u'smile\u263a' (type unicode)

    But how should I decode() it when running on a system where $LANG
    doesn't end with "UTF-8"? Apparently some locales have non-ascii default
    charsets. For instance zh_TW is BIG5 charset by default, ru_RU is
    ISO-8850-5, etc. How do I detect that to get the right charset for decode()?

    I tend to have everything internally in Unicode but it's often unclear
    how to convert some inputs to Unicode in the first place. What are the
    best practices for dealing with these chraset issues in Python?

    Thanks!

    Michal
    --
    * Amazon S3 backup tool -- http://s3tools.logix.cz/s3cmd
    Michal Ludvig, Dec 12, 2008
    #1
    1. Advertising

  2. On Fri, 12 Dec 2008 23:32:27 +1300, Michal Ludvig wrote:

    > is there any way to determine what's the charset of filenames returned
    > by os.walk()?


    No. Especially under *nix file systems file names are just a string of
    bytes, not characters. It is possible to have file names in different
    encondings in the same directory.

    > The trouble is, if I pass <type 'str'> argument to os.walk() I get the
    > filenames as byte-strings. Possibly UTF-8 encoded Unicode, who knows.


    Nobody knows. :)

    > What's the right and safe way to walk the filesystem and get some
    > meaningful filenames?


    The safe way is to use `str`.

    > Related question - if the directory is given name on a command line
    > what's the right way to preprocess the argument before passing it down
    > to os.walk()?


    Pass it as is.

    > For instance with LANG=en_NZ.UTF-8 (i.e. UTF-8 system): * directory is
    > called 'smile☺'
    > * sys.argv[1] will be 'smile\xe2\x98\xba' (type str) * after
    > .decode("utf-8") I get u'smile\u263a' (type unicode)
    >
    > But how should I decode() it when running on a system where $LANG
    > doesn't end with "UTF-8"? Apparently some locales have non-ascii default
    > charsets. For instance zh_TW is BIG5 charset by default, ru_RU is
    > ISO-8850-5, etc. How do I detect that to get the right charset for
    > decode()?


    You can't. Even if you know the preferred encoding of the system, e.g.
    via $LANG, there is no guarantee that all file names are encoded this way.

    > I tend to have everything internally in Unicode but it's often unclear
    > how to convert some inputs to Unicode in the first place. What are the
    > best practices for dealing with these chraset issues in Python?


    I'm usually using UTF-8 as default but offer the user ways, e.g. command
    line switches, to change that.

    If I have to display file names in a GUI I use a decoded version of the
    byte string file name, but keep the byte string for operations on the
    file.

    Ciao,
    Marc 'BlackJack' Rintsch
    Marc 'BlackJack' Rintsch, Dec 12, 2008
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. BLG
    Replies:
    13
    Views:
    1,021
    John C. Bollinger
    Oct 21, 2003
  2. jb
    Replies:
    5
    Views:
    382
    Benjamin Niemann
    Mar 29, 2006
  3. Ron
    Replies:
    1
    Views:
    360
    Noah Roberts
    Nov 1, 2003
  4. Michael

    character sets? unicode?

    Michael, Feb 3, 2005, in forum: Python
    Replies:
    0
    Views:
    290
    Michael
    Feb 3, 2005
  5. Alan

    Unicode character names

    Alan, Oct 25, 2007, in forum: Java
    Replies:
    3
    Views:
    361
Loading...

Share This Page