os.lisdir, gets unicode, returns unicode... USUALLY?!?!?

Discussion in 'Python' started by gabor, Nov 16, 2006.

  1. gabor

    gabor Guest

    hi,

    from the documentation (http://docs.python.org/lib/os-file-dir.html) for
    os.listdir:

    "On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
    will be a list of Unicode objects."

    i'm on Unix. (linux, ubuntu edgy)

    so it seems that it does not always return unicode filenames.

    it seems that it tries to interpret the filenames using the filesystem's
    encoding, and if that fails, it simply returns the filename as byte-string.

    so you get back let's say an array of 21 filenames, from which 3 are
    byte-strings, and the rest unicode strings.

    after digging around, i found this in the source code:

    > #ifdef Py_USING_UNICODE
    > if (arg_is_unicode) {
    > PyObject *w;
    >
    > w = PyUnicode_FromEncodedObject(v,
    > Py_FileSystemDefaultEncoding,
    > "strict");
    > if (w != NULL) {
    > Py_DECREF(v);
    > v = w;
    > }
    > else {
    > /* fall back to the original byte string, as
    > discussed in patch #683592 */
    > PyErr_Clear();
    > }
    > }
    > #endif


    so if the to-unicode-conversion fails, it falls back to the original
    byte-string. i went and have read the patch-discussion.

    and now i'm not sure what to do.
    i know that:

    1. the documentation is completely wrong. it does not always return
    unicode filenames
    2. it's true that the documentation does not specify what happens if the
    filename is not in the filesystem-encoding, but i simply expected that i
    get an Unicode-exception, as everywhere else. you see, exceptions are
    ok, i can deal with them. but this is just plain wrong. from now on,
    EVERYWHERE where i use os.listdir, i will have to go through all the
    filenames in it, and check if they are unicode-strings or not.

    so basically i'd like to ask here: am i reading something incorrectly?
    or am i using os.listdir the "wrong way"? how do other people deal with
    this?

    p.s: one additional note. if you code expects os.listdir to return
    unicode, that usually means that all your code uses unicode strings.
    which in turn means, that those filenames will somehow later interact
    with unicode strings. which means that that byte-string-filename will
    probably get auto-converted to unicode at a later point, and that
    auto-conversion will VERY probably fail, because the auto-convert only
    happens using 'ascii' as the encoding, and if it was not possible to
    decode the filename inside listdir, it's quite probable that it also
    will not work using 'ascii' as the charset.


    gabor
     
    gabor, Nov 16, 2006
    #1
    1. Advertising

  2. gabor

    Terry Reedy Guest

    "gabor" <> wrote in message
    news:edfc7$455cd28b$59ad1aca$...
    > so if the to-unicode-conversion fails, it falls back to the original
    > byte-string. i went and have read the patch-discussion.
    >
    > and now i'm not sure what to do.
    > i know that:
    >
    > 1. the documentation is completely wrong. it does not always return
    > unicode filenames


    Unless someone says otherwise, report the discrepancy between doc and code
    as a bug on the SF tracker. I have no idea of what the resolution should
    be ;-).

    tjr
     
    Terry Reedy, Nov 16, 2006
    #2
    1. Advertising

  3. gabor schrieb:
    > so basically i'd like to ask here: am i reading something incorrectly?


    You are reading it correctly. This is how it behaves.

    > or am i using os.listdir the "wrong way"? how do other people deal with
    > this?


    You didn't say why the behavior causes a problem for you - you only
    explained what the behavior is.

    Most people use os.listdir in a way like this:

    for name in os.listdir(path):
    full = os.path.join(path, name)
    attrib = os.stat(full)
    if some-condition:
    f = open(full)
    ...

    All this code will typically work just fine with the current behavior,
    so people typically don't see any problem.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Nov 16, 2006
    #3
  4. gabor

    gabor Guest

    Martin v. Löwis wrote:
    > gabor schrieb:
    >
    >> or am i using os.listdir the "wrong way"? how do other people deal with
    >> this?

    >
    > You didn't say why the behavior causes a problem for you - you only
    > explained what the behavior is.
    >
    > Most people use os.listdir in a way like this:
    >
    > for name in os.listdir(path):
    > full = os.path.join(path, name)
    > attrib = os.stat(full)
    > if some-condition:
    > f = open(full)
    > ...
    >
    > All this code will typically work just fine with the current behavior,
    > so people typically don't see any problem.
    >


    i am sorry, but it will not work. actually this is exactly what i did,
    and it did not work. it dies in the os.path.join call, where file_name
    is converted into unicode. and python uses 'ascii' as the charset in
    such cases. but, because listdir already failed to decode the file_name
    with the filesystem-encoding, it usually also fails when tried with 'ascii'.

    example:

    >>> dir_name = u'something'
    >>> unicode_file_name = u'\u732b.txt' # the japanese cat-symbol
    >>> bytestring_file_name = unicode_file_name.encode('utf-8')
    >>>
    >>>
    >>> import os.path
    >>>
    >>> os.path.join(dir_name,unicode_file_name)

    u'something/\u732b.txt'
    >>>
    >>>
    >>> os.path.join(dir_name,bytestring_file_name)

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    File "/usr/lib/python2.4/posixpath.py", line 65, in join
    path += '/' + b
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 1:
    ordinal not in range(128)
    >>>



    gabor
     
    gabor, Nov 16, 2006
    #4
  5. gabor schrieb:
    >> All this code will typically work just fine with the current behavior,
    >> so people typically don't see any problem.
    >>

    >
    > i am sorry, but it will not work. actually this is exactly what i did,
    > and it did not work. it dies in the os.path.join call, where file_name
    > is converted into unicode. and python uses 'ascii' as the charset in
    > such cases. but, because listdir already failed to decode the file_name
    > with the filesystem-encoding, it usually also fails when tried with
    > 'ascii'.


    Ah, right. So yes, it will typically fail immediately - just as you
    wanted it to do, anyway; the advantage with this failure is that you
    can also find out what specific file name is causing the problem
    (whereas when listdir failed completely, you could not easily find
    out the cause of the failure).

    How would you propose listdir should behave?

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Nov 16, 2006
    #5
  6. gabor wrote:

    > get an Unicode-exception, as everywhere else. you see, exceptions are
    > ok, i can deal with them.


    > p.s: one additional note. if you code expects os.listdir to return
    > unicode, that usually means that all your code uses unicode strings.
    > which in turn means, that those filenames will somehow later interact
    > with unicode strings. which means that that byte-string-filename will
    > probably get auto-converted to unicode at a later point, and that
    > auto-conversion will VERY probably fail


    it will raise an exception, most likely. didn't you just say that
    exceptions were ok?

    </F>
     
    Fredrik Lundh, Nov 17, 2006
    #6
  7. gabor a écrit :
    > hi,
    >
    > from the documentation (http://docs.python.org/lib/os-file-dir.html) for
    > os.listdir:
    >
    > "On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
    > will be a list of Unicode objects."


    Maybe, for each filename, you can test if it is an unicode string, and
    if not, convert it to unicode using the encoding indicated by
    sys.getfilesystemencoding().

    Have a try.

    A+

    Laurent.
     
    Laurent Pointal, Nov 17, 2006
    #7
  8. Laurent Pointal wrote:
    > gabor a écrit :
    >> hi,
    >>
    >> from the documentation (http://docs.python.org/lib/os-file-dir.html) for
    >> os.listdir:
    >>
    >> "On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
    >> will be a list of Unicode objects."

    >
    > Maybe, for each filename, you can test if it is an unicode string, and
    > if not, convert it to unicode using the encoding indicated by
    > sys.getfilesystemencoding().
    >
    > Have a try.
    >
    > A+
    >
    > Laurent.


    Strange coincident, as I was wrestling with this problem only yesterday.

    I found this most illuminating discussion on the topic with
    contributions from Mr Lövis and others:

    http://www.thescripts.com/forum/thread41954.html

    /johan
     
    Johan von Boisman, Nov 17, 2006
    #8
  9. gabor

    gabor Guest

    Laurent Pointal wrote:
    Laurent Pointal wrote:
    > gabor a écrit :
    >> hi,
    >>
    >> from the documentation (http://docs.python.org/lib/os-file-dir.html) for
    >> os.listdir:
    >>
    >> "On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
    >> will be a list of Unicode objects."

    >
    > Maybe, for each filename, you can test if it is an unicode string, and
    > if not, convert it to unicode using the encoding indicated by
    > sys.getfilesystemencoding().
    >
    > Have a try.
    >
    > A+
    >
    > Laurent.


    > gabor a écrit :
    >> hi,
    >>
    >> from the documentation (http://docs.python.org/lib/os-file-dir.html) for
    >> os.listdir:
    >>
    >> "On Windows NT/2k/XP and Unix, if path is a Unicode object, the result
    >> will be a list of Unicode objects."

    >
    > Maybe, for each filename, you can test if it is an unicode string, and
    > if not, convert it to unicode using the encoding indicated by
    > sys.getfilesystemencoding().
    >

    i don't think it would work. because os.listdir already tried, and
    failed (that's why we got a byte-string and not an unicode-string)

    gabor
     
    gabor, Nov 17, 2006
    #9
  10. gabor

    Leo Kislov Guest

    Martin v. Löwis wrote:
    > gabor schrieb:
    > >> All this code will typically work just fine with the current behavior,
    > >> so people typically don't see any problem.
    > >>

    > >
    > > i am sorry, but it will not work. actually this is exactly what i did,
    > > and it did not work. it dies in the os.path.join call, where file_name
    > > is converted into unicode. and python uses 'ascii' as the charset in
    > > such cases. but, because listdir already failed to decode the file_name
    > > with the filesystem-encoding, it usually also fails when tried with
    > > 'ascii'.

    >
    > Ah, right. So yes, it will typically fail immediately - just as you
    > wanted it to do, anyway; the advantage with this failure is that you
    > can also find out what specific file name is causing the problem
    > (whereas when listdir failed completely, you could not easily find
    > out the cause of the failure).
    >
    > How would you propose listdir should behave?


    How about returning two lists, first list contains unicode names, the
    second list contains undecodable names:

    files, troublesome = os.listdir(separate_errors=True)

    and make separate_errors=True by default in python 3.0 ?

    -- Leo
     
    Leo Kislov, Nov 17, 2006
    #10
  11. gabor

    gabor Guest

    Fredrik Lundh wrote:
    > gabor wrote:
    >
    >> get an Unicode-exception, as everywhere else. you see, exceptions are
    >> ok, i can deal with them.

    >
    >> p.s: one additional note. if you code expects os.listdir to return
    >> unicode, that usually means that all your code uses unicode strings.
    >> which in turn means, that those filenames will somehow later interact
    >> with unicode strings. which means that that byte-string-filename will
    >> probably get auto-converted to unicode at a later point, and that
    >> auto-conversion will VERY probably fail

    >
    > it will raise an exception, most likely. didn't you just say that
    > exceptions were ok?


    yes, but it's raised at the wrong place imho :)

    (just to clarify: simply pointing out this behavior in the documentation
    is also one of the possible solutions)

    for me the current behavior seems as if file-reading would work like this:

    a = open('foo.txt')
    data = a.read()
    a.close()

    print data
    >>> TheFileFromWhichYouHaveReadDidNotExistException



    gabor
     
    gabor, Nov 17, 2006
    #11
  12. Leo Kislov schrieb:
    > How about returning two lists, first list contains unicode names, the
    > second list contains undecodable names:
    >
    > files, troublesome = os.listdir(separate_errors=True)
    >
    > and make separate_errors=True by default in python 3.0 ?


    That would be quite an incompatible change, no?

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Nov 18, 2006
    #12
  13. Martin v. Löwis wrote:

    >> How about returning two lists, first list contains unicode names, the
    >> second list contains undecodable names:
    >>
    >> files, troublesome = os.listdir(separate_errors=True)
    >>
    >> and make separate_errors=True by default in python 3.0 ?

    >
    > That would be quite an incompatible change, no?


    it also violates a fundamental design rule for the standard library.

    </F>
     
    Fredrik Lundh, Nov 18, 2006
    #13
  14. gabor

    Leo Kislov Guest

    Martin v. Löwis wrote:
    > Leo Kislov schrieb:
    > > How about returning two lists, first list contains unicode names, the
    > > second list contains undecodable names:
    > >
    > > files, troublesome = os.listdir(separate_errors=True)
    > >
    > > and make separate_errors=True by default in python 3.0 ?

    >
    > That would be quite an incompatible change, no?


    Yeah, that was idea-dump. Actually it is possible to make this idea
    mostly backward compatible by making os.listdir() return only unicode
    names and os.binlistdir() return only binary directory entries.
    Unfortunately the same trick will not work for getcwd.

    Another idea is to map all 256 bytes to unicode private code points.
    When a file name cannot be fully decoded the undecoded bytes will be
    mapped to specially allocated code points. Unfortunately this idea
    seems to leak if the program later wants to write such unicode string
    to a file. Python will have to throw an exception since we don't know
    if it is ok to write broken string to a file. So we are back to square
    one, programs need to deal with filesystem garbage :(

    -- Leo
     
    Leo Kislov, Nov 18, 2006
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Miguel Dias Moura

    How is this usually done?

    Miguel Dias Moura, Dec 30, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    309
  2. Kuidros via DotNetMonster.com
    Replies:
    0
    Views:
    397
    Kuidros via DotNetMonster.com
    Oct 25, 2005
  3. Reuben L.
    Replies:
    9
    Views:
    490
    William Tasso
    Nov 1, 2003
  4. Derek
    Replies:
    17
    Views:
    1,182
    Leor Zolman
    May 30, 2004
  5. Jean-Paul Calderone
    Replies:
    23
    Views:
    699
    Leo Kislov
    Nov 21, 2006
Loading...

Share This Page