Re: os.lisdir, gets unicode, returns unicode... USUALLY?!?!?

Discussion in 'Python' started by Jean-Paul Calderone, Nov 17, 2006.

  1. On Fri, 17 Nov 2006 00:31:06 +0100, "\"Martin v. Löwis\"" <> wrote:
    >gabor schrieb:
    >>> All this code will typically work just fine with the current behavior,
    >>> so people typically don't see any problem.
    >>>

    >>
    >> i am sorry, but it will not work. actually this is exactly what i did,
    >> and it did not work. it dies in the os.path.join call, where file_name
    >> is converted into unicode. and python uses 'ascii' as the charset in
    >> such cases. but, because listdir already failed to decode the file_name
    >> with the filesystem-encoding, it usually also fails when tried with
    >> 'ascii'.

    >
    >Ah, right. So yes, it will typically fail immediately - just as you
    >wanted it to do, anyway; the advantage with this failure is that you
    >can also find out what specific file name is causing the problem
    >(whereas when listdir failed completely, you could not easily find
    > out the cause of the failure).
    >
    >How would you propose listdir should behave?


    Umm, just a wild guess, but how about raising an exception which includes
    the name of the file which could not be decoded?

    Jean-Paul
    Jean-Paul Calderone, Nov 17, 2006
    #1
    1. Advertising

  2. Jean-Paul Calderone

    gabor Guest

    Jean-Paul Calderone wrote:
    > On Fri, 17 Nov 2006 00:31:06 +0100, "\"Martin v. Löwis\""
    > <> wrote:
    >> gabor schrieb:
    >>>> All this code will typically work just fine with the current behavior,
    >>>> so people typically don't see any problem.
    >>>>
    >>>
    >>> i am sorry, but it will not work. actually this is exactly what i did,
    >>> and it did not work. it dies in the os.path.join call, where file_name
    >>> is converted into unicode. and python uses 'ascii' as the charset in
    >>> such cases. but, because listdir already failed to decode the file_name
    >>> with the filesystem-encoding, it usually also fails when tried with
    >>> 'ascii'.

    >>
    >> Ah, right. So yes, it will typically fail immediately - just as you
    >> wanted it to do, anyway; the advantage with this failure is that you
    >> can also find out what specific file name is causing the problem
    >> (whereas when listdir failed completely, you could not easily find
    >> out the cause of the failure).
    >>
    >> How would you propose listdir should behave?

    >
    > Umm, just a wild guess, but how about raising an exception which includes
    > the name of the file which could not be decoded?
    >


    i also recommend this approach.

    also, raising an exception goes well with the principle of the least
    surprise imho.

    gabor
    gabor, Nov 17, 2006
    #2
    1. Advertising

  3. Jean-Paul Calderone schrieb:
    >> How would you propose listdir should behave?

    >
    > Umm, just a wild guess, but how about raising an exception which includes
    > the name of the file which could not be decoded?


    There may be multiple of these, of course, but I assume that you want
    it to report the first one it encounters?

    Regards,
    Martin
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, Nov 17, 2006
    #3
  4. gabor schrieb:
    > i also recommend this approach.
    >
    > also, raising an exception goes well with the principle of the least
    > surprise imho.


    Are you saying you wouldn't have been surprised if that had been
    the behavior? How would you deal with that exception in your code?

    Regards,
    Martin
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, Nov 17, 2006
    #4
  5. In <>, Jean-Paul
    Calderone wrote:

    >>How would you propose listdir should behave?

    >
    > Umm, just a wild guess, but how about raising an exception which includes
    > the name of the file which could not be decoded?


    Suppose you have a directory with just some files having a name that can't
    be decoded with the file system encoding. So `listdir()` fails at this
    point and raises an exception. How would you get the names then? Even the
    ones that *can* be decoded? This doesn't look very nice:

    path = u'some path'
    try:
    files = os.listdir(path)
    except UnicodeError, e:
    files = os.listdir(path.encode(sys.getfilesystemencoding()))
    # Decode and filter the list "manually" here.

    Ciao,
    Marc 'BlackJack' Rintsch
    Marc 'BlackJack' Rintsch, Nov 17, 2006
    #5
  6. Jean-Paul Calderone

    gabor Guest

    Marc 'BlackJack' Rintsch wrote:
    > In <>, Jean-Paul
    > Calderone wrote:
    >
    >>> How would you propose listdir should behave?

    >> Umm, just a wild guess, but how about raising an exception which includes
    >> the name of the file which could not be decoded?

    >
    > Suppose you have a directory with just some files having a name that can't
    > be decoded with the file system encoding. So `listdir()` fails at this
    > point and raises an exception. How would you get the names then? Even the
    > ones that *can* be decoded? This doesn't look very nice:
    >
    > path = u'some path'
    > try:
    > files = os.listdir(path)
    > except UnicodeError, e:
    > files = os.listdir(path.encode(sys.getfilesystemencoding()))
    > # Decode and filter the list "manually" here.


    i agree that it does not look very nice.

    but does this look nicer? :)

    path = u'some path'
    files = os.listdir(path)

    def check_and_fix_wrong_filename(file):
    if isinstance(file,unicode):
    return file
    else:
    #somehow convert it to unicode, and return it

    files = [check_and_fix_wrong_filename(f) for f in files]

    in other words, your opinion is that the proposed solution is not
    optimal, or that the current behavior is fine?

    gabor
    gabor, Nov 17, 2006
    #6
  7. Jean-Paul Calderone

    gabor Guest

    Martin v. Löwis wrote:
    > gabor schrieb:
    >> i also recommend this approach.
    >>
    >> also, raising an exception goes well with the principle of the least
    >> surprise imho.

    >
    > Are you saying you wouldn't have been surprised if that had been
    > the behavior?



    yes, i would not have been surprised. because it's kind-of expected when
    dealing with input, that malformed input raises an unicode-exception.
    and i would also expect, that if os.listdir completed without raising an
    exception, then the returned data is correct.

    > How would you deal with that exception in your code?


    depends on the application. in the one where it happened i would just
    display an error message, and tell the admins to check the
    filesystem-encoding.

    (in other ones, where it's not critical to get the correct name, i would
    probably just convert the text to unicode using the "replace" behavior)

    what about using flags similar to how unicode() works? strict, ignore,
    replace and maybe keep-as-bytestring.

    like:
    os.listdir(dirname,'strict')

    i know it's not the most elegant, but it would solve most of the
    use-cases imho (at least my use-cases).

    gabor
    gabor, Nov 17, 2006
    #7
  8. Jean-Paul Calderone

    Leo Kislov Guest

    gabor wrote:
    > Martin v. Löwis wrote:
    > > gabor schrieb:
    > >> i also recommend this approach.
    > >>
    > >> also, raising an exception goes well with the principle of the least
    > >> surprise imho.

    > >
    > > Are you saying you wouldn't have been surprised if that had been
    > > the behavior?

    >
    >
    > yes, i would not have been surprised. because it's kind-of expected when
    > dealing with input, that malformed input raises an unicode-exception.
    > and i would also expect, that if os.listdir completed without raising an
    > exception, then the returned data is correct.


    The problem is that most programmers just don't want to deal with
    filesystem garbage but they won't be happy if the program breaks
    either.

    > > How would you deal with that exception in your code?

    >
    > depends on the application. in the one where it happened i would just
    > display an error message, and tell the admins to check the
    > filesystem-encoding.
    >
    > (in other ones, where it's not critical to get the correct name, i would
    > probably just convert the text to unicode using the "replace" behavior)
    >
    > what about using flags similar to how unicode() works? strict, ignore,
    > replace and maybe keep-as-bytestring.
    >
    > like:
    > os.listdir(dirname,'strict')


    That's actually an interesting idea. The error handling modes could be:
    'mix' -- current behaviour, 'ignore' -- drop names that cannot be
    decoded, 'separate' -- see my other message.

    -- Leo
    Leo Kislov, Nov 17, 2006
    #8
  9. In <4cefe$455d8f47$59ad1aca$>, gabor wrote:

    > Marc 'BlackJack' Rintsch wrote:
    >> In <>, Jean-Paul
    >> Calderone wrote:
    >>
    >>>> How would you propose listdir should behave?
    >>> Umm, just a wild guess, but how about raising an exception which includes
    >>> the name of the file which could not be decoded?

    >>
    >> Suppose you have a directory with just some files having a name that can't
    >> be decoded with the file system encoding. So `listdir()` fails at this
    >> point and raises an exception. How would you get the names then? Even the
    >> ones that *can* be decoded? This doesn't look very nice:
    >>
    >> path = u'some path'
    >> try:
    >> files = os.listdir(path)
    >> except UnicodeError, e:
    >> files = os.listdir(path.encode(sys.getfilesystemencoding()))
    >> # Decode and filter the list "manually" here.

    >
    > i agree that it does not look very nice.
    >
    > but does this look nicer? :)
    >
    > path = u'some path'
    > files = os.listdir(path)
    >
    > def check_and_fix_wrong_filename(file):
    > if isinstance(file,unicode):
    > return file
    > else:
    > #somehow convert it to unicode, and return it
    >
    > files = [check_and_fix_wrong_filename(f) for f in files]


    I think this is very "special" code as you can't use the fixed names to
    open the files anymore unless you guess the encoding correctly. I think
    it's a bit fragile. Wouldn't it be a better solution to convert the
    `path` to the file system encoding for getting the file names. This way
    you can use all the names to process the files.

    > in other words, your opinion is that the proposed solution is not
    > optimal, or that the current behavior is fine?


    I think the current behavior is okay but should be documented.

    Maybe I just didn't had enough use cases yet that needed the names as
    unicode objects and from my linux file systems experience file names are
    just byte strings with two limitations: no slashes and no zero bytes. :)

    Ciao,
    Marc 'BlackJack' Rintsch
    Marc 'BlackJack' Rintsch, Nov 17, 2006
    #9
  10. gabor schrieb:
    > depends on the application. in the one where it happened i would just
    > display an error message, and tell the admins to check the
    > filesystem-encoding.
    >
    > (in other ones, where it's not critical to get the correct name, i would
    > probably just convert the text to unicode using the "replace" behavior)
    >
    > what about using flags similar to how unicode() works? strict, ignore,
    > replace and maybe keep-as-bytestring.
    >
    > like:
    > os.listdir(dirname,'strict')
    >
    > i know it's not the most elegant, but it would solve most of the
    > use-cases imho (at least my use-cases).


    Of course, it's possible to implement this on top of the existing
    listdir operation.

    def failing_listdir(dirname, mode):
    result = os.listdir(dirname)
    if mode != 'strict': return result
    for r in result:
    if isinstance(r, str):
    raise UnicodeDecodeError
    return result

    Regards,
    Martin
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, Nov 18, 2006
    #10
  11. Jean-Paul Calderone

    gabor Guest

    Martin v. Löwis wrote:
    > gabor schrieb:
    >> depends on the application. in the one where it happened i would just
    >> display an error message, and tell the admins to check the
    >> filesystem-encoding.
    >>
    >> (in other ones, where it's not critical to get the correct name, i would
    >> probably just convert the text to unicode using the "replace" behavior)
    >>
    >> what about using flags similar to how unicode() works? strict, ignore,
    >> replace and maybe keep-as-bytestring.
    >>
    >> like:
    >> os.listdir(dirname,'strict')
    >>
    >> i know it's not the most elegant, but it would solve most of the
    >> use-cases imho (at least my use-cases).

    >
    > Of course, it's possible to implement this on top of the existing
    > listdir operation.
    >
    > def failing_listdir(dirname, mode):
    > result = os.listdir(dirname)
    > if mode != 'strict': return result
    > for r in result:
    > if isinstance(r, str):
    > raise UnicodeDecodeError
    > return result
    >


    yes, sure... but then.. it's possible to implement it also on top of an
    raise-when-error version :)

    so, what do you think, how should this issue be solved?

    currently i see 2 ways:

    1. simply fix the documentation, and state that if the file-name cannot
    be decoded into unicode, then it's returned as byte-string. but that
    also means, that the typical usage of:

    [os.path.join(path,n) for n in os.listdir(path)]

    will not work.

    2. add support for some unicode-decoding flags, like i wrote before

    3. some solution.

    ?

    gaobr
    gabor, Nov 19, 2006
    #11
  12. gabor wrote:

    > yes, sure... but then.. it's possible to implement it also on top of an
    > raise-when-error version :)


    not necessarily if raise-when-error means raise-error-in-os-listdir.

    </F>
    Fredrik Lundh, Nov 19, 2006
    #12
  13. Jean-Paul Calderone

    gabor Guest

    Fredrik Lundh wrote:
    > gabor wrote:
    >
    >> yes, sure... but then.. it's possible to implement it also on top of
    >> an raise-when-error version :)

    >
    > not necessarily if raise-when-error means raise-error-in-os-listdir.
    >


    could you please clarify?

    currently i see 2 approaches how to do it on the raise-when-error version:

    1.
    dirname = u'something'
    try:
    files = os.listdir(dirname)
    except UnicodeError:
    byte_files = os.listdir(dirname.encode('encoding))
    #do something with it

    2.

    dirname = u'something'
    byte_files = os.listdir(dirname.encode('encoding'))
    for byte_file in byte_files:
    try:
    file = byte_file.decode(sys.getfsenc())
    except UnicodeError:
    #do something else
    #do something


    the byte-string version of os.listdir remains. so all the other versions
    can be implemented on the top of it. imho the question is:
    which should be the 'default' behavior, offered by the python standard
    library.

    gabor
    gabor, Nov 19, 2006
    #13
  14. gabor schrieb:
    > 1. simply fix the documentation, and state that if the file-name cannot
    > be decoded into unicode, then it's returned as byte-string.


    For 2.5, this should be done. Contributions are welcome.

    [...then]
    > [os.path.join(path,n) for n in os.listdir(path)]
    >
    > will not work.
    >
    > 2. add support for some unicode-decoding flags, like i wrote before


    I may have missed something, but did you present a solution that would
    make the case above work?

    > 3. some solution.


    One approach I had been considering is to always make the decoding
    succeed, by using the private-use-area of Unicode to represent bytes
    that don't decode correctly.

    Regards,
    Martin
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, Nov 19, 2006
    #14
  15. Jean-Paul Calderone

    gabor Guest

    Martin v. Löwis wrote:
    > gabor schrieb:
    >> 1. simply fix the documentation, and state that if the file-name cannot
    >> be decoded into unicode, then it's returned as byte-string.

    >
    > For 2.5, this should be done. Contributions are welcome.
    >
    > [...then]
    >> [os.path.join(path,n) for n in os.listdir(path)]
    >>
    >> will not work.
    >>
    >> 2. add support for some unicode-decoding flags, like i wrote before

    >
    > I may have missed something, but did you present a solution that would
    > make the case above work?


    if we use the same decoding flags as binary-string.decode(),
    then we could do:

    [os.path.join(path,n) for n in os.listdir(path,'ignore')]

    or

    [os.path.join(path,n) for n in os.listdir(path,'replace')]

    it's not an elegant solution, but it would solve i think most of the
    problems.


    >
    >> 3. some solution.

    >
    > One approach I had been considering is to always make the decoding
    > succeed, by using the private-use-area of Unicode to represent bytes
    > that don't decode correctly.
    >


    hmm..an interesting idea..

    and what happens with such texts, when they are encoded into let's say
    utf-8? are the in-private-use-area characters ignored?

    gabor
    gabor, Nov 19, 2006
    #15
  16. gabor schrieb:
    >> I may have missed something, but did you present a solution that would
    >> make the case above work?

    >
    > if we use the same decoding flags as binary-string.decode(),
    > then we could do:
    >
    > [os.path.join(path,n) for n in os.listdir(path,'ignore')]


    That wouldn't work. The characters in the file name that didn't
    decode would be dropped, so the resulting file names would be
    invalid. Trying to do os.stat() on such a file name would raise
    an exception that the file doesn't exist.

    > [os.path.join(path,n) for n in os.listdir(path,'replace')]


    Likewise. The characters would get replaced with REPLACEMENT
    CHARACTER; passing that to os.stat would give an encoding
    error.

    > it's not an elegant solution, but it would solve i think most of the
    > problems.


    No, it wouldn't. This idea is as bad or worse than just dropping
    these file names from the directory listing.

    >> One approach I had been considering is to always make the decoding
    >> succeed, by using the private-use-area of Unicode to represent bytes
    >> that don't decode correctly.
    >>

    >
    > hmm..an interesting idea..
    >
    > and what happens with such texts, when they are encoded into let's say
    > utf-8? are the in-private-use-area characters ignored?


    UTF-8 supports encoding of all Unicode characters, including the PUA
    blocks.

    py> u"\ue020".encode("utf-8")
    '\xee\x80\xa0'

    Regards,
    Martin
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, Nov 19, 2006
    #16
  17. Jean-Paul Calderone

    gabor Guest

    Martin v. Löwis wrote:
    > gabor schrieb:
    >>> I may have missed something, but did you present a solution that would
    >>> make the case above work?

    >> if we use the same decoding flags as binary-string.decode(),
    >> then we could do:
    >>
    >> [os.path.join(path,n) for n in os.listdir(path,'ignore')]

    >
    > That wouldn't work. The characters in the file name that didn't
    > decode would be dropped, so the resulting file names would be
    > invalid. Trying to do os.stat() on such a file name would raise
    > an exception that the file doesn't exist.
    >
    >> [os.path.join(path,n) for n in os.listdir(path,'replace')]

    >
    > Likewise. The characters would get replaced with REPLACEMENT
    > CHARACTER; passing that to os.stat would give an encoding
    > error.
    >
    >> it's not an elegant solution, but it would solve i think most of the
    >> problems.

    >
    > No, it wouldn't. This idea is as bad or worse than just dropping
    > these file names from the directory listing.


    i think that depends on the point of view.
    if you need to do something later with the content of files, then you're
    right.

    but if all you need is to display them for example...

    >
    >>> One approach I had been considering is to always make the decoding
    >>> succeed, by using the private-use-area of Unicode to represent bytes
    >>> that don't decode correctly.
    >>>

    >> hmm..an interesting idea..
    >>
    >> and what happens with such texts, when they are encoded into let's say
    >> utf-8? are the in-private-use-area characters ignored?

    >
    > UTF-8 supports encoding of all Unicode characters, including the PUA
    > blocks.
    >
    > py> u"\ue020".encode("utf-8")
    > '\xee\x80\xa0'


    so basically you'd like to be able to "round-trip"?

    so that:

    listdir returns an array of filenames, the un-representable bytes will
    be represented in the PUA.

    all the other file-handling functions (stat, open, etc..) recognize such
    strings, and handle them correctly.

    ?

    gabor
    gabor, Nov 19, 2006
    #17
  18. Jean-Paul Calderone

    Ross Ridge Guest

    Martin v. Löwis wrote:
    > One approach I had been considering is to always make the decoding
    > succeed, by using the private-use-area of Unicode to represent bytes
    > that don't decode correctly.


    That would conflict with private use characters appearing in file
    names.

    Personally, I think os.listdir() should return the file names only in
    Unicode if they're actually stored that way in the underlying file
    system (eg. NTFS), otherwise return them as byte strings. I doubt
    anyone in this thread would like that, though.

    Ross Ridge
    Ross Ridge, Nov 20, 2006
    #18
  19. Ross Ridge schrieb:
    > Martin v. Löwis wrote:
    >> One approach I had been considering is to always make the decoding
    >> succeed, by using the private-use-area of Unicode to represent bytes
    >> that don't decode correctly.

    >
    > That would conflict with private use characters appearing in file
    > names.


    Not necessarily: they could get escaped.

    AFAICT, you can have that conflict only if the file system encoding
    is UTF-8: otherwise, there is no way to represent them.

    > Personally, I think os.listdir() should return the file names only in
    > Unicode if they're actually stored that way in the underlying file
    > system (eg. NTFS), otherwise return them as byte strings. I doubt
    > anyone in this thread would like that, though.


    So I assume you would not want to allow to pass Unicode strings
    to open(), stat() etc. either, as the _real_ file system API requires
    byte strings there, as well?

    People would indeed see that as a step backwards. If you don't want
    Unicode strings returned from listdir, don't pass Unicode string
    as the directory name.

    Technically, how do you determine whether the underlying file
    system stores file names "in Unicode"? Does OSX use Unicode
    (it requires path names to be UTF-8)? After all, each and
    every encoding is a Unicode encoding - that was a design
    goal of Unicode.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Nov 20, 2006
    #19
  20. Jean-Paul Calderone

    Ross Ridge Guest

    Martin v. Löwis wrote:
    > One approach I had been considering is to always make the decoding
    > succeed, by using the private-use-area of Unicode to represent bytes
    > that don't decode correctly.


    Ross Ridge schrieb:
    > That would conflict with private use characters appearing in file
    > names.


    Martin v. Löwis wrote:
    > Not necessarily: they could get escaped.


    How?

    > AFAICT, you can have that conflict only if the file system encoding
    > is UTF-8: otherwise, there is no way to represent them.


    They can also appear UTF-16 filenames (obviously) and various Far-East
    multi-byte encodings.

    > > Personally, I think os.listdir() should return the file names only in
    > > Unicode if they're actually stored that way in the underlying file
    > > system (eg. NTFS), otherwise return them as byte strings. I doubt
    > > anyone in this thread would like that, though.

    >
    > So I assume you would not want to allow to pass Unicode strings
    > to open(), stat() etc. either, as the _real_ file system API requires
    > byte strings there, as well?


    No, I just expect that if the underlying file system API does accept a
    given byte or Unicode string that I could pass the same string to
    open() and stat(), etc.. and have it work. I have no problem if
    additional strings happen to work because Python converts byte strings
    to Unicode or vice-versa as the API requires.

    Should I assume that since you think that having "os.listdir()" return
    Unicode strings when passed a Unicode directory name is a good idea,
    that you also think that file object methods (eg. readline) should
    return Unicode strings when opened with a Unicode filename?

    > Technically, how do you determine whether the underlying file
    > system stores file names "in Unicode"?


    On Windows you can use GetVolumeInformation(), though it may be more
    practical to assume Unicode or byte strings based on the OS. On Unix
    you'd assume byte strings.

    > Does OSX use Unicode (it requires path names to be UTF-8)?


    HFS+ uses Unicode. I have no idea how you'd figure out the properties
    of a filesystem under OS/X, but then the Python docs suggests this
    os.listdir() Unicode feature doesn't work on Macintosh systems anyways.

    > After all, each and every encoding is a Unicode encoding - that was a design
    > goal of Unicode.


    If it were as simple as that, then yes, there wouldn't be a problem.
    Unfortunately, as this thread has revealed, os.llistdir() isn't always
    able to map byte string filenames into Unicode, either because they
    don't use the assumed encoding, don't all use the same encoding or
    don't use any standard encoding. That's the problem here, there's no
    encoding associated Unix filenames, they're just byte strings. Since
    Python byte strings also have no encoding associated with them they're
    the natural way of representing all valid file names on Unix systems.
    On the other hand, under Windows NT/2K/XP and NTFS or VFAT the natural
    way to represent all valid file names is Unicode strings.

    Ross Ridge
    Ross Ridge, Nov 20, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Miguel Dias Moura

    How is this usually done?

    Miguel Dias Moura, Dec 30, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    299
  2. Kuidros via DotNetMonster.com
    Replies:
    0
    Views:
    389
    Kuidros via DotNetMonster.com
    Oct 25, 2005
  3. Reuben L.
    Replies:
    9
    Views:
    472
    William Tasso
    Nov 1, 2003
  4. Derek
    Replies:
    17
    Views:
    1,167
    Leor Zolman
    May 30, 2004
  5. gabor
    Replies:
    13
    Views:
    549
    Leo Kislov
    Nov 18, 2006
Loading...

Share This Page