pep 277, Unicode filenames & mbcs encoding &c.

Discussion in 'Python' started by Edward K. Ream, Oct 21, 2003.

  1. Am I reading pep 277 correctly? On Windows NT/XP, should filenames always
    be converted to Unicode using the mbcs encoding? For example,

    myFile = unicode(__file__, "mbcs", "strict")

    This seems to work, and I'm wondering whether there are any other details to
    consider.

    My experiments with Idle for Python 2.2 indicate that os.path.join doesn't
    work as I expect when one of the args is a Unicode string. Everything
    before the Unicode string gets thrown away. But this is probably moot: pep
    277 implies Python 2.3...

    Am I correct that conversions to Unicode (using "mbcs" on Windows) should be
    done before passing arguments to os.path.join, os.path.split,
    os.path.normpath, etc. ? Presumably os.path functions use the default
    system encoding to convert strings to Unicode, which isn't likely to be
    "mbcs" or anything else useful :)

    Are there any situations where some other encoding should be used instead on
    Windows? What about other platforms? For instance, does Linux allow
    non-ascii file names? If so, what encoding should be specified when
    converting to Unicode? Thanks.

    Edward
    --------------------------------------------------------------------
    Edward K. Ream email:
    Leo: Literate Editor with Outlines
    Leo: http://webpages.charter.net/edreamleo/front.html
    --------------------------------------------------------------------
     
    Edward K. Ream, Oct 21, 2003
    #1
    1. Advertising

  2. "Edward K. Ream" <> schrieb im Newsbeitrag
    news:...
    | Am I reading pep 277 correctly? On Windows NT/XP, should filenames always
    | be converted to Unicode using the mbcs encoding? For example,
    |
    | myFile = unicode(__file__, "mbcs", "strict")

    No and no. You can *still* use regular byte strings. Python will do the
    conversion to Unicode for you using "mbcs" as encoding.

    |
    | This seems to work, and I'm wondering whether there are any other details
    to
    | consider.
    |
    | My experiments with Idle for Python 2.2 indicate that os.path.join doesn't
    | work as I expect when one of the args is a Unicode string. Everything
    | before the Unicode string gets thrown away. But this is probably moot:
    pep
    | 277 implies Python 2.3...

    Exactly. Python Unicode file name support has arrived with 2.3.

    |
    ....
    |
    | Are there any situations where some other encoding should be used instead
    on
    | Windows? What about other platforms? For instance, does Linux allow
    | non-ascii file names?

    You can use "os.path.supports_unicode_filenames" to check...


    HTH

    Vincent Wehren

    If so, what encoding should be specified when
    | converting to Unicode? Thanks.
    Propably the default encoding, on Linux

    |
    | Edward
    | --------------------------------------------------------------------
    | Edward K. Ream email:
    | Leo: Literate Editor with Outlines
    | Leo: http://webpages.charter.net/edreamleo/front.html
    | --------------------------------------------------------------------
    |
    |
     
    vincent wehren, Oct 21, 2003
    #2
    1. Advertising

  3. Edward K. Ream

    Just Guest

    In article <bn3jfo$df$1.nb.home.nl>,
    "vincent wehren" <> wrote:

    > | Are there any situations where some other encoding should be used instead
    > on
    > | Windows? What about other platforms? For instance, does Linux allow
    > | non-ascii file names?
    >
    > You can use "os.path.supports_unicode_filenames" to check...


    Actually, you can't, see:

    http://python.org/sf/767645

    The only two platforms that currently support unicode filenames properly
    are Windows NT/XP and MacOSX, and for one of them
    os.path.supports_unicode_filenames returns False :(

    Just
     
    Just, Oct 21, 2003
    #3
  4. "Edward K. Ream" <> writes:

    > Am I reading pep 277 correctly? On Windows NT/XP, should filenames always
    > be converted to Unicode using the mbcs encoding?


    What do you mean with "should"? "Should Python always..." or "Should
    the application always"?

    PEP 277 actually answers neither question. As Vincent explains,
    nothing changes with respect to using byte strings on the API. The
    changes only affect Unicode strings passed to functions expecting file names.

    > For example,
    >
    > myFile = unicode(__file__, "mbcs", "strict")
    >
    > This seems to work


    And it has nothing to do with PEP 277: You are not passing myFile to
    any API function.

    If you mean to use myFile as a file name, then yes: this is intended
    to work. However, using plain __file__ directly should also work.

    > Am I correct that conversions to Unicode (using "mbcs" on Windows) should be
    > done before passing arguments to os.path.join, os.path.split,
    > os.path.normpath, etc. ?


    You should either use only Unicode strings, or only byte strings. The
    functions of os.path are not all affected by the PEP 277
    implementation (although they probably should).

    > Presumably os.path functions use the default
    > system encoding to convert strings to Unicode, which isn't likely to be
    > "mbcs" or anything else useful :)


    Right. This is actually unfortunate.

    > Are there any situations where some other encoding should be used instead on
    > Windows?


    If you get data from a cmd.exe Window.

    > What about other platforms? For instance, does Linux allow non-ascii
    > file names?


    Yes, it does.

    > If so, what encoding should be specified when converting to Unicode?


    Nobody knows, but the convention is to use the locale's encoding, as
    returned by locale.getpreferredencoding().

    Regards,
    Martin
     
    Martin v. =?iso-8859-15?q?L=F6wis?=, Oct 21, 2003
    #4
  5. Many thanks, Martin, for these comments. They are so helpful...

    > You should either use only Unicode strings, or only byte strings. The
    > functions of os.path are not all affected by the PEP 277
    > implementation (although they probably should).


    My working assumption is that all strings in my app must be Unicode strings.
    For example, the crashes happening right now trying to support Unicode
    filenames occur when a string is converted to Unicode in situations like:

    if fileName1 == fileName2:

    where one fileName is a unicode string and the other isn't yet. That's why
    I wanted to do:

    myFile = unicode(__file__, "mbcs", "strict")

    The challenge in my app is to make sure the proper encoding is used in the
    more than 30 situations where a filename gets created somehow. Naturally,
    that's not your problem, nor PEP 277's problem either :)

    > > If so, what encoding should be specified when converting to Unicode?

    >
    > Nobody knows, but the convention is to use the locale's encoding, as
    > returned by locale.getpreferredencoding().


    Thanks for this.

    Edward
    --------------------------------------------------------------------
    Edward K. Ream email:
    Leo: Literate Editor with Outlines
    Leo: http://webpages.charter.net/edreamleo/front.html
    --------------------------------------------------------------------
     
    Edward K. Ream, Oct 22, 2003
    #5
  6. "Edward K. Ream" <> writes:

    > if fileName1 == fileName2:
    >
    > where one fileName is a unicode string and the other isn't yet. That's why
    > I wanted to do:
    >
    > myFile = unicode(__file__, "mbcs", "strict")


    Ah, I see. Instead of "mbcs", you should use
    sys.getfilesystemencoding(). This is what Python will use when
    converting the Unicode strings back to byte strings before passing
    them to the system (in case it converts back at all, which it doesn't
    on Windows thanks to PEP 277).

    Regards,
    Martin
     
    Martin v. =?iso-8859-15?q?L=F6wis?=, Oct 23, 2003
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. B.J.
    Replies:
    4
    Views:
    772
    Toby Inkster
    Apr 23, 2005
  2. Tejas
    Replies:
    1
    Views:
    648
    William Ahern
    Nov 14, 2007
  3. Giovanni Bajo
    Replies:
    2
    Views:
    527
    Martin v. Löwis
    Jan 27, 2008
  4. Skip Montanaro

    2to3 on Mac - unknown encoding: mbcs

    Skip Montanaro, Nov 6, 2009, in forum: Python
    Replies:
    0
    Views:
    411
    Skip Montanaro
    Nov 6, 2009
  5. Gabriel Genellina

    Re: 2to3 on Mac - unknown encoding: mbcs

    Gabriel Genellina, Nov 6, 2009, in forum: Python
    Replies:
    0
    Views:
    586
    Gabriel Genellina
    Nov 6, 2009
Loading...

Share This Page