pep 277, Unicode filenames & mbcs encoding &c.

Edward K. Ream · Oct 21, 2003

Am I reading pep 277 correctly? On Windows NT/XP, should filenames always
be converted to Unicode using the mbcs encoding? For example,

myFile = unicode(__file__, "mbcs", "strict")

This seems to work, and I'm wondering whether there are any other details to
consider.

My experiments with Idle for Python 2.2 indicate that os.path.join doesn't
work as I expect when one of the args is a Unicode string. Everything
before the Unicode string gets thrown away. But this is probably moot: pep
277 implies Python 2.3...

Am I correct that conversions to Unicode (using "mbcs" on Windows) should be
done before passing arguments to os.path.join, os.path.split,
os.path.normpath, etc. ? Presumably os.path functions use the default
system encoding to convert strings to Unicode, which isn't likely to be
"mbcs" or anything else useful

Are there any situations where some other encoding should be used instead on
Windows? What about other platforms? For instance, does Linux allow
non-ascii file names? If so, what encoding should be specified when
converting to Unicode? Thanks.

Edward

vincent wehren · Oct 21, 2003

| Am I reading pep 277 correctly? On Windows NT/XP, should filenames always
| be converted to Unicode using the mbcs encoding? For example,
|
| myFile = unicode(__file__, "mbcs", "strict")

No and no. You can *still* use regular byte strings. Python will do the
conversion to Unicode for you using "mbcs" as encoding.

|
| This seems to work, and I'm wondering whether there are any other details
to
| consider.
|
| My experiments with Idle for Python 2.2 indicate that os.path.join doesn't
| work as I expect when one of the args is a Unicode string. Everything
| before the Unicode string gets thrown away. But this is probably moot:
pep
| 277 implies Python 2.3...

Exactly. Python Unicode file name support has arrived with 2.3.

|
....
|
| Are there any situations where some other encoding should be used instead
on
| Windows? What about other platforms? For instance, does Linux allow
| non-ascii file names?

You can use "os.path.supports_unicode_filenames" to check...

HTH

Vincent Wehren

If so, what encoding should be specified when
| converting to Unicode? Thanks.
Propably the default encoding, on Linux

|
| Edward
| --------------------------------------------------------------------
| Edward K. Ream email: (e-mail address removed)
| Leo: Literate Editor with Outlines
| Leo: http://webpages.charter.net/edreamleo/front.html
| --------------------------------------------------------------------
|
|

Just · Oct 21, 2003

"vincent wehren said:
| Are there any situations where some other encoding should be used instead
on
| Windows? What about other platforms? For instance, does Linux allow
| non-ascii file names?

You can use "os.path.supports_unicode_filenames" to check...

Actually, you can't, see:

http://python.org/sf/767645

The only two platforms that currently support unicode filenames properly
are Windows NT/XP and MacOSX, and for one of them
os.path.supports_unicode_filenames returns False

Just

Martin v. =?iso-8859-15?q?L=F6wis?= · Oct 21, 2003

Edward K. Ream said:
Am I reading pep 277 correctly? On Windows NT/XP, should filenames always
be converted to Unicode using the mbcs encoding?

What do you mean with "should"? "Should Python always..." or "Should
the application always"?

PEP 277 actually answers neither question. As Vincent explains,
nothing changes with respect to using byte strings on the API. The
changes only affect Unicode strings passed to functions expecting file names.

For example,

myFile = unicode(__file__, "mbcs", "strict")

This seems to work

And it has nothing to do with PEP 277: You are not passing myFile to
any API function.

If you mean to use myFile as a file name, then yes: this is intended
to work. However, using plain __file__ directly should also work.

Am I correct that conversions to Unicode (using "mbcs" on Windows) should be
done before passing arguments to os.path.join, os.path.split,
os.path.normpath, etc. ?

You should either use only Unicode strings, or only byte strings. The
functions of os.path are not all affected by the PEP 277
implementation (although they probably should).

Presumably os.path functions use the default
system encoding to convert strings to Unicode, which isn't likely to be
"mbcs" or anything else useful

Right. This is actually unfortunate.

Are there any situations where some other encoding should be used instead on
Windows?

If you get data from a cmd.exe Window.

What about other platforms? For instance, does Linux allow non-ascii
file names?

Yes, it does.

If so, what encoding should be specified when converting to Unicode?

Nobody knows, but the convention is to use the locale's encoding, as
returned by locale.getpreferredencoding().

Regards,
Martin

Edward K. Ream · Oct 22, 2003

Many thanks, Martin, for these comments. They are so helpful...

You should either use only Unicode strings, or only byte strings. The
functions of os.path are not all affected by the PEP 277
implementation (although they probably should).

My working assumption is that all strings in my app must be Unicode strings.
For example, the crashes happening right now trying to support Unicode
filenames occur when a string is converted to Unicode in situations like:

if fileName1 == fileName2:

where one fileName is a unicode string and the other isn't yet. That's why
I wanted to do:

myFile = unicode(__file__, "mbcs", "strict")

The challenge in my app is to make sure the proper encoding is used in the
more than 30 situations where a filename gets created somehow. Naturally,
that's not your problem, nor PEP 277's problem either

Nobody knows, but the convention is to use the locale's encoding, as
returned by locale.getpreferredencoding().

Thanks for this.

Edward

Martin v. =?iso-8859-15?q?L=F6wis?= · Oct 23, 2003

Edward K. Ream said:
if fileName1 == fileName2:

where one fileName is a unicode string and the other isn't yet. That's why
I wanted to do:

myFile = unicode(__file__, "mbcs", "strict")

Ah, I see. Instead of "mbcs", you should use
sys.getfilesystemencoding(). This is what Python will use when
converting the Unicode strings back to byte strings before passing
them to the system (in case it converts back at all, which it doesn't
on Windows thanks to PEP 277).

Regards,
Martin

Unicode Chars in Windows Path	12	Apr 3, 2014
strftime return value encoding (mbcs, locale, etc.)	2	Jan 27, 2008
WinXP, Python3.1.2,dir-listing to XML - problem with unicode file names	0	Apr 3, 2010
PEP 383: Non-decodable Bytes in System Character Interfaces	1	Apr 22, 2009
Managing non-ascii filenames in python	1	Jul 20, 2009
Unicode again ... default codec ...	0	Oct 20, 2009
Ubunu - Linux - Unicode - encoding	6	Feb 1, 2007
Simpler transition to PEP 3000 "Unicode only strings"?	3	Sep 20, 2005

pep 277, Unicode filenames & mbcs encoding &c.

Edward K. Ream

vincent wehren

Just

Martin v. =?iso-8859-15?q?L=F6wis?=

Edward K. Ream

Martin v. =?iso-8859-15?q?L=F6wis?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads