PEP 383: Non-decodable Bytes in System Character Interfaces

Martin v. Löwis · Apr 22, 2009

I'm proposing the following PEP for inclusion into Python 3.1.
Please comment.

Regards,
Martin

PEP: 383
Title: Non-decodable Bytes in System Character Interfaces
Version: $Revision: 71793 $
Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $
Author: Martin v. Löwis <[email protected]>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 22-Apr-2009
Python-Version: 3.1
Post-History:

Abstract
========

File names, environment variables, and command line arguments are
defined as being character data in POSIX; the C APIs however allow
passing arbitrary bytes - whether these conform to a certain encoding
or not. This PEP proposes a means of dealing with such irregularities
by embedding the bytes in character strings in such a way that allows
recreation of the original byte string.

Rationale
=========

The C char type is a data type that is commonly used to represent both
character data and bytes. Certain POSIX interfaces are specified and
widely understood as operating on character data, however, the system
call interfaces make no assumption on the encoding of these data, and
pass them on as-is. With Python 3, character strings use a
Unicode-based internal representation, making it difficult to ignore
the encoding of byte strings in the same way that the C interfaces can
ignore the encoding.

On the other hand, Microsoft Windows NT has correct the original
design limitation of Unix, and made it explicit in its system
interfaces that these data (file names, environment variables, command
line arguments) are indeed character data, by providing a
Unicode-based API (keeping a C-char-based one for backwards
compatibility).

For Python 3, one proposed solution is to provide two sets of APIs: a
byte-oriented one, and a character-oriented one, where the
character-oriented one would be limited to not being able to represent
all data accurately. Unfortunately, for Windows, the situation would
be exactly the opposite: the byte-oriented interface cannot represent
all data; only the character-oriented API can. As a consequence,
libraries and applications that want to support all user data in a
cross-platform manner have to accept mish-mash of bytes and characters
exactly in the way that caused endless troubles for Python 2.x.

With this PEP, a uniform treatment of these data as characters becomes
possible. The uniformity is achieved by using specific encoding
algorithms, meaning that the data can be converted back to bytes on
POSIX systems only if the same encoding is used.

Specification
=============

On Windows, Python uses the wide character APIs to access
character-oriented APIs, allowing direct conversion of the
environmental data to Python str objects.

On POSIX systems, Python currently applies the locale's encoding to
convert the byte data to Unicode. If the locale's encoding is UTF-8,
it can represent the full set of Unicode characters, otherwise, only a
subset is representable. In the latter case, using private-use
characters to represent these bytes would be an option. For UTF-8,
doing so would create an ambiguity, as the private-use characters may
regularly occur in the input also.

To convert non-decodable bytes, a new error handler "python-escape" is
introduced, which decodes non-decodable bytes using into a private-use
character U+F01xx, which is believed to not conflict with private-use
characters that currently exist in Python codecs.

The error handler interface is extended to allow the encode error
handler to return byte strings immediately, in addition to returning
Unicode strings which then get encoded again.

If the locale's encoding is UTF-8, the file system encoding is set to
a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
(which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.

Discussion
==========

While providing a uniform API to non-decodable bytes, this interface
has the limitation that chosen representation only "works" if the data
get converted back to bytes with the python-escape error handler
also. Encoding the data with the locale's encoding and the (default)
strict error handler will raise an exception, encoding them with UTF-8
will produce non-sensical data.

For most applications, we assume that they eventually pass data
received from a system interface back into the same system
interfaces. For example, and application invoking os.listdir() will
likely pass the result strings back into APIs like os.stat() or
open(), which then encodes them back into their original byte
representation. Applications that need to process the original byte
strings can obtain them by encoding the character strings with the
file system encoding, passing "python-escape" as the error handler
name.

Copyright
=========

This document has been placed in the public domain.

v+python · Apr 22, 2009

I'm proposing the following PEP for inclusion into Python 3.1.
Please comment.

Basically the scheme doesn't work. Aside from that, it is very close.

There are tons of encoding schemes that could work... they don't have
to include half-surrogates or bytes. What they have to do, is make
sure that they are uniformly applied to all appropriate strings.

The problem with this, and other preceding schemes that have been
discussed here, is that there is no means of ascertaining whether a
particular file name str was obtained from a str API, or was funny-
decoded from a bytes API... and thus, there is no means of reliably
ascertaining whether a particular filename str should be passed to a
str API, or funny-encoded back to bytes.

The assumption in the 2nd Discussion paragraph may hold for a large
percentage of cases, maybe even including some number of 9s, but it is
not guaranteed, and cannot be enforced, therefore there are cases that
could fail. Whether those failure cases are a concern or not is an
open question. Picking a character (I don't find U+F01xx in the
Unicode standard, so I don't know what it is) that is obscure, and
unlikely to be used in "real" file names, might help the heuristic
nature of the encoding and decoding avoid most conflicts, but provides
no guarantee that data puns will not occur in practice. Today's
obscure character is tomorrows commonly used character, perhaps.
Someone not on this list may be happily using that character for their
own nefarious, incompatible purpose.

As I realized in the email-sig, in talking about decoding corrupted
headers, there is only one way to guarantee this... to encode _all_
character sequences, from _all_ interfaces. Basically it requires
reserving an escape character (I'll use ? in these examples -- yes, an
ASCII question mark -- happens to be illegal in Windows filenames so
all the better on that platform, but the specific character doesn't
matter... avoiding / \ and . is probably good, though).

So the rules would be, when obtaining a file name from the bytes OS
interface, that doesn't properly decode according to UTF-8, decode it
by placing a ? at the beginning, then for each decodable UTF-8
sequence, add a Unicode character -- unless the character is ?, in
which case you add two ??, and for each non-decodable byte sequence,
place a ? and two hex digits, or a ? and a half surrogate code, or a ?
and whatever gibberish you like. Two hex digits are fine by me, and
will serve for this discussion.

ALSO, when obtaining a file name from the str OS interfaces, encode it
too... if it contains a ? at the front, it must be replaced by ???
and then any other ? in the name doubled.

Then you have a string that can/must be encoded to be used on either
str or bytes OS interfaces... or any other interfaces that want str or
bytes... but whichever they want, you can do a decode, or determine
that you can't, into that form. The encode and decode functions
should be available for coders to use, that code to external
interfaces, either OS or 3rd party packages, that do not use this
encoding scheme. This encoding scheme would be used throughout all
Python APIs (most of which would need very little change to
accommodate it). However, programs would have to keep track of
whether they were dealing with encoded or unencoded strings, if they
use both types in their program (an example, is hard-coded file names
or file name parts).

The initial ? is not strictly necessary for this scheme to work, but I
think it would be a good flag to the user that this name has been
altered.

This scheme does not depend on assumptions about the use of file
names.

This scheme would be enhanced if the file name APIs returned a subtype
of str for the encoded names, but that should be considered only a
hint, not a requirement.

When encoding file name strings to pass to bytes APIs, the ? followed
by two hex digits would be converted to a byte. Leading ? would be
dropped, and ?? would convert to ?. I don't believe failures are
possible when encoding to bytes.

When encoding file name strings to pass to str APIs, the discovery
of ? followed by two hex digits would raise an exception, the file
name is not acceptable to a str API. However, leading ? would be
dropped, and ?? would convert to ?, and if no ? followed by two hex
digits were found, the file name would be successfully converted for
use on the str API.

Note that not even on Unix/Posix is it particularly easy nor useful to
place a ? into file names from command lines due to shell escapes,
etc. The use of ? in file names also interferes with easy ability to
specifically match them in globs, etc.

Anything short of such an encoding of both types of interfaces, such
that it is known that all python-manipulated filenames will be
encoded, will have data puns that provide a potential for failure in
edge cases.

Note that in this scheme, no file names that are fully Unicode and do
not contain ? characters are altered by the decoding or the encoding
process. That will probably reach quite a few 9s of likelihood that
the scheme will go unnoticed by most people and programs and
filenames. But the scheme will work reliably if implemented correctly
and completely, and will have no edge cases of failure due to not
having data puns.

Python 3.5, bytes, and %-interpolation (aka PEP 461)	10	Feb 24, 2014
HELP , with operating system related program in c.	1	Mar 27, 2023
PEP 3131: Supporting Non-ASCII Identifiers	399	May 13, 2007
Frustrating circular bytes issue	1	Jun 26, 2012
Sudden error: SyntaxError: Non-ASCII character '\xc2' in file	5	Mar 30, 2011
stream bytes	2	Dec 5, 2011
[RFC] PEP 3143: supplementary group list concerns	1	Mar 11, 2012
Flatten an email Message with a non-ASCII body using 8bit CTE	0	Jan 24, 2013

PEP 383: Non-decodable Bytes in System Character Interfaces

Martin v. Löwis

v+python

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads