Martin v. Löwis said:
From the viewpoint of PEP 263, there is absolutely *no*,
and I repeat NO difference between chosing UTF-8 and
chosing windows-1252 as the source encoding.
I don't believe I ever said that PEP 263 said there was
a difference. If I gave you that impression, I will
appologize if you can show me where it I did it.
I don't understand why it is desirable to single out
UTF-8 as a source encoding. PEP 263 does no such thing,
except for allowing an addition encoding declaration
for UTF-8 (by means of the UTF-8 signature).
As far as I'm concerned, what PEP 263 says is utterly
irrelevant to the point I'm trying to make.
The only connection PEP 263 has to the entire thread
(at least from my view) is that I wanted to check on
whether phase 2, as described in the PEP, was
scheduled for 2.4. I was under the impression it was
and was puzzled by not seeing it. You said it wouldn't
be in 2.4. Question answered, no further issue on
that point (but see below for an additonal puzzlement.)
PEP 263 never says such a thing. Why did you get this impression
after reading it?
I didn't get it from the PEP. I got it from what you said. Your
response seemed to make sense only if you assumed that I
had this totally idiotic idea that we should change everything
to 7-bit ascii. That was not my intention.
Let's go back to square one and see if I can explain my
concern from first principles.
8-bit strings have a builtin assumption that one
byte equals one character. This is something that
is ingrained in the basic fabric of many programming
languages, Python included. It's a basic assumption
in the string module, the string methods and all through
just about everything, and it's something that most
programmers expect, and IMO have every right
to expect.
Now, people violate this assumption all the time,
for a number of reasons, including binary data and
encoded data (including utf-8 encodings)
but they do so deliberately, knowing what they're
doing. These particular exceptions don't negate the
rule.
The problem I have is that if you use utf-8 as the
source encoding, you can suddenly drop multi-byte
characters into an 8-bit string ***BY ACCIDENT***.
This accident is not possible with single byte
encodings, which is why I am emphasizing that I
am only talking about source that is encoded in utf-8.
(I don't know what happens with far Eastern multi-byte
encodings.)
UTF-8 encoded source has this problem. Source
encoded with single byte encodings does not have
this problem. It's as simple as that. Accordingly
it is not my intention, and has never been my
intention, to change the way 8-bit string literals
are handled when the source program has a
single byte encoding.
We may disagree on whether this is enough of
a problem that it warrents a solution. That's life.
Now, my suggested solution of this problem was
to require that 8-bit string literals in source that was
encoded with UTF-8 be restricted to the 7-bit
ascii subset. The reason is that there are logically
three things that can be done here if we find a
character that is outside of the 7-bit ascii subset.
One is to do the current practice and violate the
one byte == one character invariant, the second
is to use some encoding to convert the non-ascii
characters into a single byte encoding, thus
preserving the one byte == one character invariant.
The third is to prohibit anything that is ambiguous,
which in practice means to restrict 8-bit literals
to the 7-bit ascii subset (plus hex escapes, of course.)
The second possibility begs the question of what
encoding to use, which is why I don't seriously
propose it (although if I understand Hallvard's
position correctly, that's essentially his proposal.)
*If* you understood that byte string literals can have the full
power of the source encoding, plus hex-escaping, I can't see what
made you think that power did not apply if the source encoding
was UTF-8.
I think I covered that adequately above. It's not that
it doesn't apply, it's that it's unsafe.
Might be. This is precisely the issue that Hallvard is addressing.
I agree there should be a mechanism to check whether all significant
non-ASCII characters are inside Unicode literals.
I think that means we're in substantive agreement (although
I see no reason to restrict comments to 7-bit ascii.)
I personally would prefer a command line switch over a per-file
declaration, but that would be the subject of Hallvard's PEP.
Under no circumstances I would disallow using the full source
encoding in byte strings, even if the source encoding is UTF-8.
I assume here you intended to mean strings, not literals. If
so, we're in agreement - I see absolutely no reason to even
think of suggesting a change to Python's run time string
handling behavior.
Even for UTF-8, you need an encoding declaration (although
the UTF-8 signature is sufficient for that matter). If
there is no encoding declaration whatsoever, Python will
assume that the source is us-ascii.
I think I didn't say this clearly. What I intended to get across
is that there isn't any major reason for a source to be utf-8;
other encodings are for the most part satisfactory.
Saying something about the declaration seems to have muddied
the meaning.
The last sentence puzzles me. In 2.3, absent a declaration
(and absent a parameter on the interpreter) Python assumes
that the source is Latin-1, and phase 2 was to change
this to the 7-bit ascii subset (US-Ascii). That was the
original question at the start of this thread. I had assumed
that change was to go into 2.4, your reply made it seem
that it would go into 2.5 (maybe.) This statement makes
it seem that it is the current state in 2.3.
I very much doubt that, in two ways:
a) Python 3.0 will not happen, in any foreseeable future
I probably should let this sleeping dog lie, however,
there is a general expectation that there will be a 3.0
at some point before the heat death of the universe.
I was certainly under that impression, and I've seen
nothing from anyone who I regard as authoratitive until
this statement that says otherwise.
b) if it happens, much code will stay the same, or only
require minor changes. I doubt that non-UTF-8 source
encoding will be banned in Python 3.
What do you mean, your entire program? All strings?
Certainly you were. Why not?
Of course, before UTF-8 was an RFC, there were no
editors available, nor would any operating system
support output in UTF-8, so you would need to
organize everything on your own (perhaps it was
simpler on Plan-9 at that time, but I have never
really used Plan-9 - and you might have needed
UTF-1 instead, anyway).
This doesn't make sense in context. I'm not talking
about some misty general UTF-8. I'm talking
about writing Python programs using the c-python
interpreter. Not jython, not IronPython, not some
other programming language.
Specifically, what would the Python 2.2 interpreter
have done if I handed it a program encoded in utf-8?
Was that a legitimate encoding? I don't know whether
it was or not. Clearly it wouldn't have been possible
before the unicode support in 2.0.
John Roth