PEP 263 status check

?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

John said:
I don't believe I ever said that PEP 263 said there was
a difference. If I gave you that impression, I will
appologize if you can show me where it I did it.

In <[email protected]>, titled
" PEP 263 status check", you write

My specific question there was how the code handles the
combination of UTF-8 as the encoding and a non-ascii
character in an 8-bit string literal. Is this an error?
[end quote]

So I assumed you were all the time talking about how this
is implemented, and how you expected to be implemented,
and I assumed we agree that the implementation should
match the specification in PEP 263.
As far as I'm concerned, what PEP 263 says is utterly
irrelevant to the point I'm trying to make.

Then I don't know what the point is you are trying to
make. It appears that you are now saying that Python
does not work the way it should work. IOW, you are
proposing that it be changed, right? This sounds like
another PEP.
The only connection PEP 263 has to the entire thread
(at least from my view) is that I wanted to check on
whether phase 2, as described in the PEP, was
scheduled for 2.4. I was under the impression it was
and was puzzled by not seeing it. You said it wouldn't
be in 2.4. Question answered, no further issue on
that point (but see below for an additonal puzzlement.)

Ok. A change of subject might have helped.
8-bit strings have a builtin assumption that one
byte equals one character.

Not at all. Some 8-bit strings don't denote characters
at all, and some 8-bit string, atleast in some regions
of the world, are deliberately using multi-byte character
encodings. In particular, UTF-8 is such an encoding.
It's a basic assumption
in the string module, the string methods and all through
just about everything, and it's something that most
programmers expect, and IMO have every right
to expect.

Not at all. Most string methods don't assume anything
about characters. Instead, they assume that the building
block of a byte string is a "byte", and operate on those.
Only some methods of the string objects assume that the
bytes denote characters; they typically assume that the
current locale provides the definition of the character
set.
Now, people violate this assumption all the time,
for a number of reasons, including binary data and
encoded data (including utf-8 encodings)
but they do so deliberately, knowing what they're
doing. These particular exceptions don't negate the
rule.

Not at all. These usages are deliberate, equally-righted
applications of the string type. In Python, the string
type really is meant for binary data (unlike, say, C,
which has issues with NUL bytes).
The problem I have is that if you use utf-8 as the
source encoding, you can suddenly drop multi-byte
characters into an 8-bit string ***BY ACCIDENT***.
Ok.

(I don't know what happens with far Eastern multi-byte
encodings.)

The same issues as UTF-8, plus some additional worse issues.
Now, my suggested solution of this problem was
to require that 8-bit string literals in source that was
encoded with UTF-8 be restricted to the 7-bit
ascii subset.

Ok. I disagree that this is desirable; if you really
want to see that happen, you should write a PEP.
The second possibility begs the question of what
encoding to use, which is why I don't seriously
propose it (although if I understand Hallvard's
position correctly, that's essentially his proposal.)

No. He proposes your third alternative (ban non-ASCII
characters in byte string literals), not just for UTF-8,
but for all encodings. Not for all files, though, but
only for selected files.
If
there is no encoding declaration whatsoever, Python will
assume that the source is us-ascii.
[...]
The last sentence puzzles me. In 2.3, absent a declaration
(and absent a parameter on the interpreter) Python assumes
that the source is Latin-1, and phase 2 was to change
this to the 7-bit ascii subset (US-Ascii). That was the
original question at the start of this thread. I had assumed
that change was to go into 2.4, your reply made it seem
that it would go into 2.5 (maybe.) This statement makes
it seem that it is the current state in 2.3.

With "will assume", I actually meant future tense. Not
being a native speaker, I'm uncertain how to distinguish
this from the conditional form that you apparently understood.
Specifically, what would the Python 2.2 interpreter
have done if I handed it a program encoded in utf-8?
Was that a legitimate encoding?

Yes, the Python interpeter would have processed it.

print "Grüß Gott"

would have send the greeting to the terminal.
> I don't know whether
it was or not. Clearly it wouldn't have been possible
before the unicode support in 2.0.

Why do you think so? The above print statement has worked
since Python 1.0 or so. Before PEP 263, Python was unaware
of source encodings, and would literally copy the bytes
from the source code file into the string object - whether
they were latin-1, UTF-8, or some other encoding. The
only requirement was that the encoding needs to be an
ASCII superset, so that Python properly detects the end
of the string.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Terry said:
While sympathizing with this notion, I have hitherto opposed it on the
basis that this would lead to code that could only be read by people within
each language group. But, rereading your idea, I realize that this
objection would be overcome by a reader that displayed for each Unicode
char (codepoint?) not its native glyph but a roman transliteration.

I personally consider this objection irrelevant. Yes, it is desirable
that portable libraries use only pronouncable (in English) identifiers.
However, that is no justification for the language to make a policy
decision that all source code in the language needs to use pronouncable
identifiers. Instead, the author of each piece of code needs to make
a decision what kind of identifiers to use. Some people (e.g. children)
don't care a bit if somebody 20km away can read their source code, let
alone somebody 10000km away - those far-away people will never get
to see the code in the first place.

So I doubt there is much need for transliterating source code viewers.
At the same time, it might be a fun project to do.
Some writing systems also have different number digits, which could also be
used natively and tranliterated. A Unicode Python could also use a set of
user codepoints as an alternate coding of keywords for almost complete
nativification. I believe the math symbols are pretty universal (but could
be educated if not).

Now, this is different story. To implement this, the Python parser needs
to be changed to contain locale information, and one carefully has to
make an implementation so that the same code will run the same way
independent on the locale in which it is executed. This requires that
information about all locales is included in all installations, which
is expensive to maintain.

In addition, alternate keywords might not help so much, since real
integration into the natural language would also require to change
the order of identifiers and keywords - something that I consider
unimplementable.

Regards,
Martin
 
J

John Roth

Martin v. Löwis said:
John said:
I don't believe I ever said that PEP 263 said there was
a difference. If I gave you that impression, I will
appologize if you can show me where it I did it.

In <[email protected]>, titled
" PEP 263 status check", you write

My specific question there was how the code handles the
combination of UTF-8 as the encoding and a non-ascii
character in an 8-bit string literal. Is this an error?
[end quote]

So I assumed you were all the time talking about how this
is implemented, and how you expected to be implemented,
and I assumed we agree that the implementation should
match the specification in PEP 263.

Ah! While my assumption was that the code had been
implemented correctly according to the specification,
and that the specification leaves a trap for the unwary
in one very significant (although also very narrow) case.
Then I don't know what the point is you are trying to
make. It appears that you are now saying that Python
does not work the way it should work. IOW, you are
proposing that it be changed, right? This sounds like
another PEP.

It could very well be another PEP.
Not at all. Some 8-bit strings don't denote characters
at all, and some 8-bit string, atleast in some regions
of the world, are deliberately using multi-byte character
encodings. In particular, UTF-8 is such an encoding.

This is true, but it's also beside the point. Most *programmers*
(other than ones that use single-language multi-byte
encodings) make that assumption. If they didn't there
wouldn't be a problem.

Every tutorial I've ever seen on unicode spends a great
deal of time at the beginning explaining the difference
between bytes, characters, encodings and all that stuff.
If this was common knowledge, why would the authors
bother? They bother simply because it isn't common
knowledge, at least in the sense that it's wired into
developer's common coding intuitions and habits.
Ok. I disagree that this is desirable; if you really
want to see that happen, you should write a PEP.


No. He proposes your third alternative (ban non-ASCII
characters in byte string literals), not just for UTF-8,
but for all encodings. Not for all files, though, but
only for selected files.

Which is what I don't like about it. It adds complexity
to the language and a feature that I don't think is really
necessary (restricting string literals for single-byte encodings.)
The other thing I don't like is that it still leaves the
trap for the unwary which I'm discussing.
If
there is no encoding declaration whatsoever, Python will
assume that the source is us-ascii.
[...]
The last sentence puzzles me. In 2.3, absent a declaration
(and absent a parameter on the interpreter) Python assumes
that the source is Latin-1, and phase 2 was to change
this to the 7-bit ascii subset (US-Ascii). That was the
original question at the start of this thread. I had assumed
that change was to go into 2.4, your reply made it seem
that it would go into 2.5 (maybe.) This statement makes
it seem that it is the current state in 2.3.

With "will assume", I actually meant future tense. Not
being a native speaker, I'm uncertain how to distinguish
this from the conditional form that you apparently understood.

Ah. I understand now. I understood the final clause as a
form of present tense. To make it a future I'd probably
stick the word 'eventually' or 'in Release 2.5' in there:
"will eventually assume" or "In Release 2.5, Python will assume..."
Yes, the Python interpeter would have processed it.

print "Grüß Gott"

would have send the greeting to the terminal.

I see your point here. It does round trip successfully.

John Roth
 
D

Dieter Maurer

Martin v. Löwis said:
...
I personally believe it is too early yet to make lack of
encoding declarations a syntax error. I'd like to
reconsider the issue with Python 2.5.

I hope, it will never come...

The declaration is necessary for modules that are distributed
all over the world but superfluous for modules only used locally
(with fixed encoding).

Dieter
 
H

Hallvard B Furuseth

John said:
Hallvard B Furuseth said:
An addition to Martin's reply:
John said:
To be more specific: In an UTF-8 source file, doing

print "ö" == "\xc3\xb6"
print "ö"[0] == "\xc3"

would print two times "True", and len("ö") is 2.
OTOH, len(u"ö")==1.

(...)
I'd expect that the compiler would reject anything that
wasn't either in the 7-bit ascii subset, or else defined
with a hex escape.

Then you should also expect a lot of people to move to
another language - one whose designers live in the real
world instead of your Utopian Unicode world.

Rudeness objection to your characteization.

Sorry, I guess that was a bit over the top. I've just gotten so fed up
with bad charset handling, including over-standardization, over the
years. And as you point out, I misunderstood the scope of your
suggestion. But you have been saying that people should always use
Unicode, and things like that.
Please see my response to Martin - I'm talking only,
and I repeat ONLY, about scripts that explicitly
say they are encoded in utf-8. Nothing else. I've
been in this business for close to 40 years, and I'm
quite well aware of backwards compatibility issues
and issues with breaking existing code.

Programmers in general have a very strong, and
let me repeat that, VERY STRONG assumption
that an 8-bit string contains one byte per character
unless there is a good reason to believe otherwise.

Often true in our part of the world. However, another VERY STRONG
assumption is that if we feed the computer a raw character string and
ensure that it doesn't do any fancy charset handling, the program won't
mess with the string and things will Just Work. Well, except that
programs that strip the 8. bit is a problem. While there is no longer
any telling what a program will do if it gets the idea that it can be
helpful about the character set.

The biggest problem with labeling anything as Unicode may be that it
will have to be converted back before it is output, but the program
often does not know which character set to convert it to. It might not
be running on a system where "the charset" is available in some standard
location. It might not be able to tell from the name of the locale. In
any case, the desired output charset might not be the same as that of
the current locale. So the program (or some module it is using) can
decide to guess, which can give very bad results, or it can fail, which
is no fun either. Or the programmer can set a default charset, even
though he does not know that the user will be using this charset. Or
the program can refuse to run unless the user configures the charset,
which is often nonsense.

The rest of my reply to that grew to a rather large rant with very
little relevance to PEP 263, so I moved it to the end of this message.

Anyway, the fact remains that in quite a number of situations, the
simplest way to do charset handling is to keep various programs firmly
away from charset issues. If a program does not know which charset is
in use, the best way is to not do any charset handling. In the case of
Python strings, that means 'str' literals instead of u'Unicode'
literals. Then the worst that can happen if the program is run with an
unexpected charset/encoding is that the strings built into the program
will not be displayed correctly.

It would be nice to have a machinery to tag all strings, I/O channels
and so on with their charset/encoding and with what to do if a string
cannot be converted to that encoding, but lacking that (or lacking
knowledge of how to tag some data), no charset handling will remain
better than guesstimate charset handling in some situations.
This assumption is built into various places, including
all of the string methods.

I don't agree with that, but maybe it's a matter of how we view function
and type names, or something.
The current design allows accidental inclusion of
a character that is not in the 7bit ascii subset ***IN
A PROGRAM THAT HAS A UTF-8 CHARACTER
ENCODING DECLARATION*** to break that
assumption without any kind of notice. That in
turn will break all of the assumptions that the string
module and string methods are based on. That in
turn is likely to break lots of existing modules and
cause a lot of debugging time that could be avoided
by proper design.

For programs that think they work with Unicode strings, yes. For
programs that have no charset opinion, quite the opposite is true.
First, there's nothing that's stopping you. All that
my proposal will do is require you to do a one
time conversion of any strings you put in the
program as literals. It doesn't affect any other
strings in any other way at any other time.

It is not a one-time conversion if it's inside a loop or a small
function which is called many times. It would have to be moved out
to a global variable or something, which makes the program a lot more
cumbersome.

Second, any time one has to write more complex expressions to achieve
something, it becomes easier to introduce bugs. In particular when
people's solution will sometimes be to write '\xc3\xb8' instead of 'ø'
and add a comment with the real string. If the comment is wrong, which
happens, the bug may survive for a long time.
I'll withdraw my objection if you can seriously
assure me that working with raw utf-8 in
8-bit character string literals is what most programmers
are going to do most of the time.

Of course it isn't. Nor is working with a lot of other Python features.
I'm not going to accept the very common need
of converting unicode strings to 8-bit strings so
they can be written to disk or stored in a data base
or whatnot (or reversing the conversion for reading.)

That's your choice, of course. It's not mine.
That has nothing to do with the current issue - it's
something that everyone who deals with unicode
needs to do, regardless of the encoding of the
source program.

I'm not not even sure which issue is the 'current issue',
if it makes that irrelevant.

========

<rant>
I've been a programmer for about 20 years, and for most of that time
the solution to charset issues in my environment (Tops-20, Unix, no
multi-cultural issues) has been for the user to take care of the
matter.

At first, the computer thought it was using ASCII, we were using
terminals and printers with NS_4551-1 - not that I knew a name for it
- and that was that. (NS_4551-1 is ASCII with [\]{|} replaced with
ÆØÅæøå.) If we wanted to print an ASCII file, there might be a switch
to get an ASCII font, we might have an ASCII printer/terminal, or we
just learned to read ÆØÅ as [\] and vice versa. A C program which
should output a Norwegian string would use [\\] as ÆØÅ - or the other
way around, depending on how one displayed the program.

Then some programs began to become charset-aware, but they "knew" that
we were using ASCII, and began to e.g. label everyone's e-mail
messages with "X-Charset: ASCII" or something. So such labels came in
practice to mean 'any character set'. The solution was to ignore that
label and get on with life. Maybe a program had to be tweaked a bit
to achieve that, but usually not. And it might or might not be
possible to configure a program to label things correctly, but since
everyone ignored the label anyway, who cared?

Then 8-bit character sets and MIME arrived, and the same thing
happened again: 'Content-Type: text/plain; charset=iso-8859-1' came to
mean 'any character set or encoding'. After all, programmers knew
that this was the charset everyone was using if they were not using
ASCII. This time it couldn't even be blamed on poor programmers: If I
remember correctly, MIME says the default character set is ASCII, so
programs _have_ to label 8-bit messages with a charset even if they
have no idea which charset is in use. Programs can make the charset
configurable, of course, but most users didn't know or care about such
things, so that was really no help.

Fortunately, most programs just displayed the raw bytes and ignored
the charset, so it was easy to stay with the old solution of ignoring
charset labels and get on with life. Same with e.g. the X window
system: Parts of it (cut&paste buffers? Don't remember) was defined to
work with latin-1, but NS_4551-1 fonts worked just fine. Of course,
if we pasted æøå from an NS_4551-1 window to a latin-1 window we got
{|}, but that's was what we had learned to expect anyway. I don't
remember if one had to to some tweaking to convince X not to get
clever, but I think not.

Locales arrived too, and they might be helpful - except several
implementations were so buggy that programs crashed or misbehaved if
one turned them on. Also, it might or might not be possible to deduce
which character set was in use from the names of the locales. So, on
many machines, ignore them and move on.

Then UTF-8 arrived, and things got messy. We actually begun to need
to deal with different encodings as well as character sets.

UTF-8 texts labeled as iso-8859-1 (these still occur once in a while)
have to be decoded, it's not enough to switch the window's font if the
charset is wrong. Programs expecting UTF-8 would get a parse error on
iso-8859-1 input, it was not enough to change font.

There is a Linux box I'm sometimes doing remote logins to which I
can't figure out how to display non-ASCII characters. It insist that
my I'm using UTF-8. My X.11 font is latin-1. I can turn off the
locale settings, but then 8-bit characters are not displayed at all.
I'm sure there is some way to fix that, but I haven't bothered to find
out. I didn't need to dig around in manuals to find out that sort of
thing before.

I remember we had 3 LDAPv2 servers running for a while - one with
UTF-8, one with iso-8859-1, and one with T.61, which is the character
set which the LDAPv2 standard actually specified. Unless the third
server used NS_4551-1; I don't remember.

I've mentioned elsewhere that I had to downgrade Perl5.8 to a Unicode-
unaware version when my programs crashed. There was a feature to turn
off Unicode, but it didn't work. It seems to work in later versions.
Maybe it's even bug-free this time. I'm not planning to find out,
since we can't risk that these programs produce wrong output.

And don't get me started on Emacs MULE, a charset solution so poor
that from what I hear even Microsoft began to abandon it a decade
earlier (code pages). For a while the --unibyte helped, but after a
while that got icky. Oh well, most of the MULE bugs seem to be gone
now, after - is it 5 years?

The C language recently got both 8-bit characters and Unicode tokens
and literals (\unnnn). As far as I can tell, what it didn't get was
any provision for compilers and linkers which don't know which
character set is in use and therefore can't know which native
character should be translated to which Unicode character or vice
versa. So my guess is that compilers will just pick a character set
which seems likely if they aren't told. Or use the locale, which may
have nothing at all to do with which character set the program source
code is using. I may be wrong there, though; I only remember some of
the discussions on comp.std.c, I haven't checked the final standard.
</rant>

Of course, there are a lot of good sides to the story too - even locales
got cleaned up a lot, for example. And you'd get a very different story
from people in different environments (e.g. multi-cultural ones) or with
different operating systems even in Norway, but you already know that.
 
H

Hallvard B Furuseth

I said:
Sorry, I guess that was a bit over the top. I've just gotten so fed
up with bad charset handling, including over-standardization, over the
years. And as you point out, I misunderstood the scope of your
suggestion. But you have been saying that people should always use
Unicode, and things like that.

Sorry again, I seem to have confused you with Peter. I should have
gotten a clue when you said "if you want your source to be utf-8, you
need to accept the consequences". Not exactly the words of a True
Believer in Unicode:)
 
H

Hallvard B Furuseth

John said:
The problem I have is that if you use utf-8 as the
source encoding, you can suddenly drop multi-byte
characters into an 8-bit string ***BY ACCIDENT***.
(...)
Now, my suggested solution of this problem was
to require that 8-bit string literals in source that was
encoded with UTF-8 be restricted to the 7-bit
ascii subset.

Then shouldn't your solution catch all multibyte encodings, not
just UTF-8?
[Hallvard] proposes your third alternative (ban non-ASCII
characters in byte string literals), not just for UTF-8,
but for all encodings. Not for all files, though, but
only for selected files.

John said:
Which is what I don't like about it. It adds complexity
to the language and a feature that I don't think is really
necessary (restricting string literals for single-byte encodings.)

It's to prevent several errors:

* If the source file has one 'coding:' and the output destination has
another character set/encoding, then the wrong character set will be
output. Python offers two simple solutions to this:
- If the program is charset-aware, it can work with Unicode strings,
and the 8-bit string literal should be a Unicode literal.
- Otherwise, the program can stay away from Unicode and leave the
charset problem to the user.

* A worse case of the above: If the 8-bit output goes to an utf-8
destination, it won't merely give the wrong character, it will have
invalid format. So a program which reads the output may close the
connection it reads from, or fail to display the file at all, or -
if it is not robust - crash. I expect the same applies to other
multibyte encodings, and probably some single-byte encodings too.

* If the program is charset-aware and works with Unicode strings,
the Unicode handling blows up if it is passed an 8-bit str
(example copied from Anders' pychecker feature request):

# -*- coding: latin-1 -*-
x = "blåbærgrød"
unicode(x)
-->
Traceback (most recent call last):
File "/tmp/u.py", line 3, in ?
unicode(x)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5
in position 2: ordinal not in range(128)

The problem is that even though the file is tagged with latin-1, the
string x does not inherit that tag. So the Unicode handling doesn't
know which character set, if any, the string contains.
The other thing I don't like is that it still leaves the
trap for the unwary which I'm discussing.

Well, I would like to see a feature like this turned on by default
eventually (both for UTF-8 and other character sets), but for the time
being I'll stick to getting the feature into Python in the first place.

Though I do seem to have been too unambitious. For some reason I was
thinking it would be harder to get a new option into Python than a
per-file declaration.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Dieter said:
I hope, it will never come...

Your hope will not be fulfilled. Some version of Python *will*
require that all non-ASCII source code contains an encoding
declarations. Read PEP 263, which has already been accepted.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top