PEP 263 status check

J

John Roth

PEP 263 is marked finished in the PEP index, however
I haven't seen the specified Phase 2 in the list of changes
for 2.4 which is when I expected it.

Did phase 2 get cancelled, or is it just not in the
changes document?

John Roth
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

John said:
PEP 263 is marked finished in the PEP index, however
I haven't seen the specified Phase 2 in the list of changes
for 2.4 which is when I expected it.

Did phase 2 get cancelled, or is it just not in the
changes document?

Neither, nor. Although this hasn't been discussed widely,
I personally believe it is too early yet to make lack of
encoding declarations a syntax error. I'd like to
reconsider the issue with Python 2.5.

OTOH, not many people have commented either way: would you
be outraged if a script that has given you a warning about
missing encoding declarations for some time fails with a
strict SyntaxError in 2.4? Has everybody already corrected
their scripts?

Regards,
Martin
 
F

Fernando Perez

Martin v. Löwis said:
I personally believe it is too early yet to make lack of
encoding declarations a syntax error. I'd like to

+1

Making this an all-out failure is pretty brutal, IMHO. You could change the
warning message to be more stringent about it becoming soon an error. But if
someone upgrades to 2.4 because of other benefits, and some large third-party
code they rely on (and which is otherwise perfectly fine with 2.4) fails
catastrophically because of these warnings becoming errors, I suspect they
will be very unhappy.

I see the need to nudge people in the right direction, but there's no need to
do it with a 10.000 Volt stick :)

Best,

f
 
J

John Roth

Martin v. Löwis said:
Neither, nor. Although this hasn't been discussed widely,
I personally believe it is too early yet to make lack of
encoding declarations a syntax error. I'd like to
reconsider the issue with Python 2.5.

OTOH, not many people have commented either way: would you
be outraged if a script that has given you a warning about
missing encoding declarations for some time fails with a
strict SyntaxError in 2.4? Has everybody already corrected
their scripts?

Well, I don't particularly have that problem because I don't
have a huge number of scripts and for the ones I do it would be
relatively simple to do a scan and update - or just run them
with the unit tests and see if they break!

In fact, I think that a scan and update program in the tools
directory might be a very good idea - just walk through a
Python library, scan and update everything that doesn't
have a declaration.

The issue has popped in and out of my awareness a few
times, what brought it up this time was Hallvard's thread.

My specific question there was how the code handles the
combination of UTF-8 as the encoding and a non-ascii
character in an 8-bit string literal. Is this an error? The
PEP does not say so. If it isn't, what encoding will
it use to translate from unicode back to an 8-bit
encoding?

Another project for people who care about this
subject: tools. Of the half zillion editors, pretty printers
and so forth out there, how many check for the encoding
line and do the right thing with it? Which ones need to
be updated?

John Roth
 
V

Vincent Wehren

|
| | > John Roth wrote:
| > > PEP 263 is marked finished in the PEP index, however
| > > I haven't seen the specified Phase 2 in the list of changes
| > > for 2.4 which is when I expected it.
| > >
| > > Did phase 2 get cancelled, or is it just not in the
| > > changes document?
| >
| > Neither, nor. Although this hasn't been discussed widely,
| > I personally believe it is too early yet to make lack of
| > encoding declarations a syntax error. I'd like to
| > reconsider the issue with Python 2.5.
| >
| > OTOH, not many people have commented either way: would you
| > be outraged if a script that has given you a warning about
| > missing encoding declarations for some time fails with a
| > strict SyntaxError in 2.4? Has everybody already corrected
| > their scripts?
|
| Well, I don't particularly have that problem because I don't
| have a huge number of scripts and for the ones I do it would be
| relatively simple to do a scan and update - or just run them
| with the unit tests and see if they break!

Here's another thought: the company I work for uses (embedded) Python as
scripting language
for their report writer (among other things). Users can add little scripts
to their document templates which are used for printing database data. This
means, there are literally hundreds of little Python scripts embeddeded
within the document templates, which themselves are stored in whatever
database is used as the backend. In such a case, "scan and update" when
upgrading gets a little more complicated ;)

|
| In fact, I think that a scan and update program in the tools
| directory might be a very good idea - just walk through a
| Python library, scan and update everything that doesn't
| have a declaration.
|
| The issue has popped in and out of my awareness a few
| times, what brought it up this time was Hallvard's thread.
|
| My specific question there was how the code handles the
| combination of UTF-8 as the encoding and a non-ascii
| character in an 8-bit string literal. Is this an error? The
| PEP does not say so. If it isn't, what encoding will
| it use to translate from unicode back to an 8-bit
| encoding?

Isn't this covered by:

"Embedding of differently encoded data is not allowed and will
result in a decoding error during compilation of the Python
source code."

--
Vincent Wehren


|
| Another project for people who care about this
| subject: tools. Of the half zillion editors, pretty printers
| and so forth out there, how many check for the encoding
| line and do the right thing with it? Which ones need to
| be updated?
|
| John Roth
| >
| > Regards,
| > Martin
|
|
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

John said:
In fact, I think that a scan and update program in the tools
directory might be a very good idea - just walk through a
Python library, scan and update everything that doesn't
have a declaration.

Good idea. I see whether I can write something before 2.4,
but contributions are definitely welcome.
My specific question there was how the code handles the
combination of UTF-8 as the encoding and a non-ascii
character in an 8-bit string literal. Is this an error? The
PEP does not say so. If it isn't, what encoding will
it use to translate from unicode back to an 8-bit
encoding?

UTF-8 is not in any way special wrt. the PEP. Notice that
UTF-8 is *not* Unicode - it is an encoding of Unicode, just
like ISO-8559-1 or us-ascii (although the latter two only
encode a subset of Unicode). Yes, the byte string literals
will be converted back to an "8-bit encoding", but the 8-bit
encoding will be UTF-8! IOW, byte string literals are always
converted back to the source encoding before execution.
Another project for people who care about this
subject: tools. Of the half zillion editors, pretty printers
and so forth out there, how many check for the encoding
line and do the right thing with it? Which ones need to
be updated?

I know IDLE, Eric, Komodo, and Emacs do support encoding
declarations. I know PythonWin doesn't, although I once
had written patches to add such support. A number of editors
(like notepad.exe) do the right thing only if the document
has the UTF-8 signature.

Of course, editors don't necessarily need to actively
support the feature as long as the declared encoding is
the one they use, anyway. They won't display source in
other encodings correctly, but some of them don't have
the notion of multiple encodings, anyway.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Vincent said:
Here's another thought: the company I work for uses (embedded) Python as
scripting language
for their report writer (among other things). Users can add little scripts
to their document templates which are used for printing database data. This
means, there are literally hundreds of little Python scripts embeddeded
within the document templates, which themselves are stored in whatever
database is used as the backend. In such a case, "scan and update" when
upgrading gets a little more complicated ;)

At the same time, it might get also more simple. If the user interface
to edit these scripts is encoding-aware, and/or the database to store
them in is encoding-aware, an automated tool would not need to guess
what the encoding in the source is.
| My specific question there was how the code handles the
| combination of UTF-8 as the encoding and a non-ascii
| character in an 8-bit string literal. Is this an error? The
| PEP does not say so. If it isn't, what encoding will
| it use to translate from unicode back to an 8-bit
| encoding?

Isn't this covered by:

"Embedding of differently encoded data is not allowed and will
result in a decoding error during compilation of the Python
source code."

No. It is perfectly legal to have non-ASCII data in 8-bit string
literals (aka byte string literals, aka <type 'str'>). Of course,
these non-ASCII data also need to be encoded in UTF-8. Whether UTF-8
is an 8-bit encoding, I don't know - it is more precisely described
as a multibyte encoding. At execution time, the byte string literals
then have the source encoding again, i.e. UTF-8.

Regards,
Martin
 
J

John Roth

Martin v. Löwis said:
John Roth wrote:

UTF-8 is not in any way special wrt. the PEP.

That's what I thought.
Notice that
UTF-8 is *not* Unicode - it is an encoding of Unicode, just
like ISO-8559-1 or us-ascii (although the latter two only
encode a subset of Unicode).

I disagree, but I think this is a definitional issue.
Yes, the byte string literals
will be converted back to an "8-bit encoding", but the 8-bit
encoding will be UTF-8! IOW, byte string literals are always
converted back to the source encoding before execution.

If I understand you correctly, if I put, say, a mixture of
Cyrillic, Hebrew, Arabic and Greek into a byte string
literal, at run time that character string will contain the
proper unicode at each character position?

Or are you trying to say that the character string will
contain the UTF-8 encoding of these characters; that
is, if I do a subscript, I will get one character of the
multi-byte encoding?

The point of this is that I don't think that either behavior
is what one would expect. It's also an open invitation
for someone to make an unchecked mistake! I think this
may be Hallvard's underlying issue in the other thread.
Regards,
Martin

John Roth
 
M

Michael Hudson

John Roth said:
If I understand you correctly, if I put, say, a mixture of
Cyrillic, Hebrew, Arabic and Greek into a byte string
literal, at run time that character string will contain the
proper unicode at each character position?

Uh, I seem to be making a habit of labelling things you suggest
impossible :)
Or are you trying to say that the character string will
contain the UTF-8 encoding of these characters; that
is, if I do a subscript, I will get one character of the
multi-byte encoding?

This is what happens, indeed.

Cheers,
mwh
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

John said:
Or are you trying to say that the character string will
contain the UTF-8 encoding of these characters; that
is, if I do a subscript, I will get one character of the
multi-byte encoding?

Michael is almost right: this is what happens. Except that
what you get, I wouldn't call a "character". Instead, it
is always a single byte - even if that byte is part of
a multi-byte character.

Unfortunately, the things that constitute a byte string
are also called characters in the literature.

To be more specific: In an UTF-8 source file, doing

print "ö" == "\xc3\xb6"
print "ö"[0] == "\xc3"

would print two times "True", and len("ö") is 2.
OTOH, len(u"ö")==1.
The point of this is that I don't think that either behavior
is what one would expect. It's also an open invitation
for someone to make an unchecked mistake! I think this
may be Hallvard's underlying issue in the other thread.

What would you expect instead? Do you think your expectation
is implementable?

Regards,
Martin
 
J

John Roth

Martin v. Löwis said:
John said:
Or are you trying to say that the character string will
contain the UTF-8 encoding of these characters; that
is, if I do a subscript, I will get one character of the
multi-byte encoding?

Michael is almost right: this is what happens. Except that
what you get, I wouldn't call a "character". Instead, it
is always a single byte - even if that byte is part of
a multi-byte character.

Unfortunately, the things that constitute a byte string
are also called characters in the literature.

To be more specific: In an UTF-8 source file, doing

print "ö" == "\xc3\xb6"
print "ö"[0] == "\xc3"

would print two times "True", and len("ö") is 2.
OTOH, len(u"ö")==1.
The point of this is that I don't think that either behavior
is what one would expect. It's also an open invitation
for someone to make an unchecked mistake! I think this
may be Hallvard's underlying issue in the other thread.

What would you expect instead? Do you think your expectation
is implementable?

I'd expect that the compiler would reject anything that
wasn't either in the 7-bit ascii subset, or else defined
with a hex escape.

The reason for this is simply that wanting to put characters
outside of the 7-bit ascii subset into a byte character string
isn't portable. It just pushes the need for a character set
(encoding) declaration down one level of recursion.
There's already a way of doing this: use a unicode string,
so it's not like we need two ways of doing it.

Now I will grant you that there is a need for representing
the utf-8 encoding in a character string, but do we need
to support that in the source text when it's much more
likely that it's a programming mistake?

As far as implementation goes, it should have been done
at the beginning. Prior to 2.3, there was no way of writing
a program using the utf-8 encoding (I think - I might be
wrong on that) so there were no programs out there that
put non-ascii subset characters into byte strings.

Today it's one more forward migration hurdle to jump over.
I don't think it's a particularly large one, but I don't have
any real world data at hand.

John Roth
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

John said:
I'd expect that the compiler would reject anything that
wasn't either in the 7-bit ascii subset, or else defined
with a hex escape.

Are we still talking about PEP 263 here? If the entire source
code has to be in the 7-bit ASCII subset, then what is the point
of encoding declarations?

If you were suggesting that anything except Unicode literals
should be in the 7-bit ASCII subset, then this is still
unacceptable: Comments should also be allowed to contain non-ASCII
characters, don't you agree?

If you think that only Unicode literals and comments should be
allowed to contain non-ASCII, I disagree: At some point, I'd
like to propose support for non-ASCII in identifiers. This would
allow people to make identifiers that represent words from their
native language, which is helpful for people who don't speak
English well.

If you think that only Unicod literals, comments, and identifiers
should be allowed non-ASCII: perhaps, but this is out of scope
of PEP 263, which *only* introduces encoding declarations,
and explains what they mean for all current constructs.
The reason for this is simply that wanting to put characters
outside of the 7-bit ascii subset into a byte character string
isn't portable.

Define "is portable". With an encoding declaration, I can move
the source code from one machine to another, open it in an editor,
and have it display correctly. This was not portable without
encoding declarations (likewise for comments); with PEP 263,
such source code became portable.

Also, the run-time behaviour is fully predictable (which it
even was without PEP 263): At run-time, the string will have
exactly the same bytes that it does in the .py file. This
is fully portable.
It just pushes the need for a character set
(encoding) declaration down one level of recursion.

It depends on the program. E.g. if the program was to generate
HTML files with an explicit HTTP-Equiv charset=iso-8859-1,
then the resulting program is absolutely, 100% portable.

For messages directly output to a terminal, portability
might not be important.
There's already a way of doing this: use a unicode string,
so it's not like we need two ways of doing it.

Using a Unicode string might not work, because a library might
crash when confronted with a Unicode string. You are proposing
to break existing applications for no good reason, and with
no simple fix.
Now I will grant you that there is a need for representing
the utf-8 encoding in a character string, but do we need
to support that in the source text when it's much more
likely that it's a programming mistake?

But it isn't! People do put KOI-8R into source code, into
string literals, and it works perfectly fine for them. There
is no reason to arbitrarily break their code.
As far as implementation goes, it should have been done
at the beginning. Prior to 2.3, there was no way of writing
a program using the utf-8 encoding (I think - I might be
wrong on that)

You are wrong. You were always able to put UTF-8 into byte
strings, even at a time where UTF-8 was not yet an RFC
(say, in Python 1.1).
so there were no programs out there that
put non-ascii subset characters into byte strings.

That is just not true. If it were true, there would be no
need to introduce a grace period in the PEP. However,
*many* scripts in the world use non-ASCII in string literals;
it was always possible (although the documentation was
wishy-washy on what it actually meant).
Today it's one more forward migration hurdle to jump over.
I don't think it's a particularly large one, but I don't have
any real world data at hand.

Trust me: the outcry for banning non-ASCII from string literals
would be, by far, louder than the one for a proposed syntax
on decorators. That would break many production systems, CGI
scripts would suddenly stop working, GUIs would crash, etc.

Regards,
Martin
 
J

John Roth

Martin v. Löwis said:
Are we still talking about PEP 263 here? If the entire source
code has to be in the 7-bit ASCII subset, then what is the point
of encoding declarations?

Martin, I think you misinterpreted what I said at the
beginning. I'm only, and I need to repeat this, ONLY
dealing with the case where the encoding declaration
specifically says that the script is in UTF-8. No other
case.

I'm going to deal with your response point by point,
but I don't think most of this is really relevant. Your
response only makes sense if you missed the point that
I was talking about scripts that explicitly declared their
encoding to be UTF-8, and no other scripts in no
other circumstances.

I didn't mean the entire source was in 7-bit ascii. What
I meant was that if the encoding was utf-8 then the source
for 8-bit string literals must be in 7-bit ascii. Nothing more.
If you were suggesting that anything except Unicode literals
should be in the 7-bit ASCII subset, then this is still
unacceptable: Comments should also be allowed to contain non-ASCII
characters, don't you agree?

Of course.
If you think that only Unicode literals and comments should be
allowed to contain non-ASCII, I disagree: At some point, I'd
like to propose support for non-ASCII in identifiers. This would
allow people to make identifiers that represent words from their
native language, which is helpful for people who don't speak
English well.

L:ikewise. I never thought otherwise; in fact I'd like to expand
the availible operators to include the set operators as well as
the logical operators and the "real" division operator (the one
you learned in grade school - the dash with a dot above and
below the line.)
If you think that only Unicod literals, comments, and identifiers
should be allowed non-ASCII: perhaps, but this is out of scope
of PEP 263, which *only* introduces encoding declarations,
and explains what they mean for all current constructs.


Define "is portable". With an encoding declaration, I can move
the source code from one machine to another, open it in an editor,
and have it display correctly. This was not portable without
encoding declarations (likewise for comments); with PEP 263,
such source code became portable.
Also, the run-time behaviour is fully predictable (which it
even was without PEP 263): At run-time, the string will have
exactly the same bytes that it does in the .py file. This
is fully portable.

It's predictable, but as far as I'm concerned, that's
not only useless behavior, it's counterproductive
behavior. I find it difficult to imagine any case
where the benefit of having normal character
literals accidentally contain utf-8 multi-byte
characters outweighs the pain of having it happen
accidentally, and then figuring out why your program
is giving you wierd behavior.

I would grant that there are cases where you
might want this behavior. I am pretty sure they
are in the distinct minority.

It depends on the program. E.g. if the program was to generate
HTML files with an explicit HTTP-Equiv charset=iso-8859-1,
then the resulting program is absolutely, 100% portable.

It's portable, but that's not the normal case. See above.
For messages directly output to a terminal, portability
might not be important.

Portabiliity is less of an issue for me than the likelihood
of making a mistake in coding a literal and then having
to debug unexpected behavior when one byte no longer
equals one character.

Using a Unicode string might not work, because a library might
crash when confronted with a Unicode string. You are proposing
to break existing applications for no good reason, and with
no simple fix.

There's no reason why you have to have a utf-8
encoding declaration. If you want your source to
be utf-8, you need to accept the consequences.
I fully expect Python to support the usual mixture
of encodings until 3.0 at least. At that point, everything
gets to be rewritten anyway.
But it isn't! People do put KOI-8R into source code, into
string literals, and it works perfectly fine for them. There
is no reason to arbitrarily break their code.


You are wrong. You were always able to put UTF-8 into byte
strings, even at a time where UTF-8 was not yet an RFC
(say, in Python 1.1).

Were you able to write your entire program in UTF-8?
I think not.
 
H

Hallvard B Furuseth

An addition to Martin's reply:

John said:
Martin v. Löwis said:
John Roth wrote:

To be more specific: In an UTF-8 source file, doing

print "ö" == "\xc3\xb6"
print "ö"[0] == "\xc3"

would print two times "True", and len("ö") is 2.
OTOH, len(u"ö")==1.
The point of this is that I don't think that either behavior
is what one would expect. It's also an open invitation
for someone to make an unchecked mistake! I think this
may be Hallvard's underlying issue in the other thread.

What would you expect instead? Do you think your expectation
is implementable?

I'd expect that the compiler would reject anything that
wasn't either in the 7-bit ascii subset, or else defined
with a hex escape.

Then you should also expect a lot of people to move to
another language - one whose designers live in the real
world instead of your Utopian Unicode world.
The reason for this is simply that wanting to put characters
outside of the 7-bit ascii subset into a byte character string
isn't portable.

Unicode isn't portable either.
Try to output a Unicode string to a device (e.g. your terminal)
whose character encoding is not known to the program.
The program will fail, or just output the raw utf-8 string or
something, or just guess some character set the program's author
is fond of.

For that matter, tell me why my programs should spend any time
on converting between UTF-8 and the character set the
application actually works with just because you are fond of
Unicode. That might be a lot more time than just the time spent
parsing the program. Or tell me why I should spell quite normal
text strings with hex escaping or something, if that's what you
mean.

And tell me why I shouldn't be allowed to work easily with raw
UTF-8 strings, if I do use coding:utf-8.
 
J

John Roth

Hallvard B Furuseth said:
An addition to Martin's reply:

John said:
Martin v. Löwis said:
John Roth wrote:

To be more specific: In an UTF-8 source file, doing

print "ö" == "\xc3\xb6"
print "ö"[0] == "\xc3"

would print two times "True", and len("ö") is 2.
OTOH, len(u"ö")==1.

The point of this is that I don't think that either behavior
is what one would expect. It's also an open invitation
for someone to make an unchecked mistake! I think this
may be Hallvard's underlying issue in the other thread.

What would you expect instead? Do you think your expectation
is implementable?

I'd expect that the compiler would reject anything that
wasn't either in the 7-bit ascii subset, or else defined
with a hex escape.

Then you should also expect a lot of people to move to
another language - one whose designers live in the real
world instead of your Utopian Unicode world.

Rudeness objection to your characteization.

Please see my response to Martin - I'm talking only,
and I repeat ONLY, about scripts that explicitly
say they are encoded in utf-8. Nothing else. I've
been in this business for close to 40 years, and I'm
quite well aware of backwards compatibility issues
and issues with breaking existing code.

Programmers in general have a very strong, and
let me repeat that, VERY STRONG assumption
that an 8-bit string contains one byte per character
unless there is a good reason to believe otherwise.
This assumption is built into various places, including
all of the string methods.

The current design allows accidental inclusion of
a character that is not in the 7bit ascii subset ***IN
A PROGRAM THAT HAS A UTF-8 CHARACTER
ENCODING DECLARATION*** to break that
assumption without any kind of notice. That in
turn will break all of the assumptions that the string
module and string methods are based on. That in
turn is likely to break lots of existing modules and
cause a lot of debugging time that could be avoided
by proper design.

One of Python's strong points is that it's difficult
to get into trouble unless you deliberately try (then
it's quite easy, fortunately.)

I'm not worried about this causing people to
abandon Python. I'm more worried about the
current situation causing enough grief that people
will decided that utf-8 source code encoding isn't
worth it.
And tell me why I shouldn't be allowed to work easily with raw
UTF-8 strings, if I do use coding:utf-8.

First, there's nothing that's stopping you. All that
my proposal will do is require you to do a one
time conversion of any strings you put in the
program as literals. It doesn't affect any other
strings in any other way at any other time.

I'll withdraw my objection if you can seriously
assure me that working with raw utf-8 in
8-bit character string literals is what most programmers
are going to do most of the time.

I'm not going to accept the very common need
of converting unicode strings to 8-bit strings so
they can be written to disk or stored in a data base
or whatnot (or reversing the conversion for reading.)
That has nothing to do with the current issue - it's
something that everyone who deals with unicode
needs to do, regardless of the encoding of the
source program.

John Roth
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

John said:
Martin, I think you misinterpreted what I said at the
beginning. I'm only, and I need to repeat this, ONLY
dealing with the case where the encoding declaration
specifically says that the script is in UTF-8. No other
case.

From the viewpoint of PEP 263, there is absolutely *no*,
and I repeat NO difference between chosing UTF-8 and
chosing windows-1252 as the source encoding.
I'm going to deal with your response point by point,
but I don't think most of this is really relevant. Your
response only makes sense if you missed the point that
I was talking about scripts that explicitly declared their
encoding to be UTF-8, and no other scripts in no
other circumstances.

I don't understand why it is desirable to single out
UTF-8 as a source encoding. PEP 263 does no such thing,
except for allowing an addition encoding declaration
for UTF-8 (by means of the UTF-8 signature).
I didn't mean the entire source was in 7-bit ascii. What
I meant was that if the encoding was utf-8 then the source
for 8-bit string literals must be in 7-bit ascii. Nothing more.

PEP 263 never says such a thing. Why did you get this impression
after reading it?

*If* you understood that byte string literals can have the full
power of the source encoding, plus hex-escaping, I can't see what
made you think that power did not apply if the source encoding
was UTF-8.
L:ikewise. I never thought otherwise; in fact I'd like to expand
the availible operators to include the set operators as well as
the logical operators and the "real" division operator (the one
you learned in grade school - the dash with a dot above and
below the line.)

That would be a different PEP, though, and I doubt Guido will be
in favour. However, this is OT for this thread.
It's predictable, but as far as I'm concerned, that's
not only useless behavior, it's counterproductive
behavior. I find it difficult to imagine any case
where the benefit of having normal character
literals accidentally contain utf-8 multi-byte
characters outweighs the pain of having it happen
accidentally, and then figuring out why your program
is giving you wierd behavior.

Might be. This is precisely the issue that Hallvard is addressing.
I agree there should be a mechanism to check whether all significant
non-ASCII characters are inside Unicode literals.

I personally would prefer a command line switch over a per-file
declaration, but that would be the subject of Hallvard's PEP.
Under no circumstances I would disallow using the full source
encoding in byte strings, even if the source encoding is UTF-8.
There's no reason why you have to have a utf-8
encoding declaration. If you want your source to
be utf-8, you need to accept the consequences.

Even for UTF-8, you need an encoding declaration (although
the UTF-8 signature is sufficient for that matter). If
there is no encoding declaration whatsoever, Python will
assume that the source is us-ascii.
I fully expect Python to support the usual mixture
of encodings until 3.0 at least. At that point, everything
gets to be rewritten anyway.

I very much doubt that, in two ways:
a) Python 3.0 will not happen, in any foreseeable future
b) if it happens, much code will stay the same, or only
require minor changes. I doubt that non-UTF-8 source
encoding will be banned in Python 3.
Were you able to write your entire program in UTF-8?
I think not.

What do you mean, your entire program? All strings?
Certainly you were. Why not?

Of course, before UTF-8 was an RFC, there were no
editors available, nor would any operating system
support output in UTF-8, so you would need to
organize everything on your own (perhaps it was
simpler on Plan-9 at that time, but I have never
really used Plan-9 - and you might have needed
UTF-1 instead, anyway).

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

John said:
I've
been in this business for close to 40 years, and I'm
quite well aware of backwards compatibility issues
and issues with breaking existing code.

Programmers in general have a very strong, and
let me repeat that, VERY STRONG assumption
that an 8-bit string contains one byte per character
unless there is a good reason to believe otherwise.

You clearly come from a Western business. In CJK
languages, people are very aware that characters can
have more than one byte. They consider UTF-8 as just
another multi-byte encoding, and used to consider it
as an encoding that Westerners made to complicate their
lifes. That attitude appears to be changing now, but
UTF-8 is not a clear winner in the worlds where we
Westerners would expect it to be a clear winner.
The current design allows accidental inclusion of
a character that is not in the 7bit ascii subset ***IN
A PROGRAM THAT HAS A UTF-8 CHARACTER
ENCODING DECLARATION*** to break that
assumption without any kind of notice.

This is a problem only for the Western world. In the
CJK languages, such programs were broken a long time
ago. I don't think Python needs to be so Americo-centric
as to protect American programmers from programming
mistakes.
That in
turn will break all of the assumptions that the string
module and string methods are based on. That in
turn is likely to break lots of existing modules and
cause a lot of debugging time that could be avoided
by proper design.

Indeed. If the program is currently not broken, why
are you changing the source encoding? If you are
trying to support multiple languages, a properly-
designed application would use gettext instead
of putting non-ASCII into source code.

If you are writing a new application, and you
put non-ASCII into the source, in UTF-8, are you
not testing your application properly?
I'm not worried about this causing people to
abandon Python. I'm more worried about the
current situation causing enough grief that people
will decided that utf-8 source code encoding isn't
worth it.

Again, this is what Hallvard's PEP is for. It
does not apply to UTF-8 only, but I see no reason
why UTF-8 needs to be singled out.
I'll withdraw my objection if you can seriously
assure me that working with raw utf-8 in
8-bit character string literals is what most programmers
are going to do most of the time.

In what time scale? Near time, most people will use
other source encodings. In the medium term, I expect
Unix will switch to UTF-8 throughout, at which point
using UTF-8 byte strings will work on every Unix
system - the scripts, by nature, won't work on non-Unix
systems, anyway. In the long term, I expect all Python
strings will be Unicode strings, unless explicitly
declared as byte strings.

Regards,
Martin
 
T

Terry Reedy

Martin v. Löwis said:
If you think that only Unicode literals and comments should be
allowed to contain non-ASCII, I disagree: At some point, I'd
like to propose support for non-ASCII in identifiers. This would
allow people to make identifiers that represent words from their
native language, which is helpful for people who don't speak
English well.

Off the main topic of this thread, but...

While sympathizing with this notion, I have hitherto opposed it on the
basis that this would lead to code that could only be read by people within
each language group. But, rereading your idea, I realize that this
objection would be overcome by a reader that displayed for each Unicode
char (codepoint?) not its native glyph but a roman transliteration. As far
as I know, such tranliterations, more or less standardized, exist at least
for all major alphabets and syllable systems. Indeed, I would find
Japanese code displayed as

for sushi in michiro.readlines():
print fuji(sushi)

clearer than 'English' code using identifiers like Q8zB2_0Ol1!

If the Unicode group does not distribute a master roman tranliteration
table at least for alphabetic symbols, I would consider it a lack that
hinders adoption of Unicode.

Some writing systems also have different number digits, which could also be
used natively and tranliterated. A Unicode Python could also use a set of
user codepoints as an alternate coding of keywords for almost complete
nativification. I believe the math symbols are pretty universal (but could
be educated if not).

Terry J. Reedy
 
J

John Roth

Martin v. Löwis said:
From the viewpoint of PEP 263, there is absolutely *no*,
and I repeat NO difference between chosing UTF-8 and
chosing windows-1252 as the source encoding.

I don't believe I ever said that PEP 263 said there was
a difference. If I gave you that impression, I will
appologize if you can show me where it I did it.


I don't understand why it is desirable to single out
UTF-8 as a source encoding. PEP 263 does no such thing,
except for allowing an addition encoding declaration
for UTF-8 (by means of the UTF-8 signature).

As far as I'm concerned, what PEP 263 says is utterly
irrelevant to the point I'm trying to make.

The only connection PEP 263 has to the entire thread
(at least from my view) is that I wanted to check on
whether phase 2, as described in the PEP, was
scheduled for 2.4. I was under the impression it was
and was puzzled by not seeing it. You said it wouldn't
be in 2.4. Question answered, no further issue on
that point (but see below for an additonal puzzlement.)
PEP 263 never says such a thing. Why did you get this impression
after reading it?

I didn't get it from the PEP. I got it from what you said. Your
response seemed to make sense only if you assumed that I
had this totally idiotic idea that we should change everything
to 7-bit ascii. That was not my intention.

Let's go back to square one and see if I can explain my
concern from first principles.

8-bit strings have a builtin assumption that one
byte equals one character. This is something that
is ingrained in the basic fabric of many programming
languages, Python included. It's a basic assumption
in the string module, the string methods and all through
just about everything, and it's something that most
programmers expect, and IMO have every right
to expect.

Now, people violate this assumption all the time,
for a number of reasons, including binary data and
encoded data (including utf-8 encodings)
but they do so deliberately, knowing what they're
doing. These particular exceptions don't negate the
rule.

The problem I have is that if you use utf-8 as the
source encoding, you can suddenly drop multi-byte
characters into an 8-bit string ***BY ACCIDENT***.
This accident is not possible with single byte
encodings, which is why I am emphasizing that I
am only talking about source that is encoded in utf-8.
(I don't know what happens with far Eastern multi-byte
encodings.)

UTF-8 encoded source has this problem. Source
encoded with single byte encodings does not have
this problem. It's as simple as that. Accordingly
it is not my intention, and has never been my
intention, to change the way 8-bit string literals
are handled when the source program has a
single byte encoding.

We may disagree on whether this is enough of
a problem that it warrents a solution. That's life.

Now, my suggested solution of this problem was
to require that 8-bit string literals in source that was
encoded with UTF-8 be restricted to the 7-bit
ascii subset. The reason is that there are logically
three things that can be done here if we find a
character that is outside of the 7-bit ascii subset.

One is to do the current practice and violate the
one byte == one character invariant, the second
is to use some encoding to convert the non-ascii
characters into a single byte encoding, thus
preserving the one byte == one character invariant.
The third is to prohibit anything that is ambiguous,
which in practice means to restrict 8-bit literals
to the 7-bit ascii subset (plus hex escapes, of course.)

The second possibility begs the question of what
encoding to use, which is why I don't seriously
propose it (although if I understand Hallvard's
position correctly, that's essentially his proposal.)
*If* you understood that byte string literals can have the full
power of the source encoding, plus hex-escaping, I can't see what
made you think that power did not apply if the source encoding
was UTF-8.

I think I covered that adequately above. It's not that
it doesn't apply, it's that it's unsafe.
Might be. This is precisely the issue that Hallvard is addressing.
I agree there should be a mechanism to check whether all significant
non-ASCII characters are inside Unicode literals.

I think that means we're in substantive agreement (although
I see no reason to restrict comments to 7-bit ascii.)
I personally would prefer a command line switch over a per-file
declaration, but that would be the subject of Hallvard's PEP.
Under no circumstances I would disallow using the full source
encoding in byte strings, even if the source encoding is UTF-8.

I assume here you intended to mean strings, not literals. If
so, we're in agreement - I see absolutely no reason to even
think of suggesting a change to Python's run time string
handling behavior.
Even for UTF-8, you need an encoding declaration (although
the UTF-8 signature is sufficient for that matter). If
there is no encoding declaration whatsoever, Python will
assume that the source is us-ascii.

I think I didn't say this clearly. What I intended to get across
is that there isn't any major reason for a source to be utf-8;
other encodings are for the most part satisfactory.
Saying something about the declaration seems to have muddied
the meaning.

The last sentence puzzles me. In 2.3, absent a declaration
(and absent a parameter on the interpreter) Python assumes
that the source is Latin-1, and phase 2 was to change
this to the 7-bit ascii subset (US-Ascii). That was the
original question at the start of this thread. I had assumed
that change was to go into 2.4, your reply made it seem
that it would go into 2.5 (maybe.) This statement makes
it seem that it is the current state in 2.3.
I very much doubt that, in two ways:
a) Python 3.0 will not happen, in any foreseeable future

I probably should let this sleeping dog lie, however,
there is a general expectation that there will be a 3.0
at some point before the heat death of the universe.
I was certainly under that impression, and I've seen
nothing from anyone who I regard as authoratitive until
this statement that says otherwise.
b) if it happens, much code will stay the same, or only
require minor changes. I doubt that non-UTF-8 source
encoding will be banned in Python 3.


What do you mean, your entire program? All strings?
Certainly you were. Why not?

Of course, before UTF-8 was an RFC, there were no
editors available, nor would any operating system
support output in UTF-8, so you would need to
organize everything on your own (perhaps it was
simpler on Plan-9 at that time, but I have never
really used Plan-9 - and you might have needed
UTF-1 instead, anyway).

This doesn't make sense in context. I'm not talking
about some misty general UTF-8. I'm talking
about writing Python programs using the c-python
interpreter. Not jython, not IronPython, not some
other programming language.
Specifically, what would the Python 2.2 interpreter
have done if I handed it a program encoded in utf-8?
Was that a legitimate encoding? I don't know whether
it was or not. Clearly it wouldn't have been possible
before the unicode support in 2.0.

John Roth
 
J

John Roth

Martin v. Löwis said:
You clearly come from a Western business. In CJK
languages, people are very aware that characters can
have more than one byte. They consider UTF-8 as just
another multi-byte encoding, and used to consider it
as an encoding that Westerners made to complicate their
lifes. That attitude appears to be changing now, but
UTF-8 is not a clear winner in the worlds where we
Westerners would expect it to be a clear winner.

I'm aware of that.
This is a problem only for the Western world. In the
CJK languages, such programs were broken a long time
ago. I don't think Python needs to be so Americo-centric
as to protect American programmers from programming
mistakes.

American != non East Asian.

In fact, I would consider American programmers to
be the least prone to making this kind of mistake
simply because all standard characters are included
in the US-Ascii subset. It's much more likely to be
a European (or non North American) problem.
Even when writing in English, people's names will
have non-English characters, and they have a
tendency to leak into literals.
(Mexico considers themselves to be part of
Central America, for some political reason.)
Indeed. If the program is currently not broken, why
are you changing the source encoding? If you are
trying to support multiple languages, a properly-
designed application would use gettext instead
of putting non-ASCII into source code.

If you are writing a new application, and you
put non-ASCII into the source, in UTF-8, are you
not testing your application properly?


Again, this is what Hallvard's PEP is for. It
does not apply to UTF-8 only, but I see no reason
why UTF-8 needs to be singled out.


In what time scale? Near time, most people will use
other source encodings. In the medium term, I expect
Unix will switch to UTF-8 throughout, at which point
using UTF-8 byte strings will work on every Unix
system - the scripts, by nature, won't work on non-Unix
systems, anyway. In the long term, I expect all Python
strings will be Unicode strings, unless explicitly
declared as byte strings.

I asked Hallvard this question, not you. It makes sense
in the context of the statements of his I was responding to.

Your answer does not make sense. Hallvard's objection
was that he actually wanted to have non-ascii characters
put into byte literals in their utf-8 encoded forms (at least
as I understand it.)

If I thought about it, I could undoubtedly come up with
use cases where I would find this behavior useful. The
presupposition behind my statement was that those
use cases were overwhelmingly less likely than the
standard uses of byte string literals where a utf-8
encoded "character" would be a problem.

John Roth
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top