Why asci-only symbols?

M

Mike Meyer

Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?

<mike
 
J

jepler

I'm not aware of any PEPs on the subject, but google groups turns up some past
threads. Here's one from February 2004:
http://groups.google.com/group/comp.lang.python/browse_frm/thread/d5fcc1c8825a60dc/96856af647ce71d5
I didn't immediately find this message of Guido's that everyone's talking about
as this thread began, though.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD4DBQFDSxnJJd01MZaTXX0RAlOiAJjp6Bv3VSWDEqkkmp7SCuRwl3wOAKCPdL/g
xhDDw1NPR6JpgjLptiWERQ==
=bFRK
-----END PGP SIGNATURE-----
 
P

Peter Hansen

Mike said:
Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?

And of equally random curiosity :), what alternative(s) can you suggest
would have been appropriate? (I note that Unicode, for example, dates
from around the time Python was first released. And I can't really
imagine a non-ugly alternative, which probably reveals something bad
about my imagination.)

-Peter
 
D

Do Re Mi chel La Si Do

Hi !

I agree with you; I will adore capacity to call functions named in Unicode.

@-salutations

Michel Claveau
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Mike said:
Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?

No PEP yet; I meant to write one for several years now.

The principles would be
- sources must use encoding declarations
- valid identifiers would follow the Unicode consortium guidelines,
in particular: identifiers would be normalized in NFKC (I think),
adjusted in the ASCII range for backward compatibility (i.e.
not introducing any additional ASCII characters as legal identifier
characters)
- __dict__ will contain Unicode keys
- all objects should support Unicode getattr/setattr (potentially
raising AttributeError, of course)
- open issue: what to do on the C API (perhaps nothing, perhaps
allowing UTF-8)

Regards,
Martin
 
B

Bengt Richter

No PEP yet; I meant to write one for several years now.

The principles would be
- sources must use encoding declarations
- valid identifiers would follow the Unicode consortium guidelines,
in particular: identifiers would be normalized in NFKC (I think),
adjusted in the ASCII range for backward compatibility (i.e.
not introducing any additional ASCII characters as legal identifier
characters)
- __dict__ will contain Unicode keys
- all objects should support Unicode getattr/setattr (potentially
raising AttributeError, of course)
- open issue: what to do on the C API (perhaps nothing, perhaps
allowing UTF-8)

Perhaps string equivalence in keys will be treated like numeric equivalence?
I.e., a key/name representation is established by the initial key/name binding, but
values can be retrieved by "equivalent" key/names with different representations
like unicode vs ascii or latin-1 etc.?

Regards,
Bengt Richter
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Bengt said:
Perhaps string equivalence in keys will be treated like numeric equivalence?
I.e., a key/name representation is established by the initial key/name binding, but
values can be retrieved by "equivalent" key/names with different representations
like unicode vs ascii or latin-1 etc.?

That would require that you know the encoding of a byte string; this
information is not available at run-time.

You could also try all possible encodings to see whether the strings
are equal if you chose the right encoding for each one. This would
be both expensive and unlike numeric equivalence: in numeric
equivalence, you don't give a sequence of bytes all possible
interpretations to find some interpretation in which they are
equivalent, either.

There is one special case, though: when comparing a byte string
and a Unicode string, the system default encoding (i.e. ASCII)
is assumed. This only really works if the default encoding
really *is* ASCII. Otherwise, equal strings might not hash
equal, in which case you wouldn't find them properly in a
dictionary.

Regards,
Martin
 
B

Bengt Richter

That would require that you know the encoding of a byte string; this
information is not available at run-time.
Well, what will be assumed about name after the lines

#-*- coding: latin1 -*-
name = 'Martin Löwis'

?
I know type(name) will be <type 'str'> and in itself contain no encoding information now,
but why shouldn't the default assumption for literal-generated strings be what the coding
cookie specified? I know the current implementation doesn't keep track of the different
encodings that could reasonably be inferred from the source of the strings, but we are talking
about future stuff here ;-)
You could also try all possible encodings to see whether the strings
are equal if you chose the right encoding for each one. This would
be both expensive and unlike numeric equivalence: in numeric
equivalence, you don't give a sequence of bytes all possible
interpretations to find some interpretation in which they are
equivalent, either.
Agreed, that would be a mess.
There is one special case, though: when comparing a byte string
and a Unicode string, the system default encoding (i.e. ASCII)
is assumed. This only really works if the default encoding
really *is* ASCII. Otherwise, equal strings might not hash
equal, in which case you wouldn't find them properly in a
dictionary.
Perhaps the str (or future byte) type could have an encoding attribute
defaulting to None, meaning to treat its instances as a current str instances.
Then setting the attribute to some particular encoding, like 'latin-1' (probably
internally normalized and optimized to be represented as a c pointer slot with a
NULL or a pointer to an appropriate codec or whatever) would make the str byte
string explicitly an encoded string, without changing the byte string data or
converting to a unicode encoding. With encoding information explicitly present
or absent, keys could have a normalized hash and comparison, maybe just normalizing
to platform utf for dict encoding-tagged string keys by default.

If this were done, IWT the automatic result of

#-*- coding: latin1 -*-
name = 'Martin Löwis'

could be that name.encoding == 'latin-1'

whereas without the encoding cookie, the default encoding assumption
for the program source would be used, and set explicitly to 'ascii'
or whatever it is.

Functions that generate strings, such as chr(), could be assumed to create
a string with the same encoding as the source code for the chr(...) invocation.
Ditto for e.g. '%s == %c' % (65, 65)
And
s = u'Martin Löwis'.encode('latin-1')
would get
s.encoding == 'latin-1'
not
s.encoding == None
so that the encoding information could make
print s
mean
print s.decode(s.encoding)
(which of course would re-encode to the output device encoding for output, like current
print s.decode('latin-1') and not fail like the current default assumption for s encoding
which is s.encoding==None, i.e., assume default, which is likely print s.decode('ascii'))

Hm, probably
s.encode(None)
and
s.decode(None)
could mean retrieve the str byte data unchanged as a str string with encoding set to None
in the result either way.

Now when you read a file in binary without specifying any encoding assumption, you
would get a str string with .encoding==None, but you could effectively reinterpret-cast it
to any encoding you like by assigning the encoding attribute. The attribute
could be a property that causes decode/encode automatically to create data in the
new encoding. The None encoding would coming or going would not change the data bytes, but
differing explicit encodings would cause decode/encode.

This could also support s1+s2 to mean generate a concatenated string
that has the same encoding attribute if s1.encoding==s2.encoding and otherwise promotes
each to the platform standard unicode encoding and concatenates those if they
are different (and records the unicode encoding chosen in the result's encoding
attribute).

This is not a fully developed idea, and there has been discussion on the topic before
(even between us ;-) but I thought another round might bring out your current thinking
on it ;-)

Regards,
Bengt Richter
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Bengt said:
Well, what will be assumed about name after the lines

#-*- coding: latin1 -*-
name = 'Martin Löwis'

?

Are you asking what is assumed about the identifier 'name', or the value
bound to that identifier? Currently, the identifier must be encoded in
latin1 in this source code, and it must only consist of letters, digits,
and the underscore.

The value of name will be a string consisting of the bytes
4d 61 72 74 69 6e 20 4c f6 77 69 73
I know type(name) will be <type 'str'> and in itself contain no encoding information now,
but why shouldn't the default assumption for literal-generated strings be what the coding
cookie specified?

That certainly is the assumption: string literals must be in the
encoding specified in the source encoding, in the source code file
on disk. If they aren't (and cannot be interpreted that way), you
get a syntax error.
I know the current implementation doesn't keep track of the different
encodings that could reasonably be inferred from the source of the strings,
> but we are talking about future stuff here ;-)

Ah, so you want the source encoding to be preserved, say as an attribute
of the string literal. This has been discussed many times, and was
always rejected.

Some people reject it because it is overkill: if you want reliable,
stable representation of characters, you should use Unicode strings.

Others reject it because of semantic difficulties: how would such
strings behave under concatenation, if the encodings are different?
#-*- coding: latin1 -*-
name = 'Martin Löwis'

could be that name.encoding == 'latin-1'

That is not at all intuitive. I would have expected name.encoding
to be 'latin1'.
Functions that generate strings, such as chr(), could be assumed to create
a string with the same encoding as the source code for the chr(...) invocation.

What is the source of the chr invocation? If I do chr(param), should I
use the source where param was computed, or the source where the call
to chr occurs? If the latter, how should the interpreter preserve the
encoding of where the call came from?

What about the many other sources of byte strings (like strings read
from a file, or received via a socket)?
This is not a fully developed idea, and there has been discussion on the topic before
(even between us ;-) but I thought another round might bring out your current thinking
on it ;-)

My thinking still is the same. It cannot really work, and it wouldn't do
any good with what little it could do. Just use Unicode strings.

Regards,
Martin
 
B

Bengt Richter

Are you asking what is assumed about the identifier 'name', or the value
bound to that identifier? Currently, the identifier must be encoded in
latin1 in this source code, and it must only consist of letters, digits,
and the underscore.

The value of name will be a string consisting of the bytes
4d 61 72 74 69 6e 20 4c f6 77 69 73

Which is the latin-1 encoding. Ok, so far so good. We know it's latin1, but the knowledge
is lost to python.
That certainly is the assumption: string literals must be in the
encoding specified in the source encoding, in the source code file
on disk. If they aren't (and cannot be interpreted that way), you
get a syntax error.
I meant the "literal-generated string" (internal str instance representation compiled
from the latin1-encoded source string literal.
Ah, so you want the source encoding to be preserved, say as an attribute
of the string literal. This has been discussed many times, and was
always rejected.
Not of the string literal per se. That is only one (constant) expression resulting
in a str instance. I want (for the sake of this discussion ;-) the str instance
to have an encoding attribute when it can reliably be inferred, as e.g. when a coding
cookie is specified and the str instance comes from a constant literal string expression.
Some people reject it because it is overkill: if you want reliable,
stable representation of characters, you should use Unicode strings.

Others reject it because of semantic difficulties: how would such
strings behave under concatenation, if the encodings are different?
I mentioned that in parts you snipped (2nd half here):
"""
Now when you read a file in binary without specifying any encoding assumption, you
would get a str string with .encoding==None, but you could effectively reinterpret-cast it
to any encoding you like by assigning the encoding attribute. The attribute
could be a property that causes decode/encode automatically to create data in the
new encoding. The None encoding would coming or going would not change the data bytes, but
differing explicit encodings would cause decode/encode.

This could also support s1+s2 to mean generate a concatenated string
that has the same encoding attribute if s1.encoding==s2.encoding and otherwise promotes
each to the platform standard unicode encoding and concatenates those if they
are different (and records the unicode encoding chosen in the result's encoding
attribute).
"""
That is not at all intuitive. I would have expected name.encoding
to be 'latin1'.
That's pretty dead-pan. Not even a smiley ;-)
What is the source of the chr invocation? If I do chr(param), should I
The source file that the "chr(param)" appears in.
use the source where param was computed, or the source where the call
No, the param is numeric, and has no reasonably inferrable encoding. (I don't
propose to have ord pass it on for integers to carry ;-) (so ord in another
module with different source encoding could be the source and an encoding
conversion could happen with integer as intermediary. But that's expected ;-)
to chr occurs? If the latter, how should the interpreter preserve the
encoding of where the call came from?
not this latter, so not applicable.
What about the many other sources of byte strings (like strings read
from a file, or received via a socket)?
I mentioned that in parts you snipped. See above.
My thinking still is the same. It cannot really work, and it wouldn't do
any good with what little it could do. Just use Unicode strings.
To hear "It cannot really work" causes me agitation, even if I know it's not worth
the effort to pursue it ;-)

Anyway, ok, I'll leave it at that, but I'm not altogether happy with having to write

#-*- coding: latin1 -*-
name = 'Martin Löwis'
print name.decode('latin1')

where I think

#-*- coding: latin1 -*-
name = 'Martin Löwis'
print name

should reasonably produce the same output. Though I grant you

#-*- coding: latin1 -*-
name = u'Martin Löwis'
print name

is not that hard to do. (Please excuse the use of your name, which has a handy non-ascii letter ;-)

Regards,
Bengt Richter
 
S

Scott David Daniels

Bengt said:
<on tracking the encodings of literal generated astrings>

The big problem you'll hit is figuring out how to use these strings.
Which string ops preserve the encoding? Even the following is
problematic:

#-*- coding: utf-8 -*-
name = 'Martin Löwis'

brokenpart = name[: 9]

Because brokenpart is not a correct utf-8 encoding of anything.
The problem is that there is no good way to propagate the
encoding without understanding the purpose of the operations
themselves.

--Scott David Daniels
(e-mail address removed)
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Bengt said:
I mentioned that in parts you snipped (2nd half here):

This could also support s1+s2 to mean generate a concatenated string
that has the same encoding attribute if s1.encoding==s2.encoding and otherwise promotes
each to the platform standard unicode encoding and concatenates those if they
are different (and records the unicode encoding chosen in the result's encoding
attribute).

It remains semantically difficult. There are other alternatives, e.g.
(s1+s2).encoding could become None, instead of using your procedure.

Also, this specification is incomplete: what if either s1.encoding
or s2.encoding is None?

Then, what recoding to the platform encoding fails? With ASCII
being the default encoding at the moment, it is very likely that
concatenations will fail if there are funny characters in either
string.

If you propose that this should raise an exception, it means that
normal string concatenations will then give you exceptions just
as often as (or even more often than) you get UnicodeErrors
currently. I doubt users would like that.
To hear "It cannot really work" causes me agitation, even if I know it's not worth
the effort to pursue it ;-)

It is certainly implementable, yes. But it will then break a lot of
existing code.
Though I grant you

#-*- coding: latin1 -*-
name = u'Martin Löwis'
print name

is not that hard to do.

This is indeed what you should do. In Python 3, you can omit the u,
as the string type will go away (and be replaced with the Unicode type).
(Please excuse the use of your name, which has a handy non-ascii letter ;-)

No problem with that :)

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top