M
Mike Meyer
Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?
<mike
Python symbols are restricted to 7-bit ascii?
<mike
Mike said:Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?
Mike said:Out of random curiosity, is there a PEP/thread/? that explains why
Python symbols are restricted to 7-bit ascii?
No PEP yet; I meant to write one for several years now.
The principles would be
- sources must use encoding declarations
- valid identifiers would follow the Unicode consortium guidelines,
in particular: identifiers would be normalized in NFKC (I think),
adjusted in the ASCII range for backward compatibility (i.e.
not introducing any additional ASCII characters as legal identifier
characters)
- __dict__ will contain Unicode keys
- all objects should support Unicode getattr/setattr (potentially
raising AttributeError, of course)
- open issue: what to do on the C API (perhaps nothing, perhaps
allowing UTF-8)
Bengt said:Perhaps string equivalence in keys will be treated like numeric equivalence?
I.e., a key/name representation is established by the initial key/name binding, but
values can be retrieved by "equivalent" key/names with different representations
like unicode vs ascii or latin-1 etc.?
Well, what will be assumed about name after the linesThat would require that you know the encoding of a byte string; this
information is not available at run-time.
Agreed, that would be a mess.You could also try all possible encodings to see whether the strings
are equal if you chose the right encoding for each one. This would
be both expensive and unlike numeric equivalence: in numeric
equivalence, you don't give a sequence of bytes all possible
interpretations to find some interpretation in which they are
equivalent, either.
Perhaps the str (or future byte) type could have an encoding attributeThere is one special case, though: when comparing a byte string
and a Unicode string, the system default encoding (i.e. ASCII)
is assumed. This only really works if the default encoding
really *is* ASCII. Otherwise, equal strings might not hash
equal, in which case you wouldn't find them properly in a
dictionary.
Bengt said:Well, what will be assumed about name after the lines
#-*- coding: latin1 -*-
name = 'Martin Löwis'
?
I know type(name) will be <type 'str'> and in itself contain no encoding information now,
but why shouldn't the default assumption for literal-generated strings be what the coding
cookie specified?
I know the current implementation doesn't keep track of the different
encodings that could reasonably be inferred from the source of the strings,
> but we are talking about future stuff here ;-)
#-*- coding: latin1 -*-
name = 'Martin Löwis'
could be that name.encoding == 'latin-1'
Functions that generate strings, such as chr(), could be assumed to create
a string with the same encoding as the source code for the chr(...) invocation.
This is not a fully developed idea, and there has been discussion on the topic before
(even between us ;-) but I thought another round might bring out your current thinking
on it ;-)
Are you asking what is assumed about the identifier 'name', or the value
bound to that identifier? Currently, the identifier must be encoded in
latin1 in this source code, and it must only consist of letters, digits,
and the underscore.
The value of name will be a string consisting of the bytes
4d 61 72 74 69 6e 20 4c f6 77 69 73
I meant the "literal-generated string" (internal str instance representation compiledThat certainly is the assumption: string literals must be in the
encoding specified in the source encoding, in the source code file
on disk. If they aren't (and cannot be interpreted that way), you
get a syntax error.
Not of the string literal per se. That is only one (constant) expression resultingAh, so you want the source encoding to be preserved, say as an attribute
of the string literal. This has been discussed many times, and was
always rejected.
I mentioned that in parts you snipped (2nd half here):Some people reject it because it is overkill: if you want reliable,
stable representation of characters, you should use Unicode strings.
Others reject it because of semantic difficulties: how would such
strings behave under concatenation, if the encodings are different?
That's pretty dead-pan. Not even a smiley ;-)That is not at all intuitive. I would have expected name.encoding
to be 'latin1'.
The source file that the "chr(param)" appears in.What is the source of the chr invocation? If I do chr(param), should I
No, the param is numeric, and has no reasonably inferrable encoding. (I don'tuse the source where param was computed, or the source where the call
not this latter, so not applicable.to chr occurs? If the latter, how should the interpreter preserve the
encoding of where the call came from?
I mentioned that in parts you snipped. See above.What about the many other sources of byte strings (like strings read
from a file, or received via a socket)?
To hear "It cannot really work" causes me agitation, even if I know it's not worthMy thinking still is the same. It cannot really work, and it wouldn't do
any good with what little it could do. Just use Unicode strings.
Bengt said:<on tracking the encodings of literal generated astrings>
Bengt said:I mentioned that in parts you snipped (2nd half here):
This could also support s1+s2 to mean generate a concatenated string
that has the same encoding attribute if s1.encoding==s2.encoding and otherwise promotes
each to the platform standard unicode encoding and concatenates those if they
are different (and records the unicode encoding chosen in the result's encoding
attribute).
To hear "It cannot really work" causes me agitation, even if I know it's not worth
the effort to pursue it ;-)
Though I grant you
#-*- coding: latin1 -*-
name = u'Martin Löwis'
print name
is not that hard to do.
(Please excuse the use of your name, which has a handy non-ascii letter ;-)
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.