Allowing non-ASCII identifiers (Fran?ois Pinard)

Discussion in 'Python' started by Doug Fort, Feb 9, 2004.

  1. Doug Fort

    Doug Fort Guest

    This is an excerpt from a much longer post on the python-dev mailing list.
    I'm responding here, to avoid cluttering up python-dev.

    [François Pinard]
    <snip>
    >Some English readers might not really imagine, but it is a constant
    >misery, having to mangle identifiers while documenting and thinking
    >in languages other than English, merely because the Python notion of
    >letter is limited to the English subset. Granted, keywords and standard
    >library use English, this is Python, and this is not at stake here!
    >However, there is a good part of code in local (or in-house) programs
    >which is thought as our crafted code, and even the linguistic change is
    >useful (to us) for segregating between what comes from the language and
    >what comes from us. The idea is extremely appealing of being able to
    >craft and polish our code (comments, strings, identifiers) to make it as

    <nice as it could get, while thinking in our native, natural language.
    >--
    >François Pinard http://www.iro.umontreal.ca/~pinard

    </snip>

    Monglot English speakers, like me, might also benefit from reading
    well-crafted Python code with non-english identifiers and comments. I learn
    best by anchoring new ideas in a familiar context.

    One of my (non-programmer) friends is improving his French by working
    through the French versions of the Harry Potter novels.
     
    Doug Fort, Feb 9, 2004
    #1
    1. Advertising

  2. Doug Fort

    Paul Prescod Guest

    Doug Fort wrote:

    > [François Pinard]
    > <snip>
    >
    >>Some English readers might not really imagine, but it is a constant
    >>misery, having to mangle identifiers while documenting and thinking
    >>in languages other than English, merely because the Python notion of
    >>letter is limited to the English subset. Granted, keywords and standard
    >>library use English, this is Python, and this is not at stake here!
    >>However, there is a good part of code in local (or in-house) programs
    >>which is thought as our crafted code, and even the linguistic change is
    >>useful (to us) for segregating between what comes from the language and
    >>what comes from us. The idea is extremely appealing of being able to
    >>craft and polish our code (comments, strings, identifiers) to make it as


    I wonder if the proposal would be more palatable if it were restricted
    to 8-bit encodings (what we used to call "code pages"). This is at least
    a first step in the right direction that would help westerners and could
    be made to work even if Python were compiled without Unicode support.
    (it is still possible to compile Python without Unicode isn't it?)

    Paul Prescod
     
    Paul Prescod, Feb 9, 2004
    #2
    1. Advertising

  3. [Paul Prescod]

    > I wonder if the proposal would be more palatable if it were restricted
    > to 8-bit encodings (what we used to call "code pages"). This is at
    > least a first step in the right direction that would help westerners
    > and could be made to work even if Python were compiled without Unicode
    > support.


    To repeat something I was writing to python-dev earlier today, it
    already works by some kind of accident. A smallish main program
    could do:

    import locale
    locale.setlocale(locale.LC_ALL, '')
    import THE-REAL-APPLICATION

    to activate your code page, given your environment is already set for
    it. This will activate proper classification of characters in <ctype.c>
    and then, Python seems to behave properly with non-ASCII identifiers
    within the imported application.

    It is an accident because it was not meant this way by Guido, at least
    so far that I know. The trick might break at various places, who knows.
    I did not test it seriously, and do not intend to rely on it, as Guido
    might even choose to consider this as a bug to be corrected.

    The plan rather seems to be to support non-ASCII identifiers widely
    instead of parsimoniously, if Python ever does it, or not at all. The
    decision has not been taken yet, Guido wants a PEP and a discussion
    first.

    In my experience, such discussions are often rough (or at least
    demanding), because people have a lot of emotions on linguistic
    issues, and do not always show the real relations between emotions and
    rationalisations, which sometimes get convoluted.

    > (it is still possible to compile Python without Unicode isn't it?)


    I would guess that Unicode in Python is central if you want codecs to
    work, in particular for all code pages which Python currently supports.

    --
    François Pinard http://www.iro.umontreal.ca/~pinard
     
    =?iso-8859-1?Q?Fran=E7ois?= Pinard, Feb 9, 2004
    #3
  4. Re: Allowing non-ASCII identifiers

    Paul Prescod wrote:
    > I wonder if the proposal would be more palatable if it were restricted
    > to 8-bit encodings (what we used to call "code pages"). This is at least
    > a first step in the right direction that would help westerners and could
    > be made to work even if Python were compiled without Unicode support.
    > (it is still possible to compile Python without Unicode isn't it?)


    I doubt that it would matter much to those currently opposed; I know
    that *I* would be opposed to such a strategy: Allowing arbitrary source
    code encoding is no technical challenge whatsoever, and restricting
    it to single-byte encodings is an arbitrary restriction.

    I believe Guido's concern is more along the lines "How do I call a
    function that has a ł in its name, or a Σ?", or, even, "How can I
    find out what the function does, by looking at its name and doc
    string, if that is in Polish or Greek?" The fact that there is
    a single-byte encoding for either character doesn't really help
    here.

    So this is about social issues, coding policies, guidelines, etc -
    not about technical issues.

    Regards,
    Martin
     
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, Feb 9, 2004
    #4
  5. Doug Fort

    Paul Prescod Guest

    Re: Allowing non-ASCII identifiers

    Martin v. Löwis wrote:

    > Paul Prescod wrote:
    >
    >> I wonder if the proposal would be more palatable if it were restricted
    >> to 8-bit encodings (what we used to call "code pages"). This is at
    >> least a first step in the right direction that would help westerners
    >> and could be made to work even if Python were compiled without Unicode
    >> support. (it is still possible to compile Python without Unicode isn't
    >> it?)

    >
    >
    > I doubt that it would matter much to those currently opposed; I know
    > that *I* would be opposed to such a strategy: Allowing arbitrary source
    > code encoding is no technical challenge whatsoever, and restricting
    > it to single-byte encodings is an arbitrary restriction.


    You are right. Re-reading Guido's complaint I understand what you mean.
    But I have heard the argument in the past that Unicode source files
    would break introspection tools. If that isn't a concern this time
    around then disregard my suggestion.

    Paul Prescod
     
    Paul Prescod, Feb 9, 2004
    #5
  6. Doug Fort

    John Roth Guest

    Re: Allowing non-ASCII identifiers

    "Paul Prescod" <> wrote in message
    news:...
    Martin v. Löwis wrote:

    > Paul Prescod wrote:
    >
    >> I wonder if the proposal would be more palatable if it were restricted
    >> to 8-bit encodings (what we used to call "code pages"). This is at
    >> least a first step in the right direction that would help westerners
    >> and could be made to work even if Python were compiled without Unicode
    >> support. (it is still possible to compile Python without Unicode isn't
    >> it?)

    >
    >
    > I doubt that it would matter much to those currently opposed; I know
    > that *I* would be opposed to such a strategy: Allowing arbitrary source
    > code encoding is no technical challenge whatsoever, and restricting
    > it to single-byte encodings is an arbitrary restriction.


    You are right. Re-reading Guido's complaint I understand what you mean.
    But I have heard the argument in the past that Unicode source files
    would break introspection tools. If that isn't a concern this time
    around then disregard my suggestion.

    [JR]
    I believe that unicode (actually UTF-8) source code files
    are legitimate if you declare them properly in the encoding
    line. In fact, UTF-8 is the example in the documentation.

    I'm all in favor of going to unicode all the way. I'd like to
    have the proper mathematical symbols for logical and set
    operations, as well as integer divide. They're all there in the
    unicode character set, after all; why should we have to
    settle for archaic character restrictions?

    John Roth
    [/JR]

    Paul Prescod
     
    John Roth, Feb 10, 2004
    #6
  7. Doug Fort

    AdSR Guest

    Re: Allowing non-ASCII identifiers

    "John Roth" <> wrote in message news:<>...
    > I'm all in favor of going to unicode all the way. I'd like to
    > have the proper mathematical symbols for logical and set
    > operations, as well as integer divide. They're all there in the
    > unicode character set, after all; why should we have to
    > settle for archaic character restrictions?


    Java allows for Unicode identifiers and I'm yet to see a single source
    file that uses anything but ASCII. Actually, so far I have only seen
    non-ASCII in Polish Logo many years ago, and that was only for
    educational purposes.

    As a non-native English speaker, coming from Polish and Portuguese
    background, I could argue in favor of non-ASCII identifiers, but I'm
    against them. Do we really need those? Even if program output is in
    Polish, all my code is "identified" and commented in English, which I
    think of as of a good habit. (With exception of HTML, where comments
    are closely related to content.)

    I don't have any _really_ solid reasons against Unicode identifiers,
    except for simplicity. It's just the way I feel about programming.

    On a side note, one place where I think non-ASCII really should be
    avoided are domain names, something that is being much debated
    recently.

    AdSR
     
    AdSR, Feb 10, 2004
    #7
  8. Re: Allowing non-ASCII identifiers

    (AdSR) writes:

    > On a side note, one place where I think non-ASCII really should be
    > avoided are domain names, something that is being much debated
    > recently.


    And something Python supports already :)

    Cheers,
    mwh

    --
    Windows XP: Big cow. Stands there, not especially malevolent
    but constantly crapping on your carpet. Eventually you have to
    open a window to let the crap out or you die.
    -- Jim's pedigree of operating systems, asr
     
    Michael Hudson, Feb 10, 2004
    #8
  9. Re: Allowing non-ASCII identifiers

    John Roth wrote:
    > ...
    > I believe that unicode (actually UTF-8) source code files
    > are legitimate if you declare them properly in the encoding
    > line. In fact, UTF-8 is the example in the documentation.
    >
    > I'm all in favor of going to unicode all the way. I'd like to
    > have the proper mathematical symbols for logical and set
    > operations, as well as integer divide. They're all there in the
    > unicode character set, after all; why should we have to
    > settle for archaic character restrictions?

    Because some of us use archaic systems and/or fonts which are
    incapable of displaying such symbols. Never mind whether we
    can read them.

    Also, we would have to solve the issue of multiple representations
    for the same identifier (normalized identifiers)? There are four
    equivalent representations:

    (u'\N{Latin small letter e with acute}l'
    u'\N{Latin small letter e with grave}ve')

    (u'\N{Latin small letter e with acute}l'
    u'e\N{Combining grave accent}ve')

    (u'e\N{Combining acute accent}l'
    u'\N{Latin small letter e with grave}ve')

    (u'e\N{Combining acute accent}l'
    u'e\N{Combining grave accent}ve')

    Unicode says we should treat these four identically. Further,
    they each have a distinct hash code, so a dictionary will not
    necessarily even try to compare them to find them equal.


    --
    -Scott David Daniels
     
    Scott David Daniels, Feb 10, 2004
    #9
  10. Re: Allowing non-ASCII identifiers

    Paul Prescod wrote:

    > You are right. Re-reading Guido's complaint I understand what you mean.
    > But I have heard the argument in the past that Unicode source files
    > would break introspection tools. If that isn't a concern this time
    > around then disregard my suggestion.


    That might be a problem, indeed. OTOH, those tools likely also
    break if you use non-ASCII byte strings for identifiers.

    Regards,
    Martin
     
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, Feb 10, 2004
    #10
  11. Doug Fort

    Dietrich Epp Guest

    Re: Allowing non-ASCII identifiers

    On Feb 10, 2004, at 8:59 AM, Scott David Daniels wrote:

    > Also, we would have to solve the issue of multiple representations
    > for the same identifier (normalized identifiers)? There are four
    > equivalent representations:
    >
    > (u'\N{Latin small letter e with acute}l'
    > u'\N{Latin small letter e with grave}ve')
    >
    > (u'\N{Latin small letter e with acute}l'
    > u'e\N{Combining grave accent}ve')
    >
    > (u'e\N{Combining acute accent}l'
    > u'\N{Latin small letter e with grave}ve')
    >
    > (u'e\N{Combining acute accent}l'
    > u'e\N{Combining grave accent}ve')
    >
    > Unicode says we should treat these four identically. Further,
    > they each have a distinct hash code, so a dictionary will not
    > necessarily even try to compare them to find them equal.


    You could require that all identifiers be the canonically decomposed
    Unicode representations encoded into UTF-8. This would mean that no
    matter which string is chosen from the above, the result is always the
    same sequence of characters. This is how many filesystems use unicode,
    i.e., Mac HFS+ works this way (but filesystems usually also require a
    specific version of Unicode for backwards compatibility).

    I personally think that Unicode identifiers would be catastrophic.
    With Unicode on the web, if you can't represent some characters, you
    can't read the web page. With programming, it could mean that you are
    unable to use a particular module, altering the functionality for
    people who can't enter certain codes. There is also the issue of which
    characters to allow, because some characters look like numbers. Is
    unicode 'IV' a number or an identifier? What about a circled 4? What
    about unicode line breaks and paragraph breaks? What about opening and
    closing quote marks? What about right-to-left characters? What about
    ligatures? Non-breaking spaces? Function application?

    I think the assumption some people have is that Unicode will only ever
    be used for things that are like the roman alphabet: adding diacritical
    marks, etc. It sounds like the most worthless extension ever, and the
    only language I think of when I think of special characters is
    Intercal.
     
    Dietrich Epp, Feb 11, 2004
    #11
  12. Re: Allowing non-ASCII identifiers

    Dietrich Epp wrote:

    > You could require that all identifiers be the canonically decomposed
    > Unicode representations encoded into UTF-8. This would mean that no
    > matter which string is chosen from the above, the result is always the
    > same sequence of characters. This is how many filesystems use unicode,
    > i.e., Mac HFS+ works this way (but filesystems usually also require a
    > specific version of Unicode for backwards compatibility).

    There are several "Normal forms" for Unicode letters. You'd need to
    choose one.

    > I personally think that Unicode identifiers would be catastrophic.....

    {lotsa examples, some good, some not-so-good elided)
    I'm reluctant to endorse it because I _know_ I'll see "Why doesn't my
    program work?" accompanied by characters I'm not used to distinguishing.

    > I think the assumption some people have is that Unicode will only ever
    > be used for things that are like the roman alphabet: adding diacritical
    > marks, etc. It sounds like the most worthless extension ever, and the
    > only language I think of when I think of special characters is Intercal.

    And this is why I had to comment. You obviously never dealt with APL.
    I actually used it without an APL type ball, which was painful in the
    extreme. When I give language summaries, my quote for APL is,
    "APL is the only language where you regularly see one programmer walk
    into another's office (well, cube now, but in the day....) and say,
    'I bet you cannot guess what this one-line program does.'"

    --
    -Scott David Daniels
     
    Scott David Daniels, Feb 12, 2004
    #12
  13. Re: Allowing non-ASCII identifiers

    Scott David Daniels wrote:
    > Because some of us use archaic systems and/or fonts which are
    > incapable of displaying such symbols. Never mind whether we
    > can read them.


    Right. However, policy whether to use non-ASCII identifiers
    because of such issues should be with the source code authors,
    not with the language implementation. Being able to use non-ASCII
    identifiers does not mean you *have* to; not being able means
    you *cannot*.

    > Also, we would have to solve the issue of multiple representations
    > for the same identifier (normalized identifiers)?


    I would use NFC, because it has the best chances of being displayed
    properly even on terminals that don't do combining characters.

    For the language itself, the specific choice of normalization form
    is irrelevant - any form would do (but I agree that normalization
    should happen).

    > Unicode says we should treat these four identically. Further,
    > they each have a distinct hash code, so a dictionary will not
    > necessarily even try to compare them to find them equal.


    If identifiers are Unicode-normalized, this is not an issue -
    all copies of the normal form will hash identical.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Feb 12, 2004
    #13
  14. Re: Allowing non-ASCII identifiers

    Dietrich Epp wrote:
    > You could require that all identifiers be the canonically decomposed
    > Unicode representations encoded into UTF-8.


    That would be unpythonic: non-ASCII identifiers should be represented
    as Unicode objects, not as UTF-8 byte strings.

    > I personally think that Unicode identifiers would be catastrophic. With
    > Unicode on the web, if you can't represent some characters, you can't
    > read the web page. With programming, it could mean that you are unable
    > to use a particular module, altering the functionality for people who
    > can't enter certain codes.


    It is the case that some people would have problems invoking certain
    functions. Why would that be a catastrophy? Authors of Python software
    should make a choice whether they prefer readability of the source code,
    or accessibility to everyone. Depending on the situation, one choice
    or the other may be appropriate. Python should not police that decision
    for the developer.

    > There is also the issue of which characters
    > to allow, because some characters look like numbers.


    Yes. I would go with a list similar to the Java one, except with a
    few obvious restrictions (e.g. disallow currency symbols: Python
    does not allow the DOLLAR SIGN in identifiers, whereas Java does).

    > Is unicode 'IV' a number or an identifier?


    It is certainly *not* a number. I propose to change the syntax of
    identifiers, not of numbers. Whether this specific character â…£ is
    an identifier or should give a syntax error is a choice one needs
    to make, certainly. What would be your choice?

    > What about a circled 4? What about unicode
    > line breaks and paragraph breaks? What about opening and closing quote
    > marks? What about right-to-left characters? What about ligatures?
    > Non-breaking spaces? Function application?


    The Unicode consortium gives guidance on all these questions. As I said,
    I would closely follow the Java principles, which were derived from
    the Unicode consortium guidance. Here is my proposal:

    Legal non-ASCII identifiers are what legal non-ASCII
    identifiers are in Java, except that Python may use
    a different version of the Unicode character database.
    Python would share the property that future versions
    allow more characters in identifiers than older versions.

    If you are too lazy too look up the Java definition,
    here is a rough overview:
    An identifier is "JavaLetter JavaLetterOrDigit*"

    JavaLetter is a character of the classes Lu, Ll,
    Lt, Lm, or Lo, or a currency symbol (for Python:
    excluding $), or a connecting punctuation character
    (which is unfortunately underspecified - will
    research the implementation).

    JavaLetterOrDigit is a JavaLetter, or a digit,
    a numeric letter, a combining mark, a non-spacing
    mark, or an ignorable control character.

    I believe this specification allows you to answer your questions
    yourself.

    > I think the assumption some people have is that Unicode will only ever
    > be used for things that are like the roman alphabet: adding diacritical
    > marks, etc. It sounds like the most worthless extension ever, and the
    > only language I think of when I think of special characters is Intercal.


    That is certainly not my assumption. Instead, I expect that this
    extension will primarily be used by developers whose native language
    is Russian, Japanese, Chinese, Korean, or Arabic. Atleast, I've heard
    developers from these cultures ask for the specific feature in the
    past (I've also heard French and German people ask for the feature,
    but that fits with your expectation).

    Regards,
    Martin
     
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, Feb 12, 2004
    #14
  15. Doug Fort

    Neil Hodgson Guest

    Re: Allowing non-ASCII identifiers

    Scott David Daniels:

    > Because some of us use archaic systems and/or fonts which are
    > incapable of displaying such symbols. Never mind whether we
    > can read them.


    For such circumstances, I would like to see hex escape sequences allowed
    in identifiers as in Java. That means that there is a representation of last
    resort that can be used by those using less capable tools. A simple filter
    could translate to and from this format for the extremely rare occasions it
    would be needed.

    Neil
     
    Neil Hodgson, Feb 12, 2004
    #15
  16. Doug Fort

    Joe Mason Guest

    Re: Allowing non-ASCII identifiers

    In article <c0f8h6$8e4$01$-online.com>, Martin v. Löwis wrote:
    > It is the case that some people would have problems invoking certain
    > functions. Why would that be a catastrophy?


    Oh, it wouldn't be. Not being catastrophic doesn't make it good.

    > Authors of Python software should make a choice whether they prefer
    > readability of the source code, or accessibility to everyone.


    Yeah, they should, but they won't. They'll go nuts with the cool
    features and not stop to think about the consequences. Those of us
    stuck cleaning up after them will then be hindered by the cool features
    that don't work. History has shown us this.

    If non-ASCII characters are allowed, they'll be used frivolously.
    Somebody will put "et tu, Bruté" in a comment, or start their career
    planning package with "import resumé", and these otherwise working
    programs would break for people without Unicode support.

    > Python should not police that decision for the developer.


    Why not? It polices everything else. Isn't Python still the "only one
    way to do it" language?

    If you were suggesting this for Perl or Ruby, I'd be all in favour (in fact,
    it'd be especially apropriate for Ruby). But in Python it's perfectly
    appropriate to restrict something that many people would find useful in
    favour of simplicity and consistency.

    Joe
     
    Joe Mason, Feb 12, 2004
    #16
  17. Doug Fort

    Paul Prescod Guest

    Re: Allowing non-ASCII identifiers

    Dietrich Epp wrote:

    >
    > I personally think that Unicode identifiers would be catastrophic.


    This is an overstatement. One of the great things about Python is that
    it borrows from other langauges. VB and C# for sure and I think Java
    allow non-ASCII identifiers and there was no catastrophe. VB has its
    problems but Unicode identifiers is not a big one.

    I am +0 on this proposal because I really doubt it will cause me big
    problems and at least some foreign language speakers claim it will make
    their lives much easier. If they post to c.l.py asking for help with
    code I can't read I'll tell them I can't read it. If they write
    extension modules I can't use I'll just ask them to put an ASCII API
    alongside their Unicode one (language is likely to be a bigger
    readability problem than encoding anyhow)

    Paul Prescod
     
    Paul Prescod, Feb 12, 2004
    #17
  18. Re: Allowing non-ASCII identifiers

    Joe Mason wrote:
    > If non-ASCII characters are allowed, they'll be used frivolously.
    > Somebody will put "et tu, Bruté" in a comment


    People can (and do) already put their natural language into comments;
    whether or not non-ASCII characters are allowed in identifiers is
    irrelevant for that usage.

    Also, people don't need "Unicode support" to read those comments.
    They just need an editor that can display the character set that
    the people wrote their comments in.

    Assuming you speak the language in which the comments are written,
    you very likely have a text editor which can display them. Or you
    use IDLE.

    >>Python should not police that decision for the developer.

    >
    >
    > Why not? It polices everything else. Isn't Python still the "only one
    > way to do it" language?


    And that wouldn't change: There would be only a single way to do

    import resumé

    Currently, there is no way, which is less than "only one way".

    > If you were suggesting this for Perl or Ruby, I'd be all in favour (in fact,
    > it'd be especially apropriate for Ruby). But in Python it's perfectly
    > appropriate to restrict something that many people would find useful in
    > favour of simplicity and consistency.


    And indeed, using non-ASCII characters in identifiers is simple and
    consistent.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Feb 13, 2004
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Brian Quinlan

    RE: Allowing non-ASCII identifiers

    Brian Quinlan, Feb 12, 2004, in forum: Python
    Replies:
    2
    Views:
    312
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Feb 13, 2004
  2. =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=

    PEP 3131: Supporting Non-ASCII Identifiers

    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, May 13, 2007, in forum: Python
    Replies:
    399
    Views:
    4,766
  3. Michael Yanowitz

    RE: PEP 3131: Supporting Non-ASCII Identifiers

    Michael Yanowitz, May 14, 2007, in forum: Python
    Replies:
    4
    Views:
    281
    Chuck Rhode
    May 15, 2007
  4. bruce
    Replies:
    38
    Views:
    316
    Mark Lawrence
    Nov 1, 2013
  5. MRAB
    Replies:
    0
    Views:
    112
Loading...

Share This Page