PEP 263 status check

Discussion in 'Python' started by John Roth, Aug 5, 2004.

  1. John Roth

    John Roth Guest

    PEP 263 is marked finished in the PEP index, however
    I haven't seen the specified Phase 2 in the list of changes
    for 2.4 which is when I expected it.

    Did phase 2 get cancelled, or is it just not in the
    changes document?

    John Roth
     
    John Roth, Aug 5, 2004
    #1
    1. Advertising

  2. John Roth wrote:
    > PEP 263 is marked finished in the PEP index, however
    > I haven't seen the specified Phase 2 in the list of changes
    > for 2.4 which is when I expected it.
    >
    > Did phase 2 get cancelled, or is it just not in the
    > changes document?


    Neither, nor. Although this hasn't been discussed widely,
    I personally believe it is too early yet to make lack of
    encoding declarations a syntax error. I'd like to
    reconsider the issue with Python 2.5.

    OTOH, not many people have commented either way: would you
    be outraged if a script that has given you a warning about
    missing encoding declarations for some time fails with a
    strict SyntaxError in 2.4? Has everybody already corrected
    their scripts?

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Aug 5, 2004
    #2
    1. Advertising

  3. "Martin v. Löwis" wrote:

    > I personally believe it is too early yet to make lack of
    > encoding declarations a syntax error. I'd like to


    +1

    Making this an all-out failure is pretty brutal, IMHO. You could change the
    warning message to be more stringent about it becoming soon an error. But if
    someone upgrades to 2.4 because of other benefits, and some large third-party
    code they rely on (and which is otherwise perfectly fine with 2.4) fails
    catastrophically because of these warnings becoming errors, I suspect they
    will be very unhappy.

    I see the need to nudge people in the right direction, but there's no need to
    do it with a 10.000 Volt stick :)

    Best,

    f
     
    Fernando Perez, Aug 5, 2004
    #3
  4. John Roth

    John Roth Guest

    "Martin v. Löwis" <> wrote in message
    news:...
    > John Roth wrote:
    > > PEP 263 is marked finished in the PEP index, however
    > > I haven't seen the specified Phase 2 in the list of changes
    > > for 2.4 which is when I expected it.
    > >
    > > Did phase 2 get cancelled, or is it just not in the
    > > changes document?

    >
    > Neither, nor. Although this hasn't been discussed widely,
    > I personally believe it is too early yet to make lack of
    > encoding declarations a syntax error. I'd like to
    > reconsider the issue with Python 2.5.
    >
    > OTOH, not many people have commented either way: would you
    > be outraged if a script that has given you a warning about
    > missing encoding declarations for some time fails with a
    > strict SyntaxError in 2.4? Has everybody already corrected
    > their scripts?


    Well, I don't particularly have that problem because I don't
    have a huge number of scripts and for the ones I do it would be
    relatively simple to do a scan and update - or just run them
    with the unit tests and see if they break!

    In fact, I think that a scan and update program in the tools
    directory might be a very good idea - just walk through a
    Python library, scan and update everything that doesn't
    have a declaration.

    The issue has popped in and out of my awareness a few
    times, what brought it up this time was Hallvard's thread.

    My specific question there was how the code handles the
    combination of UTF-8 as the encoding and a non-ascii
    character in an 8-bit string literal. Is this an error? The
    PEP does not say so. If it isn't, what encoding will
    it use to translate from unicode back to an 8-bit
    encoding?

    Another project for people who care about this
    subject: tools. Of the half zillion editors, pretty printers
    and so forth out there, how many check for the encoding
    line and do the right thing with it? Which ones need to
    be updated?

    John Roth
    >
    > Regards,
    > Martin
     
    John Roth, Aug 6, 2004
    #4
  5. "John Roth" <> schrieb im Newsbeitrag
    news:...
    |
    | "Martin v. Löwis" <> wrote in message
    | news:...
    | > John Roth wrote:
    | > > PEP 263 is marked finished in the PEP index, however
    | > > I haven't seen the specified Phase 2 in the list of changes
    | > > for 2.4 which is when I expected it.
    | > >
    | > > Did phase 2 get cancelled, or is it just not in the
    | > > changes document?
    | >
    | > Neither, nor. Although this hasn't been discussed widely,
    | > I personally believe it is too early yet to make lack of
    | > encoding declarations a syntax error. I'd like to
    | > reconsider the issue with Python 2.5.
    | >
    | > OTOH, not many people have commented either way: would you
    | > be outraged if a script that has given you a warning about
    | > missing encoding declarations for some time fails with a
    | > strict SyntaxError in 2.4? Has everybody already corrected
    | > their scripts?
    |
    | Well, I don't particularly have that problem because I don't
    | have a huge number of scripts and for the ones I do it would be
    | relatively simple to do a scan and update - or just run them
    | with the unit tests and see if they break!

    Here's another thought: the company I work for uses (embedded) Python as
    scripting language
    for their report writer (among other things). Users can add little scripts
    to their document templates which are used for printing database data. This
    means, there are literally hundreds of little Python scripts embeddeded
    within the document templates, which themselves are stored in whatever
    database is used as the backend. In such a case, "scan and update" when
    upgrading gets a little more complicated ;)

    |
    | In fact, I think that a scan and update program in the tools
    | directory might be a very good idea - just walk through a
    | Python library, scan and update everything that doesn't
    | have a declaration.
    |
    | The issue has popped in and out of my awareness a few
    | times, what brought it up this time was Hallvard's thread.
    |
    | My specific question there was how the code handles the
    | combination of UTF-8 as the encoding and a non-ascii
    | character in an 8-bit string literal. Is this an error? The
    | PEP does not say so. If it isn't, what encoding will
    | it use to translate from unicode back to an 8-bit
    | encoding?

    Isn't this covered by:

    "Embedding of differently encoded data is not allowed and will
    result in a decoding error during compilation of the Python
    source code."

    --
    Vincent Wehren


    |
    | Another project for people who care about this
    | subject: tools. Of the half zillion editors, pretty printers
    | and so forth out there, how many check for the encoding
    | line and do the right thing with it? Which ones need to
    | be updated?
    |
    | John Roth
    | >
    | > Regards,
    | > Martin
    |
    |
     
    Vincent Wehren, Aug 6, 2004
    #5
  6. John Roth wrote:
    > In fact, I think that a scan and update program in the tools
    > directory might be a very good idea - just walk through a
    > Python library, scan and update everything that doesn't
    > have a declaration.


    Good idea. I see whether I can write something before 2.4,
    but contributions are definitely welcome.

    > My specific question there was how the code handles the
    > combination of UTF-8 as the encoding and a non-ascii
    > character in an 8-bit string literal. Is this an error? The
    > PEP does not say so. If it isn't, what encoding will
    > it use to translate from unicode back to an 8-bit
    > encoding?


    UTF-8 is not in any way special wrt. the PEP. Notice that
    UTF-8 is *not* Unicode - it is an encoding of Unicode, just
    like ISO-8559-1 or us-ascii (although the latter two only
    encode a subset of Unicode). Yes, the byte string literals
    will be converted back to an "8-bit encoding", but the 8-bit
    encoding will be UTF-8! IOW, byte string literals are always
    converted back to the source encoding before execution.

    > Another project for people who care about this
    > subject: tools. Of the half zillion editors, pretty printers
    > and so forth out there, how many check for the encoding
    > line and do the right thing with it? Which ones need to
    > be updated?


    I know IDLE, Eric, Komodo, and Emacs do support encoding
    declarations. I know PythonWin doesn't, although I once
    had written patches to add such support. A number of editors
    (like notepad.exe) do the right thing only if the document
    has the UTF-8 signature.

    Of course, editors don't necessarily need to actively
    support the feature as long as the declared encoding is
    the one they use, anyway. They won't display source in
    other encodings correctly, but some of them don't have
    the notion of multiple encodings, anyway.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Aug 6, 2004
    #6
  7. Vincent Wehren wrote:
    > Here's another thought: the company I work for uses (embedded) Python as
    > scripting language
    > for their report writer (among other things). Users can add little scripts
    > to their document templates which are used for printing database data. This
    > means, there are literally hundreds of little Python scripts embeddeded
    > within the document templates, which themselves are stored in whatever
    > database is used as the backend. In such a case, "scan and update" when
    > upgrading gets a little more complicated ;)


    At the same time, it might get also more simple. If the user interface
    to edit these scripts is encoding-aware, and/or the database to store
    them in is encoding-aware, an automated tool would not need to guess
    what the encoding in the source is.

    > | My specific question there was how the code handles the
    > | combination of UTF-8 as the encoding and a non-ascii
    > | character in an 8-bit string literal. Is this an error? The
    > | PEP does not say so. If it isn't, what encoding will
    > | it use to translate from unicode back to an 8-bit
    > | encoding?
    >
    > Isn't this covered by:
    >
    > "Embedding of differently encoded data is not allowed and will
    > result in a decoding error during compilation of the Python
    > source code."


    No. It is perfectly legal to have non-ASCII data in 8-bit string
    literals (aka byte string literals, aka <type 'str'>). Of course,
    these non-ASCII data also need to be encoded in UTF-8. Whether UTF-8
    is an 8-bit encoding, I don't know - it is more precisely described
    as a multibyte encoding. At execution time, the byte string literals
    then have the source encoding again, i.e. UTF-8.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Aug 6, 2004
    #7
  8. John Roth

    John Roth Guest

    "Martin v. Löwis" <> wrote in message
    news:...
    > John Roth wrote:


    > > My specific question there was how the code handles the
    > > combination of UTF-8 as the encoding and a non-ascii
    > > character in an 8-bit string literal. Is this an error? The
    > > PEP does not say so. If it isn't, what encoding will
    > > it use to translate from unicode back to an 8-bit
    > > encoding?

    >
    > UTF-8 is not in any way special wrt. the PEP.


    That's what I thought.

    > Notice that
    > UTF-8 is *not* Unicode - it is an encoding of Unicode, just
    > like ISO-8559-1 or us-ascii (although the latter two only
    > encode a subset of Unicode).


    I disagree, but I think this is a definitional issue.

    > Yes, the byte string literals
    > will be converted back to an "8-bit encoding", but the 8-bit
    > encoding will be UTF-8! IOW, byte string literals are always
    > converted back to the source encoding before execution.


    If I understand you correctly, if I put, say, a mixture of
    Cyrillic, Hebrew, Arabic and Greek into a byte string
    literal, at run time that character string will contain the
    proper unicode at each character position?

    Or are you trying to say that the character string will
    contain the UTF-8 encoding of these characters; that
    is, if I do a subscript, I will get one character of the
    multi-byte encoding?

    The point of this is that I don't think that either behavior
    is what one would expect. It's also an open invitation
    for someone to make an unchecked mistake! I think this
    may be Hallvard's underlying issue in the other thread.

    > Regards,
    > Martin


    John Roth
     
    John Roth, Aug 6, 2004
    #8
  9. "John Roth" <> writes:

    > If I understand you correctly, if I put, say, a mixture of
    > Cyrillic, Hebrew, Arabic and Greek into a byte string
    > literal, at run time that character string will contain the
    > proper unicode at each character position?


    Uh, I seem to be making a habit of labelling things you suggest
    impossible :)

    > Or are you trying to say that the character string will
    > contain the UTF-8 encoding of these characters; that
    > is, if I do a subscript, I will get one character of the
    > multi-byte encoding?


    This is what happens, indeed.

    Cheers,
    mwh

    --
    This is the fixed point problem again; since all some implementors
    do is implement the compiler and libraries for compiler writing, the
    language becomes good at writing compilers and not much else!
    -- Brian Rogoff, comp.lang.functional
     
    Michael Hudson, Aug 6, 2004
    #9
  10. John Roth wrote:
    > Or are you trying to say that the character string will
    > contain the UTF-8 encoding of these characters; that
    > is, if I do a subscript, I will get one character of the
    > multi-byte encoding?


    Michael is almost right: this is what happens. Except that
    what you get, I wouldn't call a "character". Instead, it
    is always a single byte - even if that byte is part of
    a multi-byte character.

    Unfortunately, the things that constitute a byte string
    are also called characters in the literature.

    To be more specific: In an UTF-8 source file, doing

    print "ö" == "\xc3\xb6"
    print "ö"[0] == "\xc3"

    would print two times "True", and len("ö") is 2.
    OTOH, len(u"ö")==1.

    > The point of this is that I don't think that either behavior
    > is what one would expect. It's also an open invitation
    > for someone to make an unchecked mistake! I think this
    > may be Hallvard's underlying issue in the other thread.


    What would you expect instead? Do you think your expectation
    is implementable?

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Aug 6, 2004
    #10
  11. John Roth

    John Roth Guest

    "Martin v. Löwis" <> wrote in message
    news:...
    > John Roth wrote:
    > > Or are you trying to say that the character string will
    > > contain the UTF-8 encoding of these characters; that
    > > is, if I do a subscript, I will get one character of the
    > > multi-byte encoding?

    >
    > Michael is almost right: this is what happens. Except that
    > what you get, I wouldn't call a "character". Instead, it
    > is always a single byte - even if that byte is part of
    > a multi-byte character.
    >
    > Unfortunately, the things that constitute a byte string
    > are also called characters in the literature.
    >
    > To be more specific: In an UTF-8 source file, doing
    >
    > print "ö" == "\xc3\xb6"
    > print "ö"[0] == "\xc3"
    >
    > would print two times "True", and len("ö") is 2.
    > OTOH, len(u"ö")==1.
    >
    > > The point of this is that I don't think that either behavior
    > > is what one would expect. It's also an open invitation
    > > for someone to make an unchecked mistake! I think this
    > > may be Hallvard's underlying issue in the other thread.

    >
    > What would you expect instead? Do you think your expectation
    > is implementable?


    I'd expect that the compiler would reject anything that
    wasn't either in the 7-bit ascii subset, or else defined
    with a hex escape.

    The reason for this is simply that wanting to put characters
    outside of the 7-bit ascii subset into a byte character string
    isn't portable. It just pushes the need for a character set
    (encoding) declaration down one level of recursion.
    There's already a way of doing this: use a unicode string,
    so it's not like we need two ways of doing it.

    Now I will grant you that there is a need for representing
    the utf-8 encoding in a character string, but do we need
    to support that in the source text when it's much more
    likely that it's a programming mistake?

    As far as implementation goes, it should have been done
    at the beginning. Prior to 2.3, there was no way of writing
    a program using the utf-8 encoding (I think - I might be
    wrong on that) so there were no programs out there that
    put non-ascii subset characters into byte strings.

    Today it's one more forward migration hurdle to jump over.
    I don't think it's a particularly large one, but I don't have
    any real world data at hand.

    John Roth
    >
    > Regards,
    > Martin
     
    John Roth, Aug 6, 2004
    #11
  12. John Roth wrote:
    >>What would you expect instead? Do you think your expectation
    >>is implementable?

    >
    >
    > I'd expect that the compiler would reject anything that
    > wasn't either in the 7-bit ascii subset, or else defined
    > with a hex escape.


    Are we still talking about PEP 263 here? If the entire source
    code has to be in the 7-bit ASCII subset, then what is the point
    of encoding declarations?

    If you were suggesting that anything except Unicode literals
    should be in the 7-bit ASCII subset, then this is still
    unacceptable: Comments should also be allowed to contain non-ASCII
    characters, don't you agree?

    If you think that only Unicode literals and comments should be
    allowed to contain non-ASCII, I disagree: At some point, I'd
    like to propose support for non-ASCII in identifiers. This would
    allow people to make identifiers that represent words from their
    native language, which is helpful for people who don't speak
    English well.

    If you think that only Unicod literals, comments, and identifiers
    should be allowed non-ASCII: perhaps, but this is out of scope
    of PEP 263, which *only* introduces encoding declarations,
    and explains what they mean for all current constructs.

    > The reason for this is simply that wanting to put characters
    > outside of the 7-bit ascii subset into a byte character string
    > isn't portable.


    Define "is portable". With an encoding declaration, I can move
    the source code from one machine to another, open it in an editor,
    and have it display correctly. This was not portable without
    encoding declarations (likewise for comments); with PEP 263,
    such source code became portable.

    Also, the run-time behaviour is fully predictable (which it
    even was without PEP 263): At run-time, the string will have
    exactly the same bytes that it does in the .py file. This
    is fully portable.

    > It just pushes the need for a character set
    > (encoding) declaration down one level of recursion.


    It depends on the program. E.g. if the program was to generate
    HTML files with an explicit HTTP-Equiv charset=iso-8859-1,
    then the resulting program is absolutely, 100% portable.

    For messages directly output to a terminal, portability
    might not be important.

    > There's already a way of doing this: use a unicode string,
    > so it's not like we need two ways of doing it.


    Using a Unicode string might not work, because a library might
    crash when confronted with a Unicode string. You are proposing
    to break existing applications for no good reason, and with
    no simple fix.

    > Now I will grant you that there is a need for representing
    > the utf-8 encoding in a character string, but do we need
    > to support that in the source text when it's much more
    > likely that it's a programming mistake?


    But it isn't! People do put KOI-8R into source code, into
    string literals, and it works perfectly fine for them. There
    is no reason to arbitrarily break their code.

    > As far as implementation goes, it should have been done
    > at the beginning. Prior to 2.3, there was no way of writing
    > a program using the utf-8 encoding (I think - I might be
    > wrong on that)


    You are wrong. You were always able to put UTF-8 into byte
    strings, even at a time where UTF-8 was not yet an RFC
    (say, in Python 1.1).

    > so there were no programs out there that
    > put non-ascii subset characters into byte strings.


    That is just not true. If it were true, there would be no
    need to introduce a grace period in the PEP. However,
    *many* scripts in the world use non-ASCII in string literals;
    it was always possible (although the documentation was
    wishy-washy on what it actually meant).

    > Today it's one more forward migration hurdle to jump over.
    > I don't think it's a particularly large one, but I don't have
    > any real world data at hand.


    Trust me: the outcry for banning non-ASCII from string literals
    would be, by far, louder than the one for a proposed syntax
    on decorators. That would break many production systems, CGI
    scripts would suddenly stop working, GUIs would crash, etc.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Aug 6, 2004
    #12
  13. John Roth

    John Roth Guest

    "Martin v. Löwis" <> wrote in message
    news:...
    > John Roth wrote:
    > >>What would you expect instead? Do you think your expectation
    > >>is implementable?

    > >
    > >
    > > I'd expect that the compiler would reject anything that
    > > wasn't either in the 7-bit ascii subset, or else defined
    > > with a hex escape.

    >
    > Are we still talking about PEP 263 here? If the entire source
    > code has to be in the 7-bit ASCII subset, then what is the point
    > of encoding declarations?


    Martin, I think you misinterpreted what I said at the
    beginning. I'm only, and I need to repeat this, ONLY
    dealing with the case where the encoding declaration
    specifically says that the script is in UTF-8. No other
    case.

    I'm going to deal with your response point by point,
    but I don't think most of this is really relevant. Your
    response only makes sense if you missed the point that
    I was talking about scripts that explicitly declared their
    encoding to be UTF-8, and no other scripts in no
    other circumstances.

    I didn't mean the entire source was in 7-bit ascii. What
    I meant was that if the encoding was utf-8 then the source
    for 8-bit string literals must be in 7-bit ascii. Nothing more.

    > If you were suggesting that anything except Unicode literals
    > should be in the 7-bit ASCII subset, then this is still
    > unacceptable: Comments should also be allowed to contain non-ASCII
    > characters, don't you agree?


    Of course.

    > If you think that only Unicode literals and comments should be
    > allowed to contain non-ASCII, I disagree: At some point, I'd
    > like to propose support for non-ASCII in identifiers. This would
    > allow people to make identifiers that represent words from their
    > native language, which is helpful for people who don't speak
    > English well.


    L:ikewise. I never thought otherwise; in fact I'd like to expand
    the availible operators to include the set operators as well as
    the logical operators and the "real" division operator (the one
    you learned in grade school - the dash with a dot above and
    below the line.)

    > If you think that only Unicod literals, comments, and identifiers
    > should be allowed non-ASCII: perhaps, but this is out of scope
    > of PEP 263, which *only* introduces encoding declarations,
    > and explains what they mean for all current constructs.
    >
    > > The reason for this is simply that wanting to put characters
    > > outside of the 7-bit ascii subset into a byte character string
    > > isn't portable.

    >
    > Define "is portable". With an encoding declaration, I can move
    > the source code from one machine to another, open it in an editor,
    > and have it display correctly. This was not portable without
    > encoding declarations (likewise for comments); with PEP 263,
    > such source code became portable.


    > Also, the run-time behaviour is fully predictable (which it
    > even was without PEP 263): At run-time, the string will have
    > exactly the same bytes that it does in the .py file. This
    > is fully portable.


    It's predictable, but as far as I'm concerned, that's
    not only useless behavior, it's counterproductive
    behavior. I find it difficult to imagine any case
    where the benefit of having normal character
    literals accidentally contain utf-8 multi-byte
    characters outweighs the pain of having it happen
    accidentally, and then figuring out why your program
    is giving you wierd behavior.

    I would grant that there are cases where you
    might want this behavior. I am pretty sure they
    are in the distinct minority.


    > > It just pushes the need for a character set
    > > (encoding) declaration down one level of recursion.

    >
    > It depends on the program. E.g. if the program was to generate
    > HTML files with an explicit HTTP-Equiv charset=iso-8859-1,
    > then the resulting program is absolutely, 100% portable.


    It's portable, but that's not the normal case. See above.

    > For messages directly output to a terminal, portability
    > might not be important.


    Portabiliity is less of an issue for me than the likelihood
    of making a mistake in coding a literal and then having
    to debug unexpected behavior when one byte no longer
    equals one character.


    > > There's already a way of doing this: use a unicode string,
    > > so it's not like we need two ways of doing it.

    >
    > Using a Unicode string might not work, because a library might
    > crash when confronted with a Unicode string. You are proposing
    > to break existing applications for no good reason, and with
    > no simple fix.


    There's no reason why you have to have a utf-8
    encoding declaration. If you want your source to
    be utf-8, you need to accept the consequences.
    I fully expect Python to support the usual mixture
    of encodings until 3.0 at least. At that point, everything
    gets to be rewritten anyway.

    > > Now I will grant you that there is a need for representing
    > > the utf-8 encoding in a character string, but do we need
    > > to support that in the source text when it's much more
    > > likely that it's a programming mistake?

    >
    > But it isn't! People do put KOI-8R into source code, into
    > string literals, and it works perfectly fine for them. There
    > is no reason to arbitrarily break their code.
    >
    > > As far as implementation goes, it should have been done
    > > at the beginning. Prior to 2.3, there was no way of writing
    > > a program using the utf-8 encoding (I think - I might be
    > > wrong on that)

    >
    > You are wrong. You were always able to put UTF-8 into byte
    > strings, even at a time where UTF-8 was not yet an RFC
    > (say, in Python 1.1).


    Were you able to write your entire program in UTF-8?
    I think not.

    >
    > > so there were no programs out there that
    > > put non-ascii subset characters into byte strings.

    >
    > That is just not true. If it were true, there would be no
    > need to introduce a grace period in the PEP. However,
    > *many* scripts in the world use non-ASCII in string literals;
    > it was always possible (although the documentation was
    > wishy-washy on what it actually meant).
    >
    > > Today it's one more forward migration hurdle to jump over.
    > > I don't think it's a particularly large one, but I don't have
    > > any real world data at hand.

    >
    > Trust me: the outcry for banning non-ASCII from string literals
    > would be, by far, louder than the one for a proposed syntax
    > on decorators. That would break many production systems, CGI
    > scripts would suddenly stop working, GUIs would crash, etc.


    ..



    >
    > Regards,
    > Martin
     
    John Roth, Aug 6, 2004
    #13
  14. An addition to Martin's reply:

    John Roth wrote:
    >"Martin v. Löwis" <> wrote in message
    >news:...
    >>John Roth wrote:
    >>
    >> To be more specific: In an UTF-8 source file, doing
    >>
    >> print "ö" == "\xc3\xb6"
    >> print "ö"[0] == "\xc3"
    >>
    >> would print two times "True", and len("ö") is 2.
    >> OTOH, len(u"ö")==1.
    >>
    >>> The point of this is that I don't think that either behavior
    >>> is what one would expect. It's also an open invitation
    >>> for someone to make an unchecked mistake! I think this
    >>> may be Hallvard's underlying issue in the other thread.

    >>
    >> What would you expect instead? Do you think your expectation
    >> is implementable?

    >
    > I'd expect that the compiler would reject anything that
    > wasn't either in the 7-bit ascii subset, or else defined
    > with a hex escape.


    Then you should also expect a lot of people to move to
    another language - one whose designers live in the real
    world instead of your Utopian Unicode world.

    > The reason for this is simply that wanting to put characters
    > outside of the 7-bit ascii subset into a byte character string
    > isn't portable.


    Unicode isn't portable either.
    Try to output a Unicode string to a device (e.g. your terminal)
    whose character encoding is not known to the program.
    The program will fail, or just output the raw utf-8 string or
    something, or just guess some character set the program's author
    is fond of.

    For that matter, tell me why my programs should spend any time
    on converting between UTF-8 and the character set the
    application actually works with just because you are fond of
    Unicode. That might be a lot more time than just the time spent
    parsing the program. Or tell me why I should spell quite normal
    text strings with hex escaping or something, if that's what you
    mean.

    And tell me why I shouldn't be allowed to work easily with raw
    UTF-8 strings, if I do use coding:utf-8.

    --
    Hallvard
     
    Hallvard B Furuseth, Aug 6, 2004
    #14
  15. John Roth

    John Roth Guest

    "Hallvard B Furuseth" <> wrote in message
    news:...
    > An addition to Martin's reply:
    >
    > John Roth wrote:
    > >"Martin v. Löwis" <> wrote in message
    > >news:...
    > >>John Roth wrote:
    > >>
    > >> To be more specific: In an UTF-8 source file, doing
    > >>
    > >> print "ö" == "\xc3\xb6"
    > >> print "ö"[0] == "\xc3"
    > >>
    > >> would print two times "True", and len("ö") is 2.
    > >> OTOH, len(u"ö")==1.
    > >>
    > >>> The point of this is that I don't think that either behavior
    > >>> is what one would expect. It's also an open invitation
    > >>> for someone to make an unchecked mistake! I think this
    > >>> may be Hallvard's underlying issue in the other thread.
    > >>
    > >> What would you expect instead? Do you think your expectation
    > >> is implementable?

    > >
    > > I'd expect that the compiler would reject anything that
    > > wasn't either in the 7-bit ascii subset, or else defined
    > > with a hex escape.

    >
    > Then you should also expect a lot of people to move to
    > another language - one whose designers live in the real
    > world instead of your Utopian Unicode world.


    Rudeness objection to your characteization.

    Please see my response to Martin - I'm talking only,
    and I repeat ONLY, about scripts that explicitly
    say they are encoded in utf-8. Nothing else. I've
    been in this business for close to 40 years, and I'm
    quite well aware of backwards compatibility issues
    and issues with breaking existing code.

    Programmers in general have a very strong, and
    let me repeat that, VERY STRONG assumption
    that an 8-bit string contains one byte per character
    unless there is a good reason to believe otherwise.
    This assumption is built into various places, including
    all of the string methods.

    The current design allows accidental inclusion of
    a character that is not in the 7bit ascii subset ***IN
    A PROGRAM THAT HAS A UTF-8 CHARACTER
    ENCODING DECLARATION*** to break that
    assumption without any kind of notice. That in
    turn will break all of the assumptions that the string
    module and string methods are based on. That in
    turn is likely to break lots of existing modules and
    cause a lot of debugging time that could be avoided
    by proper design.

    One of Python's strong points is that it's difficult
    to get into trouble unless you deliberately try (then
    it's quite easy, fortunately.)

    I'm not worried about this causing people to
    abandon Python. I'm more worried about the
    current situation causing enough grief that people
    will decided that utf-8 source code encoding isn't
    worth it.

    > And tell me why I shouldn't be allowed to work easily with raw
    > UTF-8 strings, if I do use coding:utf-8.


    First, there's nothing that's stopping you. All that
    my proposal will do is require you to do a one
    time conversion of any strings you put in the
    program as literals. It doesn't affect any other
    strings in any other way at any other time.

    I'll withdraw my objection if you can seriously
    assure me that working with raw utf-8 in
    8-bit character string literals is what most programmers
    are going to do most of the time.

    I'm not going to accept the very common need
    of converting unicode strings to 8-bit strings so
    they can be written to disk or stored in a data base
    or whatnot (or reversing the conversion for reading.)
    That has nothing to do with the current issue - it's
    something that everyone who deals with unicode
    needs to do, regardless of the encoding of the
    source program.

    John Roth
    >
    > --
    > Hallvard
     
    John Roth, Aug 6, 2004
    #15
  16. John Roth wrote:
    > Martin, I think you misinterpreted what I said at the
    > beginning. I'm only, and I need to repeat this, ONLY
    > dealing with the case where the encoding declaration
    > specifically says that the script is in UTF-8. No other
    > case.


    From the viewpoint of PEP 263, there is absolutely *no*,
    and I repeat NO difference between chosing UTF-8 and
    chosing windows-1252 as the source encoding.

    > I'm going to deal with your response point by point,
    > but I don't think most of this is really relevant. Your
    > response only makes sense if you missed the point that
    > I was talking about scripts that explicitly declared their
    > encoding to be UTF-8, and no other scripts in no
    > other circumstances.


    I don't understand why it is desirable to single out
    UTF-8 as a source encoding. PEP 263 does no such thing,
    except for allowing an addition encoding declaration
    for UTF-8 (by means of the UTF-8 signature).

    > I didn't mean the entire source was in 7-bit ascii. What
    > I meant was that if the encoding was utf-8 then the source
    > for 8-bit string literals must be in 7-bit ascii. Nothing more.


    PEP 263 never says such a thing. Why did you get this impression
    after reading it?

    *If* you understood that byte string literals can have the full
    power of the source encoding, plus hex-escaping, I can't see what
    made you think that power did not apply if the source encoding
    was UTF-8.

    > L:ikewise. I never thought otherwise; in fact I'd like to expand
    > the availible operators to include the set operators as well as
    > the logical operators and the "real" division operator (the one
    > you learned in grade school - the dash with a dot above and
    > below the line.)


    That would be a different PEP, though, and I doubt Guido will be
    in favour. However, this is OT for this thread.

    > It's predictable, but as far as I'm concerned, that's
    > not only useless behavior, it's counterproductive
    > behavior. I find it difficult to imagine any case
    > where the benefit of having normal character
    > literals accidentally contain utf-8 multi-byte
    > characters outweighs the pain of having it happen
    > accidentally, and then figuring out why your program
    > is giving you wierd behavior.


    Might be. This is precisely the issue that Hallvard is addressing.
    I agree there should be a mechanism to check whether all significant
    non-ASCII characters are inside Unicode literals.

    I personally would prefer a command line switch over a per-file
    declaration, but that would be the subject of Hallvard's PEP.
    Under no circumstances I would disallow using the full source
    encoding in byte strings, even if the source encoding is UTF-8.

    > There's no reason why you have to have a utf-8
    > encoding declaration. If you want your source to
    > be utf-8, you need to accept the consequences.


    Even for UTF-8, you need an encoding declaration (although
    the UTF-8 signature is sufficient for that matter). If
    there is no encoding declaration whatsoever, Python will
    assume that the source is us-ascii.

    > I fully expect Python to support the usual mixture
    > of encodings until 3.0 at least. At that point, everything
    > gets to be rewritten anyway.


    I very much doubt that, in two ways:
    a) Python 3.0 will not happen, in any foreseeable future
    b) if it happens, much code will stay the same, or only
    require minor changes. I doubt that non-UTF-8 source
    encoding will be banned in Python 3.

    > Were you able to write your entire program in UTF-8?
    > I think not.


    What do you mean, your entire program? All strings?
    Certainly you were. Why not?

    Of course, before UTF-8 was an RFC, there were no
    editors available, nor would any operating system
    support output in UTF-8, so you would need to
    organize everything on your own (perhaps it was
    simpler on Plan-9 at that time, but I have never
    really used Plan-9 - and you might have needed
    UTF-1 instead, anyway).

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Aug 6, 2004
    #16
  17. John Roth wrote:
    > I've
    > been in this business for close to 40 years, and I'm
    > quite well aware of backwards compatibility issues
    > and issues with breaking existing code.
    >
    > Programmers in general have a very strong, and
    > let me repeat that, VERY STRONG assumption
    > that an 8-bit string contains one byte per character
    > unless there is a good reason to believe otherwise.


    You clearly come from a Western business. In CJK
    languages, people are very aware that characters can
    have more than one byte. They consider UTF-8 as just
    another multi-byte encoding, and used to consider it
    as an encoding that Westerners made to complicate their
    lifes. That attitude appears to be changing now, but
    UTF-8 is not a clear winner in the worlds where we
    Westerners would expect it to be a clear winner.

    > The current design allows accidental inclusion of
    > a character that is not in the 7bit ascii subset ***IN
    > A PROGRAM THAT HAS A UTF-8 CHARACTER
    > ENCODING DECLARATION*** to break that
    > assumption without any kind of notice.


    This is a problem only for the Western world. In the
    CJK languages, such programs were broken a long time
    ago. I don't think Python needs to be so Americo-centric
    as to protect American programmers from programming
    mistakes.

    > That in
    > turn will break all of the assumptions that the string
    > module and string methods are based on. That in
    > turn is likely to break lots of existing modules and
    > cause a lot of debugging time that could be avoided
    > by proper design.


    Indeed. If the program is currently not broken, why
    are you changing the source encoding? If you are
    trying to support multiple languages, a properly-
    designed application would use gettext instead
    of putting non-ASCII into source code.

    If you are writing a new application, and you
    put non-ASCII into the source, in UTF-8, are you
    not testing your application properly?

    > I'm not worried about this causing people to
    > abandon Python. I'm more worried about the
    > current situation causing enough grief that people
    > will decided that utf-8 source code encoding isn't
    > worth it.


    Again, this is what Hallvard's PEP is for. It
    does not apply to UTF-8 only, but I see no reason
    why UTF-8 needs to be singled out.

    > I'll withdraw my objection if you can seriously
    > assure me that working with raw utf-8 in
    > 8-bit character string literals is what most programmers
    > are going to do most of the time.


    In what time scale? Near time, most people will use
    other source encodings. In the medium term, I expect
    Unix will switch to UTF-8 throughout, at which point
    using UTF-8 byte strings will work on every Unix
    system - the scripts, by nature, won't work on non-Unix
    systems, anyway. In the long term, I expect all Python
    strings will be Unicode strings, unless explicitly
    declared as byte strings.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Aug 6, 2004
    #17
  18. John Roth

    Terry Reedy Guest

    "Martin v. Löwis" <> wrote in message
    news:...
    > If you think that only Unicode literals and comments should be
    > allowed to contain non-ASCII, I disagree: At some point, I'd
    > like to propose support for non-ASCII in identifiers. This would
    > allow people to make identifiers that represent words from their
    > native language, which is helpful for people who don't speak
    > English well.


    Off the main topic of this thread, but...

    While sympathizing with this notion, I have hitherto opposed it on the
    basis that this would lead to code that could only be read by people within
    each language group. But, rereading your idea, I realize that this
    objection would be overcome by a reader that displayed for each Unicode
    char (codepoint?) not its native glyph but a roman transliteration. As far
    as I know, such tranliterations, more or less standardized, exist at least
    for all major alphabets and syllable systems. Indeed, I would find
    Japanese code displayed as

    for sushi in michiro.readlines():
    print fuji(sushi)

    clearer than 'English' code using identifiers like Q8zB2_0Ol1!

    If the Unicode group does not distribute a master roman tranliteration
    table at least for alphabetic symbols, I would consider it a lack that
    hinders adoption of Unicode.

    Some writing systems also have different number digits, which could also be
    used natively and tranliterated. A Unicode Python could also use a set of
    user codepoints as an alternate coding of keywords for almost complete
    nativification. I believe the math symbols are pretty universal (but could
    be educated if not).

    Terry J. Reedy
     
    Terry Reedy, Aug 6, 2004
    #18
  19. John Roth

    John Roth Guest

    "Martin v. Löwis" <> wrote in message
    news:...
    > John Roth wrote:
    > > Martin, I think you misinterpreted what I said at the
    > > beginning. I'm only, and I need to repeat this, ONLY
    > > dealing with the case where the encoding declaration
    > > specifically says that the script is in UTF-8. No other
    > > case.

    >
    > From the viewpoint of PEP 263, there is absolutely *no*,
    > and I repeat NO difference between chosing UTF-8 and
    > chosing windows-1252 as the source encoding.


    I don't believe I ever said that PEP 263 said there was
    a difference. If I gave you that impression, I will
    appologize if you can show me where it I did it.



    > > I'm going to deal with your response point by point,
    > > but I don't think most of this is really relevant. Your
    > > response only makes sense if you missed the point that
    > > I was talking about scripts that explicitly declared their
    > > encoding to be UTF-8, and no other scripts in no
    > > other circumstances.

    >
    > I don't understand why it is desirable to single out
    > UTF-8 as a source encoding. PEP 263 does no such thing,
    > except for allowing an addition encoding declaration
    > for UTF-8 (by means of the UTF-8 signature).


    As far as I'm concerned, what PEP 263 says is utterly
    irrelevant to the point I'm trying to make.

    The only connection PEP 263 has to the entire thread
    (at least from my view) is that I wanted to check on
    whether phase 2, as described in the PEP, was
    scheduled for 2.4. I was under the impression it was
    and was puzzled by not seeing it. You said it wouldn't
    be in 2.4. Question answered, no further issue on
    that point (but see below for an additonal puzzlement.)

    > > I didn't mean the entire source was in 7-bit ascii. What
    > > I meant was that if the encoding was utf-8 then the source
    > > for 8-bit string literals must be in 7-bit ascii. Nothing more.

    >
    > PEP 263 never says such a thing. Why did you get this impression
    > after reading it?


    I didn't get it from the PEP. I got it from what you said. Your
    response seemed to make sense only if you assumed that I
    had this totally idiotic idea that we should change everything
    to 7-bit ascii. That was not my intention.

    Let's go back to square one and see if I can explain my
    concern from first principles.

    8-bit strings have a builtin assumption that one
    byte equals one character. This is something that
    is ingrained in the basic fabric of many programming
    languages, Python included. It's a basic assumption
    in the string module, the string methods and all through
    just about everything, and it's something that most
    programmers expect, and IMO have every right
    to expect.

    Now, people violate this assumption all the time,
    for a number of reasons, including binary data and
    encoded data (including utf-8 encodings)
    but they do so deliberately, knowing what they're
    doing. These particular exceptions don't negate the
    rule.

    The problem I have is that if you use utf-8 as the
    source encoding, you can suddenly drop multi-byte
    characters into an 8-bit string ***BY ACCIDENT***.
    This accident is not possible with single byte
    encodings, which is why I am emphasizing that I
    am only talking about source that is encoded in utf-8.
    (I don't know what happens with far Eastern multi-byte
    encodings.)

    UTF-8 encoded source has this problem. Source
    encoded with single byte encodings does not have
    this problem. It's as simple as that. Accordingly
    it is not my intention, and has never been my
    intention, to change the way 8-bit string literals
    are handled when the source program has a
    single byte encoding.

    We may disagree on whether this is enough of
    a problem that it warrents a solution. That's life.

    Now, my suggested solution of this problem was
    to require that 8-bit string literals in source that was
    encoded with UTF-8 be restricted to the 7-bit
    ascii subset. The reason is that there are logically
    three things that can be done here if we find a
    character that is outside of the 7-bit ascii subset.

    One is to do the current practice and violate the
    one byte == one character invariant, the second
    is to use some encoding to convert the non-ascii
    characters into a single byte encoding, thus
    preserving the one byte == one character invariant.
    The third is to prohibit anything that is ambiguous,
    which in practice means to restrict 8-bit literals
    to the 7-bit ascii subset (plus hex escapes, of course.)

    The second possibility begs the question of what
    encoding to use, which is why I don't seriously
    propose it (although if I understand Hallvard's
    position correctly, that's essentially his proposal.)

    > *If* you understood that byte string literals can have the full
    > power of the source encoding, plus hex-escaping, I can't see what
    > made you think that power did not apply if the source encoding
    > was UTF-8.


    I think I covered that adequately above. It's not that
    it doesn't apply, it's that it's unsafe.

    > > It's predictable, but as far as I'm concerned, that's
    > > not only useless behavior, it's counterproductive
    > > behavior. I find it difficult to imagine any case
    > > where the benefit of having normal character
    > > literals accidentally contain utf-8 multi-byte
    > > characters outweighs the pain of having it happen
    > > accidentally, and then figuring out why your program
    > > is giving you wierd behavior.

    >
    > Might be. This is precisely the issue that Hallvard is addressing.
    > I agree there should be a mechanism to check whether all significant
    > non-ASCII characters are inside Unicode literals.


    I think that means we're in substantive agreement (although
    I see no reason to restrict comments to 7-bit ascii.)

    > I personally would prefer a command line switch over a per-file
    > declaration, but that would be the subject of Hallvard's PEP.
    > Under no circumstances I would disallow using the full source
    > encoding in byte strings, even if the source encoding is UTF-8.


    I assume here you intended to mean strings, not literals. If
    so, we're in agreement - I see absolutely no reason to even
    think of suggesting a change to Python's run time string
    handling behavior.

    > > There's no reason why you have to have a utf-8
    > > encoding declaration. If you want your source to
    > > be utf-8, you need to accept the consequences.

    >
    > Even for UTF-8, you need an encoding declaration (although
    > the UTF-8 signature is sufficient for that matter). If
    > there is no encoding declaration whatsoever, Python will
    > assume that the source is us-ascii.


    I think I didn't say this clearly. What I intended to get across
    is that there isn't any major reason for a source to be utf-8;
    other encodings are for the most part satisfactory.
    Saying something about the declaration seems to have muddied
    the meaning.

    The last sentence puzzles me. In 2.3, absent a declaration
    (and absent a parameter on the interpreter) Python assumes
    that the source is Latin-1, and phase 2 was to change
    this to the 7-bit ascii subset (US-Ascii). That was the
    original question at the start of this thread. I had assumed
    that change was to go into 2.4, your reply made it seem
    that it would go into 2.5 (maybe.) This statement makes
    it seem that it is the current state in 2.3.

    > > I fully expect Python to support the usual mixture
    > > of encodings until 3.0 at least. At that point, everything
    > > gets to be rewritten anyway.

    >
    > I very much doubt that, in two ways:
    > a) Python 3.0 will not happen, in any foreseeable future


    I probably should let this sleeping dog lie, however,
    there is a general expectation that there will be a 3.0
    at some point before the heat death of the universe.
    I was certainly under that impression, and I've seen
    nothing from anyone who I regard as authoratitive until
    this statement that says otherwise.

    > b) if it happens, much code will stay the same, or only
    > require minor changes. I doubt that non-UTF-8 source
    > encoding will be banned in Python 3.
    >
    > > Were you able to write your entire program in UTF-8?
    > > I think not.

    >
    > What do you mean, your entire program? All strings?
    > Certainly you were. Why not?
    >
    > Of course, before UTF-8 was an RFC, there were no
    > editors available, nor would any operating system
    > support output in UTF-8, so you would need to
    > organize everything on your own (perhaps it was
    > simpler on Plan-9 at that time, but I have never
    > really used Plan-9 - and you might have needed
    > UTF-1 instead, anyway).


    This doesn't make sense in context. I'm not talking
    about some misty general UTF-8. I'm talking
    about writing Python programs using the c-python
    interpreter. Not jython, not IronPython, not some
    other programming language.
    Specifically, what would the Python 2.2 interpreter
    have done if I handed it a program encoded in utf-8?
    Was that a legitimate encoding? I don't know whether
    it was or not. Clearly it wouldn't have been possible
    before the unicode support in 2.0.

    John Roth

    >
    > Regards,
    > Martin
     
    John Roth, Aug 7, 2004
    #19
  20. John Roth

    John Roth Guest

    "Martin v. Löwis" <> wrote in message
    news:...
    > John Roth wrote:
    > > I've
    > > been in this business for close to 40 years, and I'm
    > > quite well aware of backwards compatibility issues
    > > and issues with breaking existing code.
    > >
    > > Programmers in general have a very strong, and
    > > let me repeat that, VERY STRONG assumption
    > > that an 8-bit string contains one byte per character
    > > unless there is a good reason to believe otherwise.

    >
    > You clearly come from a Western business. In CJK
    > languages, people are very aware that characters can
    > have more than one byte. They consider UTF-8 as just
    > another multi-byte encoding, and used to consider it
    > as an encoding that Westerners made to complicate their
    > lifes. That attitude appears to be changing now, but
    > UTF-8 is not a clear winner in the worlds where we
    > Westerners would expect it to be a clear winner.


    I'm aware of that.

    > > The current design allows accidental inclusion of
    > > a character that is not in the 7bit ascii subset ***IN
    > > A PROGRAM THAT HAS A UTF-8 CHARACTER
    > > ENCODING DECLARATION*** to break that
    > > assumption without any kind of notice.

    >
    > This is a problem only for the Western world. In the
    > CJK languages, such programs were broken a long time
    > ago. I don't think Python needs to be so Americo-centric
    > as to protect American programmers from programming
    > mistakes.


    American != non East Asian.

    In fact, I would consider American programmers to
    be the least prone to making this kind of mistake
    simply because all standard characters are included
    in the US-Ascii subset. It's much more likely to be
    a European (or non North American) problem.
    Even when writing in English, people's names will
    have non-English characters, and they have a
    tendency to leak into literals.
    (Mexico considers themselves to be part of
    Central America, for some political reason.)

    > > That in
    > > turn will break all of the assumptions that the string
    > > module and string methods are based on. That in
    > > turn is likely to break lots of existing modules and
    > > cause a lot of debugging time that could be avoided
    > > by proper design.

    >
    > Indeed. If the program is currently not broken, why
    > are you changing the source encoding? If you are
    > trying to support multiple languages, a properly-
    > designed application would use gettext instead
    > of putting non-ASCII into source code.
    >
    > If you are writing a new application, and you
    > put non-ASCII into the source, in UTF-8, are you
    > not testing your application properly?
    >
    > > I'm not worried about this causing people to
    > > abandon Python. I'm more worried about the
    > > current situation causing enough grief that people
    > > will decided that utf-8 source code encoding isn't
    > > worth it.

    >
    > Again, this is what Hallvard's PEP is for. It
    > does not apply to UTF-8 only, but I see no reason
    > why UTF-8 needs to be singled out.
    >
    > > I'll withdraw my objection if you can seriously
    > > assure me that working with raw utf-8 in
    > > 8-bit character string literals is what most programmers
    > > are going to do most of the time.

    >
    > In what time scale? Near time, most people will use
    > other source encodings. In the medium term, I expect
    > Unix will switch to UTF-8 throughout, at which point
    > using UTF-8 byte strings will work on every Unix
    > system - the scripts, by nature, won't work on non-Unix
    > systems, anyway. In the long term, I expect all Python
    > strings will be Unicode strings, unless explicitly
    > declared as byte strings.


    I asked Hallvard this question, not you. It makes sense
    in the context of the statements of his I was responding to.

    Your answer does not make sense. Hallvard's objection
    was that he actually wanted to have non-ascii characters
    put into byte literals in their utf-8 encoded forms (at least
    as I understand it.)

    If I thought about it, I could undoubtedly come up with
    use cases where I would find this behavior useful. The
    presupposition behind my statement was that those
    use cases were overwhelmingly less likely than the
    standard uses of byte string literals where a utf-8
    encoded "character" would be a problem.

    John Roth

    >
    > Regards,
    > Martin
     
    John Roth, Aug 7, 2004
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. saha
    Replies:
    0
    Views:
    4,069
  2. Christoph Becker-Freyseng

    PEP for new modules (I read PEP 2)

    Christoph Becker-Freyseng, Jan 15, 2004, in forum: Python
    Replies:
    3
    Views:
    375
    Gerrit Holl
    Jan 16, 2004
  3. Bengt Richter
    Replies:
    3
    Views:
    255
    Thomas Heller
    Jan 6, 2005
  4. Russ
    Replies:
    96
    Views:
    1,767
    Bruno Desthuilliers
    Sep 2, 2007
  5. Lie
    Replies:
    25
    Views:
    742
    Dafydd Hughes
    Dec 18, 2007
Loading...

Share This Page