2to3 chokes on bad character

Discussion in 'Python' started by Frank Millman, Feb 23, 2011.

  1. Hi all

    I don't know if this counts as a bug in 2to3.py, but when I ran it on my
    program directory it crashed, with a traceback but without any indication of
    which file caused the problem.

    Here is the traceback -

    Traceback (most recent call last):
    File "C:\Python32\Tools\Scripts\2to3.py", line 5, in <module>
    sys.exit(main("lib2to3.fixes"))
    File "C:\Python32\lib\lib2to3\main.py", line 172, in main
    options.processes)
    File "C:\Python32\lib\lib2to3\refactor.py", line 700, in refactor
    items, write, doctests_only)
    File "C:\Python32\lib\lib2to3\refactor.py", line 294, in refactor
    self.refactor_dir(dir_or_file, write, doctests_only)
    File "C:\Python32\lib\lib2to3\refactor.py", line 314, in refactor_dir
    self.refactor_file(fullname, write, doctests_only)
    File "C:\Python32\lib\lib2to3\refactor.py", line 741, in refactor_file
    *args, **kwargs)
    File "C:\Python32\lib\lib2to3\refactor.py", line 336, in refactor_file
    input, encoding = self._read_python_source(filename)
    File "C:\Python32\lib\lib2to3\refactor.py", line 332, in
    _read_python_source
    return _from_system_newlines(f.read()), encoding
    File "C:\Python32\lib\codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 5055:
    invalid start byte

    On investigation, I found some funny characters in docstrings that I
    copy/pasted from a pdf file.

    Here are the details if they are of any use. Oddly, I found two instances
    where characters 'look like' apostrophes when viewed in my text editor, but
    one of them was accepted by 2to3 and the other caused the crash.

    The one that was accepted consists of three bytes - 226, 128, 153 (as
    reported by python 2.6) or 226, 8364, 8482 (as reported by python3.2).

    The one that crashed consists of a single byte - 146 (python 2.6) or 8217
    (python 3.2).

    The issue is not that 2to3 should handle this correctly, but that it should
    give a more informative error message to the unsuspecting user.

    Frank Millman

    BTW I have always waited for 'final releases' before upgrading in the past,
    but this makes me realise the importance of checking out the beta versions -
    I will do so in future.
     
    Frank Millman, Feb 23, 2011
    #1
    1. Advertising

  2. Frank Millman

    John Machin Guest

    On Feb 23, 7:47 pm, "Frank Millman" <> wrote:
    > Hi all
    >
    > I don't know if this counts as a bug in 2to3.py, but when I ran it on my
    > program directory it crashed, with a traceback but without any indicationof
    > which file caused the problem.
    >

    [traceback snipped]

    > UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 5055:
    > invalid start byte
    >
    > On investigation, I found some funny characters in docstrings that I
    > copy/pasted from a pdf file.
    >
    > Here are the details if they are of any use. Oddly, I found two instances
    > where characters 'look like' apostrophes when viewed in my text editor, but
    > one of them was accepted by 2to3 and the other caused the crash.
    >
    > The one that was accepted consists of three bytes - 226, 128, 153 (as
    > reported by python 2.6)


    How did you incite it to report like that? Just use repr(the_3_bytes).
    It'll show up as '\xe2\x80\x99'.

    >>> from unicodedata import name as ucname
    >>> ''.join(chr(i) for i in (226, 128, 153)).decode('utf8')

    u'\u2019'
    >>> ucname(_)

    'RIGHT SINGLE QUOTATION MARK'

    What you have there is the UTF-8 representation of U+2019 RIGHT SINGLE
    QUOTATION MARK. That's OK.

    or 226, 8364, 8482 (as reported by python3.2).

    Sorry, but you have instructed Python 3.2 to commit a nonsense:

    >>> [ord(chr(i).decode('cp1252')) for i in (226, 128, 153)]

    [226, 8364, 8482]

    In other words, you have taken that 3-byte sequence, decoded each byte
    separately using cp1252 (aka "the usual suspect") into a meaningless
    Unicode character and printed its ordinal.

    In Python 3, don't use repr(); it has undergone the MHTP
    transformation and become ascii().

    >
    > The one that crashed consists of a single byte - 146 (python 2.6) or 8217
    > (python 3.2).


    >>> chr(146).decode('cp1252')

    u'\u2019'
    >>> hex(8217)

    '0x2019'


    > The issue is not that 2to3 should handle this correctly, but that it should
    > give a more informative error message to the unsuspecting user.


    Your Python 2.x code should be TESTED before you poke 2to3 at it. In
    this case just trying to run or import the offending code file would
    have given an informative syntax error (you have declared the .py file
    to be encoded in UTF-8 but it's not).

    > BTW I have always waited for 'final releases' before upgrading in the past,
    > but this makes me realise the importance of checking out the beta versions -
    > I will do so in future.


    I'm willing to bet that the same would happen with Python 3.1, if a
    3.1 to 3.2 upgrade is what you are talking about
     
    John Machin, Feb 24, 2011
    #2
    1. Advertising

  3. Frank Millman

    Peter Otten Guest

    John Machin wrote:

    > On Feb 23, 7:47 pm, "Frank Millman" <> wrote:
    >> Hi all
    >>
    >> I don't know if this counts as a bug in 2to3.py, but when I ran it on my
    >> program directory it crashed, with a traceback but without any indication
    >> of which file caused the problem.
    >>

    > [traceback snipped]
    >
    >> UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 5055:
    >> invalid start byte
    >>
    >> On investigation, I found some funny characters in docstrings that I
    >> copy/pasted from a pdf file.
    >>
    >> Here are the details if they are of any use. Oddly, I found two instances
    >> where characters 'look like' apostrophes when viewed in my text editor,
    >> but one of them was accepted by 2to3 and the other caused the crash.
    >>
    >> The one that was accepted consists of three bytes - 226, 128, 153 (as
    >> reported by python 2.6)

    >
    > How did you incite it to report like that? Just use repr(the_3_bytes).
    > It'll show up as '\xe2\x80\x99'.
    >
    > >>> from unicodedata import name as ucname
    > >>> ''.join(chr(i) for i in (226, 128, 153)).decode('utf8')

    > u'\u2019'
    > >>> ucname(_)

    > 'RIGHT SINGLE QUOTATION MARK'
    >
    > What you have there is the UTF-8 representation of U+2019 RIGHT SINGLE
    > QUOTATION MARK. That's OK.
    >
    > or 226, 8364, 8482 (as reported by python3.2).
    >
    > Sorry, but you have instructed Python 3.2 to commit a nonsense:
    >
    > >>> [ord(chr(i).decode('cp1252')) for i in (226, 128, 153)]

    > [226, 8364, 8482]
    >
    > In other words, you have taken that 3-byte sequence, decoded each byte
    > separately using cp1252 (aka "the usual suspect") into a meaningless
    > Unicode character and printed its ordinal.
    >
    > In Python 3, don't use repr(); it has undergone the MHTP
    > transformation and become ascii().
    >
    >>
    >> The one that crashed consists of a single byte - 146 (python 2.6) or 8217
    >> (python 3.2).

    >
    > >>> chr(146).decode('cp1252')

    > u'\u2019'
    > >>> hex(8217)

    > '0x2019'
    >
    >
    >> The issue is not that 2to3 should handle this correctly, but that it
    >> should give a more informative error message to the unsuspecting user.

    >
    > Your Python 2.x code should be TESTED before you poke 2to3 at it. In
    > this case just trying to run or import the offending code file would
    > have given an informative syntax error (you have declared the .py file
    > to be encoded in UTF-8 but it's not).


    The problem is that Python 2.x accepts arbitrary bytes in string constants.
    No error message or warning:

    $ python
    Python 2.6.4 (r264:75706, Dec 7 2009, 18:43:55)
    [GCC 4.4.1] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> with open("tmp.py", "w") as f: # prepare the broken script

    .... f.write("# -*- coding: utf-8 -*-\nprint 'bogus char: \x92'\n")
    ....
    >>>

    $ cat tmp.py
    # -*- coding: utf-8 -*-
    print 'bogus char: �'
    $ python2.6 tmp.py
    bogus char: �
    $ 2to3-3.2 tmp.py
    [traceback snipped]
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 43:
    invalid start byte

    In theory 2to3 could be changed to take the same approach as os.listdir(),
    but as in the OP's example occurences of the problem are likely to be
    editing accidents.
     
    Peter Otten, Feb 24, 2011
    #3
  4. "John Machin" <> wrote:
    On Feb 23, 7:47 pm, "Frank Millman" <> wrote:

    [snip lots of valuable info]

    >> The issue is not that 2to3 should handle this correctly, but that it
    >> should
    >> give a more informative error message to the unsuspecting user.


    >Your Python 2.x code should be TESTED before you poke 2to3 at it. In
    >this case just trying to run or import the offending code file would
    >have given an informative syntax error (you have declared the .py file
    >to be encoded in UTF-8 but it's not).


    Thank you, John - this is the main lesson.

    The file that caused the error has a .py extension, and looks like a python
    file, but it just contains documentation. It has never been executed or
    imported.

    As you say, if I had tried to run it under Python 2 it would have failed
    straight away. In these circumstances, it is unreasonable to expect 2to3 to
    know what to do with it, so it is definitely not a bug.

    >> BTW I have always waited for 'final releases' before upgrading in the
    >> past,
    >> but this makes me realise the importance of checking out the beta
    >> versions -
    >> I will do so in future.


    >I'm willing to bet that the same would happen with Python 3.1, if a
    >3.1 to 3.2 upgrade is what you are talking about


    This is my first look at Python 3, so I am talking about moving from 2.6 to
    3.2. In this case, it turns out that it was not a bug, but still, in future
    I will run some tests when betas are released, just in case I come up with
    something.

    Thanks for your response - it was very useful.

    Frank
     
    Frank Millman, Feb 24, 2011
    #4
  5. "Peter Otten" <> wrote
    > John Machin wrote:
    >
    >>
    >> Your Python 2.x code should be TESTED before you poke 2to3 at it. In
    >> this case just trying to run or import the offending code file would
    >> have given an informative syntax error (you have declared the .py file
    >> to be encoded in UTF-8 but it's not).

    >
    > The problem is that Python 2.x accepts arbitrary bytes in string
    > constants.
    > No error message or warning:
    >


    Thanks, Peter. I saw this after I replied to John, so this somewhat
    invalidates my reply.

    However, John's principle still holds true, and that is the main lesson I
    have taken away from this.

    Frank
     
    Frank Millman, Feb 24, 2011
    #5
  6. Frank Millman

    Terry Reedy Guest

    On 2/24/2011 8:11 AM, Frank Millman wrote:

    > future I will run some tests when betas are released, just in case I
    > come up with something.


    Please do, perhaps more than once. The test suite coverage is being
    improved but is not 100%. The day *after* 3.2.0 was released, someone
    reported an unpleasant bug, a regression from 3.1.x. If they are tested
    with the last beta or first release candidate, it would have been found
    and fixed. Now its there until 3.2.1.

    --
    Terry Jan Reedy
     
    Terry Reedy, Feb 24, 2011
    #6
  7. Frank Millman

    John Machin Guest

    On Feb 25, 12:00 am, Peter Otten <> wrote:
    > John Machin wrote:


    > > Your Python 2.x code should be TESTED before you poke 2to3 at it. In
    > > this case just trying to run or import the offending code file would
    > > have given an informative syntax error (you have declared the .py file
    > > to be encoded in UTF-8 but it's not).

    >
    > The problem is that Python 2.x accepts arbitrary bytes in string constants.


    Ummm ... isn't that a bug? According to section 2.1.4 of the Python
    2.7.1 Language Reference Manual: """The encoding is used for all
    lexical analysis, in particular to find the end of a string, and to
    interpret the contents of Unicode literals. String literals are
    converted to Unicode for syntactical analysis, then converted back to
    their original encoding before interpretation starts ..."""

    How do you reconcile "used for all lexical analysis" and "String
    literals are converted to Unicode for syntactical analysis" with the
    actual (astonishing to me) behaviour?
     
    John Machin, Feb 24, 2011
    #7
  8. Frank Millman

    Peter Otten Guest

    John Machin wrote:

    > On Feb 25, 12:00 am, Peter Otten <> wrote:
    >> John Machin wrote:

    >
    >> > Your Python 2.x code should be TESTED before you poke 2to3 at it. In
    >> > this case just trying to run or import the offending code file would
    >> > have given an informative syntax error (you have declared the .py file
    >> > to be encoded in UTF-8 but it's not).

    >>
    >> The problem is that Python 2.x accepts arbitrary bytes in string
    >> constants.

    >
    > Ummm ... isn't that a bug? According to section 2.1.4 of the Python
    > 2.7.1 Language Reference Manual: """The encoding is used for all
    > lexical analysis, in particular to find the end of a string, and to
    > interpret the contents of Unicode literals. String literals are
    > converted to Unicode for syntactical analysis, then converted back to
    > their original encoding before interpretation starts ..."""
    >
    > How do you reconcile "used for all lexical analysis" and "String
    > literals are converted to Unicode for syntactical analysis" with the
    > actual (astonishing to me) behaviour?


    You are right, the current behaviour is probably an implementation accident
    stemming from the assumption that

    s.decode("utf-8").encode("utf-8") == s

    always holds. Other encodings (I tried cp1252) produce the expected
    SyntaxError.
     
    Peter Otten, Feb 25, 2011
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. William Payne

    New compiler chokes on template class

    William Payne, Aug 21, 2004, in forum: C++
    Replies:
    3
    Views:
    384
    Old Wolf
    Aug 22, 2004
  2. ‘5ÛHH575-UAZWKVVP-7H2H48V3
    Replies:
    7
    Views:
    685
    Kanenas
    Feb 15, 2005
  3. Bram Stolk
    Replies:
    4
    Views:
    347
    Bram Stolk
    May 25, 2005
  4. Rene Pijlman
    Replies:
    6
    Views:
    689
    Fredrik Lundh
    May 29, 2006
  5. rantingrick
    Replies:
    44
    Views:
    1,235
    Peter Pearson
    Jul 13, 2010
Loading...

Share This Page