2to3 chokes on bad character

Frank Millman · Feb 23, 2011

Hi all

I don't know if this counts as a bug in 2to3.py, but when I ran it on my
program directory it crashed, with a traceback but without any indication of
which file caused the problem.

Here is the traceback -

Traceback (most recent call last):
File "C:\Python32\Tools\Scripts\2to3.py", line 5, in <module>
sys.exit(main("lib2to3.fixes"))
File "C:\Python32\lib\lib2to3\main.py", line 172, in main
options.processes)
File "C:\Python32\lib\lib2to3\refactor.py", line 700, in refactor
items, write, doctests_only)
File "C:\Python32\lib\lib2to3\refactor.py", line 294, in refactor
self.refactor_dir(dir_or_file, write, doctests_only)
File "C:\Python32\lib\lib2to3\refactor.py", line 314, in refactor_dir
self.refactor_file(fullname, write, doctests_only)
File "C:\Python32\lib\lib2to3\refactor.py", line 741, in refactor_file
*args, **kwargs)
File "C:\Python32\lib\lib2to3\refactor.py", line 336, in refactor_file
input, encoding = self._read_python_source(filename)
File "C:\Python32\lib\lib2to3\refactor.py", line 332, in
_read_python_source
return _from_system_newlines(f.read()), encoding
File "C:\Python32\lib\codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 5055:
invalid start byte

On investigation, I found some funny characters in docstrings that I
copy/pasted from a pdf file.

Here are the details if they are of any use. Oddly, I found two instances
where characters 'look like' apostrophes when viewed in my text editor, but
one of them was accepted by 2to3 and the other caused the crash.

The one that was accepted consists of three bytes - 226, 128, 153 (as
reported by python 2.6) or 226, 8364, 8482 (as reported by python3.2).

The one that crashed consists of a single byte - 146 (python 2.6) or 8217
(python 3.2).

The issue is not that 2to3 should handle this correctly, but that it should
give a more informative error message to the unsuspecting user.

Frank Millman

BTW I have always waited for 'final releases' before upgrading in the past,
but this makes me realise the importance of checking out the beta versions -
I will do so in future.

John Machin · Feb 24, 2011

Hi all

I don't know if this counts as a bug in 2to3.py, but when I ran it on my
program directory it crashed, with a traceback but without any indicationof
which file caused the problem.

[traceback snipped]

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 5055:
invalid start byte

On investigation, I found some funny characters in docstrings that I
copy/pasted from a pdf file.

Here are the details if they are of any use. Oddly, I found two instances
where characters 'look like' apostrophes when viewed in my text editor, but
one of them was accepted by 2to3 and the other caused the crash.

The one that was accepted consists of three bytes - 226, 128, 153 (as
reported by python 2.6)

How did you incite it to report like that? Just use repr(the_3_bytes).
It'll show up as '\xe2\x80\x99'.
'RIGHT SINGLE QUOTATION MARK'

What you have there is the UTF-8 representation of U+2019 RIGHT SINGLE
QUOTATION MARK. That's OK.

or 226, 8364, 8482 (as reported by python3.2).

Sorry, but you have instructed Python 3.2 to commit a nonsense:

[ord(chr(i).decode('cp1252')) for i in (226, 128, 153)]

Click to expand...

Click to expand...

[226, 8364, 8482]

In other words, you have taken that 3-byte sequence, decoded each byte
separately using cp1252 (aka "the usual suspect") into a meaningless
Unicode character and printed its ordinal.

In Python 3, don't use repr(); it has undergone the MHTP
transformation and become ascii().

The one that crashed consists of a single byte - 146 (python 2.6) or 8217
(python 3.2). '0x2019'

The issue is not that 2to3 should handle this correctly, but that it should
give a more informative error message to the unsuspecting user.

Your Python 2.x code should be TESTED before you poke 2to3 at it. In
this case just trying to run or import the offending code file would
have given an informative syntax error (you have declared the .py file
to be encoded in UTF-8 but it's not).

BTW I have always waited for 'final releases' before upgrading in the past,
but this makes me realise the importance of checking out the beta versions -
I will do so in future.

I'm willing to bet that the same would happen with Python 3.1, if a
3.1 to 3.2 upgrade is what you are talking about

Peter Otten · Feb 24, 2011

John said:
Hi all

I don't know if this counts as a bug in 2to3.py, but when I ran it on my
program directory it crashed, with a traceback but without any indication
of which file caused the problem.

Click to expand...

[traceback snipped]

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 5055:
invalid start byte

On investigation, I found some funny characters in docstrings that I
copy/pasted from a pdf file.

Here are the details if they are of any use. Oddly, I found two instances
where characters 'look like' apostrophes when viewed in my text editor,
but one of them was accepted by 2to3 and the other caused the crash.

The one that was accepted consists of three bytes - 226, 128, 153 (as
reported by python 2.6)

Click to expand...

How did you incite it to report like that? Just use repr(the_3_bytes).
It'll show up as '\xe2\x80\x99'.
'RIGHT SINGLE QUOTATION MARK'

What you have there is the UTF-8 representation of U+2019 RIGHT SINGLE
QUOTATION MARK. That's OK.

or 226, 8364, 8482 (as reported by python3.2).

Sorry, but you have instructed Python 3.2 to commit a nonsense:

[ord(chr(i).decode('cp1252')) for i in (226, 128, 153)]

Click to expand...

Click to expand...

[226, 8364, 8482]

In other words, you have taken that 3-byte sequence, decoded each byte
separately using cp1252 (aka "the usual suspect") into a meaningless
Unicode character and printed its ordinal.

In Python 3, don't use repr(); it has undergone the MHTP
transformation and become ascii().

The one that crashed consists of a single byte - 146 (python 2.6) or 8217
(python 3.2). '0x2019'

The issue is not that 2to3 should handle this correctly, but that it
should give a more informative error message to the unsuspecting user.

Click to expand...

Your Python 2.x code should be TESTED before you poke 2to3 at it. In
this case just trying to run or import the offending code file would
have given an informative syntax error (you have declared the .py file
to be encoded in UTF-8 but it's not).

The problem is that Python 2.x accepts arbitrary bytes in string constants.
No error message or warning:

$ python
Python 2.6.4 (r264:75706, Dec 7 2009, 18:43:55)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information..... f.write("# -*- coding: utf-8 -*-\nprint 'bogus char: \x92'\n")
....$ cat tmp.py
# -*- coding: utf-8 -*-
print 'bogus char: ï¿½'
$ python2.6 tmp.py
bogus char: ï¿½
$ 2to3-3.2 tmp.py
[traceback snipped]
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 43:
invalid start byte

In theory 2to3 could be changed to take the same approach as os.listdir(),
but as in the OP's example occurences of the problem are likely to be
editing accidents.

Frank Millman · Feb 24, 2011

[snip lots of valuable info]

Your Python 2.x code should be TESTED before you poke 2to3 at it. In
this case just trying to run or import the offending code file would
have given an informative syntax error (you have declared the .py file
to be encoded in UTF-8 but it's not).

Thank you, John - this is the main lesson.

The file that caused the error has a .py extension, and looks like a python
file, but it just contains documentation. It has never been executed or
imported.

As you say, if I had tried to run it under Python 2 it would have failed
straight away. In these circumstances, it is unreasonable to expect 2to3 to
know what to do with it, so it is definitely not a bug.

I'm willing to bet that the same would happen with Python 3.1, if a
3.1 to 3.2 upgrade is what you are talking about

This is my first look at Python 3, so I am talking about moving from 2.6 to
3.2. In this case, it turns out that it was not a bug, but still, in future
I will run some tests when betas are released, just in case I come up with
something.

Thanks for your response - it was very useful.

Frank

Frank Millman · Feb 24, 2011

Peter Otten said:
The problem is that Python 2.x accepts arbitrary bytes in string
constants.
No error message or warning:

Thanks, Peter. I saw this after I replied to John, so this somewhat
invalidates my reply.

However, John's principle still holds true, and that is the main lesson I
have taken away from this.

Frank

Terry Reedy · Feb 24, 2011

future I will run some tests when betas are released, just in case I
come up with something.

Please do, perhaps more than once. The test suite coverage is being
improved but is not 100%. The day *after* 3.2.0 was released, someone
reported an unpleasant bug, a regression from 3.1.x. If they are tested
with the last beta or first release candidate, it would have been found
and fixed. Now its there until 3.2.1.

John Machin · Feb 24, 2011

John Machin wrote:

The problem is that Python 2.x accepts arbitrary bytes in string constants.

Ummm ... isn't that a bug? According to section 2.1.4 of the Python
2.7.1 Language Reference Manual: """The encoding is used for all
lexical analysis, in particular to find the end of a string, and to
interpret the contents of Unicode literals. String literals are
converted to Unicode for syntactical analysis, then converted back to
their original encoding before interpretation starts ..."""

How do you reconcile "used for all lexical analysis" and "String
literals are converted to Unicode for syntactical analysis" with the
actual (astonishing to me) behaviour?

Peter Otten · Feb 25, 2011

John said:
Ummm ... isn't that a bug? According to section 2.1.4 of the Python
2.7.1 Language Reference Manual: """The encoding is used for all
lexical analysis, in particular to find the end of a string, and to
interpret the contents of Unicode literals. String literals are
converted to Unicode for syntactical analysis, then converted back to
their original encoding before interpretation starts ..."""

How do you reconcile "used for all lexical analysis" and "String
literals are converted to Unicode for syntactical analysis" with the
actual (astonishing to me) behaviour?

You are right, the current behaviour is probably an implementation accident
stemming from the assumption that

s.decode("utf-8").encode("utf-8") == s

always holds. Other encodings (I tried cp1252) produce the expected
SyntaxError.

2to3 on Mac - unknown encoding: mbcs	0	Nov 5, 2009
Missing 'line' event when writing to frame.f_lineno in trace function	0	May 9, 2011
No module named Pwd - under Apache 2.2	1	Oct 21, 2011
wrong ImportError message printed by python3.3 when it can't finda module?	2	Dec 7, 2012
error in exception syntax	5	Mar 9, 2011
Missing library path (WIndows)	4	Sep 28, 2012
ContextDecorator via contextmanager: broken?	0	Jun 10, 2011
ctypes and twain_32.dll	2	Apr 30, 2011

2to3 chokes on bad character

Frank Millman

John Machin

Peter Otten

Frank Millman

Frank Millman

Terry Reedy

John Machin

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads