2to3 chokes on bad character

F

Frank Millman

Hi all

I don't know if this counts as a bug in 2to3.py, but when I ran it on my
program directory it crashed, with a traceback but without any indication of
which file caused the problem.

Here is the traceback -

Traceback (most recent call last):
File "C:\Python32\Tools\Scripts\2to3.py", line 5, in <module>
sys.exit(main("lib2to3.fixes"))
File "C:\Python32\lib\lib2to3\main.py", line 172, in main
options.processes)
File "C:\Python32\lib\lib2to3\refactor.py", line 700, in refactor
items, write, doctests_only)
File "C:\Python32\lib\lib2to3\refactor.py", line 294, in refactor
self.refactor_dir(dir_or_file, write, doctests_only)
File "C:\Python32\lib\lib2to3\refactor.py", line 314, in refactor_dir
self.refactor_file(fullname, write, doctests_only)
File "C:\Python32\lib\lib2to3\refactor.py", line 741, in refactor_file
*args, **kwargs)
File "C:\Python32\lib\lib2to3\refactor.py", line 336, in refactor_file
input, encoding = self._read_python_source(filename)
File "C:\Python32\lib\lib2to3\refactor.py", line 332, in
_read_python_source
return _from_system_newlines(f.read()), encoding
File "C:\Python32\lib\codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 5055:
invalid start byte

On investigation, I found some funny characters in docstrings that I
copy/pasted from a pdf file.

Here are the details if they are of any use. Oddly, I found two instances
where characters 'look like' apostrophes when viewed in my text editor, but
one of them was accepted by 2to3 and the other caused the crash.

The one that was accepted consists of three bytes - 226, 128, 153 (as
reported by python 2.6) or 226, 8364, 8482 (as reported by python3.2).

The one that crashed consists of a single byte - 146 (python 2.6) or 8217
(python 3.2).

The issue is not that 2to3 should handle this correctly, but that it should
give a more informative error message to the unsuspecting user.

Frank Millman

BTW I have always waited for 'final releases' before upgrading in the past,
but this makes me realise the importance of checking out the beta versions -
I will do so in future.
 
J

John Machin

Hi all

I don't know if this counts as a bug in 2to3.py, but when I ran it on my
program directory it crashed, with a traceback but without any indicationof
which file caused the problem.
[traceback snipped]
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 5055:
invalid start byte

On investigation, I found some funny characters in docstrings that I
copy/pasted from a pdf file.

Here are the details if they are of any use. Oddly, I found two instances
where characters 'look like' apostrophes when viewed in my text editor, but
one of them was accepted by 2to3 and the other caused the crash.

The one that was accepted consists of three bytes - 226, 128, 153 (as
reported by python 2.6)

How did you incite it to report like that? Just use repr(the_3_bytes).
It'll show up as '\xe2\x80\x99'.
'RIGHT SINGLE QUOTATION MARK'

What you have there is the UTF-8 representation of U+2019 RIGHT SINGLE
QUOTATION MARK. That's OK.

or 226, 8364, 8482 (as reported by python3.2).

Sorry, but you have instructed Python 3.2 to commit a nonsense:
[ord(chr(i).decode('cp1252')) for i in (226, 128, 153)]
[226, 8364, 8482]

In other words, you have taken that 3-byte sequence, decoded each byte
separately using cp1252 (aka "the usual suspect") into a meaningless
Unicode character and printed its ordinal.

In Python 3, don't use repr(); it has undergone the MHTP
transformation and become ascii().
The one that crashed consists of a single byte - 146 (python 2.6) or 8217
(python 3.2). '0x2019'


The issue is not that 2to3 should handle this correctly, but that it should
give a more informative error message to the unsuspecting user.

Your Python 2.x code should be TESTED before you poke 2to3 at it. In
this case just trying to run or import the offending code file would
have given an informative syntax error (you have declared the .py file
to be encoded in UTF-8 but it's not).
BTW I have always waited for 'final releases' before upgrading in the past,
but this makes me realise the importance of checking out the beta versions -
I will do so in future.

I'm willing to bet that the same would happen with Python 3.1, if a
3.1 to 3.2 upgrade is what you are talking about
 
P

Peter Otten

John said:
Hi all

I don't know if this counts as a bug in 2to3.py, but when I ran it on my
program directory it crashed, with a traceback but without any indication
of which file caused the problem.
[traceback snipped]
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 5055:
invalid start byte

On investigation, I found some funny characters in docstrings that I
copy/pasted from a pdf file.

Here are the details if they are of any use. Oddly, I found two instances
where characters 'look like' apostrophes when viewed in my text editor,
but one of them was accepted by 2to3 and the other caused the crash.

The one that was accepted consists of three bytes - 226, 128, 153 (as
reported by python 2.6)

How did you incite it to report like that? Just use repr(the_3_bytes).
It'll show up as '\xe2\x80\x99'.
'RIGHT SINGLE QUOTATION MARK'

What you have there is the UTF-8 representation of U+2019 RIGHT SINGLE
QUOTATION MARK. That's OK.

or 226, 8364, 8482 (as reported by python3.2).

Sorry, but you have instructed Python 3.2 to commit a nonsense:
[ord(chr(i).decode('cp1252')) for i in (226, 128, 153)]
[226, 8364, 8482]

In other words, you have taken that 3-byte sequence, decoded each byte
separately using cp1252 (aka "the usual suspect") into a meaningless
Unicode character and printed its ordinal.

In Python 3, don't use repr(); it has undergone the MHTP
transformation and become ascii().
The one that crashed consists of a single byte - 146 (python 2.6) or 8217
(python 3.2). '0x2019'


The issue is not that 2to3 should handle this correctly, but that it
should give a more informative error message to the unsuspecting user.

Your Python 2.x code should be TESTED before you poke 2to3 at it. In
this case just trying to run or import the offending code file would
have given an informative syntax error (you have declared the .py file
to be encoded in UTF-8 but it's not).

The problem is that Python 2.x accepts arbitrary bytes in string constants.
No error message or warning:

$ python
Python 2.6.4 (r264:75706, Dec 7 2009, 18:43:55)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information..... f.write("# -*- coding: utf-8 -*-\nprint 'bogus char: \x92'\n")
....$ cat tmp.py
# -*- coding: utf-8 -*-
print 'bogus char: �'
$ python2.6 tmp.py
bogus char: �
$ 2to3-3.2 tmp.py
[traceback snipped]
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 43:
invalid start byte

In theory 2to3 could be changed to take the same approach as os.listdir(),
but as in the OP's example occurences of the problem are likely to be
editing accidents.
 
F

Frank Millman

[snip lots of valuable info]
Your Python 2.x code should be TESTED before you poke 2to3 at it. In
this case just trying to run or import the offending code file would
have given an informative syntax error (you have declared the .py file
to be encoded in UTF-8 but it's not).

Thank you, John - this is the main lesson.

The file that caused the error has a .py extension, and looks like a python
file, but it just contains documentation. It has never been executed or
imported.

As you say, if I had tried to run it under Python 2 it would have failed
straight away. In these circumstances, it is unreasonable to expect 2to3 to
know what to do with it, so it is definitely not a bug.
I'm willing to bet that the same would happen with Python 3.1, if a
3.1 to 3.2 upgrade is what you are talking about

This is my first look at Python 3, so I am talking about moving from 2.6 to
3.2. In this case, it turns out that it was not a bug, but still, in future
I will run some tests when betas are released, just in case I come up with
something.

Thanks for your response - it was very useful.

Frank
 
F

Frank Millman

Peter Otten said:
The problem is that Python 2.x accepts arbitrary bytes in string
constants.
No error message or warning:

Thanks, Peter. I saw this after I replied to John, so this somewhat
invalidates my reply.

However, John's principle still holds true, and that is the main lesson I
have taken away from this.

Frank
 
T

Terry Reedy

future I will run some tests when betas are released, just in case I
come up with something.

Please do, perhaps more than once. The test suite coverage is being
improved but is not 100%. The day *after* 3.2.0 was released, someone
reported an unpleasant bug, a regression from 3.1.x. If they are tested
with the last beta or first release candidate, it would have been found
and fixed. Now its there until 3.2.1.
 
J

John Machin

John Machin wrote:

The problem is that Python 2.x accepts arbitrary bytes in string constants.

Ummm ... isn't that a bug? According to section 2.1.4 of the Python
2.7.1 Language Reference Manual: """The encoding is used for all
lexical analysis, in particular to find the end of a string, and to
interpret the contents of Unicode literals. String literals are
converted to Unicode for syntactical analysis, then converted back to
their original encoding before interpretation starts ..."""

How do you reconcile "used for all lexical analysis" and "String
literals are converted to Unicode for syntactical analysis" with the
actual (astonishing to me) behaviour?
 
P

Peter Otten

John said:
Ummm ... isn't that a bug? According to section 2.1.4 of the Python
2.7.1 Language Reference Manual: """The encoding is used for all
lexical analysis, in particular to find the end of a string, and to
interpret the contents of Unicode literals. String literals are
converted to Unicode for syntactical analysis, then converted back to
their original encoding before interpretation starts ..."""

How do you reconcile "used for all lexical analysis" and "String
literals are converted to Unicode for syntactical analysis" with the
actual (astonishing to me) behaviour?

You are right, the current behaviour is probably an implementation accident
stemming from the assumption that

s.decode("utf-8").encode("utf-8") == s

always holds. Other encodings (I tried cp1252) produce the expected
SyntaxError.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top