Python 3.1.1 bytes decode with replace bug

Joe · Oct 24, 2009

The Python 3.1.1 documentation has the following example:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
unexpected code byte'abc'

Strict and Ignore appear to work as per the documentation but replace
does not. Instead of replacing the values it fails:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "p:\SW64\Python.3.1.1\lib\encodings\cp437.py", line 19, in
encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in
position
1: character maps to <undefined>

If this a known bug with 3.1.1?

Terry Reedy · Oct 24, 2009

Joe wrote:

Please provide more information

The Python 3.1.1 documentation has the following example:

Where? I could not find them

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
unexpected code byte

'abc'

Click to expand...

Strict and Ignore appear to work as per the documentation but replace
does not. Instead of replacing the values it fails:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "p:\SW64\Python.3.1.1\lib\encodings\cp437.py", line 19, in
encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in
position
1: character maps to <undefined>

Which interpreter and system? With Python 3.1 (r31:73574, Jun 26 2009,
20:21:35) [MSC v.1500 32 bit (Intel)] on win32, IDLE, I get
'ï¿½abc'

as per the example.

If this a known bug with 3.1.1?

Do you do a search in the issues list at bugs.python.org?
I did and did not find anything. The discrepancy between doc (if the
example really is from the doc) and behavior (if really 3.1) would be a
bug, but more info is needed.

Terry Jan Reedy

Benjamin Kaplan · Oct 24, 2009

The Python 3.1.1 documentation has the following example:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
unexpected code byte'abc'

Strict and Ignore appear to work as per the documentation but replace
does not. Instead of replacing the values it fails:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "p:\SW64\Python.3.1.1\lib\encodings\cp437.py", line 19, in
encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in
position
1: character maps to <undefined>

If this a known bug with 3.1.1?

It's not a bug. The problem isn't even the decode statement. Python
successfully creates the unicode string '\ufffdabc' and then tries to
print it to the screen. so it has to convert it to cp437 (your console
encoding) which fails. That's why the traceback mentions the cp437
file and not the utf-8 file.

Joe · Oct 24, 2009

Thanks for your response.

Please provide more information

Where? I could not find them

http://docs.python.org/3.1/howto/unicode.html#unicode-howto

Scroll down the page about half way to the "The String Type" section.

The example was copied from the second example with the light green
background.

Which interpreter and system? With Python 3.1 (r31:73574, Jun 26 2009,

Python 3.1.1 (r311:74483, Aug 17 2009, 16:45:59) [MSC v.1500 64 bit
(AMD64)] on win32

Windows 7 x64 RTM, Python 3.1.1

Do you do a search in the issues list at bugs.python.org?

Yes, I did not see anything that seemed to apply.

Terry Reedy · Oct 24, 2009

Joe said:
Thanks for your response.

Please provide more information

Where? I could not find them

Click to expand...

http://docs.python.org/3.1/howto/unicode.html#unicode-howto

Scroll down the page about half way to the "The String Type" section.

The example was copied from the second example with the light green
background.

Which interpreter and system? With Python 3.1 (r31:73574, Jun 26 2009,

Click to expand...

Python 3.1.1 (r311:74483, Aug 17 2009, 16:45:59) [MSC v.1500 64 bit
(AMD64)] on win32

Windows 7 x64 RTM, Python 3.1.1

For the reason BK explained, the important difference is that I ran in
the IDLE shell, which handles screen printing of unicode better ;-)

The important lesson for debugging, which I forgot also in my response,
is to separate creation of a (unicode) string from the printing of such.
You are not the first to get caught on this.

IE,

Yes, I did not see anything that seemed to apply.

tjr

Joe · Oct 25, 2009

For the reason BK explained, the important difference is that I ran in

the IDLE shell, which handles screen printing of unicode better ;-)

Something still does not seem right here to me.

In the example above the bytes were decoded to 'UTF-8' with the
replace option so any characters that were not UTF-8 were replaced and
the resulting string is '\ufffdabc' as BK explained. I understand
that the replace worked.

Now consider this:

Python 3.1.1 (r311:74483, Aug 17 2009, 16:45:59) [MSC v.1500 64 bit
(AMD64)] on
win32
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "p:\SW64\Python.3.1.1\lib\encodings\cp437.py", line 19, in
encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in
position
'utf-8'

This too fails for the exact same reason (and doesn't invole decode).

In the original example I decoded to UTF-8 and in this example the
default encoding is UTF-8 so why is cp437 being used?

Thanks in advance for your assistance!

Benjamin Kaplan · Oct 25, 2009

For the reason BK explained, the important difference is that I ran in
the IDLE shell, which handles screen printing of unicode better ;-)

Click to expand...

Something still does not seem right here to me.

In the example above the bytes were decoded to 'UTF-8' with the
replace option so any characters that were not UTF-8 were replaced and
the resulting string is '\ufffdabc' as BK explained. I understand
that the replace worked.

Now consider this:

Python 3.1.1 (r311:74483, Aug 17 2009, 16:45:59) [MSC v.1500 64 bit
(AMD64)] on
win32
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "p:\SW64\Python.3.1.1\lib\encodings\cp437.py", line 19, in
encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in
position
'utf-8'

This too fails for the exact same reason (and doesn't invole decode).

In the original example I decoded to UTF-8 and in this example the
default encoding is UTF-8 so why is cp437 being used?

Thanks in advance for your assistance!

Try checking sys.stdout.encoding. Then run the command chcp (not in
the python interpreter). You'll probably get 437 from both of those.
Just because the system encoding is set to utf-8 doesn't mean the
console is. Nobody really uses cp437 anymore- it was replaced years
ago by cp1252- but Microsoft is scared to do anything to cmd.exe
because it might break somebody's 20-year-old DOS script

Dave Angel · Oct 25, 2009

Joe said:
Something still does not seem right here to me.

In the example above the bytes were decoded to 'UTF-8' with the

*nope* you're decoding FROM utf-8 to unicode.

replace option so any characters that were not UTF-8 were replaced and
the resulting string is '\ufffdabc' as BK explained. I understand
that the replace worked.

Now consider this:

Python 3.1.1 (r311:74483, Aug 17 2009, 16:45:59) [MSC v.1500 64 bit
(AMD64)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "p:\SW64\Python.3.1.1\lib\encodings\cp437.py", line 19, in
encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in
position
'utf-8'

This too fails for the exact same reason (and doesn't invole decode).

In the original example I decoded to UTF-8 and in this example the
default encoding is UTF-8 so why is cp437 being used?

Thanks in advance for your assistance!

Benjamin had it right, but you still don't understand what he said.

The problem in your original example, and in the current one, is not in
decode(), but in encode(), which is implicitly called by print(), when
needed to convert from Unicode to some byte format of the console. Take
your original example:

The decode() is explicit, and converts *FROM* utf8 string to a unicode
one. But since you're running in a debugger, there's an implicit print,
which is converting unicode into whatever your default console encoding
is. That calls encode() (or one of its variants, charmap_encode(), on
the unicode string. There is no relationship between the two steps.

In your current example, you're explicitly doing the print(), but still
have the same implicit encoding to cp437, which gets the equivalent
error. That's the encoding that your Python 3.x is choosing for the
stdout console, based on country-specific Windows settings. In the US,
that implicit encoding is ASCII. I don't know how to override it
generically, but I know it's possible to replace stdout with a wrapper
that does your preferred encoding. You probably want to keep cp437, but
change the error handling to ignore. Or if this is a one-time problem,
I suspect you could do the encoding manually, to a byte array, then
print that.

DaveA

Mark Tolonen · Oct 26, 2009

Dave Angel said:
Joe said:

Something still does not seem right here to me.

In the example above the bytes were decoded to 'UTF-8' with the

Click to expand...

*nope* you're decoding FROM utf-8 to unicode.

replace option so any characters that were not UTF-8 were replaced and
the resulting string is '\ufffdabc' as BK explained. I understand
that the replace worked.

Now consider this:

Python 3.1.1 (r311:74483, Aug 17 2009, 16:45:59) [MSC v.1500 64 bit
(AMD64)] on
win32
Type "help", "copyright", "credits" or "license" for more information.

s = '\ufffdabc'
print(s)

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "p:\SW64\Python.3.1.1\lib\encodings\cp437.py", line 19, in
encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in
position

0: character maps to said:

import sys
sys.getdefaultencoding()

Click to expand...

'utf-8'

This too fails for the exact same reason (and doesn't invole decode).

In the original example I decoded to UTF-8 and in this example the
default encoding is UTF-8 so why is cp437 being used?

Thanks in advance for your assistance!

Click to expand...

Benjamin had it right, but you still don't understand what he said.

The problem in your original example, and in the current one, is not in
decode(), but in encode(), which is implicitly called by print(), when
needed to convert from Unicode to some byte format of the console. Take
your original example:

The decode() is explicit, and converts *FROM* utf8 string to a unicode
one. But since you're running in a debugger, there's an implicit print,
which is converting unicode into whatever your default console encoding
is. That calls encode() (or one of its variants, charmap_encode(), on
the unicode string. There is no relationship between the two steps.

In your current example, you're explicitly doing the print(), but still
have the same implicit encoding to cp437, which gets the equivalent error.
That's the encoding that your Python 3.x is choosing for the stdout
console, based on country-specific Windows settings. In the US, that
implicit encoding is ASCII. I don't know how to override it generically,
but I know it's possible to replace stdout with a wrapper that does your
preferred encoding. You probably want to keep cp437, but change the error
handling to ignore. Or if this is a one-time problem, I suspect you could
do the encoding manually, to a byte array, then print that.

You can also replace the Unicode replacement character U+FFFD with a valid
cp437 character before displaying it:
'?abc'

-Mark

Joe · Oct 26, 2009

Thanks Benjamin for solving the mystery of where the cp437 usage was
coming from.

So b'\x80abc'.decode("utf-8", "replace") was working properly but then
when the interactive prompt tried to print it, it was basically taking
the results and doing a
encode('cp437', 'strict') which failed because of the characters that
are not part of cp437.

It might not be a bad idea to put a note on that documentation page
because I sure others will work though the samples like I did and if
they are on Windows run into the same issue.

Printing characters outside of the ASCII range	18	Nov 9, 2012
python 3.3 repr	28	Nov 15, 2013
Missing library path (WIndows)	4	Sep 29, 2012
unable to print Unicode characters in Python 3	12	Jan 26, 2009
Unicode	2	Mar 15, 2013
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position	4	Dec 6, 2012
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	67	Jul 4, 2013
Encoding/decoding: Still don't get it :-/	4	Mar 13, 2009

Python 3.1.1 bytes decode with replace bug

Joe

Terry Reedy

Benjamin Kaplan

Joe

Terry Reedy

Joe

Benjamin Kaplan

Dave Angel

Mark Tolonen

Joe

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads