What the \xc2\xa0 ?!!

Brian D · Sep 7, 2010

In an HTML page that I'm scraping using urllib2, a \xc2\xa0
bytestring appears.

The page's charset = utf-8, and the Chrome browser I'm using displays
the characters as a space.

The page requires authentication:
https://www.nolaready.info/myalertlog.php

When I try to concatenate strings containing the bytestring, Python
chokes because it refuses to coerce the bytestring into ascii.

wfile.write('|'.join(valueList))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
163: ordinal not in range(128)

In searching for help with this issue, I've learned that the
bytestring *might* represent a non-breaking space.

When I scrape the page using urllib2, however, the characters print
as â”¬Ã¡ in a Windows command prompt (though I wouldn't be surprised if
this is some erroneous attempt by the antiquated command window to
handle something it doesn't understand).

If I use IDLE to attempt to decode the single byte referenced in the
error message, and convert it into UTF-8, another error message is
generated:

Traceback (most recent call last):
File "<pyshell#72>", line 1, in <module>
weird = unicode('\xc2', 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0:
unexpected end of data

If I attempt to decode the full bytestring, I don't obtain a human-
readable string (expecting, perhaps, a non-breaking space):

weird = unicode('\xc2\xa0', 'utf-8')
par = ' - '.join(['This is', weird])
par

Click to expand...

Click to expand...

u'This is - \xa0'

I suspect that the bytestring isn't UTF-8, but what is it? Latin1?
u'This just gets \xc2\xa0'

Or is it a Microsoft bytestring?
u'This just gets \xc2\xa0'

None of these codecs seem to work.

Back to the original purpose, as I'm scraping the page, I'm storing
the field/value pair in a dictionary with each iteration through table
elements on the page. This is all fine, until a value is found that
contains the offending bytestring. I have attempted to coerce all
value strings into an encoding, but Python doesn't seem to like that
when the string is already Unicode:

valuesDict[fieldString] = unicode(value, 'UTF-8')
TypeError: decoding Unicode is not supported

The solution I've arrived at is to specify the encoding for value
strings both when reading and writing value strings.

for k, v in valuesDict.iteritems():
valuePair = ':'.join([k, v.encode('UTF-8')])
[snip] ...
wfile.write('|'.join(valueList))

I'm not sure I have a question, but does this sound familiar to any
Unicode experts out there?

How should I handle these odd bytestring values? Am I doing it
correctly, or what could I improve?

Thanks!

John Roth · Sep 7, 2010

In an HTML page that I'm scraping using urllib2, a Â \xc2\xa0
bytestring appears.

The page's charset = utf-8, and the Chrome browser I'm using displays
the characters as a space.

The page requires authentication:https://www.nolaready.info/myalertlog.php

When I try to concatenate strings containing the bytestring, Python
chokes because it refuses to coerce the bytestring into ascii.

wfile.write('|'.join(valueList))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
163: ordinal not in range(128)

In searching for help with this issue, I've learned that the
bytestring *might* represent a non-breaking space.

When I scrape the page using urllib2, however, the characters print
as Â â”¬Ã¡ Â in a Windows command prompt (though I wouldn't be surprised if
this is some erroneous attempt by the antiquated command window to
handle something it doesn't understand).

If I use IDLE to attempt to decode the single byte referenced in the
error message, and convert it into UTF-8, another error message is
generated:

Traceback (most recent call last):
Â File "<pyshell#72>", line 1, in <module>
Â Â weird = unicode('\xc2', 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0:
unexpected end of data

If I attempt to decode the full bytestring, I don't obtain a human-
readable string (expecting, perhaps, a non-breaking space):

weird = unicode('\xc2\xa0', 'utf-8')
par = ' - '.join(['This is', weird])
par

Click to expand...

Click to expand...

u'This is - \xa0'

I suspect that the bytestring isn't UTF-8, but what is it? Latin1?

u'This just gets \xc2\xa0'

Or is it a Microsoft bytestring?

u'This just gets \xc2\xa0'

None of these codecs seem to work.

Back to the original purpose, as I'm scraping the page, I'm storing
the field/value pair in a dictionary with each iteration through table
elements on the page. This is all fine, until a value is found that
contains the offending bytestring. I have attempted to coerce all
value strings into an encoding, but Python doesn't seem to like that
when the string is already Unicode:

valuesDict[fieldString] = unicode(value, 'UTF-8')
TypeError: decoding Unicode is not supported

The solution I've arrived at is to specify the encoding for value
strings both when reading and writing value strings.

for k, v in valuesDict.iteritems():
Â Â valuePair = ':'.join([k, v.encode('UTF-8')])
Â Â [snip] ...
Â Â wfile.write('|'.join(valueList))

I'm not sure I have a question, but does this sound familiar to any
Unicode experts out there?

How should I handle these odd bytestring values? Am I doing it
correctly, or what could I improve?

Thanks!

Since it's UTF-8, one should go to one of the UTF-8 pages that
describes how to decode it. As it turns out, its unicode hex value is
A0, which is indeed a non-breaking space.

This is probably as good as any page: http://en.wikipedia.org/wiki/UTF-8

John Roth

Unicode confusion	0	Jul 14, 2008
Py3: Read file with Unicode characters	4	Apr 8, 2010
Mini Web Server in C++ (Part One)	4	Oct 2, 2025
byte count unicode string	2	Sep 20, 2006
helping with unicode	4	Jul 2, 2012
logging module and binary strings	1	Jul 1, 2009
Unicode Question	4	Jan 9, 2006
problem with logging exceptions with non-ASCII __str__ result	1	Jan 14, 2008

What the \xc2\xa0 ?!!

Brian D

John Roth

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads