HTMLParser and non-ascii html pages

Yaþar Arabacý · Sep 20, 2011

Hi,

I am using a simple sublclass of HTMLParser like this:

class LinkCollector(HTMLParser):

def reset(self):
self.links = []
HTMLParser.reset(self)

def handle_starttag(self,tag,attr):
if tag in ("a","link"):
key = "href"
elif tag in ("img","script"):
key = "src"
else:
return
self.links.extend([v for k,v in attr if k == key])

This gives following error:

Traceback (most recent call last):
File "downloader.py", line 209, in <module>
if __name__ == "__main__": main()
File "downloader.py", line 201, in main
link_collect.feed(response)
File "C:\Python27\lib\HTMLParser.py", line 108, in feed
self.goahead(0)
File "C:\Python27\lib\HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "C:\Python27\lib\HTMLParser.py", line 252, in parse_starttag
attrvalue = self.unescape(attrvalue)
File "C:\Python27\lib\HTMLParser.py", line 393, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities,
s)
File "C:\Python27\lib\re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13:
ordinal not in range(128)

Rest of the code available as attachment. Does anyone know how to solve
this?

HTMLParser can't read japanese	3	Apr 13, 2010
HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
HTMLParser not parsing whole html file	4	Oct 24, 2010
BeautifulSoup	8	Jan 13, 2010
Question regarding HTMLParser module.	1	Jul 28, 2003
HTML File Parsing	3	Oct 28, 2008
Newbie, list has no attribute iteritems	2	Jul 4, 2008
Help w/ HTMLParser lib	4	May 20, 2004

HTMLParser and non-ascii html pages

Yaþar Arabacý

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads