HTMLParser can't read japanese

D

Dodo

Here's a small script to generate again the error
running windows 7 with python 3.1

FILE : parseShift.py

import urllib.request as url
from html.parser import HTMLParser

class myParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start of %s tag : %s" % (tag, attrs))


test = myParser()
handle = url.urlretrieve("http://localhost/shift.html")
handleTemp = open( handle[0] , encoding="Shift-JIS" )
test.feed( handleTemp.read() )
handleTempl.close()

FILE : shift.html (encoded Shift-JIS)

<p class="thisisclass (not_in_japanese) reading_this_should_be_ok">Some
random japanese
<p><strong>æ±æ–¹ãƒ—ロジェクト</strong> <a href="#" title="キャプテン・ムラ
サ">Link</a>

OUTPUT

Start of p tag : [('class', 'thisisclass (not_in_japanese)
reading_this_should_be_ok')]
Start of p tag : []
Start of strong tag : []
Traceback (most recent call last):
File "D:\Dorian\Python\parseShift.py", line 12, in <module>
test.feed( handleTemp.read() )
File "C:\Python31\lib\html\parser.py", line 108, in feed
self.goahead(0)
File "C:\Python31\lib\html\parser.py", line 148, in goahead
k = self.parse_starttag(i)
File "C:\Python31\lib\html\parser.py", line 268, in parse_starttag
self.handle_starttag(tag, attrs)
File "D:\Dorian\Python\parseShift.py", line 6, in handle_starttag
print("Start of %s tag : %s" % (tag, attrs))
File "C:\Python31\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position
44-52: c
haracter maps to <undefined>


any help?
Dorian
 
D

Dodo

alright, it's just because of Windows cmd
in IDLE it works fine

any workaround?

Dorian

Le 13/04/2010 13:40, Dodo a écrit :
 
S

Stefan Behnel

Dodo, 13.04.2010 13:40:
Here's a small script to generate again the error
running windows 7 with python 3.1

FILE : parseShift.py

import urllib.request as url
from html.parser import HTMLParser

class myParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Start of %s tag : %s" % (tag, attrs))

You problem is the last line. Your terminal does not support printing the
text, so you get an exception here.

Either change your terminal encoding to a suitable encoding, or write the
text to an encoded file instead (see the 'encoding' option of the open()
function for that).

Stefan
 
J

John Nagle

Yes. Try "cmd /u" to get a Unicode console.

HTMLparser should already have converted from Shift-JIS
to Unicode, so the "print" is outputting Unicode.

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top