Once again a unicode question

N

Nicolas Evrard

Hello,

I'm puzzled by this test I made while trying to transform a page in
html to plain text. Because I cannot send unicode to feed, nor str so
how can I do this ?

..nicoe@smarties:~$ python2.4
..Python 2.4.1c2 (#2, Mar 19 2005, 01:04:19)
..[GCC 3.3.5 (Debian 1:3.3.5-12)] on linux2
..Type "help", "copyright", "credits" or "license" for more information.
..>>> import formatter
..>>> import htmllib
..>>> html2txt = htmllib.HTMLParser(formatter.AbstractFormatter(formatter.DumbWriter()))
..>>> html2txt.feed(u'D\xe9but')
..Traceback (most recent call last):
.. File "<stdin>", line 1, in ?
.. File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
.. self.goahead(0)
.. File "/usr/lib/python2.4/sgmllib.py", line 120, in goahead
.. self.handle_data(rawdata[i:j])
.. File "/usr/lib/python2.4/htmllib.py", line 65, in handle_data
.. self.formatter.add_flowing_data(data)
.. File "/usr/lib/python2.4/formatter.py", line 197, in add_flowing_data
.. self.writer.send_flowing_data(data)
.. File "/usr/lib/python2.4/formatter.py", line 421, in send_flowing_data
.. write(word)
..UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
..>>> html2txt.feed(u'D\xe9but'.encode('latin1'))
..Traceback (most recent call last):
.. File "<stdin>", line 1, in ?
.. File "/usr/lib/python2.4/sgmllib.py", line 94, in feed
.. self.rawdata = self.rawdata + data
..UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 1: ordinal not in range(128)
..>>> html2txt.feed('Début')
..Traceback (most recent call last):
.. File "<stdin>", line 1, in ?
.. File "/usr/lib/python2.4/sgmllib.py", line 94, in feed
.. self.rawdata = self.rawdata + data
..UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
..>>>
 
S

Serge Orlov

Nicolas said:
Hello,

I'm puzzled by this test I made while trying to transform a page in
html to plain text. Because I cannot send unicode to feed, nor str so
how can I do this ?

Seems like the parser is in the broken state after the first exception.
Feed only binary strings to it.

Serge.
 
N

Nicolas Evrard

* Serge Orlov [23:45 26/03/05 CET]:
Seems like the parser is in the broken state after the first exception.
Feed only binary strings to it.

That was that thank you very much.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top