encoding in lxml

jasiu85 · Nov 3, 2008

Hey,

I have a problem with character encoding in LXML. Here's how it goes:

I read an HTML document from a third-party site. It is supposed to be
in UTF-8, but unfortunately from time to time it's not. I parse the
document like this:

html_doc = HTML(string_with_document)

Then I retrieve some info from the document with XPath:

xpath_nodes = html_doc('/html/body/something')

Now I'm guaranteed that the xpath_nodes list contains only one
element. So I read it's content:

xpath_nodes[0].text

And I get exception here. The exception is coming from the text
property of an Element object. The problem is that the text contains a
non-utf8 character. LXML seems to be using strict decoding and I can't
find a way to make it ignore the error. Is there anything I can do to
retrieve the text without getting an exception?

Regards,

Mike

pjacobi.de · Nov 3, 2008

Hi Mike,

I read an HTML document from a third-party site. It is supposed to be
in UTF-8, but unfortunately from time to time it's not.

There will be host of more lightweight solutions, but you can opt
to sanizite incominhg HTML with HTML Tidy (python binding available).

It will replace invalid UTF-8 bytes with U+FFFD. It will not
guess a better encoding to use.

If you are sure you don't have HTML sloppiness to correct but only
the
occasional wrong byte, even decoding (with fallback) and encoding
using
the standard codec package will do.

Regards,
Peter

Stefan Behnel · Nov 3, 2008

jasiu85 said:
I have a problem with character encoding in LXML. Here's how it goes:

I read an HTML document from a third-party site. It is supposed to be
in UTF-8, but unfortunately from time to time it's not.

You can instantiate your own HTML parser and pass encoding="utf-8". That way,
when it's not UTF-8, you will get an exception at parse time, which allows you
to reparse the document with another encoding (say, ISO-8859-1) to get the
correct content.

Stefan

XML header with lxml	2	Apr 4, 2011
Uploading images - binary or unsupported text encoding	2	Dec 24, 2022
[ANN] lxml 1.0 released	2	Jun 2, 2006
lxml removing tag, keeping text order	2	Oct 24, 2008
Problem inserting an element where I want it using lxml	2	Jan 5, 2011
Partly erratic wrong behaviour, Python 3, lxml	5	Mar 4, 2010
encoding error	1	Feb 20, 2013
lxml/ElementTree and .tail	30	Nov 15, 2006

encoding in lxml

jasiu85

pjacobi.de

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads