encoding in lxml

Discussion in 'Python' started by jasiu85, Nov 3, 2008.

  1. jasiu85

    jasiu85 Guest

    Hey,

    I have a problem with character encoding in LXML. Here's how it goes:

    I read an HTML document from a third-party site. It is supposed to be
    in UTF-8, but unfortunately from time to time it's not. I parse the
    document like this:

    html_doc = HTML(string_with_document)

    Then I retrieve some info from the document with XPath:

    xpath_nodes = html_doc('/html/body/something')

    Now I'm guaranteed that the xpath_nodes list contains only one
    element. So I read it's content:

    xpath_nodes[0].text

    And I get exception here. The exception is coming from the text
    property of an Element object. The problem is that the text contains a
    non-utf8 character. LXML seems to be using strict decoding and I can't
    find a way to make it ignore the error. Is there anything I can do to
    retrieve the text without getting an exception?

    Regards,

    Mike
     
    jasiu85, Nov 3, 2008
    #1
    1. Advertising

  2. jasiu85

    Guest

    Hi Mike,

    > I read an HTML document from a third-party site. It is supposed to be
    > in UTF-8, but unfortunately from time to time it's not.


    There will be host of more lightweight solutions, but you can opt
    to sanizite incominhg HTML with HTML Tidy (python binding available).

    It will replace invalid UTF-8 bytes with U+FFFD. It will not
    guess a better encoding to use.

    If you are sure you don't have HTML sloppiness to correct but only
    the
    occasional wrong byte, even decoding (with fallback) and encoding
    using
    the standard codec package will do.

    Regards,
    Peter
     
    , Nov 3, 2008
    #2
    1. Advertising

  3. jasiu85 wrote:
    > I have a problem with character encoding in LXML. Here's how it goes:
    >
    > I read an HTML document from a third-party site. It is supposed to be
    > in UTF-8, but unfortunately from time to time it's not.


    You can instantiate your own HTML parser and pass encoding="utf-8". That way,
    when it's not UTF-8, you will get an exception at parse time, which allows you
    to reparse the document with another encoding (say, ISO-8859-1) to get the
    correct content.

    Stefan
     
    Stefan Behnel, Nov 3, 2008
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,999
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Replies:
    1
    Views:
    23,525
    Real Gagnon
    Oct 8, 2004
  3. Stefan Behnel
    Replies:
    0
    Views:
    411
    Stefan Behnel
    Oct 17, 2005
  4. Stefan Behnel

    [ANN] lxml 0.9 is out!

    Stefan Behnel, Mar 20, 2006, in forum: Python
    Replies:
    0
    Views:
    330
    Stefan Behnel
    Mar 20, 2006
  5. Replies:
    2
    Views:
    398
Loading...

Share This Page