Using lxml to screen scrap a site, problem with charset

Feb 2, 2009

So, I'm using lxml to screen scrap a site that uses the cyrillic
alphabet (windows-1251 encoding). The sites HTML doesn't have the <META
...content-type.. charset=..> header, but does have a HTTP header that
specifies the charset... so they are standards compliant enough.

Now when I run this code:

from lxml import html
doc = html.parse('http://a1.com.mk/')
root = doc.getroot()
title = root.cssselect(('head title'))[0]
print title.text

the title.text is Ð° unicode string, but it has been wrongly decoded as
latin1 -> unicode

So.. is this a deficiency/bug in lxml or I'm doing something wrong.
Also, what are my other options here?

I'm running Python 2.6.1 and python-lxml 2.1.4 on Linux if matters.

--
Ð´Ð°Ð¼Ñ˜Ð°Ð½ ( http://softver.org.mk/damjan/ )

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

Tim Arnold · Feb 2, 2009

?????? ??????????? said:
So, I'm using lxml to screen scrap a site that uses the cyrillic
alphabet (windows-1251 encoding). The sites HTML doesn't have the <META
..content-type.. charset=..> header, but does have a HTTP header that
specifies the charset... so they are standards compliant enough.

Now when I run this code:

from lxml import html
doc = html.parse('http://a1.com.mk/')
root = doc.getroot()
title = root.cssselect(('head title'))[0]
print title.text

the title.text is ? unicode string, but it has been wrongly decoded as
latin1 -> unicode

So.. is this a deficiency/bug in lxml or I'm doing something wrong.
Also, what are my other options here?

I'm running Python 2.6.1 and python-lxml 2.1.4 on Linux if matters.

--
?????? ( http://softver.org.mk/damjan/ )

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

The way I do that is to open the file with codecs, encoding=cp1251, read it
into variable and feed that to the parser.

--Tim

Stefan Behnel · Feb 4, 2009

Tim said:
?????? ??????????? said:

So, I'm using lxml to screen scrap a site that uses the cyrillic
alphabet (windows-1251 encoding). The sites HTML doesn't have the <META
..content-type.. charset=..> header, but does have a HTTP header that
specifies the charset... so they are standards compliant enough.

Now when I run this code:

from lxml import html
doc = html.parse('http://a1.com.mk/')
root = doc.getroot()
title = root.cssselect(('head title'))[0]
print title.text

the title.text is ? unicode string, but it has been wrongly decoded as
latin1 -> unicode

Click to expand...

The way I do that is to open the file with codecs, encoding=cp1251, read it
into variable and feed that to the parser.

Yes, if you know the encoding through an external source (especially when
parsing broken HTML), it's best to pass in either a decoded string or a
decoding file-like object, as in

tree = lxml.html.parse( codecs.open(..., encoding='...') )

You can also create a parser with an encoding override:

parser = etree.HTMLParser(encoding='...', **other_options)

Stefan

Using lxml to screen scrap a site, problem with charset

Ð”Ð°Ð¼Ñ˜Ð°Ð½ Ð“ÐµÐ¾Ñ€Ð³Ð¸ÐµÐ²ÑÐºÐ¸

Tim Arnold

Stefan Behnel

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads