Using lxml to screen scrap a site, problem with charset

  • Thread starter Дамјан ГеоргиевÑки
  • Start date
Ð

Дамјан ГеоргиевÑки

So, I'm using lxml to screen scrap a site that uses the cyrillic
alphabet (windows-1251 encoding). The sites HTML doesn't have the <META
...content-type.. charset=..> header, but does have a HTTP header that
specifies the charset... so they are standards compliant enough.

Now when I run this code:

from lxml import html
doc = html.parse('http://a1.com.mk/')
root = doc.getroot()
title = root.cssselect(('head title'))[0]
print title.text

the title.text is а unicode string, but it has been wrongly decoded as
latin1 -> unicode

So.. is this a deficiency/bug in lxml or I'm doing something wrong.
Also, what are my other options here?


I'm running Python 2.6.1 and python-lxml 2.1.4 on Linux if matters.

--
дамјан ( http://softver.org.mk/damjan/ )

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
 
T

Tim Arnold

?????? ??????????? said:
So, I'm using lxml to screen scrap a site that uses the cyrillic
alphabet (windows-1251 encoding). The sites HTML doesn't have the <META
..content-type.. charset=..> header, but does have a HTTP header that
specifies the charset... so they are standards compliant enough.

Now when I run this code:

from lxml import html
doc = html.parse('http://a1.com.mk/')
root = doc.getroot()
title = root.cssselect(('head title'))[0]
print title.text

the title.text is ? unicode string, but it has been wrongly decoded as
latin1 -> unicode

So.. is this a deficiency/bug in lxml or I'm doing something wrong.
Also, what are my other options here?


I'm running Python 2.6.1 and python-lxml 2.1.4 on Linux if matters.

--
?????? ( http://softver.org.mk/damjan/ )

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

The way I do that is to open the file with codecs, encoding=cp1251, read it
into variable and feed that to the parser.

--Tim
 
S

Stefan Behnel

Tim said:
?????? ??????????? said:
So, I'm using lxml to screen scrap a site that uses the cyrillic
alphabet (windows-1251 encoding). The sites HTML doesn't have the <META
..content-type.. charset=..> header, but does have a HTTP header that
specifies the charset... so they are standards compliant enough.

Now when I run this code:

from lxml import html
doc = html.parse('http://a1.com.mk/')
root = doc.getroot()
title = root.cssselect(('head title'))[0]
print title.text

the title.text is ? unicode string, but it has been wrongly decoded as
latin1 -> unicode

The way I do that is to open the file with codecs, encoding=cp1251, read it
into variable and feed that to the parser.

Yes, if you know the encoding through an external source (especially when
parsing broken HTML), it's best to pass in either a decoded string or a
decoding file-like object, as in

tree = lxml.html.parse( codecs.open(..., encoding='...') )

You can also create a parser with an encoding override:

parser = etree.HTMLParser(encoding='...', **other_options)

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Staff online

Members online

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,054
Latest member
LucyCarper

Latest Threads

Top