Using utidylib, empty string returned in some cases

B

Boris

Hello

I'm using debian linux, Python 2.4.4, and utidylib (http://
utidylib.berlios.de/). I wrote simple functions to get a web page,
convert it from windows-1251 to utf8 and then I'd like to clean html
with it.

Here is two pages I use to check my program:
http://www.ya.ru/ (in this case everything works ok)
http://www.yellow-pages.ru/rus/nd2/qu5/ru15632 (in this case tidy did
not return me anything just empty string)


code:

--------------

# coding: utf-8
import urllib, urllib2, tidy

def get_page(url):
user_agent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT
5.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'
headers = { 'User-Agent' : user_agent }
data= {}

req = urllib2.Request(url, data, headers)
responce = urllib2.urlopen(req)
page = responce.read()

return page

def convert_1251(page):
p = page.decode('windows-1251')
u = p.encode('utf-8')
return u

def clean_html(page):
tidy_options = { 'output_xhtml' : 1,
'add_xml_decl' : 1,
'indent' : 1,
'input-encoding' : 'utf8',
'output-encoding' : 'utf8',
'tidy_mark' : 1,
}
cleaned_page = tidy.parseString(page, **tidy_options)
return cleaned_page

test_url = 'http://www.yellow-pages.ru/rus/nd2/qu5/ru15632'
#test_url = 'http://www.ya.ru/'

#f = open('yp.html', 'r')
#p = f.read()

print clean_html(convert_1251(get_page(test_url)))
 
G

Gabriel Genellina

I'm using debian linux, Python 2.4.4, and utidylib (http://
utidylib.berlios.de/). I wrote simple functions to get a web page,
convert it from windows-1251 to utf8 and then I'd like to clean html
with it.

Why the intermediate conversion? I don't know utidylib, but can't you feed
it with the original page, in the original encoding? If the page itself
contains a "meta http-equiv" tag stating its content-type and charset, it
won't be valid anymore if you reencode the page.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,050
Latest member
AngelS122

Latest Threads

Top