Z
Zhang Le
Hello,
I'm writing a little Tkinter application to retrieve news from
various news websites such as http://news.bbc.co.uk/, and display them
in a TK listbox. All I want are news title and url information. Since
each news site has a different layout, I think I need some
template-based techniques to build news extractors for each site,
ignoring information such as table, image, advertise, flash that I'm
not interested in.
So far I have built a simple GUI using Tkinter, a link extractor
using HTMLlib to extract HREFs from web page. But I really have no idea
how to extract news from web site. Is anyone aware of general
techniques for extracting web news? Or can point me to some falimiar
projects.
I have seen some search engines doing this, for
example:http://news.ithaki.net/, but do not know the technique used.
Any tips?
Thanks in advance,
Zhang Le
I'm writing a little Tkinter application to retrieve news from
various news websites such as http://news.bbc.co.uk/, and display them
in a TK listbox. All I want are news title and url information. Since
each news site has a different layout, I think I need some
template-based techniques to build news extractors for each site,
ignoring information such as table, image, advertise, flash that I'm
not interested in.
So far I have built a simple GUI using Tkinter, a link extractor
using HTMLlib to extract HREFs from web page. But I really have no idea
how to extract news from web site. Is anyone aware of general
techniques for extracting web news? Or can point me to some falimiar
projects.
I have seen some search engines doing this, for
example:http://news.ithaki.net/, but do not know the technique used.
Any tips?
Thanks in advance,
Zhang Le