extract news article from web

Zhang Le · Dec 22, 2004

Hello,
I'm writing a little Tkinter application to retrieve news from
various news websites such as http://news.bbc.co.uk/, and display them
in a TK listbox. All I want are news title and url information. Since
each news site has a different layout, I think I need some
template-based techniques to build news extractors for each site,
ignoring information such as table, image, advertise, flash that I'm
not interested in.

So far I have built a simple GUI using Tkinter, a link extractor
using HTMLlib to extract HREFs from web page. But I really have no idea
how to extract news from web site. Is anyone aware of general
techniques for extracting web news? Or can point me to some falimiar
projects.
I have seen some search engines doing this, for
example:http://news.ithaki.net/, but do not know the technique used.
Any tips?

Thanks in advance,

Zhang Le

Steve Holden · Dec 22, 2004

Zhang said:
Hello,
I'm writing a little Tkinter application to retrieve news from
various news websites such as http://news.bbc.co.uk/, and display them
in a TK listbox. All I want are news title and url information. Since
each news site has a different layout, I think I need some
template-based techniques to build news extractors for each site,
ignoring information such as table, image, advertise, flash that I'm
not interested in.

So far I have built a simple GUI using Tkinter, a link extractor
using HTMLlib to extract HREFs from web page. But I really have no idea
how to extract news from web site. Is anyone aware of general
techniques for extracting web news? Or can point me to some falimiar
projects.
I have seen some search engines doing this, for
example:http://news.ithaki.net/, but do not know the technique used.
Any tips?

Thanks in advance,

Zhang Le

Well, for Python-related news is suck stuff from O'Reilly's meerkat
service using xmlrpc. Once upon a time I used to update
www.holdenweb.com every four hours, but until my current hosting
situation changes I can't be arsed.

However, the code to extract the news is pretty simple. Here's the whole
program, modulo newsreader wrapping. It would be shorter if I weren't
stashing the extracted links it a relational database:

#!/usr/bin/python
#
# mkcheck.py: Get a list of article categories from the O'Reilly Network
# and update the appropriate section database
#
import xmlrpclib
server =
xmlrpclib.Server("http://www.oreillynet.com/meerkat/xml-rpc/server.php")

from db import conn, pmark
import mx.DateTime as dt
curs = conn.cursor()

pyitems = server.meerkat.getItems(
{'search':'/[Pp]ython/','num_items':10,'descriptions':100})

sqlinsert = "INSERT INTO PyLink (pylWhen, pylURL, pylDescription)
VALUES(%s, %s, %s)" % (pmark, pmark, pmark)
for itm in pyitems:
description = itm['description'] or itm['title']
if itm['link'] and not ("<" in description):
curs.execute("""SELECT COUNT(*) FROM PyLink
WHERE pylURL=%s""" % pmark, (itm['link'], ))
newlink = curs.fetchone()[0] == 0
if newlink:
print "Adding", itm['link']
curs.execute(sqlinsert,

(dt.DateTimeFromTicks(int(dt.now())), itm['link'], description))

conn.commit()
conn.close()

Similar techniques can be used on many other sites, and you will find
that (some) RSS feeds are a fruitful source of news.

regards
Steve

Steve Holden · Dec 22, 2004

Steve Holden wrote:

[...]

However, the code to extract the news is pretty simple. Here's the whole
program, modulo newsreader wrapping. It would be shorter if I weren't
stashing the extracted links it a relational database:

[...]

I see that, as is so often the case, I only told half the story, and you
will be wondering what the "db" module does. The main answer is adapts
the same logic to two different database modules in an attempt to build
a little portability into the system (which may one day be open sourced).

The point is that MySQLdb requires a "%s" in queries to mark a
substitutable parameter, whereas mxODBC requires a "?". In order to work
around this difference the db module is imported by anything that uses
the database. This makes it easier to migrate between different database
technologies, though still far from painless, and allows testing by
accessing a MySQL database directly and via ODBC as another option.

Significant strings have been modified to protect the innocent.
--------
#
# db.py: establish a database connection with
# the appropriate parameter style
#
try:
import MySQLdb as db
conn = db.connect(host="****", db="****",
user="****", passwd="****")
pmark = "%s"
print "Using MySQL"
except ImportError:
import mx.ODBC.Windows as db
conn = db.connect("****", user="****", password="****")
pmark = "?"
print "Using ODBC"

Zhang Le · Dec 22, 2004

Thanks for the hint. The xml-rpc service is great, but I want some
general techniques to parse news information in the usual html pages.

Currently I'm looking at a script-based approach found at:
http://www.namo.com/products/handstory/manual/hsceditor/
User can write some simple template to extract certain fields from a
web page. Unfortunately, it is not open source, so I can not look
inside the blackbox.:-(

Zhang Le

Steve Holden · Dec 23, 2004

Zhang said:
Thanks for the hint. The xml-rpc service is great, but I want some
general techniques to parse news information in the usual html pages.

Currently I'm looking at a script-based approach found at:
http://www.namo.com/products/handstory/manual/hsceditor/
User can write some simple template to extract certain fields from a
web page. Unfortunately, it is not open source, so I can not look
inside the blackbox.:-(

Zhang Le

That's a very large topic, and not one that I could claim to be expert
on, so let's hope that others will pitch in with their favorite
techniques. Otherwise it's down to providing individual parsers for each
service you want to scan, and maintaining the parsers as each group of
designers modifies their pages.

You might want to look at BeutifulSoup, which is a module for extracting
stuff from (possibly) irregularly-formed HTML.

regards
Steve

Fuzzyman · Dec 23, 2004

If you have a reliably structured page, then you can write a custom
parser. As Steve points out - BeautifulSOup would be a very good place
to start.

This is the problem that RSS was designed to solve. Many newssites will
supply exactly the information you want as an RSS feed. You should then
use Universal Feed Parser to process the feed.

The module you need for fecthing the webpages (in case you didn't know)
is urllib2. There is a great article on fetching webpages in the
current issue of pyzine. See http://www.pyzine.com

Regards,

Fuzzy
http://www.voidspace.org.uk/python/index.shtml

Simon Brunning · Dec 29, 2004

Hello,
I'm writing a little Tkinter application to retrieve news from
various news websites such as http://news.bbc.co.uk/, and display them
in a TK listbox. All I want are news title and url information.

Well, the BBC publishes an RSS feed[1], as do most sites like it. You
can read RSS feed with Mark Pilgrim's Feed Parser[2].

Granted, you can't read *every* site like this. But I daresay that
*most* news related sites publish feeds of some kind these days. Where
they do, using the feed is a *far* better idea than trying to parse
the HTML.

--
Cheers,
Simon B,
(e-mail address removed),
http://www.brunningonline.net/simon/blog/
[1] http://news.bbc.co.uk/2/hi/help/3223484.stm
[2] http://feedparser.org/

[ANN] Last news from Python FOSDEM 2014	0	Jan 30, 2014
Web scraping i guess (Yet to start, maybe this should be done in python?)	1	Nov 10, 2021
How to efficiently extract information from structured text file	7	Feb 16, 2010
help!! extra tricky web page to extract data from...	11	Mar 13, 2007
Python-URL! - weekly Python news and links (Nov 3)	1	Nov 3, 2009
Extract value from a attribute in a string	2	Jan 23, 2008
News to Hackers?	17	Dec 14, 2009
Best way to extract numeric values from a report?	4	May 7, 2009

extract news article from web

Zhang Le

Steve Holden

Steve Holden

Zhang Le

Steve Holden

Fuzzyman

Simon Brunning

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads