extract news article from web

Discussion in 'Python' started by Zhang Le, Dec 22, 2004.

  1. Zhang Le

    Zhang Le Guest

    Hello,
    I'm writing a little Tkinter application to retrieve news from
    various news websites such as http://news.bbc.co.uk/, and display them
    in a TK listbox. All I want are news title and url information. Since
    each news site has a different layout, I think I need some
    template-based techniques to build news extractors for each site,
    ignoring information such as table, image, advertise, flash that I'm
    not interested in.

    So far I have built a simple GUI using Tkinter, a link extractor
    using HTMLlib to extract HREFs from web page. But I really have no idea
    how to extract news from web site. Is anyone aware of general
    techniques for extracting web news? Or can point me to some falimiar
    projects.
    I have seen some search engines doing this, for
    example:http://news.ithaki.net/, but do not know the technique used.
    Any tips?

    Thanks in advance,

    Zhang Le
     
    Zhang Le, Dec 22, 2004
    #1
    1. Advertising

  2. Zhang Le

    Steve Holden Guest

    Zhang Le wrote:

    > Hello,
    > I'm writing a little Tkinter application to retrieve news from
    > various news websites such as http://news.bbc.co.uk/, and display them
    > in a TK listbox. All I want are news title and url information. Since
    > each news site has a different layout, I think I need some
    > template-based techniques to build news extractors for each site,
    > ignoring information such as table, image, advertise, flash that I'm
    > not interested in.
    >
    > So far I have built a simple GUI using Tkinter, a link extractor
    > using HTMLlib to extract HREFs from web page. But I really have no idea
    > how to extract news from web site. Is anyone aware of general
    > techniques for extracting web news? Or can point me to some falimiar
    > projects.
    > I have seen some search engines doing this, for
    > example:http://news.ithaki.net/, but do not know the technique used.
    > Any tips?
    >
    > Thanks in advance,
    >
    > Zhang Le
    >

    Well, for Python-related news is suck stuff from O'Reilly's meerkat
    service using xmlrpc. Once upon a time I used to update
    www.holdenweb.com every four hours, but until my current hosting
    situation changes I can't be arsed.

    However, the code to extract the news is pretty simple. Here's the whole
    program, modulo newsreader wrapping. It would be shorter if I weren't
    stashing the extracted links it a relational database:

    #!/usr/bin/python
    #
    # mkcheck.py: Get a list of article categories from the O'Reilly Network
    # and update the appropriate section database
    #
    import xmlrpclib
    server =
    xmlrpclib.Server("http://www.oreillynet.com/meerkat/xml-rpc/server.php")

    from db import conn, pmark
    import mx.DateTime as dt
    curs = conn.cursor()

    pyitems = server.meerkat.getItems(
    {'search':'/[Pp]ython/','num_items':10,'descriptions':100})

    sqlinsert = "INSERT INTO PyLink (pylWhen, pylURL, pylDescription)
    VALUES(%s, %s, %s)" % (pmark, pmark, pmark)
    for itm in pyitems:
    description = itm['description'] or itm['title']
    if itm['link'] and not ("<" in description):
    curs.execute("""SELECT COUNT(*) FROM PyLink
    WHERE pylURL=%s""" % pmark, (itm['link'], ))
    newlink = curs.fetchone()[0] == 0
    if newlink:
    print "Adding", itm['link']
    curs.execute(sqlinsert,

    (dt.DateTimeFromTicks(int(dt.now())), itm['link'], description))

    conn.commit()
    conn.close()

    Similar techniques can be used on many other sites, and you will find
    that (some) RSS feeds are a fruitful source of news.

    regards
    Steve
    --
    Steve Holden http://www.holdenweb.com/
    Python Web Programming http://pydish.holdenweb.com/
    Holden Web LLC +1 703 861 4237 +1 800 494 3119
     
    Steve Holden, Dec 22, 2004
    #2
    1. Advertising

  3. Zhang Le

    Steve Holden Guest

    Steve Holden wrote:

    [...]

    > However, the code to extract the news is pretty simple. Here's the whole
    > program, modulo newsreader wrapping. It would be shorter if I weren't
    > stashing the extracted links it a relational database:
    >

    [...]

    I see that, as is so often the case, I only told half the story, and you
    will be wondering what the "db" module does. The main answer is adapts
    the same logic to two different database modules in an attempt to build
    a little portability into the system (which may one day be open sourced).

    The point is that MySQLdb requires a "%s" in queries to mark a
    substitutable parameter, whereas mxODBC requires a "?". In order to work
    around this difference the db module is imported by anything that uses
    the database. This makes it easier to migrate between different database
    technologies, though still far from painless, and allows testing by
    accessing a MySQL database directly and via ODBC as another option.

    Significant strings have been modified to protect the innocent.
    --------
    #
    # db.py: establish a database connection with
    # the appropriate parameter style
    #
    try:
    import MySQLdb as db
    conn = db.connect(host="****", db="****",
    user="****", passwd="****")
    pmark = "%s"
    print "Using MySQL"
    except ImportError:
    import mx.ODBC.Windows as db
    conn = db.connect("****", user="****", password="****")
    pmark = "?"
    print "Using ODBC"
    --------
    regards
    Steve
    --
    Steve Holden http://www.holdenweb.com/
    Python Web Programming http://pydish.holdenweb.com/
    Holden Web LLC +1 703 861 4237 +1 800 494 3119
     
    Steve Holden, Dec 22, 2004
    #3
  4. Zhang Le

    Zhang Le Guest

    Thanks for the hint. The xml-rpc service is great, but I want some
    general techniques to parse news information in the usual html pages.

    Currently I'm looking at a script-based approach found at:
    http://www.namo.com/products/handstory/manual/hsceditor/
    User can write some simple template to extract certain fields from a
    web page. Unfortunately, it is not open source, so I can not look
    inside the blackbox.:-(

    Zhang Le
     
    Zhang Le, Dec 22, 2004
    #4
  5. Zhang Le

    Steve Holden Guest

    Zhang Le wrote:

    > Thanks for the hint. The xml-rpc service is great, but I want some
    > general techniques to parse news information in the usual html pages.
    >
    > Currently I'm looking at a script-based approach found at:
    > http://www.namo.com/products/handstory/manual/hsceditor/
    > User can write some simple template to extract certain fields from a
    > web page. Unfortunately, it is not open source, so I can not look
    > inside the blackbox.:-(
    >
    > Zhang Le
    >

    That's a very large topic, and not one that I could claim to be expert
    on, so let's hope that others will pitch in with their favorite
    techniques. Otherwise it's down to providing individual parsers for each
    service you want to scan, and maintaining the parsers as each group of
    designers modifies their pages.

    You might want to look at BeutifulSoup, which is a module for extracting
    stuff from (possibly) irregularly-formed HTML.

    regards
    Steve
    --
    Steve Holden http://www.holdenweb.com/
    Python Web Programming http://pydish.holdenweb.com/
    Holden Web LLC +1 703 861 4237 +1 800 494 3119
     
    Steve Holden, Dec 23, 2004
    #5
  6. Zhang Le

    Fuzzyman Guest

    If you have a reliably structured page, then you can write a custom
    parser. As Steve points out - BeautifulSOup would be a very good place
    to start.

    This is the problem that RSS was designed to solve. Many newssites will
    supply exactly the information you want as an RSS feed. You should then
    use Universal Feed Parser to process the feed.

    The module you need for fecthing the webpages (in case you didn't know)
    is urllib2. There is a great article on fetching webpages in the
    current issue of pyzine. See http://www.pyzine.com :)
    Regards,

    Fuzzy
    http://www.voidspace.org.uk/python/index.shtml
     
    Fuzzyman, Dec 23, 2004
    #6
  7. On 22 Dec 2004 09:22:15 -0800, Zhang Le <> wrote:
    > Hello,
    > I'm writing a little Tkinter application to retrieve news from
    > various news websites such as http://news.bbc.co.uk/, and display them
    > in a TK listbox. All I want are news title and url information.


    Well, the BBC publishes an RSS feed[1], as do most sites like it. You
    can read RSS feed with Mark Pilgrim's Feed Parser[2].

    Granted, you can't read *every* site like this. But I daresay that
    *most* news related sites publish feeds of some kind these days. Where
    they do, using the feed is a *far* better idea than trying to parse
    the HTML.

    --
    Cheers,
    Simon B,
    ,
    http://www.brunningonline.net/simon/blog/
    [1] http://news.bbc.co.uk/2/hi/help/3223484.stm
    [2] http://feedparser.org/
     
    Simon Brunning, Dec 29, 2004
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. admin

    Test news article with format

    admin, Oct 2, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    425
    tamttt
    Oct 2, 2004
  2. Paul Briggs
    Replies:
    1
    Views:
    410
    Mitja
    Jun 8, 2004
  3. Talimore
    Replies:
    4
    Views:
    544
    T. Audry Glamour
    Jul 18, 2004
  4. Amy
    Replies:
    0
    Views:
    513
  5. PerlFAQ Server
    Replies:
    0
    Views:
    126
    PerlFAQ Server
    Apr 5, 2011
Loading...

Share This Page