my first screen scraper

M

mensanator

## I was looking in my database of movie grosses I regulary copy
## from the Internet Movie Database and noticed I was _only_ 120
## weeks behind in my updates.
##
## Ouch.
##
## Copying a web page, pasting into a text file, running a perl
## script to convert it into a csv file and manually importing it
## into Access isn't so bad when you only have a couple to do at
## a time. Still, it's a labor intensive process and 120 isn't
## anything to look forwards to.
##
## But I abandoned perl years ago when I took up Python, so I
## can use Python to completely automate the process now.
##
## Just have to figure out how.
##
## There's 3 main tasks: capture the web page, parse the web page
## to extract the data and insert the data into the database.
##
## But I only know how to do the last step, using the odnc tools
## from win32,

import dbi
import odbc
import re

## so I snoop around comp.lang.python to pick up some
## hints and keywords on how to do the other two tasks.
##
## Documentation on urllib2 was a bit vague, but got the web page
## after only a ouple mis-steps.

import urllib2

## Unfortunately, HTMLParser remained beyond my grasp (is it
## my imagination or is the quality of the examples in the
## doumentation inversely proportional to the subject
## difficulty?)
##
## Luckily, my bag of hints had a reference to Beautiful Soup,
## whose web site proclaims:
## Beautiful Soup is a Python HTML/XML parser
## designed for quick turnaround projects like
## screen-scraping.
## Looks like just what I need, maybe I can figure it out after all.

from BeautifulSoup import BeautifulSoup

target_dates = [['11','24','2006','November']]

con = odbc.odbc("IMDB") # connect to MS-Access database
cursor = con.cursor()

for d in target_dates:
#
# build url (with CGI parameters) from list of dates needing
updating
#
the_year = d[2]
the_date = '/'.join([d[0],d[1],d[2]])
print '%10s scraping IMDB:' % (the_date),
the_url = ''.join([r'http://www.imdb.com/BusinessThisDay?
day=',d[1],'&month=',d[3]])
req = urllib2.Request(url=the_url)
f = urllib2.urlopen(req)
www = f.read()
#
# ok, page captured. now make a BeatifulSoup object from it
#
soup = BeautifulSoup(www)
#
# that was easy, much more so than HTMLParser
#
# now, _all_ I have to do is figure out how to parse it
#
# ouch again. this is a lot harder than it looks in the
# documentation. I need to get the data from cells of a
# table nested inside another table and that's hard to
# extrapolate from the examples showing how to find all
# the comments on a web page.
#
# but this looks promising. if I grab all the table rows
# (tr tags), each complete nested table is inside a cell
# of the outer table (whose table tags are lost, but aren't
# needed and whose absence makes extracting the nested
# tables easier (when you do it the stupid way, but hey,
# it works, so I'm sticking with it))
#
tr = soup.tr # table rows
tr.extract()
#
# now, I only want the third nested table. how do I get it?
# can't seem to get past the first one, should I be using
# NextSibling or something? <scratches head...>
#
# but wait...I don't need the first two tables, so I can
# simply extract and discard them. and since .extract()
# CUTS the tables, after two extractions the table I want
# IS the first one.
#
the_table = tr.find('table') # discard
the_table.extract()
the_table = tr.find('table') # discard
the_table.extract()
the_table = tr.find('table') # weekly gross
the_table.extract()
#
# of course, the data doesn't start in the first row,
# there's formatting, header rows, etc. looks like it starts
# in tr number [3]
#
## >>> the_table.contents[3].td
## <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a> </td>
#
# and since tags always imply the first one, the above
# is equivalent to
#
## >>> the_table.contents[3].contents[0]
## <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a> </td>
#
# and since the title is the first of three cells, the
# reporting year is
#
## >>> the_table.contents[3].contents[1]
## <td> <a href="/Sections/Years/2001">2001</a> </td>
#
# finally, the 3rd cell must contain the gross
#
## >>> the_table.contents[3].contents[2]
## <td align="RIGHT"> 259,674,120</td>
#
# but the contents of the first two cells are anchor tags.
# to get the actual title string, I need the contents of the
# contents. but that's not exactly what I want either,
# I don't want a list, I need a string. and the string isn't
# always in the same place in the list
#
# summarizing, what I need is
#
## print the_table.contents[3].contents[0].contents[0].contents,
## print the_table.contents[3].contents[1].contents[1].contents,
## print the_table.contents[3].contents[2].contents
#
# and that almost works, just a couple more tweaks and I can
# shove it into the database

parsed = []

for rec in the_table.contents[3:]:
the_rec_type = type(rec) # some rec are
NavSrings, skip
if str(the_rec_type) == "<type 'instance'>":
#
# ok, got a real data row
#
TITLE_DATE = rec.contents[0].contents[0].contents # a list
inside a tuple
#
# and that means we still have to index the contents
# of the contents of the contents of the contents by
# adding [0][0] to TITLE_DATE
#
YEAR = rec.contents[1].contents[1].contents # ditto
#
# this won't go into the database, just used as a filter to grab
# the records associated with the posting date and discard
# the others (which should already be in the database)
#
GROSS = rec.contents[2].contents # just a
list
#
# one other minor glitch, that film date is part of the title
# (which is of no use in the database), so it has to be pulled
out
# and put in a seperate field
#
temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\))
(.*)',str(TITLE_DATE[0][0]))
#
# which works 99% of the time. unfortunately, the IMDB
# consitency is somewhat dubious. the date is _supposed_
# to be at the end of the string, but sometimes it's not.
# so, usually, there are only 5 groups, but you have to
# allow for the fact that there may be 6
#
try:
the_title = temp_title.group(1) + temp_title.group(5)
except:
the_title = temp_title.group(1)
the_gross = str(GROSS[0])
#
# and for some unexplained reason, dates will occasionally
# be 2001/I instead of 2001, so we want to discard the trailing
# crap, if any
#
the_film_year = temp_title.group(3)[:4]
if str(YEAR[0][0])==the_year:
parsed.append([the_date,the_title,the_film_year,the_gross])

print '%3d records found ' % (len(parsed)),
#
# wow, now just have to insert all the update records directly
# into the database...into a temporary table, of course. as I said,
# IMDB consistency is somewhat dubious (such as changing the
spelling
# of the titles), so a QC check will be required inside Access
#
if len(parsed)>0:
print '...inserting into database'
for p in parsed:
cursor.execute("""
INSERT INTO imdweeks2 ( Date_reported, Title, Film_Date,
Gross_to_Date )
SELECT ?,?,?,?;""",p)
else:
print '...aborting, no records found'

cursor.close()
con.close()

# and just because it works, doesn't mean it's right.
# but hey, you get what you pay for. I'm _sure_ if I were
# to pay for a subscription to IMDBPro, I wouldn't see
# these errors ;-)

## 5 records found for 11/26/2006... ...inserting into database
##
## strange, only 5 films were in release over Thanksgiving weekend?
##
## of course not, IMDB screwed up (holidays being chronic problems
## for IMDB). for some reason, the others were posted on Friday
## instead of Sunday. at least that can be corrected by scraping
## the page for 11/24.
##
## mis-reported: records found on 11/24
## 11/24/2006 scraping IMDB: 88 records found ...inserting into
database

##
## other data is just plain missing (don't they do QA/QC at IMDB?
## how hard is it figure out that each week has 60-120 records?)
##
## 4 records found for 3/25/2007... ...inserting into database
## records appear to be missing
## 6 records found for 4/8/2007... ...inserting into database
## records appear to be missing
## 17 records found for 4/15/2007... ...inserting into database
## records appear to be missing
## 4 records found for 5/13/2007... ...inserting into database
## records appear to be missing
## 1 records found for 7/15/2007... ...inserting into database
## records appear to be missing
## 10/14/2007 scraping IMDB: 128 records found ...inserting into
database
## 10/21/2007 scraping IMDB: 130 records found ...inserting into
database
## 10/28/2007 scraping IMDB: 1 records found ...inserting into
database
## records appear to be missing
## 11/4/2007 scraping IMDB: 105 records found ...inserting into
database
## 11/11/2007 scraping IMDB: 2 records found ...inserting into
database
## records appear to be missing
## 11/18/2007 scraping IMDB: 101 records found ...inserting into
database

## other errors are legitimate. last week's grosses simply
## haven't been posted yet. and they sometimes are weeks
## behind. bet that doesn't happen with IMDBPro ;-)
## 11/25/2007 scraping IMDB: 0 records found ...aborting, no
records found

## so, next update session I start with 11/25

## there, that wasn't so bad. only took about 40 minutes
## to scrape 120 or so web pages. that's one of the things
## I like about Python. even with documentation frustration,
## I've got what I need at the end of the day
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,008
Latest member
HaroldDark

Latest Threads

Top