I'm pretty new to programming. I've just been studying a few weeks off
and on. I know a little, and I'm learning as I go. Programming is so
much fun! I really wish I would have gotten into it years ago, but
here's my question. I have a longterm project in mind, and I wont to
know if it's feasible and how difficult it will be.
There's an XML feed for my school that some other class designed. It's
just a simple idea that lists all classes, room number, and the person
with the highest GPA. The feed is set up like this. Each one of the
following lines would also be a link to more information about the
class, etc.
Economics, Room 216, James Faker, 3.4
Social Studies, Room 231, Brain Fictitious, 3.5
etc, etc
The student also has a picture reference that depicts his GPA based on
the number. The picture is basically just a graph. I just want to
write a program that uses the information on this feed.
I want it to reach out to this XML feed, record each instance of the
above format along with the picture reference of the highest GPA
student, download it locally, and then be able to use that information
in various was. I figured I'll start by counting each instance. For
example, the above would be 2 instances.
Eventually, I want it to be able to cross reference data you've
already downloaded, and be able to compare GPA's, etc. It would have a
GUI and everything too, but I am trying to keep it simple right now,
and just build onto it as I learn.
So lets just say this. How do you grab information from the web,
Depends on the web page.
Haven't tried that, just a simple CGI.
and then use that in calculations?
The key is some type of structure be it database records,
or a list of lists or whatever. Something that you can iterate
through, sort, find max element, etc.
How would you
implement such a project?
The example below uses BeautifulSoup. I'm posting it not
because it matches your problem, but to give you an idea of
the techniques involved.
Would you save the information into a text file?
Possibly, but generally no. Text files aren't very useful
except as a data exchange media.
Or would you use something else?
Your application lends itself to a database approach.
Note in my example the database part of the code is disabled.
Not every one has MS-Access on Windows.
Should I study up on SQLite?
Yes. The MS-Access code I have can be easily changed to SQLlite.
Maybe I should study classes.
I don't know, but I've always gotten along without them.
I'm just not sure. What would be the most effective technique?
Don't know that either as I've only done it once, as follows:
## I was looking in my database of movie grosses I regulary copy
## from the Internet Movie Database and noticed I was _only_ 120
## weeks behind in my updates.
##
## Ouch.
##
## Copying a web page, pasting into a text file, running a perl
## script to convert it into a csv file and manually importing it
## into Access isn't so bad when you only have a couple to do at
## a time. Still, it's a labor intensive process and 120 isn't
## anything to look forwards to.
##
## But I abandoned perl years ago when I took up Python, so I
## can use Python to completely automate the process now.
##
## Just have to figure out how.
##
## There's 3 main tasks: capture the web page, parse the web page
## to extract the data and insert the data into the database.
##
## But I only know how to do the last step, using the odnc tools
## from win32,
####import dbi
####import odbc
import re
## so I snoop around comp.lang.python to pick up some
## hints and keywords on how to do the other two tasks.
##
## Documentation on urllib2 was a bit vague, but got the web page
## after only a ouple mis-steps.
import urllib2
## Unfortunately, HTMLParser remained beyond my grasp (is it
## my imagination or is the quality of the examples in the
## doumentation inversely proportional to the subject
## difficulty?)
##
## Luckily, my bag of hints had a reference to Beautiful Soup,
## whose web site proclaims:
## Beautiful Soup is a Python HTML/XML parser
## designed for quick turnaround projects like
## screen-scraping.
## Looks like just what I need, maybe I can figure it out after all.
from BeautifulSoup import BeautifulSoup
target_dates = [['4','6','2008','April']]
####con = odbc.odbc("IMDB") # connect to MS-Access database
####cursor = con.cursor()
for d in target_dates:
#
# build url (with CGI parameters) from list of dates needing
updating
#
the_year = d[2]
the_date = '/'.join([d[0],d[1],d[2]])
print '%10s scraping IMDB:' % (the_date),
the_url = ''.join([r'
http://www.imdb.com/BusinessThisDay?
day=',d[1],'&month=',d[3]])
req = urllib2.Request(url=the_url)
f = urllib2.urlopen(req)
www = f.read()
#
# ok, page captured. now make a BeatifulSoup object from it
#
soup = BeautifulSoup(www)
#
# that was easy, much more so than HTMLParser
#
# now, _all_ I have to do is figure out how to parse it
#
# ouch again. this is a lot harder than it looks in the
# documentation. I need to get the data from cells of a
# table nested inside another table and that's hard to
# extrapolate from the examples showing how to find all
# the comments on a web page.
#
# but this looks promising. if I grab all the table rows
# (tr tags), each complete nested table is inside a cell
# of the outer table (whose table tags are lost, but aren't
# needed and whose absence makes extracting the nested
# tables easier (when you do it the stupid way, but hey,
# it works, so I'm sticking with it))
#
tr = soup.tr # table rows
tr.extract()
#
# now, I only want the third nested table. how do I get it?
# can't seem to get past the first one, should I be using
# NextSibling or something? <scratches head...>
#
# but wait...I don't need the first two tables, so I can
# simply extract and discard them. and since .extract()
# CUTS the tables, after two extractions the table I want
# IS the first one.
#
the_table = tr.find('table') # discard
the_table.extract()
the_table = tr.find('table') # discard
the_table.extract()
the_table = tr.find('table') # weekly gross
the_table.extract()
#
# of course, the data doesn't start in the first row,
# there's formatting, header rows, etc. looks like it starts
# in tr number [3]
#
## >>> the_table.contents[3].td
## <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a> </td>
#
# and since tags always imply the first one, the above
# is equivalent to
#
## >>> the_table.contents[3].contents[0]
## <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a> </td>
#
# and since the title is the first of three cells, the
# reporting year is
#
## >>> the_table.contents[3].contents[1]
## <td> <a href="/Sections/Years/2001">2001</a> </td>
#
# finally, the 3rd cell must contain the gross
#
## >>> the_table.contents[3].contents[2]
## <td align="RIGHT"> 259,674,120</td>
#
# but the contents of the first two cells are anchor tags.
# to get the actual title string, I need the contents of the
# contents. but that's not exactly what I want either,
# I don't want a list, I need a string. and the string isn't
# always in the same place in the list
#
# summarizing, what I need is
#
## print the_table.contents[3].contents[0].contents[0].contents,
## print the_table.contents[3].contents[1].contents[1].contents,
## print the_table.contents[3].contents[2].contents
#
# and that almost works, just a couple more tweaks and I can
# shove it into the database
parsed = []
for rec in the_table.contents[3:]:
the_rec_type = type(rec) # some rec are
NavSrings, skip
if str(the_rec_type) == "<type 'instance'>":
#
# ok, got a real data row
#
TITLE_DATE = rec.contents[0].contents[0].contents # a list
inside a tuple
#
# and that means we still have to index the contents
# of the contents of the contents of the contents by
# adding [0][0] to TITLE_DATE
#
YEAR = rec.contents[1].contents[1].contents # ditto
#
# this won't go into the database, just used as a filter to grab
# the records associated with the posting date and discard
# the others (which should already be in the database)
#
GROSS = rec.contents[2].contents # just a
list
#
# one other minor glitch, that film date is part of the title
# (which is of no use in the database, so it has to be pulled
out
# and put in a separate field
#
# temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\))
(.*)',str(TITLE_DATE[0][0]))
temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\))
(.*)',str(TITLE_DATE))
#
# which works 99% of the time. unfortunately, the IMDB
# consitency is somewhat dubious. the date is _supposed_
# to be at the end of the string, but sometimes it's not.
# so, usually, there are only 5 groups, but you have to
# allow for the fact that there may be 6
#
try:
the_title = temp_title.group(1) + temp_title.group(5)
except:
the_title = temp_title.group(1)
the_gross = str(GROSS[0])
#
# and for some unexplained reason, dates will occasionally
# be 2001/I instead of 2001, so we want to discard the trailing
# crap, if any
#
the_film_year = temp_title.group(3)[:4]
# if str(YEAR[0][0])==the_year:
if str(YEAR[0])==the_year:
parsed.append([the_date,the_title,the_film_year,the_gross])
print '%3d records found ' % (len(parsed))
#
# wow, now just have to insert all the update records directly
# into the database...into a temporary table, of course. as I said,
# IMDB consistency is somewhat dubious (such as changing the
spelling
# of the titles), so a QC check will be required inside Access
#
#### if len(parsed)>0:
#### print '...inserting into database'
#### for p in parsed:
#### cursor.execute("""
####INSERT INTO imdweeks2 ( Date_reported, Title, Film_Date,
Gross_to_Date )
####SELECT ?,?,?,?;""",p)
#### else:
#### print '...aborting, no records found'
####
####cursor.close()
####con.close()
for p in parsed: print p
# and just because it works, doesn't mean it's right.
# but hey, you get what you pay for. I'm _sure_ if I were
# to pay for a subscription to IMDBPro, I wouldn't see
# these errors ;-)
##You should get this:
##
## 4/6/2008 scraping IMDB: 111 records found
##['4/6/2008', "[u'I Am Legend']", '2007', ' 256,386,216']
##['4/6/2008', "[u'National Treasure: Book of Secrets']", '2007', '
218,701,477']
##['4/6/2008', "[u'Alvin and the Chipmunks']", '2007', ' 216,873,487']
##['4/6/2008', "[u'Juno']", '2007', ' 142,545,706']
##['4/6/2008', "[u'Horton Hears a Who!']", '2008', ' 131,076,768']
##['4/6/2008', "[u'Bucket List, The']", '2007', ' 91,742,612']
##['4/6/2008', "[u'10,000 BC']", '2008', ' 89,349,915']
##['4/6/2008', "[u'Cloverfield']", '2008', ' 80,034,302']
##['4/6/2008', "[u'Jumper']", '2008', ' 78,762,148']
##['4/6/2008', "[u'27 Dresses']", '2008', ' 76,376,607']
##['4/6/2008', "[u'No Country for Old Men']", '2007', ' 74,273,505']
##['4/6/2008', "[u'Vantage Point']", '2008', ' 71,037,105']
##['4/6/2008', "[u'Spiderwick Chronicles, The']", '2008', '
69,872,230']
##['4/6/2008', '[u"Fool\'s Gold"]', '2008', ' 68,636,484']
##['4/6/2008', "[u'Hannah Montana/Miley Cyrus: Best of Both Worlds
Concert Tour']", '2008', ' 65,010,561']
##['4/6/2008', "[u'Step Up 2: The Streets']", '2008', ' 57,389,556']
##['4/6/2008', "[u'Atonement']", '2007', ' 50,921,738']
##['4/6/2008', "[u'21']", '2008', ' 46,770,173']
##['4/6/2008', "[u'College Road Trip']", '2008', ' 40,918,686']
##['4/6/2008', "[u'There Will Be Blood']", '2007', ' 40,133,435']
##['4/6/2008', "[u'Meet the Spartans']", '2008', ' 38,185,300']
##['4/6/2008', "[u'Meet the Browns']", '2008', ' 37,662,502']
##['4/6/2008', "[u'Deep Sea 3D']", '2006', ' 36,141,373']
##['4/6/2008', "[u'Semi-Pro']", '2008', ' 33,289,722']
##['4/6/2008', "[u'Definitely, Maybe']", '2008', ' 31,973,840']
##['4/6/2008', "[u'Eye, The']", '2008', ' 31,397,498']
##['4/6/2008', "[u'Great Debaters, The']", '2007', ' 30,219,326']
##['4/6/2008', "[u'Bank Job, The']", '2008', ' 26,804,821']
##['4/6/2008', "[u'Other Boleyn Girl, The']", '2008', ' 26,051,195']
##['4/6/2008', "[u'Drillbit Taylor']", '2008', ' 25,490,483']
##['4/6/2008', "[u'Magnificent Desolation: Walking on the Moon 3D']",
'2005', ' 23,283,158']
##['4/6/2008', "[u'Shutter']", '2008', ' 23,138,277']
##['4/6/2008', "[u'Never Back Down']", '2008', ' 23,080,675']
##['4/6/2008', "[u'Mad Money']", '2008', ' 20,648,442']
##['4/6/2008', "[u'Galapagos']", '1955', ' 17,152,405']
##['4/6/2008', "[u'Superhero Movie']", '2008', ' 16,899,661']
##['4/6/2008', "[u'Wild Safari 3D']", '2005', ' 16,550,933']
##['4/6/2008', "[u'Kite Runner, The']", '2007', ' 15,790,223']
##['4/6/2008', '[u"Nim\'s Island"]', '2008', ' 13,210,579']
##['4/6/2008', "[u'Leatherheads']", '2008', ' 12,682,595']
##['4/6/2008', "[u'Be Kind Rewind']", '2008', ' 11,028,439']
##['4/6/2008', "[u'Doomsday']", '2008', ' 10,955,425']
##['4/6/2008', "[u'Sea Monsters: A Prehistoric Adventure']", '2007', '
10,745,308']
##['4/6/2008', "[u'Miss Pettigrew Lives for a Day']", '2008', '
10,534,800']
##['4/6/2008', "[u'Môme, La']", '2007', ' 10,299,782']
##['4/6/2008', "[u'Penelope']", '2006', ' 9,646,154']
##['4/6/2008', "[u'Misma luna, La']", '2007', ' 8,959,462']
##['4/6/2008', "[u'Roving Mars']", '2006', ' 8,463,161']
##['4/6/2008', "[u'Stop-Loss']", '2008', ' 8,170,755']
##['4/6/2008', "[u'Ruins, The']", '2008', ' 8,003,241']
##['4/6/2008', "[u'Bella']", '2006', ' 7,776,080']
##['4/6/2008', "[u'U2 3D']", '2007', ' 7,348,105']
##['4/6/2008', "[u'Orfanato, El']", '2007', ' 7,159,147']
##['4/6/2008', "[u'In Bruges']", '2008', ' 6,831,761']
##['4/6/2008', "[u'Savages, The']", '2007', ' 6,571,599']
##['4/6/2008', "[u'Scaphandre et le papillon, Le']", '2007', '
5,990,075']
##['4/6/2008', "[u'Run Fatboy Run']", '2007', ' 4,430,583']
##['4/6/2008', "[u'Persepolis']", '2007', ' 4,200,980']
##['4/6/2008', "[u'Charlie Bartlett']", '2007', ' 3,928,412']
##['4/6/2008', "[u'Jodhaa Akbar']", '2008', ' 3,434,629']
##['4/6/2008', "[u'Fälscher, Die']", '2007', ' 2,903,370']
##['4/6/2008', "[u'Bikur Ha-Tizmoret']", '2007', ' 2,459,543']
##['4/6/2008', "[u'Shine a Light']", '2008', ' 1,488,081']
##['4/6/2008', "[u'Race']", '2008', ' 1,327,606']
##['4/6/2008', "[u'Funny Games U.S.']", '2007', ' 1,274,055']
##['4/6/2008', "[u'4 luni, 3 saptamâni si 2 zile']", '2007', '
1,103,315']
##['4/6/2008', "[u'Married Life']", '2007', ' 1,002,318']
##['4/6/2008', "[u'Diary of the Dead']", '2007', ' 893,192']
##['4/6/2008', "[u'Starting Out in the Evening']", '2007', ' 882,518']
##['4/6/2008', "[u'Dolphins and Whales 3D: Tribes of the Ocean']",
'2008', ' 854,304']
##['4/6/2008', "[u'Sukkar banat']", '2007', ' 781,954']
##['4/6/2008', "[u'Bonneville']", '2006', ' 471,679']
##['4/6/2008', "[u'Flawless']", '2007', ' 390,892']
##['4/6/2008', "[u'Paranoid Park']", '2007', ' 387,119']
##['4/6/2008', "[u'Teeth']", '2007', ' 321,732']
##['4/6/2008', "[u'Hammer, The']", '2007', ' 321,579']
##['4/6/2008', "[u'Priceless']", '2008', ' 320,131']
##['4/6/2008', "[u'Steep']", '2007', ' 259,840']
##['4/6/2008', "[u'Honeydripper']", '2007', ' 259,192']
##['4/6/2008', "[u'Snow Angels']", '2007', ' 255,147']
##['4/6/2008', "[u'Taxi to the Dark Side']", '2007', ' 231,743']
##['4/6/2008', "[u'Cheung Gong 7 hou']", '2008', ' 188,067']
##['4/6/2008', "[u'Ne touchez pas la hache']", '2007', ' 184,513']
##['4/6/2008', "[u'Sleepwalking']", '2008', ' 160,715']
##['4/6/2008', "[u'Chicago 10']", '2007', ' 149,456']
##['4/6/2008', "[u'Girls Rock!']", '2007', ' 92,636']
##['4/6/2008', "[u'Beaufort']", '2007', ' 87,339']
##['4/6/2008', "[u'Shelter']", '2007', ' 85,928']
##['4/6/2008', "[u'My Blueberry Nights']", '2007', ' 74,146']
##['4/6/2008', "[u'Témoins, Les']", '2007', ' 71,624']
##['4/6/2008', "[u'Mépris, Le']", '1963', ' 70,761']
##['4/6/2008', "[u'Singing Revolution, The']", '2006', ' 66,482']
##['4/6/2008', "[u'Chop Shop']", '2007', ' 58,858']
##['4/6/2008', '[u"Chansons d\'amour, Les"]', '2007', ' 58,577']
##['4/6/2008', "[u'Praying with Lior']", '2007', ' 57,325']
##['4/6/2008', "[u'Yihe yuan']", '2006', ' 57,155']
##['4/6/2008', "[u'Casa de Alice, A']", '2007', ' 53,700']
##['4/6/2008', "[u'Blindsight']", '2006', ' 51,256']
##['4/6/2008', "[u'Boarding Gate']", '2007', ' 37,107']
##['4/6/2008', "[u'Voyage du ballon rouge, Le']", '2007', ' 35,222']
##['4/6/2008', "[u'Bill']", '2007', ' 35,201']
##['4/6/2008', "[u'Mio fratello è figlio unico']", '2007', '
34,138']
##['4/6/2008', "[u'Chapter 27']", '2007', ' 32,602']
##['4/6/2008', "[u'Meduzot']", '2007', ' 25,352']
##['4/6/2008', "[u'Shotgun Stories']", '2007', ' 25,346']
##['4/6/2008', "[u'Sconosciuta, La']", '2006', ' 18,569']
##['4/6/2008', "[u'Imaginary Witness: Hollywood and the Holocaust']",
'2004', ' 18,475']
##['4/6/2008', "[u'Irina Palm']", '2007', ' 14,214']
##['4/6/2008', "[u'Naissance des pieuvres']", '2007', ' 7,418']
##['4/6/2008', "[u'Four Letter Word, A']", '2007', ' 6,017']
##['4/6/2008', "[u'Tuya de hun shi']", '2006', ' 2,619']