how would you...?

S

Sanoski

I'm pretty new to programming. I've just been studying a few weeks off
and on. I know a little, and I'm learning as I go. Programming is so
much fun! I really wish I would have gotten into it years ago, but
here's my question. I have a longterm project in mind, and I wont to
know if it's feasible and how difficult it will be.

There's an XML feed for my school that some other class designed. It's
just a simple idea that lists all classes, room number, and the person
with the highest GPA. The feed is set up like this. Each one of the
following lines would also be a link to more information about the
class, etc.

Economics, Room 216, James Faker, 3.4
Social Studies, Room 231, Brain Fictitious, 3.5

etc, etc

The student also has a picture reference that depicts his GPA based on
the number. The picture is basically just a graph. I just want to
write a program that uses the information on this feed.

I want it to reach out to this XML feed, record each instance of the
above format along with the picture reference of the highest GPA
student, download it locally, and then be able to use that information
in various was. I figured I'll start by counting each instance. For
example, the above would be 2 instances.

Eventually, I want it to be able to cross reference data you've
already downloaded, and be able to compare GPA's, etc. It would have a
GUI and everything too, but I am trying to keep it simple right now,
and just build onto it as I learn.

So lets just say this. How do you grab information from the web, in
this case a feed, and then use that in calculations? How would you
implement such a project? Would you save the information into a text
file? Or would you use something else? Should I study up on SQLite?
Maybe I should study classes. I'm just not sure. What would be the
most effective technique?
 
M

Mensanator

I'm pretty new to programming. I've just been studying a few weeks off
and on. I know a little, and I'm learning as I go. Programming is so
much fun! I really wish I would have gotten into it years ago, but
here's my question. I have a longterm project in mind, and I wont to
know if it's feasible and how difficult it will be.

There's an XML feed for my school that some other class designed. It's
just a simple idea that lists all classes, room number, and the person
with the highest GPA. The feed is set up like this. Each one of the
following lines would also be a link to more information about the
class, etc.

Economics, Room 216, James Faker, 3.4
Social Studies, Room 231, Brain Fictitious, 3.5

etc, etc

The student also has a picture reference that depicts his GPA based on
the number. The picture is basically just a graph. I just want to
write a program that uses the information on this feed.

I want it to reach out to this XML feed, record each instance of the
above format along with the picture reference of the highest GPA
student, download it locally, and then be able to use that information
in various was. I figured I'll start by counting each instance. For
example, the above would be 2 instances.

Eventually, I want it to be able to cross reference data you've
already downloaded, and be able to compare GPA's, etc. It would have a
GUI and everything too, but I am trying to keep it simple right now,
and just build onto it as I learn.

So lets just say this. How do you grab information from the web,

Depends on the web page.
in this case a feed,

Haven't tried that, just a simple CGI.
and then use that in calculations?

The key is some type of structure be it database records,
or a list of lists or whatever. Something that you can iterate
through, sort, find max element, etc.
How would you
implement such a project?

The example below uses BeautifulSoup. I'm posting it not
because it matches your problem, but to give you an idea of
the techniques involved.
Would you save the information into a text file?

Possibly, but generally no. Text files aren't very useful
except as a data exchange media.
Or would you use something else?

Your application lends itself to a database approach.
Note in my example the database part of the code is disabled.
Not every one has MS-Access on Windows.
Should I study up on SQLite?

Yes. The MS-Access code I have can be easily changed to SQLlite.
Maybe I should study classes.

I don't know, but I've always gotten along without them.
I'm just not sure. What would be the most effective technique?

Don't know that either as I've only done it once, as follows:

## I was looking in my database of movie grosses I regulary copy
## from the Internet Movie Database and noticed I was _only_ 120
## weeks behind in my updates.
##
## Ouch.
##
## Copying a web page, pasting into a text file, running a perl
## script to convert it into a csv file and manually importing it
## into Access isn't so bad when you only have a couple to do at
## a time. Still, it's a labor intensive process and 120 isn't
## anything to look forwards to.
##
## But I abandoned perl years ago when I took up Python, so I
## can use Python to completely automate the process now.
##
## Just have to figure out how.
##
## There's 3 main tasks: capture the web page, parse the web page
## to extract the data and insert the data into the database.
##
## But I only know how to do the last step, using the odnc tools
## from win32,

####import dbi
####import odbc
import re

## so I snoop around comp.lang.python to pick up some
## hints and keywords on how to do the other two tasks.
##
## Documentation on urllib2 was a bit vague, but got the web page
## after only a ouple mis-steps.

import urllib2

## Unfortunately, HTMLParser remained beyond my grasp (is it
## my imagination or is the quality of the examples in the
## doumentation inversely proportional to the subject
## difficulty?)
##
## Luckily, my bag of hints had a reference to Beautiful Soup,
## whose web site proclaims:
## Beautiful Soup is a Python HTML/XML parser
## designed for quick turnaround projects like
## screen-scraping.
## Looks like just what I need, maybe I can figure it out after all.

from BeautifulSoup import BeautifulSoup

target_dates = [['4','6','2008','April']]

####con = odbc.odbc("IMDB") # connect to MS-Access database
####cursor = con.cursor()

for d in target_dates:
#
# build url (with CGI parameters) from list of dates needing
updating
#
the_year = d[2]
the_date = '/'.join([d[0],d[1],d[2]])
print '%10s scraping IMDB:' % (the_date),
the_url = ''.join([r'http://www.imdb.com/BusinessThisDay?
day=',d[1],'&month=',d[3]])
req = urllib2.Request(url=the_url)
f = urllib2.urlopen(req)
www = f.read()
#
# ok, page captured. now make a BeatifulSoup object from it
#
soup = BeautifulSoup(www)
#
# that was easy, much more so than HTMLParser
#
# now, _all_ I have to do is figure out how to parse it
#
# ouch again. this is a lot harder than it looks in the
# documentation. I need to get the data from cells of a
# table nested inside another table and that's hard to
# extrapolate from the examples showing how to find all
# the comments on a web page.
#
# but this looks promising. if I grab all the table rows
# (tr tags), each complete nested table is inside a cell
# of the outer table (whose table tags are lost, but aren't
# needed and whose absence makes extracting the nested
# tables easier (when you do it the stupid way, but hey,
# it works, so I'm sticking with it))
#
tr = soup.tr # table rows
tr.extract()
#
# now, I only want the third nested table. how do I get it?
# can't seem to get past the first one, should I be using
# NextSibling or something? <scratches head...>
#
# but wait...I don't need the first two tables, so I can
# simply extract and discard them. and since .extract()
# CUTS the tables, after two extractions the table I want
# IS the first one.
#
the_table = tr.find('table') # discard
the_table.extract()
the_table = tr.find('table') # discard
the_table.extract()
the_table = tr.find('table') # weekly gross
the_table.extract()
#
# of course, the data doesn't start in the first row,
# there's formatting, header rows, etc. looks like it starts
# in tr number [3]
#
## >>> the_table.contents[3].td
## <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a> </td>
#
# and since tags always imply the first one, the above
# is equivalent to
#
## >>> the_table.contents[3].contents[0]
## <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a> </td>
#
# and since the title is the first of three cells, the
# reporting year is
#
## >>> the_table.contents[3].contents[1]
## <td> <a href="/Sections/Years/2001">2001</a> </td>
#
# finally, the 3rd cell must contain the gross
#
## >>> the_table.contents[3].contents[2]
## <td align="RIGHT"> 259,674,120</td>
#
# but the contents of the first two cells are anchor tags.
# to get the actual title string, I need the contents of the
# contents. but that's not exactly what I want either,
# I don't want a list, I need a string. and the string isn't
# always in the same place in the list
#
# summarizing, what I need is
#
## print the_table.contents[3].contents[0].contents[0].contents,
## print the_table.contents[3].contents[1].contents[1].contents,
## print the_table.contents[3].contents[2].contents
#
# and that almost works, just a couple more tweaks and I can
# shove it into the database

parsed = []

for rec in the_table.contents[3:]:
the_rec_type = type(rec) # some rec are
NavSrings, skip
if str(the_rec_type) == "<type 'instance'>":
#
# ok, got a real data row
#
TITLE_DATE = rec.contents[0].contents[0].contents # a list
inside a tuple
#
# and that means we still have to index the contents
# of the contents of the contents of the contents by
# adding [0][0] to TITLE_DATE
#
YEAR = rec.contents[1].contents[1].contents # ditto
#
# this won't go into the database, just used as a filter to grab
# the records associated with the posting date and discard
# the others (which should already be in the database)
#
GROSS = rec.contents[2].contents # just a
list
#
# one other minor glitch, that film date is part of the title
# (which is of no use in the database, so it has to be pulled
out
# and put in a separate field
#
# temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\))
(.*)',str(TITLE_DATE[0][0]))
temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\))
(.*)',str(TITLE_DATE))
#
# which works 99% of the time. unfortunately, the IMDB
# consitency is somewhat dubious. the date is _supposed_
# to be at the end of the string, but sometimes it's not.
# so, usually, there are only 5 groups, but you have to
# allow for the fact that there may be 6
#
try:
the_title = temp_title.group(1) + temp_title.group(5)
except:
the_title = temp_title.group(1)
the_gross = str(GROSS[0])
#
# and for some unexplained reason, dates will occasionally
# be 2001/I instead of 2001, so we want to discard the trailing
# crap, if any
#
the_film_year = temp_title.group(3)[:4]
# if str(YEAR[0][0])==the_year:
if str(YEAR[0])==the_year:
parsed.append([the_date,the_title,the_film_year,the_gross])

print '%3d records found ' % (len(parsed))
#
# wow, now just have to insert all the update records directly
# into the database...into a temporary table, of course. as I said,
# IMDB consistency is somewhat dubious (such as changing the
spelling
# of the titles), so a QC check will be required inside Access
#
#### if len(parsed)>0:
#### print '...inserting into database'
#### for p in parsed:
#### cursor.execute("""
####INSERT INTO imdweeks2 ( Date_reported, Title, Film_Date,
Gross_to_Date )
####SELECT ?,?,?,?;""",p)
#### else:
#### print '...aborting, no records found'
####
####cursor.close()
####con.close()

for p in parsed: print p

# and just because it works, doesn't mean it's right.
# but hey, you get what you pay for. I'm _sure_ if I were
# to pay for a subscription to IMDBPro, I wouldn't see
# these errors ;-)



##You should get this:
##
## 4/6/2008 scraping IMDB: 111 records found
##['4/6/2008', "[u'I Am Legend']", '2007', ' 256,386,216']
##['4/6/2008', "[u'National Treasure: Book of Secrets']", '2007', '
218,701,477']
##['4/6/2008', "[u'Alvin and the Chipmunks']", '2007', ' 216,873,487']
##['4/6/2008', "[u'Juno']", '2007', ' 142,545,706']
##['4/6/2008', "[u'Horton Hears a Who!']", '2008', ' 131,076,768']
##['4/6/2008', "[u'Bucket List, The']", '2007', ' 91,742,612']
##['4/6/2008', "[u'10,000 BC']", '2008', ' 89,349,915']
##['4/6/2008', "[u'Cloverfield']", '2008', ' 80,034,302']
##['4/6/2008', "[u'Jumper']", '2008', ' 78,762,148']
##['4/6/2008', "[u'27 Dresses']", '2008', ' 76,376,607']
##['4/6/2008', "[u'No Country for Old Men']", '2007', ' 74,273,505']
##['4/6/2008', "[u'Vantage Point']", '2008', ' 71,037,105']
##['4/6/2008', "[u'Spiderwick Chronicles, The']", '2008', '
69,872,230']
##['4/6/2008', '[u"Fool\'s Gold"]', '2008', ' 68,636,484']
##['4/6/2008', "[u'Hannah Montana/Miley Cyrus: Best of Both Worlds
Concert Tour']", '2008', ' 65,010,561']
##['4/6/2008', "[u'Step Up 2: The Streets']", '2008', ' 57,389,556']
##['4/6/2008', "[u'Atonement']", '2007', ' 50,921,738']
##['4/6/2008', "[u'21']", '2008', ' 46,770,173']
##['4/6/2008', "[u'College Road Trip']", '2008', ' 40,918,686']
##['4/6/2008', "[u'There Will Be Blood']", '2007', ' 40,133,435']
##['4/6/2008', "[u'Meet the Spartans']", '2008', ' 38,185,300']
##['4/6/2008', "[u'Meet the Browns']", '2008', ' 37,662,502']
##['4/6/2008', "[u'Deep Sea 3D']", '2006', ' 36,141,373']
##['4/6/2008', "[u'Semi-Pro']", '2008', ' 33,289,722']
##['4/6/2008', "[u'Definitely, Maybe']", '2008', ' 31,973,840']
##['4/6/2008', "[u'Eye, The']", '2008', ' 31,397,498']
##['4/6/2008', "[u'Great Debaters, The']", '2007', ' 30,219,326']
##['4/6/2008', "[u'Bank Job, The']", '2008', ' 26,804,821']
##['4/6/2008', "[u'Other Boleyn Girl, The']", '2008', ' 26,051,195']
##['4/6/2008', "[u'Drillbit Taylor']", '2008', ' 25,490,483']
##['4/6/2008', "[u'Magnificent Desolation: Walking on the Moon 3D']",
'2005', ' 23,283,158']
##['4/6/2008', "[u'Shutter']", '2008', ' 23,138,277']
##['4/6/2008', "[u'Never Back Down']", '2008', ' 23,080,675']
##['4/6/2008', "[u'Mad Money']", '2008', ' 20,648,442']
##['4/6/2008', "[u'Galapagos']", '1955', ' 17,152,405']
##['4/6/2008', "[u'Superhero Movie']", '2008', ' 16,899,661']
##['4/6/2008', "[u'Wild Safari 3D']", '2005', ' 16,550,933']
##['4/6/2008', "[u'Kite Runner, The']", '2007', ' 15,790,223']
##['4/6/2008', '[u"Nim\'s Island"]', '2008', ' 13,210,579']
##['4/6/2008', "[u'Leatherheads']", '2008', ' 12,682,595']
##['4/6/2008', "[u'Be Kind Rewind']", '2008', ' 11,028,439']
##['4/6/2008', "[u'Doomsday']", '2008', ' 10,955,425']
##['4/6/2008', "[u'Sea Monsters: A Prehistoric Adventure']", '2007', '
10,745,308']
##['4/6/2008', "[u'Miss Pettigrew Lives for a Day']", '2008', '
10,534,800']
##['4/6/2008', "[u'Môme, La']", '2007', ' 10,299,782']
##['4/6/2008', "[u'Penelope']", '2006', ' 9,646,154']
##['4/6/2008', "[u'Misma luna, La']", '2007', ' 8,959,462']
##['4/6/2008', "[u'Roving Mars']", '2006', ' 8,463,161']
##['4/6/2008', "[u'Stop-Loss']", '2008', ' 8,170,755']
##['4/6/2008', "[u'Ruins, The']", '2008', ' 8,003,241']
##['4/6/2008', "[u'Bella']", '2006', ' 7,776,080']
##['4/6/2008', "[u'U2 3D']", '2007', ' 7,348,105']
##['4/6/2008', "[u'Orfanato, El']", '2007', ' 7,159,147']
##['4/6/2008', "[u'In Bruges']", '2008', ' 6,831,761']
##['4/6/2008', "[u'Savages, The']", '2007', ' 6,571,599']
##['4/6/2008', "[u'Scaphandre et le papillon, Le']", '2007', '
5,990,075']
##['4/6/2008', "[u'Run Fatboy Run']", '2007', ' 4,430,583']
##['4/6/2008', "[u'Persepolis']", '2007', ' 4,200,980']
##['4/6/2008', "[u'Charlie Bartlett']", '2007', ' 3,928,412']
##['4/6/2008', "[u'Jodhaa Akbar']", '2008', ' 3,434,629']
##['4/6/2008', "[u'Fälscher, Die']", '2007', ' 2,903,370']
##['4/6/2008', "[u'Bikur Ha-Tizmoret']", '2007', ' 2,459,543']
##['4/6/2008', "[u'Shine a Light']", '2008', ' 1,488,081']
##['4/6/2008', "[u'Race']", '2008', ' 1,327,606']
##['4/6/2008', "[u'Funny Games U.S.']", '2007', ' 1,274,055']
##['4/6/2008', "[u'4 luni, 3 saptamâni si 2 zile']", '2007', '
1,103,315']
##['4/6/2008', "[u'Married Life']", '2007', ' 1,002,318']
##['4/6/2008', "[u'Diary of the Dead']", '2007', ' 893,192']
##['4/6/2008', "[u'Starting Out in the Evening']", '2007', ' 882,518']
##['4/6/2008', "[u'Dolphins and Whales 3D: Tribes of the Ocean']",
'2008', ' 854,304']
##['4/6/2008', "[u'Sukkar banat']", '2007', ' 781,954']
##['4/6/2008', "[u'Bonneville']", '2006', ' 471,679']
##['4/6/2008', "[u'Flawless']", '2007', ' 390,892']
##['4/6/2008', "[u'Paranoid Park']", '2007', ' 387,119']
##['4/6/2008', "[u'Teeth']", '2007', ' 321,732']
##['4/6/2008', "[u'Hammer, The']", '2007', ' 321,579']
##['4/6/2008', "[u'Priceless']", '2008', ' 320,131']
##['4/6/2008', "[u'Steep']", '2007', ' 259,840']
##['4/6/2008', "[u'Honeydripper']", '2007', ' 259,192']
##['4/6/2008', "[u'Snow Angels']", '2007', ' 255,147']
##['4/6/2008', "[u'Taxi to the Dark Side']", '2007', ' 231,743']
##['4/6/2008', "[u'Cheung Gong 7 hou']", '2008', ' 188,067']
##['4/6/2008', "[u'Ne touchez pas la hache']", '2007', ' 184,513']
##['4/6/2008', "[u'Sleepwalking']", '2008', ' 160,715']
##['4/6/2008', "[u'Chicago 10']", '2007', ' 149,456']
##['4/6/2008', "[u'Girls Rock!']", '2007', ' 92,636']
##['4/6/2008', "[u'Beaufort']", '2007', ' 87,339']
##['4/6/2008', "[u'Shelter']", '2007', ' 85,928']
##['4/6/2008', "[u'My Blueberry Nights']", '2007', ' 74,146']
##['4/6/2008', "[u'Témoins, Les']", '2007', ' 71,624']
##['4/6/2008', "[u'Mépris, Le']", '1963', ' 70,761']
##['4/6/2008', "[u'Singing Revolution, The']", '2006', ' 66,482']
##['4/6/2008', "[u'Chop Shop']", '2007', ' 58,858']
##['4/6/2008', '[u"Chansons d\'amour, Les"]', '2007', ' 58,577']
##['4/6/2008', "[u'Praying with Lior']", '2007', ' 57,325']
##['4/6/2008', "[u'Yihe yuan']", '2006', ' 57,155']
##['4/6/2008', "[u'Casa de Alice, A']", '2007', ' 53,700']
##['4/6/2008', "[u'Blindsight']", '2006', ' 51,256']
##['4/6/2008', "[u'Boarding Gate']", '2007', ' 37,107']
##['4/6/2008', "[u'Voyage du ballon rouge, Le']", '2007', ' 35,222']
##['4/6/2008', "[u'Bill']", '2007', ' 35,201']
##['4/6/2008', "[u'Mio fratello è figlio unico']", '2007', '
34,138']
##['4/6/2008', "[u'Chapter 27']", '2007', ' 32,602']
##['4/6/2008', "[u'Meduzot']", '2007', ' 25,352']
##['4/6/2008', "[u'Shotgun Stories']", '2007', ' 25,346']
##['4/6/2008', "[u'Sconosciuta, La']", '2006', ' 18,569']
##['4/6/2008', "[u'Imaginary Witness: Hollywood and the Holocaust']",
'2004', ' 18,475']
##['4/6/2008', "[u'Irina Palm']", '2007', ' 14,214']
##['4/6/2008', "[u'Naissance des pieuvres']", '2007', ' 7,418']
##['4/6/2008', "[u'Four Letter Word, A']", '2007', ' 6,017']
##['4/6/2008', "[u'Tuya de hun shi']", '2006', ' 2,619']
 
S

Sanoski

The reason I ask about text files is the need to save the data
locally, and have it stored in a way where backups can easily be made.
Then if your computer crashes and you lose everything, but you have
the data files it uses backed up, you can just download the program,
extract the backed up data to a specific directory, and then it works
exactly the way it did before you lost it. I suppose a SQLite database
might solve this, but I'm not sure. I'm just getting started, and I
don't know too much about it yet.

I'm also still not sure how to download and associate the pictures
that each entry has for it. The main thing for me now is getting
started. It needs to get information from the web. In this case, it's
a simple XML feed. The one thing that seems that would make it easier
is every post to the feed is very consistent. Each header starts with
the letter A, which stands for Alpike Tech, follow by the name of the
class, the room number, the leading student, and his GPA. All that is
one line of text. But it's also a link to more information. For
example:

A Economics, 312, John Carbroil, 4.0
That's one whole post to the feed. Like I say, it's very simple and
consistent. Which should make this easier.

Eventually I want it to follow that link and grab information from
there too, but I'll worry about that later. Technically, if I figure
this first part out, that problem should take care of itself.





I'm pretty new to programming. I've just been studying a few weeks off
and on. I know a little, and I'm learning as I go. Programming is so
much fun! I really wish I would have gotten into it years ago, but
here's my question. I have a longterm project in mind, and I wont to
know if it's feasible and how difficult it will be.
There's an XML feed for my school that some other class designed. It's
just a simple idea that lists all classes, room number, and the person
with the highest GPA. The feed is set up like this. Each one of the
following lines would also be a link to more information about the
class, etc.
Economics, Room 216, James Faker, 3.4
Social Studies, Room 231, Brain Fictitious, 3.5
The student also has a picture reference that depicts his GPA based on
the number. The picture is basically just a graph. I just want to
write a program that uses the information on this feed.
I want it to reach out to this XML feed, record each instance of the
above format along with the picture reference of the highest GPA
student, download it locally, and then be able to use that information
in various was. I figured I'll start by counting each instance. For
example, the above would be 2 instances.
Eventually, I want it to be able to cross reference data you've
already downloaded, and be able to compare GPA's, etc. It would have a
GUI and everything too, but I am trying to keep it simple right now,
and just build onto it as I learn.
So lets just say this. How do you grab information from the web,

Depends on the web page.
in this case a feed,

Haven't tried that, just a simple CGI.
and then use that in calculations?

The key is some type of structure be it database records,
or a list of lists or whatever. Something that you can iterate
through, sort, find max element, etc.
How would you
implement such a project?

The example below uses BeautifulSoup. I'm posting it not
because it matches your problem, but to give you an idea of
the techniques involved.
Would you save the information into a text file?

Possibly, but generally no. Text files aren't very useful
except as a data exchange media.
Or would you use something else?

Your application lends itself to a database approach.
Note in my example the database part of the code is disabled.
Not every one has MS-Access on Windows.
Should I study up on SQLite?

Yes. The MS-Access code I have can be easily changed to SQLlite.
Maybe I should study classes.

I don't know, but I've always gotten along without them.
I'm just not sure. What would be the most effective technique?

Don't know that either as I've only done it once, as follows:

##  I was looking in my database of movie grosses I regulary copy
##  from the Internet Movie Database and noticed I was _only_ 120
##  weeks behind in my updates.
##
##  Ouch.
##
##  Copying a web page, pasting into a text file, running a perl
##  script to convert it into a csv file and manually importing it
##  into Access isn't so bad when you only have a couple to do at
##  a time. Still, it's a labor intensive process and 120 isn't
##  anything to look forwards to.
##
##  But I abandoned perl years ago when I took up Python, so I
##  can use Python to completely automate the process now.
##
##  Just have to figure out how.
##
##  There's 3 main tasks: capture the web page, parse the web page
##  to extract the data and insert the data into the database.
##
##  But I only know how to do the last step, using the odnc tools
##  from win32,

####import dbi
####import odbc
import re

##  so I snoop around comp.lang.python to pick up some
##  hints and keywords on how to do the other two tasks.
##
##  Documentation on urllib2 was a bit vague, but got the web page
##  after only a ouple mis-steps.

import urllib2

##  Unfortunately, HTMLParser remained beyond my grasp (is it
##  my imagination or is the quality of the examples in the
##  doumentation inversely proportional to the subject
##  difficulty?)
##
##  Luckily, my bag of hints had a reference to Beautiful Soup,
##  whose web site proclaims:
##      Beautiful Soup is a Python HTML/XML parser
##      designed for quick turnaround projects like
##      screen-scraping.
##  Looks like just what I need, maybe I can figure it out after all.

from BeautifulSoup import BeautifulSoup

target_dates = [['4','6','2008','April']]

####con = odbc.odbc("IMDB")  # connect to MS-Access database
####cursor = con.cursor()

for d in target_dates:
  #
  # build url (with CGI parameters) from list of dates needing
updating
  #
  the_year = d[2]
  the_date = '/'.join([d[0],d[1],d[2]])
  print '%10s scraping IMDB:'  % (the_date),
  the_url = ''.join([r'http://www.imdb.com/BusinessThisDay?
day=',d[1],'&month=',d[3]])
  req = urllib2.Request(url=the_url)
  f = urllib2.urlopen(req)
  www = f.read()
  #
  # ok, page captured. now make a BeatifulSoup object from it
  #
  soup = BeautifulSoup(www)
  #
  # that was easy, much more so than HTMLParser
  #
  # now, _all_ I have to do is figure out how to parse it
  #
  # ouch again. this is a lot harder than it looks in the
  # documentation. I need to get the data from cells of a
  # table nested inside another table and that's hard to
  # extrapolate from the examples showing how to find all
  # the comments on a web page.
  #
  # but this looks promising. if I grab all the table rows
  # (tr tags), each complete nested table is inside a cell
  # of the outer table (whose table tags are lost, but aren't
  # needed and whose absence makes extracting the nested
  # tables easier (when you do it the stupid way, but hey,
  # it works, so I'm sticking with it))
  #
  tr = soup.tr                          # table rows
  tr.extract()
  #
  # now, I only want the third nested table. how do I get it?
  # can't seem to get past the first one, should I be using
  # NextSibling or something? <scratches head...>
  #
  # but wait...I don't need the first two tables, so I can
  # simply extract and discard them. and since .extract()
  # CUTS the tables, after two extractions the table I want
  # IS the first one.
  #
  the_table = tr.find('table')          # discard
  the_table.extract()
  the_table = tr.find('table')          # discard
  the_table.extract()
  the_table = tr.find('table')          # weekly gross
  the_table.extract()
  #
  # of course, the data doesn't start in the first row,
  # there's formatting, header rows, etc. looks like it starts
  # in tr number [3]
  #
  ##  >>> the_table.contents[3].td
  ##  <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a> </td>
  #
  # and since tags always imply the first one, the above
  # is equivalent to
  #
  ##  >>> the_table.contents[3].contents[0]
  ##  <td><a href="/title/tt0170016/">How the Grinch Stole Christmas
(2000)</a> </td>
  #
  # and since the title is the first of three cells, the
  # reporting year is
  #
  ##  >>> the_table.contents[3].contents[1]
  ##  <td> <a href="/Sections/Years/2001">2001</a> </td>
  #
  # finally, the 3rd cell must contain the gross
  #
  ##  >>> the_table.contents[3].contents[2]
  ##  <td align="RIGHT"> 259,674,120</td>
  #
  # but the contents of the first two cells are anchor tags.
  # to get the actual title string, I need the contents of the
  # contents. but that's not exactly what I want either,
  # I don't want a list, I need a string. and the string isn't
  # always in the same place in the list
  #
  # summarizing, what I need is
  #
  ##  print the_table.contents[3].contents[0].contents[0].contents,
  ##  print the_table.contents[3].contents[1].contents[1].contents,
  ##  print the_table.contents[3].contents[2].contents
  #
  # and that almost works, just a couple more tweaks and I can
  # shove it into the database

  parsed = []

  for rec in the_table.contents[3:]:
    the_rec_type = type(rec)                      # some rec are
NavSrings, skip
    if str(the_rec_type) == "<type 'instance'>":
      #
      # ok, got a real data row
      #
      TITLE_DATE = rec.contents[0].contents[0].contents   # a list
inside a tuple
      #
      # and that means we still have to index the contents
      # of the contents of the contents of the contents by
      # adding [0][0] to TITLE_DATE
      #
      YEAR =  rec.contents[1].contents[1].contents        # ditto
      #
      # this won't go into the database, just used as a filter to grab
      # the records associated with the posting date and discard
      # the others (which should already be in the database)
      #
      GROSS = rec.contents[2].contents                    # just a
list
      #
      # one other minor glitch, that film date is part of the title
      # (which is of no use in the database, so it has to be pulled
out
      # and put in a separate field
      #
#      temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\))
(.*)',str(TITLE_DATE[0][0]))
      temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\))
(.*)',str(TITLE_DATE))
      #
      # which works 99% of the time. unfortunately, the IMDB
      # consitency is somewhat dubious. the date is _supposed_
      # to be at the end of the string, but sometimes it's not.
      # so, usually, there are only 5 groups, but you have to
      # allow for the fact that there may be 6
      #
      try:
        the_title = temp_title.group(1) + temp_title..group(5)
      except:
        the_title = temp_title.group(1)
      the_gross = str(GROSS[0])
      #
      # and for some unexplained reason, dates will occasionally
      # be 2001/I instead of 2001, so we want to discard the trailing
      # crap, if any
      #
      the_film_year = temp_title.group(3)[:4]
#      if str(YEAR[0][0])==the_year:
      if str(YEAR[0])==the_year:
        parsed.append([the_date,the_title,the_film_year,the_gross])

  print '%3d records found ' % (len(parsed))
  #
  # wow, now just have to insert all the update...

read more »
 
M

Mensanator

The reason I ask about text files is the need to save the data
locally, and have it stored in a way where backups can easily
be made.

Sure, you can always do that if you want. But if your target
is SQLlite or MS-Access, those are files also, so can be
backed up as easily as text files.
Then if your computer crashes and you lose everything, but
you have the data files it uses backed up, you can just
download the program, extract the backed up data to a
specific directory, and then it works exactly the way it
did before you lost it. I suppose a SQLite database might
solve this, but I'm not sure.

It will. Remember, once in a database, you have value-added
features like filtering, sorting, etc. that you would have
to do yourself if you simply read in text files.
I'm just getting started, and I
don't know too much about it yet.

Trust me, a database is the way to go.
My preference is MS-Access, because I need it for work.
It is a great tool for learning databases because it's
visual inteface can make you productive BEFORE you learn
SQL.
I'm also still not sure how to download and associate the pictures
that each entry has for it.

See example at end of post.
The main thing for me now is getting
started. It needs to get information from the web. In this case,
it's a simple XML feed.

BeautifulSoup also has an XML parser. Got to their
web page and read the documentation.
The one thing that seems that would
make it easier is every post to the feed is very consistent.
Each header starts with the letter A, which stands for Alpike
Tech, follow by the name of the class, the room number, the
leading student, and his GPA. All that is one line of text.
But it's also a link to more information. For example:

A Economics, 312, John Carbroil, 4.0
That's one whole post to the feed. Like I say, it's very
simple and consistent. Which should make this easier.

That's what you want for parsing, how to seperate
a composite set of data. Simple can sometimes be
done with split(), complex with regular expressions.
Eventually I want it to follow that link and grab information
from there too, but I'll worry about that later. Technically,
if I figure this first part out, that problem should take
care of itself.


A sample picture scraper:

from BeautifulSoup import BeautifulSoup
import urllib2
import urllib

#
# start by scraping the web page
#
the_url="http://members.aol.com/mensanator/OHIO/TheCobs.htm"
req = urllib2.Request(url=the_url)
f = urllib2.urlopen(req)
www = f.read()
soup = BeautifulSoup(www)
print soup.prettify()

#
# a simple page with pictures
#
##<html>
## <head>
## <title>
## Ohio - The Cobs!
## </title>
## </head>
## <body>
## <h1>
## Ohio Vacation Pictures - The Cobs!
## </h1>
## <hr />
## <img src="AUT_2784.JPG" />
## <br />
## WTF?
## <p>
## <img src="AUT_2764.JPG" />
## <br />
## This is surreal.
## </p>
## <p>
## <img src="AUT_2765.JPG" />
## <br />
## Six foot tall corn cobs made of concrete.
## </p>
## <p>
## <img src="AUT_2766.JPG" />
## <br />
## 109 of them, laid out like a modern Stonehenge.
## </p>
## <p>
## <img src="AUT_2769.JPG" />
## <br />
## With it's own Druid worshippers.
## </p>
## <p>
## <img src="AUT_2781.JPG" />
## <br />
## Cue the
## <i>
## Also Sprach Zarathustra
## </i>
## soundtrack.
## </p>
## <p>
## <img src="100_0887.JPG" />
## <br />
## Air & Space Museums are a dime a dozen.
## <br />
## But there's only
## <b>
## one
## </b>
## Cobs!
## </p>
## <p>
## </p>
## </body>
##</html>

#
# parse the page to find all the pictures (image tags)
#
the_pics = soup.findAll('img')

for i in the_pics:
print i

##<img src="AUT_2784.JPG" />
##<img src="AUT_2764.JPG" />
##<img src="AUT_2765.JPG" />
##<img src="AUT_2766.JPG" />
##<img src="AUT_2769.JPG" />
##<img src="AUT_2781.JPG" />
##<img src="100_0887.JPG" />

#
# the picutres have no path, so must be in the
# same directory as the web page
#
the_jpg_path="http://members.aol.com/mensanator/OHIO/"

#
# now with urllib, copy the picture files to the local
# hard drive renaming with sequence id at the same time
#
for i,j in enumerate(the_pics):
p = the_jpg_path + j['src']
q = 'C:\\scrape\\' + 'pic' + str(i).zfill(4) + '.jpg'
urllib.urlretrieve(p,q)

#
# and here's the captured files
#
## C:\>dir scrape
## Volume in drive C has no label.
## Volume Serial Number is D019-C60D
##
## Directory of C:\scrape
##
## 05/17/2008 07:06 PM <DIR> .
## 05/17/2008 07:06 PM <DIR> ..
## 05/17/2008 07:05 PM 69,877 pic0000.jpg
## 05/17/2008 07:05 PM 71,776 pic0001.jpg
## 05/17/2008 07:05 PM 70,958 pic0002.jpg
## 05/17/2008 07:05 PM 69,261 pic0003.jpg
## 05/17/2008 07:05 PM 70,653 pic0004.jpg
## 05/17/2008 07:05 PM 70,564 pic0005.jpg
## 05/17/2008 07:05 PM 113,356 pic0006.jpg
## 7 File(s) 536,445 bytes
## 2 Dir(s) 27,823,570,944 bytes free
 
I

inhahe

Sanoski said:
I'm pretty new to programming. I've just been studying a few weeks off
and on. I know a little, and I'm learning as I go. Programming is so
much fun! I really wish I would have gotten into it years ago, but
here's my question. I have a longterm project in mind, and I wont to
know if it's feasible and how difficult it will be.

There's an XML feed for my school that some other class designed. It's
just a simple idea that lists all classes, room number, and the person
with the highest GPA. The feed is set up like this. Each one of the
following lines would also be a link to more information about the
class, etc.

Economics, Room 216, James Faker, 3.4
Social Studies, Room 231, Brain Fictitious, 3.5

etc, etc

The student also has a picture reference that depicts his GPA based on
the number. The picture is basically just a graph. I just want to
write a program that uses the information on this feed.

I want it to reach out to this XML feed, record each instance of the
above format along with the picture reference of the highest GPA
student, download it locally, and then be able to use that information
in various was. I figured I'll start by counting each instance. For
example, the above would be 2 instances.

Eventually, I want it to be able to cross reference data you've
already downloaded, and be able to compare GPA's, etc. It would have a
GUI and everything too, but I am trying to keep it simple right now,
and just build onto it as I learn.

So lets just say this. How do you grab information from the web, in
this case a feed, and then use that in calculations? How would you
implement such a project? Would you save the information into a text
file? Or would you use something else? Should I study up on SQLite?
Maybe I should study classes. I'm just not sure. What would be the
most effective technique?


People usually say BeautifulSoup for getting stuff from the web. I think I
tried it once and had some problem and gave up. But since this is XML I
think all you'd need is an xml.dom.minidom or xml.etree.ElementTree. i'm not
sure which is easier. see doc\python.chm in the Python directory to study
up on those. To grab the webpage to begin with you'd use urllib2. That
takes around one line of code. I wouldn't save it to a text file, because
they're not that good for random access. Or storing images. I'd save it in
a database. There are other database modules than SQLite, but SQLite is a
good one, for simple projects like that where you're just going to be
running one instance of the program at a time. SQLite is fast and it's the
only one that doesn't require a separate database engine to be installed and
running.
Classes are just a way of organizing code (and a little more, but they don't
have a lot to do with saving stuff to file).
I'm not clear on whether the GPA is available as text and an image, or just
an image. If it's just available as an image you're going to want to use
PIL (Python Image Library). Btw, use float() to convert a textual GPA to a
number.
You'll have to learn some basics of the SQL language (that applies to any
database). Or maybe not with SQLObject or SQLAlchemy, but I don't know how
easy those are to learn. Or if you don't want to learn SQL you could use a
text file with fixed-length fields and perhaps references to individual
filenames that store the pictures, and I could tell you how to do that. But
a database is a lot more flexible, and you wouldn't need to learn much SQL
for the same purposes.
Btw, I used SQLite version 2 before and it didn't allow me to return query
results as dictionaries (i.e., indexable by field name), just lists of
values, except by using some strange code I found somewhere. But a list is
also easy to use. But if version 3 doesn't do it either and you want the
code I have it.
 
M

Martin Sand Christensen

inhahe> Btw, use float() to convert a textual GPA to a number.

It would be much better to use Decimal() instead of float(). A GPA of
3.6000000000000001 probably doesn't make much sense; this problem
doesn't arise when using the Decimal type.

Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top