Output of HTML parsing

J

Jackie

Hi, all,

I want to get the information of the professors (name,title) from the
following link:

"http://www.economics.utoronto.ca/index.php/index/person/faculty/"

Ideally, I'd like to have a output file where each line is one Prof,
including his name and title. In practice, I use the CSV module.

The following is my program:


--------------- Program
----------------------------------------------------

import urllib,re,csv

url = "http://www.economics.utoronto.ca/index.php/index/person/
faculty/"

sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()

namePattern = re.compile(r'class="name">(.*)</a>')
titlePattern = re.compile(r'</a>,&nbsp;(.*)\s*</td>')

name = namePattern.findall(htmlSource)
title_temp = titlePattern.findall(htmlSource)
title =[]
for item in title_temp:
item_new=" ".join(item.split()) #Suppress the
spaces between 'title' and </td>
title.extend([item_new])


output =[]
for i in range(len(name)):
output.insert(i,[name,title]) #Generate a list of
[name, title]

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(output) #output CSV file

-------------- End of Program
----------------------------------------------

My questions are:

1.The code above assume that each Prof has a tilte. If any one of them
does not, the name and title will be mismatched. How to program to
allow that title can be empty?

2.Is there any easier way to get the data I want other than using
list?

3.Should I close the opened csv file("professor.csv")? How to close
it?

Thanks!

Jackie
 
S

Sebastian Wiesner

[ Jackie said:
1.The code above assume that each Prof has a tilte. If any one of them
does not, the name and title will be mismatched. How to program to
allow that title can be empty?

2.Is there any easier way to get the data I want other than using
list?

Use BeautifulSoup.
3.Should I close the opened csv file("professor.csv")? How to close
it?

Assign the file object to a separate name (e.g. stream) and then invoke its
close method after writing all csv data to it.

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4 (GNU/Linux)

iD8DBQBGcp64n3IEGILecb4RAlblAKCmypoYjyPSciI0NaC7A9dcPIa3owCgkn3G
owa3lSPAMdTDhzejhuF8ztg=
=FP0v
-----END PGP SIGNATURE-----
 
S

Stefan Behnel

Jackie said:
I want to get the information of the professors (name,title) from the
following link:

"http://www.economics.utoronto.ca/index.php/index/person/faculty/"

That's even XHTML, no need to go through BeautifulSoup. Use lxml instead.

http://codespeak.net/lxml

Ideally, I'd like to have a output file where each line is one Prof,
including his name and title. In practice, I use the CSV module.
----------------------------------------------------

import urllib,re,csv

url = "http://www.economics.utoronto.ca/index.php/index/person/
faculty/"

sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()

import lxml.etree as et
url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
tree = et.parse(url)
namePattern = re.compile(r'class="name">(.*)</a>')
titlePattern = re.compile(r'</a>,&nbsp;(.*)\s*</td>')

name = namePattern.findall(htmlSource)
title_temp = titlePattern.findall(htmlSource)
title =[]
for item in title_temp:
item_new=" ".join(item.split()) #Suppress the
spaces between 'title' and </td>
title.extend([item_new])


output =[]
for i in range(len(name)):
output.insert(i,[name,title]) #Generate a list of
[name, title]


# untested
get_name_text = et.XPath('normalize-space(td[a/@class="name"]')
name_list = []
for name_row in tree.xpath('//tr[td/a/@class = "name"]'):
name_list.append(
tuple(get_name_text(name_row).split(",", 3) + ["","",""])[:3] )

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(output) #output CSV file

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(name_list) #output CSV file
-------------- End of Program

I guess it has a "close()" function?

Stefan
 
J

Jackie

Jackie wrote:
import lxml.etree as et
url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
tree = et.parse(url)
Stefan- -

- -

Thank you. But when I tried to run the above part, the following
message showed up:

Traceback (most recent call last):
File "D:\TS\Python\workspace\eco_department\lxml_ver.py", line 3, in
<module>
tree = et.parse(url)
File "etree.pyx", line 1845, in etree.parse
File "parser.pxi", line 928, in etree._parseDocument
File "parser.pxi", line 932, in etree._parseDocumentFromURL
File "parser.pxi", line 849, in etree._parseDocFromFile
File "parser.pxi", line 557, in etree._BaseParser._parseDocFromFile
File "parser.pxi", line 631, in etree._handleParseResult
File "parser.pxi", line 602, in etree._raiseParseError
etree.XMLSyntaxError: line 2845: Premature end of data in tag html
line 8

Could you please tell me where went wrong?

Thank you

Jackie
 
S

Stefan Behnel

Jackie said:
Thank you. But when I tried to run the above part, the following
message showed up:

Traceback (most recent call last):
File "D:\TS\Python\workspace\eco_department\lxml_ver.py", line 3, in
<module>
tree = et.parse(url)
File "etree.pyx", line 1845, in etree.parse
File "parser.pxi", line 928, in etree._parseDocument
File "parser.pxi", line 932, in etree._parseDocumentFromURL
File "parser.pxi", line 849, in etree._parseDocFromFile
File "parser.pxi", line 557, in etree._BaseParser._parseDocFromFile
File "parser.pxi", line 631, in etree._handleParseResult
File "parser.pxi", line 602, in etree._raiseParseError
etree.XMLSyntaxError: line 2845: Premature end of data in tag html
line 8

Could you please tell me where went wrong?

Ah, ok, then the page is not actually XHTML, but broken HTML. Use this idiom
instead:

parser = et.HTMLParser()
tree = et.parse(url, parser)

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top