Output of HTML parsing

Jackie · Jun 15, 2007

Hi, all,

I want to get the information of the professors (name,title) from the
following link:

"http://www.economics.utoronto.ca/index.php/index/person/faculty/"

Ideally, I'd like to have a output file where each line is one Prof,
including his name and title. In practice, I use the CSV module.

The following is my program:

--------------- Program
----------------------------------------------------

import urllib,re,csv

url = "http://www.economics.utoronto.ca/index.php/index/person/
faculty/"

sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()

namePattern = re.compile(r'class="name">(.*)</a>')
titlePattern = re.compile(r'</a>, (.*)\s*</td>')

name = namePattern.findall(htmlSource)
title_temp = titlePattern.findall(htmlSource)
title =[]
for item in title_temp:
item_new=" ".join(item.split()) #Suppress the
spaces between 'title' and </td>
title.extend([item_new])

output =[]
for i in range(len(name)):
output.insert(i,[name,title]) #Generate a list of
[name, title]

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(output) #output CSV file

-------------- End of Program
----------------------------------------------

My questions are:

1.The code above assume that each Prof has a tilte. If any one of them
does not, the name and title will be mismatched. How to program to
allow that title can be empty?

2.Is there any easier way to get the data I want other than using
list?

3.Should I close the opened csv file("professor.csv")? How to close
it?

Thanks!

Jackie

Sebastian Wiesner · Jun 15, 2007

[ Jackie said:
1.The code above assume that each Prof has a tilte. If any one of them
does not, the name and title will be mismatched. How to program to
allow that title can be empty?

2.Is there any easier way to get the data I want other than using
list?

Use BeautifulSoup.

3.Should I close the opened csv file("professor.csv")? How to close
it?

Assign the file object to a separate name (e.g. stream) and then invoke its
close method after writing all csv data to it.

--
Freedom is always the freedom of dissenters.
(Rosa Luxemburg)

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4 (GNU/Linux)

iD8DBQBGcp64n3IEGILecb4RAlblAKCmypoYjyPSciI0NaC7A9dcPIa3owCgkn3G
owa3lSPAMdTDhzejhuF8ztg=
=FP0v
-----END PGP SIGNATURE-----

Stefan Behnel · Jun 15, 2007

Jackie said:
I want to get the information of the professors (name,title) from the
following link:

"http://www.economics.utoronto.ca/index.php/index/person/faculty/"

That's even XHTML, no need to go through BeautifulSoup. Use lxml instead.

http://codespeak.net/lxml

Ideally, I'd like to have a output file where each line is one Prof,
including his name and title. In practice, I use the CSV module.
----------------------------------------------------

import urllib,re,csv

url = "http://www.economics.utoronto.ca/index.php/index/person/
faculty/"

sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()

import lxml.etree as et
url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
tree = et.parse(url)

namePattern = re.compile(r'class="name">(.*)</a>')
titlePattern = re.compile(r'</a>, (.*)\s*</td>')

name = namePattern.findall(htmlSource)
title_temp = titlePattern.findall(htmlSource)
title =[]
for item in title_temp:
item_new=" ".join(item.split()) #Suppress the
spaces between 'title' and </td>
title.extend([item_new])

output =[]
for i in range(len(name)):
output.insert(i,[name,title]) #Generate a list of
[name, title]

# untested
get_name_text = et.XPath('normalize-space(td[a/@class="name"]')
name_list = []
for name_row in tree.xpath('//tr[td/a/@class = "name"]'):
name_list.append(
tuple(get_name_text(name_row).split(",", 3) + ["","",""])[:3] )

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(output) #output CSV file

Click to expand...

writer = csv.writer(open("professor.csv", "wb"))
writer.writerows(name_list) #output CSV file

-------------- End of Program

Click to expand...

I guess it has a "close()" function?

Stefan

Jackie · Jun 19, 2007

Jackie wrote:

import lxml.etree as et
url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
tree = et.parse(url)

Stefan- -

- -

Thank you. But when I tried to run the above part, the following
message showed up:

Traceback (most recent call last):
File "D:\TS\Python\workspace\eco_department\lxml_ver.py", line 3, in
<module>
tree = et.parse(url)
File "etree.pyx", line 1845, in etree.parse
File "parser.pxi", line 928, in etree._parseDocument
File "parser.pxi", line 932, in etree._parseDocumentFromURL
File "parser.pxi", line 849, in etree._parseDocFromFile
File "parser.pxi", line 557, in etree._BaseParser._parseDocFromFile
File "parser.pxi", line 631, in etree._handleParseResult
File "parser.pxi", line 602, in etree._raiseParseError
etree.XMLSyntaxError: line 2845: Premature end of data in tag html
line 8

Could you please tell me where went wrong?

Thank you

Jackie

Stefan Behnel · Jun 19, 2007

Jackie said:
Thank you. But when I tried to run the above part, the following
message showed up:

Traceback (most recent call last):
File "D:\TS\Python\workspace\eco_department\lxml_ver.py", line 3, in
<module>
tree = et.parse(url)
File "etree.pyx", line 1845, in etree.parse
File "parser.pxi", line 928, in etree._parseDocument
File "parser.pxi", line 932, in etree._parseDocumentFromURL
File "parser.pxi", line 849, in etree._parseDocFromFile
File "parser.pxi", line 557, in etree._BaseParser._parseDocFromFile
File "parser.pxi", line 631, in etree._handleParseResult
File "parser.pxi", line 602, in etree._raiseParseError
etree.XMLSyntaxError: line 2845: Premature end of data in tag html
line 8

Could you please tell me where went wrong?

Ah, ok, then the page is not actually XHTML, but broken HTML. Use this idiom
instead:

parser = et.HTMLParser()
tree = et.parse(url, parser)

Stefan

Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Parsing html :: output to comma delimited	3	Jul 16, 2005
print header for output	0	Jun 19, 2011
Call for Papers Reminder (extended): The 2013 InternationalConference of Data Mining and Knowledge E	0	Mar 10, 2013
Call for Papers Reminder: The 2013 International Conference of DataMining and Knowledge Engineering	0	Feb 12, 2013
Call for Papers Reminder: The 2013 International Conference of Signaland Image Engineering (ICSIE 20	0	Feb 28, 2013
Call for Papers Reminder (extended): The 2013 InternationalConference of Signal and Image Engineerin	0	Mar 12, 2013
Call for Papers Reminder: International Conference on ComputerScience and Applications ICCSA 2012	0	Jun 13, 2012

Output of HTML parsing

Jackie

Sebastian Wiesner

Stefan Behnel

Jackie

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads