Output of HTML parsing

Discussion in 'Python' started by Jackie, Jun 15, 2007.

  1. Jackie

    Jackie Guest

    Hi, all,

    I want to get the information of the professors (name,title) from the
    following link:

    "http://www.economics.utoronto.ca/index.php/index/person/faculty/"

    Ideally, I'd like to have a output file where each line is one Prof,
    including his name and title. In practice, I use the CSV module.

    The following is my program:


    --------------- Program
    ----------------------------------------------------

    import urllib,re,csv

    url = "http://www.economics.utoronto.ca/index.php/index/person/
    faculty/"

    sock = urllib.urlopen(url)
    htmlSource = sock.read()
    sock.close()

    namePattern = re.compile(r'class="name">(.*)</a>')
    titlePattern = re.compile(r'</a>,&nbsp;(.*)\s*</td>')

    name = namePattern.findall(htmlSource)
    title_temp = titlePattern.findall(htmlSource)
    title =[]
    for item in title_temp:
    item_new=" ".join(item.split()) #Suppress the
    spaces between 'title' and </td>
    title.extend([item_new])


    output =[]
    for i in range(len(name)):
    output.insert(i,[name,title]) #Generate a list of
    [name, title]

    writer = csv.writer(open("professor.csv", "wb"))
    writer.writerows(output) #output CSV file

    -------------- End of Program
    ----------------------------------------------

    My questions are:

    1.The code above assume that each Prof has a tilte. If any one of them
    does not, the name and title will be mismatched. How to program to
    allow that title can be empty?

    2.Is there any easier way to get the data I want other than using
    list?

    3.Should I close the opened csv file("professor.csv")? How to close
    it?

    Thanks!

    Jackie
     
    Jackie, Jun 15, 2007
    #1
    1. Advertising

  2. [ Jackie <> ]
    > 1.The code above assume that each Prof has a tilte. If any one of them
    > does not, the name and title will be mismatched. How to program to
    > allow that title can be empty?
    >
    > 2.Is there any easier way to get the data I want other than using
    > list?


    Use BeautifulSoup.

    > 3.Should I close the opened csv file("professor.csv")? How to close
    > it?


    Assign the file object to a separate name (e.g. stream) and then invoke its
    close method after writing all csv data to it.

    --
    Freedom is always the freedom of dissenters.
    (Rosa Luxemburg)

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v2.0.4 (GNU/Linux)

    iD8DBQBGcp64n3IEGILecb4RAlblAKCmypoYjyPSciI0NaC7A9dcPIa3owCgkn3G
    owa3lSPAMdTDhzejhuF8ztg=
    =FP0v
    -----END PGP SIGNATURE-----
     
    Sebastian Wiesner, Jun 15, 2007
    #2
    1. Advertising

  3. Jackie wrote:
    > I want to get the information of the professors (name,title) from the
    > following link:
    >
    > "http://www.economics.utoronto.ca/index.php/index/person/faculty/"


    That's even XHTML, no need to go through BeautifulSoup. Use lxml instead.

    http://codespeak.net/lxml


    > Ideally, I'd like to have a output file where each line is one Prof,
    > including his name and title. In practice, I use the CSV module.
    > ----------------------------------------------------
    >
    > import urllib,re,csv
    >
    > url = "http://www.economics.utoronto.ca/index.php/index/person/
    > faculty/"
    >
    > sock = urllib.urlopen(url)
    > htmlSource = sock.read()
    > sock.close()


    import lxml.etree as et
    url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
    tree = et.parse(url)

    > namePattern = re.compile(r'class="name">(.*)</a>')
    > titlePattern = re.compile(r'</a>,&nbsp;(.*)\s*</td>')
    >
    > name = namePattern.findall(htmlSource)
    > title_temp = titlePattern.findall(htmlSource)
    > title =[]
    > for item in title_temp:
    > item_new=" ".join(item.split()) #Suppress the
    > spaces between 'title' and </td>
    > title.extend([item_new])
    >
    >
    > output =[]
    > for i in range(len(name)):
    > output.insert(i,[name,title]) #Generate a list of
    > [name, title]


    # untested
    get_name_text = et.XPath('normalize-space(td[a/@class="name"]')
    name_list = []
    for name_row in tree.xpath('//tr[td/a/@class = "name"]'):
    name_list.append(
    tuple(get_name_text(name_row).split(",", 3) + ["","",""])[:3] )


    > writer = csv.writer(open("professor.csv", "wb"))
    > writer.writerows(output) #output CSV file


    writer = csv.writer(open("professor.csv", "wb"))
    writer.writerows(name_list) #output CSV file
    > -------------- End of Program
    > ----------------------------------------------
    >
    > 3.Should I close the opened csv file("professor.csv")? How to close
    > it?


    I guess it has a "close()" function?

    Stefan
     
    Stefan Behnel, Jun 15, 2007
    #3
  4. Jackie

    Jackie Guest

    On 6 15 , 2 01 , Stefan Behnel <> wrote:
    > Jackie wrote:


    > import lxml.etree as et
    > url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
    > tree = et.parse(url)
    >


    > Stefan- -
    >
    > - -


    Thank you. But when I tried to run the above part, the following
    message showed up:

    Traceback (most recent call last):
    File "D:\TS\Python\workspace\eco_department\lxml_ver.py", line 3, in
    <module>
    tree = et.parse(url)
    File "etree.pyx", line 1845, in etree.parse
    File "parser.pxi", line 928, in etree._parseDocument
    File "parser.pxi", line 932, in etree._parseDocumentFromURL
    File "parser.pxi", line 849, in etree._parseDocFromFile
    File "parser.pxi", line 557, in etree._BaseParser._parseDocFromFile
    File "parser.pxi", line 631, in etree._handleParseResult
    File "parser.pxi", line 602, in etree._raiseParseError
    etree.XMLSyntaxError: line 2845: Premature end of data in tag html
    line 8

    Could you please tell me where went wrong?

    Thank you

    Jackie
     
    Jackie, Jun 19, 2007
    #4
  5. Jackie schrieb:
    > On 6 15 , 2 01 , Stefan Behnel <> wrote:
    >> Jackie wrote:

    >
    >> import lxml.etree as et
    >> url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
    >> tree = et.parse(url)
    >>

    >
    >> Stefan- -
    >>
    >> - -

    >
    > Thank you. But when I tried to run the above part, the following
    > message showed up:
    >
    > Traceback (most recent call last):
    > File "D:\TS\Python\workspace\eco_department\lxml_ver.py", line 3, in
    > <module>
    > tree = et.parse(url)
    > File "etree.pyx", line 1845, in etree.parse
    > File "parser.pxi", line 928, in etree._parseDocument
    > File "parser.pxi", line 932, in etree._parseDocumentFromURL
    > File "parser.pxi", line 849, in etree._parseDocFromFile
    > File "parser.pxi", line 557, in etree._BaseParser._parseDocFromFile
    > File "parser.pxi", line 631, in etree._handleParseResult
    > File "parser.pxi", line 602, in etree._raiseParseError
    > etree.XMLSyntaxError: line 2845: Premature end of data in tag html
    > line 8
    >
    > Could you please tell me where went wrong?


    Ah, ok, then the page is not actually XHTML, but broken HTML. Use this idiom
    instead:

    parser = et.HTMLParser()
    tree = et.parse(url, parser)

    Stefan
     
    Stefan Behnel, Jun 19, 2007
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. samuels
    Replies:
    3
    Views:
    388
    samuels
    Jul 18, 2005
  2. Replies:
    7
    Views:
    1,461
  3. Ted Byers
    Replies:
    8
    Views:
    239
    Peter J. Holzer
    Sep 1, 2009
  4. Ninja Li

    Parsing HTML with HTML::TableExtract

    Ninja Li, Nov 27, 2009, in forum: Perl Misc
    Replies:
    2
    Views:
    265
    Martien Verbruggen
    Nov 28, 2009
  5. Ninja Li

    Parsing HTML with HTML::Tree

    Ninja Li, Mar 1, 2010, in forum: Perl Misc
    Replies:
    1
    Views:
    179
    Ninja Li
    Mar 1, 2010
Loading...

Share This Page