python - firefox dom/xpath question/issue

Discussion in 'Python' started by bruce, Aug 25, 2008.

  1. bruce

    bruce Guest

    Hi.

    Got a test web page, that basically has two "<html" tags in it. Examining
    the page via Firefox/Dom Inspector, I can create a test xpath query
    "/html/body/form" which gets the target form for the test.

    The issue comes when I examine the page's source html. It looks like:
    <html>
    <body>
    </body>
    </html>

    <html>
    <body>
    ..
    ..
    ..
    </body>
    </html>

    I've simplified things a bit... but basically, the 1st "html/body" is empty,
    with the 2nd containing the data/nodes I need.

    In using xpath("/html/body/form"), the app returns nothing/crashes.. I've
    tried to do something like xpath("/html[position()=0]") as well with no
    luck... It's as if xpath only looks at the 1st html that it sees in a given
    page. I can't seem to find any docs for xpath to work around this. I'm using
    the libxml2dom for python 2.5.1.

    Any thoughts/comments...

    If I comment out the 1st html section, things work as they should. The test
    code is below...

    thanks

    ------------------------------------------
    #!/usr/bin/python
    #
    # test.py
    #
    # scrapes/extracts the basic data for the college
    #
    #
    # the app gets/stores
    # name
    # url
    # address (street/city/state
    # phone
    #
    ######################################################################3
    #test python script
    import re
    import libxml2dom
    import urllib
    import urllib2
    import sys, string
    from mechanize import Browser
    import mechanize
    #import tidy
    import os.path
    import cookielib
    from libxml2dom import Node
    from libxml2dom import NodeList
    import subprocess
    import time

    ########################
    #
    # Parse pricegrabber.com
    ########################
    ##cj = "p"
    ##COOKIEFILE = 'cookies.lwp'
    #cookielib = 1


    urlopen = urllib2.urlopen
    Request = urllib2.Request
    br = Browser()
    br2 = Browser()

    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    values1 = {'name' : 'Michael Foord',
    'location' : 'Northampton',
    'language' : 'Python' }
    headers = { 'User-Agent' : user_agent }

    url="http://schedule.psu.edu/"
    #=======================================


    if __name__ == "__main__":
    # main app

    txdata = None

    #----------------------------

    ##br.set_cookiejar(cj)
    br.set_handle_redirect(True)
    br.set_handle_referer(True)
    br.set_handle_robots(False)
    br.addheaders = [('User-Agent', 'Firefox')]

    print "url =",url
    #br.open(url)
    ##cj.save(COOKIEFILE) # resave cookies

    #res = br.response() # this is a copy of response
    #s = res.read()
    #print "slen=",len(s)
    tfile = open("/college/psu1.dat")
    s = tfile.read()
    print s


    # s contains HTML not XML text
    d=[]
    d = libxml2dom.parseString(s, html=1)
    print "d",d

    name_=[]
    len_=0

    br.open(url)
    ##cj.save(COOKIEFILE) # resave cookies

    #res = br.response() # this is a copy of response
    #s = res.read()
    print "slen=",len(s)

    # s contains HTML not XML text
    #d=[]
    #d = libxml2dom.parseString(s, html=1)
    #print "d",d

    #name_ = d.xpath("//form")
    name_ = d.xpath("/html/body/form")
    len_ = len(name_)
    print "len=",len_

    print "name1",name_
    print "len",len(name_)
    #print "sdlfs"
    sys.exit()
    # else:
    # print "err in form_ID"


    print "here..."
     
    bruce, Aug 25, 2008
    #1
    1. Advertising

  2. bruce schrieb:
    > Hi.
    >
    > Got a test web page, that basically has two "<html" tags in it. Examining
    > the page via Firefox/Dom Inspector, I can create a test xpath query
    > "/html/body/form" which gets the target form for the test.
    >
    > The issue comes when I examine the page's source html. It looks like:
    > <html>
    > <body>
    > </body>
    > </html>
    >
    > <html>
    > <body>
    > .
    > .
    > .
    > </body>
    > </html>
    >
    > I've simplified things a bit... but basically, the 1st "html/body" is empty,
    > with the 2nd containing the data/nodes I need.


    If that's your document, it is invalid XML - XML only allows *one* root.
    Thus the parsers failure isn't too suprising.

    Try & wrap the whole document under an arbitrary root-tag, and included
    that as first part of the xpath. See if that helps.

    Diez
     
    Diez B. Roggisch, Aug 25, 2008
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Marvin_123456

    "Memory leak" in javax.xml.xpath.XPath

    Marvin_123456, Jul 29, 2005, in forum: Java
    Replies:
    4
    Views:
    2,010
    jan V
    Jul 29, 2005
  2. Alastair Cameron
    Replies:
    1
    Views:
    7,480
    SQL Server Development Team [MSFT]
    Jul 8, 2003
  3. Anna
    Replies:
    0
    Views:
    555
  4. goog
    Replies:
    0
    Views:
    527
  5. Tjerk Wolterink

    XPath: efficiency in xpath expressions

    Tjerk Wolterink, Nov 13, 2004, in forum: XML
    Replies:
    1
    Views:
    1,679
    Richard Tobin
    Nov 13, 2004
Loading...

Share This Page