python - firefox dom/xpath question/issue

B

bruce

Hi.

Got a test web page, that basically has two "<html" tags in it. Examining
the page via Firefox/Dom Inspector, I can create a test xpath query
"/html/body/form" which gets the target form for the test.

The issue comes when I examine the page's source html. It looks like:
<html>
<body>
</body>
</html>

<html>
<body>
..
..
..
</body>
</html>

I've simplified things a bit... but basically, the 1st "html/body" is empty,
with the 2nd containing the data/nodes I need.

In using xpath("/html/body/form"), the app returns nothing/crashes.. I've
tried to do something like xpath("/html[position()=0]") as well with no
luck... It's as if xpath only looks at the 1st html that it sees in a given
page. I can't seem to find any docs for xpath to work around this. I'm using
the libxml2dom for python 2.5.1.

Any thoughts/comments...

If I comment out the 1st html section, things work as they should. The test
code is below...

thanks

------------------------------------------
#!/usr/bin/python
#
# test.py
#
# scrapes/extracts the basic data for the college
#
#
# the app gets/stores
# name
# url
# address (street/city/state
# phone
#
######################################################################3
#test python script
import re
import libxml2dom
import urllib
import urllib2
import sys, string
from mechanize import Browser
import mechanize
#import tidy
import os.path
import cookielib
from libxml2dom import Node
from libxml2dom import NodeList
import subprocess
import time

########################
#
# Parse pricegrabber.com
########################
##cj = "p"
##COOKIEFILE = 'cookies.lwp'
#cookielib = 1


urlopen = urllib2.urlopen
Request = urllib2.Request
br = Browser()
br2 = Browser()

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values1 = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }

url="http://schedule.psu.edu/"
#=======================================


if __name__ == "__main__":
# main app

txdata = None

#----------------------------

##br.set_cookiejar(cj)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('User-Agent', 'Firefox')]

print "url =",url
#br.open(url)
##cj.save(COOKIEFILE) # resave cookies

#res = br.response() # this is a copy of response
#s = res.read()
#print "slen=",len(s)
tfile = open("/college/psu1.dat")
s = tfile.read()
print s


# s contains HTML not XML text
d=[]
d = libxml2dom.parseString(s, html=1)
print "d",d

name_=[]
len_=0

br.open(url)
##cj.save(COOKIEFILE) # resave cookies

#res = br.response() # this is a copy of response
#s = res.read()
print "slen=",len(s)

# s contains HTML not XML text
#d=[]
#d = libxml2dom.parseString(s, html=1)
#print "d",d

#name_ = d.xpath("//form")
name_ = d.xpath("/html/body/form")
len_ = len(name_)
print "len=",len_

print "name1",name_
print "len",len(name_)
#print "sdlfs"
sys.exit()
# else:
# print "err in form_ID"


print "here..."
 
D

Diez B. Roggisch

bruce said:
Hi.

Got a test web page, that basically has two "<html" tags in it. Examining
the page via Firefox/Dom Inspector, I can create a test xpath query
"/html/body/form" which gets the target form for the test.

The issue comes when I examine the page's source html. It looks like:
<html>
<body>
</body>
</html>

<html>
<body>
.
.
.
</body>
</html>

I've simplified things a bit... but basically, the 1st "html/body" is empty,
with the 2nd containing the data/nodes I need.

If that's your document, it is invalid XML - XML only allows *one* root.
Thus the parsers failure isn't too suprising.

Try & wrap the whole document under an arbitrary root-tag, and included
that as first part of the xpath. See if that helps.

Diez
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top