python mechanize/libxml2dom question

B

bruce

hi...

i've got the following situation, with the following test url:
"http://schedule.psu.edu/soc/fall/Alloz/a-c/acctg.html#".

i can generate a list of the tables i want for the courses on the page.
however, when i try to create the xpath query, and plug it into the xpath
within python, i'm missing something. if i have a parent xpath query, that
generates a list of results/nodes... how can i then use the individual
parent node, and trigger off of it, to get further information.

i tried using the following chunk of code with no luck.

#s is the html from the course file
d = libxml2dom.parseString(s, html=1)

#at this point, we should have a vaild "d" representation
print "sdddd=",s

aa=libxml2dom.toString(d)
print "hereeeeee \n\n\n"
print "aa",aa
#sys.exit()

# **** course names

cpath='//table[position()>0]/descendant::td[position()=2][@width="85%"]/../t
d[1]/font/a[2]/text()'

cpath_=[]
cpath_=d.xpath(cpath)

print "len=",len(cpath_)
if len(cpath_)>0:

for cpath in cpath_:
#get the coursename info
cname=cpath.toString()
print "cpath=",cpath
print "cname=",cname
rr="./../../../../../../following-sibling::table//tr[position()>1]"

rr=cpath.xpath()
print "rrlen=",len(rr)
print rr[0].toString()
sys.exit()


i'm assuming that there's a libxml2node method that will do what i need that
i'm missing...

pointers/comments would be helpful here...

thanks!
 
S

Stefan Behnel

bruce said:
i've got the following situation, with the following test url:
"http://schedule.psu.edu/soc/fall/Alloz/a-c/acctg.html#".

i can generate a list of the tables i want for the courses on the page.
however, when i try to create the xpath query, and plug it into the xpath
within python, i'm missing something. if i have a parent xpath query, that
generates a list of results/nodes... how can i then use the individual
parent node, and trigger off of it, to get further information.
[code example stripped]

You should really use lxml. It has callable XPath objects that feel like
Python functions, and its Element objects have a getparent() method that gets
you to the parent of the node. Plus, text strings that you get back from an
XPath evaluation also have a getparent() method that returns the Element
object that holds the text. I think that's what you were looking for.

Stefan
 
P

Paul Boddie

i've got the following situation, with the following test url:
"http://schedule.psu.edu/soc/fall/Alloz/a-c/acctg.html#".

i can generate a list of the tables i want for the courses on the page.
however, when i try to create the xpath query, and plug it into the xpath
within python, i'm missing something. if i have a parent xpath query, that
generates a list of results/nodes... how can i then use the individual
parent node, and trigger off of it, to get further information.

You can always use the parentNode property on the nodes you get as
results from the XPath query, but I guess what you want to do is to
"rewind" and issue queries relative to some ancestor of the result
nodes.

[...]
# **** course names

cpath='//table[position()>0]/descendant::td[position()=2][@width="85%"]/../td[1]/font/a[2]/text()'

This obviously gets you right down to the hyperlink text within a part
of the table. However, it may be easier to break this query up in
order to get a more manageable overview of the process. My
understanding of the above query is that it can first be rewritten as
the following:

cpath = "//table//td[position()=2 and @width='85%']/../td[1]/font/a[2]/
text()"

Or even this:

cpath = "//table[.//td[position()=2 and @width='85%']]//td[1]/font/
a[2]/text()"

But what you could do is to obtain the important tables first:

tables = d.xpath("//table[.//td[position()=2 and @width='85%']]")

Here, we use the bracketed term to ensure that the table is the right
one, but we don't actually descend inside the table.

You could, from this, get the name by doing a query from each of these
tables:

for table in tables:
cnames = table.xpath(".//td[1]/font/a[2]/text()") # list of text
nodes

You might want to consider a slightly safer approach when getting the
text:

cnames = table.xpath(".//td[1]/font/a[2]") # list of nodes, should
be one
name = cnames[0].textContent # all the text from the link

When looking for the details, you can then write your query relative
to these tables, rather than having to figure out the location of the
details from the text nodes you've just extracted.

details = table.xpath("following-sibling::table[1]") # list of max
1 node
i'm assuming that there's a libxml2node method that will do what i need that
i'm missing...

You should be able to issue XPath queries from any node. There have
been issues with libxml2dom and attribute nodes obtained from XPath,
but these were fixed in recent changesets.

Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top