B
bruce
hi...
update to an ongoing issue i've been having regarding html/Browser and
selecting forms.
i've created a basic test app, and created a stripped down page of html. the
html has a single form.
i get the following error:
fname = main <<<< the app can find the frame from the XPath...
Traceback (most recent call last):
File "./axess.py", line 90, in ?
br.select_form(name = "main") <<<<< app is dying!!!
File "build/bdist.linux-i686/egg/mechanize/_mechanize.py", line 354, in
select_form
mechanize._mechanize.BrowserStateError: not viewing HTML
any thoughts/ideas/comments will be useful!!
thanks
-bruce
test code
---------------------------
import re
import libxml2dom
import urllib
import urllib2
import sys, string
#import numarray
import httplib
from mechanize import Browser, RobustFactory
import mechanize
from BeautifulSoup import *
########################
#
# Parsing App Information
########################
# datafile
tfile = open("stanford.dat", 'wr+')
cj = mechanize.CookieJar()
br = Browser()
if __name__ == "__main__":
# main app
#----------------------------
# start trying to get the stanford pages
cj = mechanize.CookieJar()
# br = Browser(factory=RobustFactory())
br = Browser()
fh = open('axess1.dat')
s = fh.read()
fh.close()
br.open("file:///home/test/axess1.dat")
# br.open(s)
print "foo"
# particular cookiejar)
br.set_cookiejar(cj)
response = br.response() # this is a copy of response
fnamepath = "/html/body[@class='PSPAGE']/form[1]/attribute::name"
s = response.read()
print response.read()
d = libxml2dom.parseString(s, html=1)
ff = d.xpath(fnamepath)
fname = ff[0].nodeValue
print "fname = ",fname
br.select_form(name = "main")
print "ssssss"
sys.exit()
test html
---------------------------
<html lang='en'>
<head>
<title>View Schedule of Classes</title>
</head>
<body class='PSPAGE' >
<br>
<form name="main" method="post" action=
"/servlets/iclientservlet/a2k_prd/?ICType=Panel&Menu=SA_LEARNER_SERVICES
&Market=GBL&PanelGroupName=CLASS_SEARCH"
autocomplete="off" id="main">
</form>
</body>
</html>
hi john...
this is in regards to the web/parsing/factory/beautifulsoup....
to reiterate, i have python 2.4, mechanize, browser, beatifulsoup installed.
i have the latest mech from svn.
i'm getting the same err as reported by john t. the code/err follows.. (i
can resend the test html if you need)
any thoughts/pointers/etc would be helpful...
thanks
-bruce
test code
#! /usr/bin/env python
#test python script
import re
import libxml2dom
import urllib
import urllib2
import sys, string
#import numarray
import httplib
from mechanize import Browser, RobustFactory
import mechanize
import BeautifulSoup
########################
#
# Parsing App Information
########################
# datafile
tfile = open("stanford.dat", 'wr+')
cj = mechanize.CookieJar()
br = Browser()
if __name__ == "__main__":
# main app
#----------------------------
# start trying to get the stanford pages
cj = mechanize.CookieJar()
br = Browser(factory=RobustFactory())
fh = open('axess.dat')
s = fh.read()
fh.close()
br.open("file:///home/test/axess.dat")
.
.
.
.
err/output
Traceback (most recent call last):
File "./axess.py", line 45, in ?
br.open("file:///home/test/axess.dat")
File "build/bdist.linux-i686/egg/mechanize/_mechanize.py", line 130, in
open
File "build/bdist.linux-i686/egg/mechanize/_mechanize.py", line 170, in
_mech_open
File "build/bdist.linux-i686/egg/mechanize/_mechanize.py", line 213, in
set_response
File "build/bdist.linux-i686/egg/mechanize/_html.py", line 577, in
set_response
File "build/bdist.linux-i686/egg/mechanize/_html.py", line 316, in
__init__
File "/usr/lib/python2.4/site-packages/BeautifulSoup.py", line 1326, in
__init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/lib/python2.4/site-packages/BeautifulSoup.py", line 973, in
__init__
self._feed()
File "/usr/lib/python2.4/site-packages/BeautifulSoup.py", line 987, in
_feed
smartQuotesTo=self.smartQuotesTo)
File "/usr/lib/python2.4/site-packages/BeautifulSoup.py", line 1580, in
__init__
u = self._convertFrom(proposedEncoding)
File "/usr/lib/python2.4/site-packages/BeautifulSoup.py", line 1614, in
_convertFrom
proposed = self.find_codec(proposed)
File "/usr/lib/python2.4/site-packages/BeautifulSoup.py", line 1731, in
find_codec
return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
File "/usr/lib/python2.4/site-packages/BeautifulSoup.py", line 1740, in
_codec
codecs.lookup(charset)
TypeError: lookup() argument 1 must be string, not bool
is this where i've seen references to integrating Beautifulsoup in the wb
browsing app?
-bruce
-----Original Message-----
From: [email protected]
[mailto[email protected]]On Behalf
Of John J Lee
Sent: Monday, July 10, 2006 2:29 AM
To: (e-mail address removed)
Cc: (e-mail address removed)
Subject: RE: [wwwsearch-general] ClientForm request re ParseErrors
]
You don't include the HTML mentioned in the exception message ('<!
Others/0/WIN; Too') in the part of the HTML that you quote, but that
snippet is enough to see what's wrong, and lets you find exactly where in
the HTML the problem lies. Comments in HTML start with '<!--' and end
with '-->'. The comment sgmllib is complaining about is missing the '--'.
You can work around bad HTML using the .set_data() method on response
objects and the .set_response() method on Browser. Call the latter before
you call any other methods that would require parsing the HTML.
r = br.response()
r.set_data(clean_html(br.get_data()))
br.set_response(r)
You must write clean_html yourself (though you may use an external tool to
do so, of course).
Alternatively, use a more robust parser, e.g.
br = mechanize.Browser(factory=mechanize.RobustFactory())
(you may also integrate another parser of your choice with mechanize, with
more effort)
John
update to an ongoing issue i've been having regarding html/Browser and
selecting forms.
i've created a basic test app, and created a stripped down page of html. the
html has a single form.
i get the following error:
fname = main <<<< the app can find the frame from the XPath...
Traceback (most recent call last):
File "./axess.py", line 90, in ?
br.select_form(name = "main") <<<<< app is dying!!!
File "build/bdist.linux-i686/egg/mechanize/_mechanize.py", line 354, in
select_form
mechanize._mechanize.BrowserStateError: not viewing HTML
any thoughts/ideas/comments will be useful!!
thanks
-bruce
test code
---------------------------
import re
import libxml2dom
import urllib
import urllib2
import sys, string
#import numarray
import httplib
from mechanize import Browser, RobustFactory
import mechanize
from BeautifulSoup import *
########################
#
# Parsing App Information
########################
# datafile
tfile = open("stanford.dat", 'wr+')
cj = mechanize.CookieJar()
br = Browser()
if __name__ == "__main__":
# main app
#----------------------------
# start trying to get the stanford pages
cj = mechanize.CookieJar()
# br = Browser(factory=RobustFactory())
br = Browser()
fh = open('axess1.dat')
s = fh.read()
fh.close()
br.open("file:///home/test/axess1.dat")
# br.open(s)
print "foo"
# particular cookiejar)
br.set_cookiejar(cj)
response = br.response() # this is a copy of response
fnamepath = "/html/body[@class='PSPAGE']/form[1]/attribute::name"
s = response.read()
print response.read()
d = libxml2dom.parseString(s, html=1)
ff = d.xpath(fnamepath)
fname = ff[0].nodeValue
print "fname = ",fname
br.select_form(name = "main")
print "ssssss"
sys.exit()
test html
---------------------------
<html lang='en'>
<head>
<title>View Schedule of Classes</title>
</head>
<body class='PSPAGE' >
<br>
<form name="main" method="post" action=
"/servlets/iclientservlet/a2k_prd/?ICType=Panel&Menu=SA_LEARNER_SERVICES
&Market=GBL&PanelGroupName=CLASS_SEARCH"
autocomplete="off" id="main">
</form>
</body>
</html>
hi john...
this is in regards to the web/parsing/factory/beautifulsoup....
to reiterate, i have python 2.4, mechanize, browser, beatifulsoup installed.
i have the latest mech from svn.
i'm getting the same err as reported by john t. the code/err follows.. (i
can resend the test html if you need)
any thoughts/pointers/etc would be helpful...
thanks
-bruce
test code
#! /usr/bin/env python
#test python script
import re
import libxml2dom
import urllib
import urllib2
import sys, string
#import numarray
import httplib
from mechanize import Browser, RobustFactory
import mechanize
import BeautifulSoup
########################
#
# Parsing App Information
########################
# datafile
tfile = open("stanford.dat", 'wr+')
cj = mechanize.CookieJar()
br = Browser()
if __name__ == "__main__":
# main app
#----------------------------
# start trying to get the stanford pages
cj = mechanize.CookieJar()
br = Browser(factory=RobustFactory())
fh = open('axess.dat')
s = fh.read()
fh.close()
br.open("file:///home/test/axess.dat")
.
.
.
.
err/output
Traceback (most recent call last):
File "./axess.py", line 45, in ?
br.open("file:///home/test/axess.dat")
File "build/bdist.linux-i686/egg/mechanize/_mechanize.py", line 130, in
open
File "build/bdist.linux-i686/egg/mechanize/_mechanize.py", line 170, in
_mech_open
File "build/bdist.linux-i686/egg/mechanize/_mechanize.py", line 213, in
set_response
File "build/bdist.linux-i686/egg/mechanize/_html.py", line 577, in
set_response
File "build/bdist.linux-i686/egg/mechanize/_html.py", line 316, in
__init__
File "/usr/lib/python2.4/site-packages/BeautifulSoup.py", line 1326, in
__init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/lib/python2.4/site-packages/BeautifulSoup.py", line 973, in
__init__
self._feed()
File "/usr/lib/python2.4/site-packages/BeautifulSoup.py", line 987, in
_feed
smartQuotesTo=self.smartQuotesTo)
File "/usr/lib/python2.4/site-packages/BeautifulSoup.py", line 1580, in
__init__
u = self._convertFrom(proposedEncoding)
File "/usr/lib/python2.4/site-packages/BeautifulSoup.py", line 1614, in
_convertFrom
proposed = self.find_codec(proposed)
File "/usr/lib/python2.4/site-packages/BeautifulSoup.py", line 1731, in
find_codec
return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
File "/usr/lib/python2.4/site-packages/BeautifulSoup.py", line 1740, in
_codec
codecs.lookup(charset)
TypeError: lookup() argument 1 must be string, not bool
is this where i've seen references to integrating Beautifulsoup in the wb
browsing app?
-bruce
-----Original Message-----
From: [email protected]
[mailto[email protected]]On Behalf
Of John J Lee
Sent: Monday, July 10, 2006 2:29 AM
To: (e-mail address removed)
Cc: (e-mail address removed)
Subject: RE: [wwwsearch-general] ClientForm request re ParseErrors
]
[...]sgmllib.SGMLParseError: expected name token at '<! Others/0/WIN; Too'
partial html
-----------------------------------
</table>
<br />
<FORM NAME='main' METHOD=POST
Action="/servlets/iclientservlet/a2k_prd/?ICType=Panel&Menu=SA_LEARNER_SERVI
CES&Market=GBL&PanelGroupName=CLASS_SEARCH" autocomplete=off>
<INPUT TYPE=hidden NAME=ICType VALUE=Panel>
<INPUT TYPE=hidden NAME=ICElementNum VALUE="0">
<INPUT TYPE=hidden NAME=ICStateNum VALUE="1">
You don't include the HTML mentioned in the exception message ('<!
Others/0/WIN; Too') in the part of the HTML that you quote, but that
snippet is enough to see what's wrong, and lets you find exactly where in
the HTML the problem lies. Comments in HTML start with '<!--' and end
with '-->'. The comment sgmllib is complaining about is missing the '--'.
You can work around bad HTML using the .set_data() method on response
objects and the .set_response() method on Browser. Call the latter before
you call any other methods that would require parsing the HTML.
r = br.response()
r.set_data(clean_html(br.get_data()))
br.set_response(r)
You must write clean_html yourself (though you may use an external tool to
do so, of course).
Alternatively, use a more robust parser, e.g.
br = mechanize.Browser(factory=mechanize.RobustFactory())
(you may also integrate another parser of your choice with mechanize, with
more effort)
John