Olivier Favre-Simon said:
I would like to know what is available for scripting browsers from
Python.
[...]
ClientForm
http://wwwsearch.sourceforge.net/ClientForm/
I use it for automation of POSTs of entire image directories to
imagevenue.com/imagehigh.com/etc hosts.
This doesn't actually address what the OP wanted: it's not a browser.
Yep. Didn't read with sufficient care. He really wants scripting not
webscraping.
Nested forms?? Good grief. Can you point me at a real life example of
such HTML? Can probably fix the parser to work around this.
What I mean is: The parser does not detect a missing </form>, so
thinks that there are nested forms, and raises a ParseError.
Browsers have an easier task at spotting non-matching form tags, because
they can use matching table or div tags around to imply that the form is
closed (DOM approach).
Not easy with a SAXish approach like HTMLParser.
I don't mean nested forms should be supported, they are crap (is this even
legal code ?)
Titus Brown says he's trying to fix sgmllib (to some extent, at least).
Also, you can always feed stuff through mxTidy.
I'd like to have a reimplementation of ClientForm on top of something
like BeautifulSoup...
John
When taken separately, either ClientForm, HTMLParser or SGMLParser work
well.
But it would be cool that competent people in the HTML parsing domain join
up, and define a base parser interface, the same way smart guys did with
WSGI for webservers.
So libs like ClientForm would not raise say an AttributeError if some
custom parser class does not implement a given attribute.
Adding an otherwise unused attribute to a parser just in case one day it
will interop with ClientForm sounds silly. And what if ClientForm changes
its attributes, etc.
No really, whatever the chosen codebase, a common parser interface would
be great.