HTML DOM parser?

Paul Rubin · Jul 31, 2003

Is there an HTML DOM parser available for Python? Preferably one that
does a reasonable job with the crappy HTML out there on real web
pages, that doesn't get upset about unterminated tables and stuff like
that. Many extra points if it understands Javascript. Application is
a screen scraping web robot. Thanks.

=?ISO-8859-15?Q?Walter_D=F6rwald?= · Jul 31, 2003

adfgvx said:
Try tidy. There are two python wrappers : mxtidy and utidy, the latest is

Where can we get utidy?

more recent and use the new tidylib. BUT it will only correct a bad html
page and transform it to an xml or xhtml output that you load after as a DOM
with another parser. Personnaly I use pyRXP.

Bye,
Walter Dörwald

John J. Lee · Aug 1, 2003

Paul Rubin said:
Is there an HTML DOM parser available for Python? Preferably one that
does a reasonable job with the crappy HTML out there on real web
pages, that doesn't get upset about unterminated tables and stuff like
that. Many extra points if it understands Javascript. Application is
a screen scraping web robot. Thanks.

glork. I just started working on this myself.

Email me if you'd like the code, such as it is. I've wrapped the
Mozilla JS interpreter but am currently stuck on a segfault, so I
could certainly do with a collaborator.

I'm using utidylib and 4DOM (latter from PyXML).

Mind you, if you actually want to get a job done <wink>, for a
quick-but-bulky (and somewhat closed) solution, try PyKDE (KHTML /
KJS) or IE automation (MSHTML / JScript). Mozilla + XPCOM also, but I
think it requires rebuilding Mozilla to get PyXPCOM support. There's
also httpunit (in Java, useable from Jython).

John

Gilles Lenfant · Aug 1, 2003

mailto:[email protected]

Paul Rubin said:
Is there an HTML DOM parser available for Python? Preferably one that
does a reasonable job with the crappy HTML out there on real web
pages, that doesn't get upset about unterminated tables and stuff like
that. Many extra points if it understands Javascript. Application is
a screen scraping web robot. Thanks.

Windoze IE5(+) + Win32All python package only :

Use IE as COM object, browse the file or URL, then, get it's DOM root.
But any javascript found in that page is executed at page load and may fool
your app.

--Gilles

Paul Rubin · Aug 2, 2003

Here is a quick example of using automation with IE
# This is a sample of automating IE using Python.

Thanks, I should have said I'm running under gnu/linux and I was
hoping for a standalone solution (some of the ones suggested sound
worth looking into). Even connecting up Python to Mozilla sounds
awfully heavyweight.

John J. Lee · Aug 2, 2003

Paul Rubin said:
Thanks, I should have said I'm running under gnu/linux and I was
hoping for a standalone solution (some of the ones suggested sound
worth looking into). Even connecting up Python to Mozilla sounds
awfully heavyweight.

PyKDE is less hassle, I think. It's certainly heavyweight, though.
Probably more lightweight still is HttpUnit on Jython. I haven't used
either, but I have compiled PyKDE recently, and didn't run into
problems (but if you're unlucky, you may have to compile Qt, KDE, sip
and PyQt first!).

I seem to have got a basic JavaScript wrapper working now (I'm using
libjs from Mozilla's standalone spidermonkey distribution), bound 4DOM
to it, and extracted & executed the script from a web page. Quite a
lot more to do, though (browser-like interface of some sort,
javascript: scheme URLs, implement window object, wiring up event
attributes to the JS interpreter, getting the DOM actually working
propertly, understanding what document.write does, trying to connect
the DOM to my Python HTML form and HTTP cookies interfaces...).

Anybody happen to know where JavaScript's document.some_form is
documented? Official W3C DOM has document.forms, but real browser
DOMs apparently have forms directly on the document object.

John

calfdog · Aug 4, 2003

PyKDE is less hassle, I think. It's certainly heavyweight, though.
Probably more lightweight still is HttpUnit on Jython. I haven't used
either, but I have compiled PyKDE recently, and didn't run into
problems (but if you're unlucky, you may have to compile Qt, KDE, sip
and PyQt first!).

I seem to have got a basic JavaScript wrapper working now (I'm using
libjs from Mozilla's standalone spidermonkey distribution), bound 4DOM
to it, and extracted & executed the script from a web page. Quite a
lot more to do, though (browser-like interface of some sort,
javascript: scheme URLs, implement window object, wiring up event
attributes to the JS interpreter, getting the DOM actually working
propertly, understanding what document.write does, trying to connect
the DOM to my Python HTML form and HTTP cookies interfaces...).

Anybody happen to know where JavaScript's document.some_form is
documented? Official W3C DOM has document.forms, but real browser
DOMs apparently have forms directly on the document object.

John

Try here: http://msdn.microsoft.com/library/d...hor/dhtml/reference/dhtml_reference_entry.asp

HTML Parser	3	Jul 2, 2013
How to check the validation of js files or html files including js?	6	Jan 12, 2020
html DOM	4	Mar 29, 2008
HTML parser to DOM via SAX?	0	Mar 7, 2005
HTML dom	7	Jun 23, 2009
Is there a HTML parser who can reconstruct the original html EXACTLY?	6	Jan 23, 2008
[ANN] pywebkit - python bindings for webkit DOM (alpha)	0	Oct 7, 2010
[ANN] pywebkit - python bindings for webkit DOM (alpha)	0	Oct 6, 2010

HTML DOM parser?

Paul Rubin

=?ISO-8859-15?Q?Walter_D=F6rwald?=

John J. Lee

Gilles Lenfant

Paul Rubin

John J. Lee

calfdog

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads