Any equivalent to Ruby's 'hpricot' html/xpath/css selector package?

Kenneth McDonald · Dec 28, 2008

Ruby has a package called 'hpricot' which can perform limited xpath
queries, and CSS selector queries. However, what makes it really
useful is that it does a good job of handling the "broken" html that
is so commonly found on the web. Does Python have anything similar,
i.e. something that will not only do XPath queries, but will do so on
imperfect HTML? (A good HTML neatener would also be fine, of course,
as I could then pass the result to a Python XPath package.)

And, what are people's favorite Python XPath solutions?

Thanks,
Ken McDonald

Bruno Desthuilliers · Dec 29, 2008

Kenneth McDonald a écrit :

Ruby has a package called 'hpricot' which can perform limited xpath
queries,

ElementTree ? (it's in the stdlib now)

and CSS selector queries.

PyQuery ?
http://pypi.python.org/pypi/pyquery

However, what makes it really useful
is that it does a good job of handling the "broken" html that is so
commonly found on the web.

BeautifulSoup ?
http://pypi.python.org/pypi/BeautifulSoup/3.0.7a

possibly with ElementSoup ?
http://pypi.python.org/pypi/ElementSoup/rev452

Mark Thomas · Dec 29, 2008

Ruby has a package called 'hpricot' which can perform limited xpath
queries, and CSS selector queries. However, what makes it really
useful is that it does a good job of handling the "broken" html that
is so commonly found on the web. Does Python have anything similar,
i.e. something that will not only do XPath queries, but will do so on
imperfect HTML?

Hpricot is a fine package but I prefer Nokogiri (see
http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-1288.html)
because it is based on libxml2 and therefore is faster, conforms to
the full XPath 1.0 spec, works on imperfect HTML, and exposes the
Hpricot API.

In python, the equivalent is lxml (http://codespeak.net/lxml/), which
is similarly based on libxml2, very fast, XPath-1.0 conformant, and
exposes the now-standard ElementTree API.

The main difference is that lxml doesn't have CSS selector syntax, but
IMHO that's a gimmick when you have a full XPath 1.0 engine at your
disposal.

-- Mark.

Stefan Behnel · Dec 30, 2008

Mark said:
The main difference is that lxml doesn't have CSS selector syntax

Feel free to read the docs:

http://codespeak.net/lxml/cssselect.html

Stefan

Stefan Behnel · Dec 30, 2008

Bruno said:
BeautifulSoup ?
http://pypi.python.org/pypi/BeautifulSoup/3.0.7a

possibly with ElementSoup ?
http://pypi.python.org/pypi/ElementSoup/rev452

It's actually debatable if BS is any better than lxml/libxml2 when parsing
broken HTML, as lxml tends to tidy things up pretty well. The only major
difference is in encoding detection, for which you can also use a separate
tool like chardet:

http://chardet.feedparser.org/

Stefan

Stefan Behnel · Dec 30, 2008

Kenneth said:
Ruby has a package called 'hpricot' which can perform limited xpath
queries, and CSS selector queries. However, what makes it really useful
is that it does a good job of handling the "broken" html that is so
commonly found on the web. Does Python have anything similar, i.e.
something that will not only do XPath queries, but will do so on
imperfect HTML?

lxml.html is your friend.

http://codespeak.net/lxml/lxmlhtml.html

Stefan

Mark Thomas · Dec 30, 2008

Feel free to read the docs:

http://codespeak.net/lxml/cssselect.html

Don't know how I missed that...

So lxml is pretty much an exact equivalent to what Ruby has to offer
(Hpricot or Nokogiri). Nice.

Elisp Tutorial: HTML Syntax Coloring Code Block	6	Oct 18, 2007
Design of a URL encoded language to specify sets of files on aWebDAV server	1	Nov 18, 2004
YUI--Competent?	1	Dec 25, 2009
Request for Feedback; a module making it easier to use regular expressions.	1	Jan 31, 2005
Ruby Weekly News 5th - 11th June 2006	0	Jun 14, 2006
python-dev Summary for 2006-12-01 through 2006-12-15	0	Jan 13, 2007
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
jQuery Attribute Summit--Latest Coverage	16	Dec 20, 2009

Any equivalent to Ruby's 'hpricot' html/xpath/css selector package?

Kenneth McDonald

Bruno Desthuilliers

Mark Thomas

Stefan Behnel

Stefan Behnel

Stefan Behnel

Mark Thomas

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads