Any equivalent to Ruby's 'hpricot' html/xpath/css selector package?

Discussion in 'Python' started by Kenneth McDonald, Dec 28, 2008.

  1. Ruby has a package called 'hpricot' which can perform limited xpath
    queries, and CSS selector queries. However, what makes it really
    useful is that it does a good job of handling the "broken" html that
    is so commonly found on the web. Does Python have anything similar,
    i.e. something that will not only do XPath queries, but will do so on
    imperfect HTML? (A good HTML neatener would also be fine, of course,
    as I could then pass the result to a Python XPath package.)

    And, what are people's favorite Python XPath solutions?

    Thanks,
    Ken McDonald
    Kenneth McDonald, Dec 28, 2008
    #1
    1. Advertising

  2. Kenneth McDonald a écrit :
    > Ruby has a package called 'hpricot' which can perform limited xpath
    > queries,


    ElementTree ? (it's in the stdlib now)

    > and CSS selector queries.


    PyQuery ?
    http://pypi.python.org/pypi/pyquery

    > However, what makes it really useful
    > is that it does a good job of handling the "broken" html that is so
    > commonly found on the web.


    BeautifulSoup ?
    http://pypi.python.org/pypi/BeautifulSoup/3.0.7a

    possibly with ElementSoup ?
    http://pypi.python.org/pypi/ElementSoup/rev452
    Bruno Desthuilliers, Dec 29, 2008
    #2
    1. Advertising

  3. Kenneth McDonald

    Mark Thomas Guest

    Re: Any equivalent to Ruby's 'hpricot' html/xpath/css selectorpackage?

    On Dec 28, 6:22 pm, Kenneth McDonald
    <> wrote:
    > Ruby has a package called 'hpricot' which can perform limited xpath  
    > queries, and CSS selector queries. However, what makes it really  
    > useful is that it does a good job of handling the "broken" html that  
    > is so commonly found on the web. Does Python have anything similar,  
    > i.e. something that will not only do XPath queries, but will do so on  
    > imperfect HTML?


    Hpricot is a fine package but I prefer Nokogiri (see
    http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-1288.html)
    because it is based on libxml2 and therefore is faster, conforms to
    the full XPath 1.0 spec, works on imperfect HTML, and exposes the
    Hpricot API.

    In python, the equivalent is lxml (http://codespeak.net/lxml/), which
    is similarly based on libxml2, very fast, XPath-1.0 conformant, and
    exposes the now-standard ElementTree API.

    The main difference is that lxml doesn't have CSS selector syntax, but
    IMHO that's a gimmick when you have a full XPath 1.0 engine at your
    disposal.

    -- Mark.
    Mark Thomas, Dec 29, 2008
    #3
  4. Stefan Behnel, Dec 30, 2008
    #4
  5. Bruno Desthuilliers wrote:
    >> However, what makes it really useful is that it does a good job of
    >> handling the "broken" html that is so commonly found on the web.

    >
    > BeautifulSoup ?
    > http://pypi.python.org/pypi/BeautifulSoup/3.0.7a
    >
    > possibly with ElementSoup ?
    > http://pypi.python.org/pypi/ElementSoup/rev452


    It's actually debatable if BS is any better than lxml/libxml2 when parsing
    broken HTML, as lxml tends to tidy things up pretty well. The only major
    difference is in encoding detection, for which you can also use a separate
    tool like chardet:

    http://chardet.feedparser.org/

    Stefan
    Stefan Behnel, Dec 30, 2008
    #5
  6. Kenneth McDonald wrote:
    > Ruby has a package called 'hpricot' which can perform limited xpath
    > queries, and CSS selector queries. However, what makes it really useful
    > is that it does a good job of handling the "broken" html that is so
    > commonly found on the web. Does Python have anything similar, i.e.
    > something that will not only do XPath queries, but will do so on
    > imperfect HTML?


    lxml.html is your friend.

    http://codespeak.net/lxml/lxmlhtml.html

    Stefan
    Stefan Behnel, Dec 30, 2008
    #6
  7. Kenneth McDonald

    Mark Thomas Guest

    Re: Any equivalent to Ruby's 'hpricot' html/xpath/css selectorpackage?

    On Dec 30, 8:20 am, Stefan Behnel <> wrote:
    > Mark Thomas wrote:
    > > The main difference is that lxml doesn't have CSS selector syntax

    >
    > Feel free to read the docs:
    >
    > http://codespeak.net/lxml/cssselect.html


    Don't know how I missed that...

    So lxml is pretty much an exact equivalent to what Ruby has to offer
    (Hpricot or Nokogiri). Nice.
    Mark Thomas, Dec 30, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. goog
    Replies:
    0
    Views:
    493
  2. Torsten Munkelt
    Replies:
    0
    Views:
    694
    Torsten Munkelt
    Jan 4, 2007
  3. Phlip
    Replies:
    3
    Views:
    230
    anansi
    Jul 29, 2007
  4. Celine
    Replies:
    13
    Views:
    252
    Chris Shea
    Dec 19, 2007
  5. Li Chen

    Hpricot and xpath

    Li Chen, Aug 12, 2008, in forum: Ruby
    Replies:
    7
    Views:
    137
    Phlip
    Aug 13, 2008
Loading...

Share This Page