Hpricot/Rubyful Soup comparison

Wes Gamble · Nov 21, 2006

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster for an
average sized HTML page and b) preserves the original HTML better.

Thanks,
Wes

lrlebron · Nov 22, 2006

I've used both Hpricot and Rubyful Soup to parse the Google News page
and found Hpricot to be much faster.

Luis

Peter Szinek · Nov 22, 2006

Wes said:
Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster for an
average sized HTML page and

I did not do any benchmarks, but I am scraping a lot of relatively big
pages on a daily basis and I can tell you, RubyfulSoup is magnitudes
slower than HPricot. I am absolutely sure about this. I am doing things
with HPricot which should be extremely slow (e.g. traversing the whole
tree and doing expensive operations on all Hpricot::Elements) yet
HPricot is surprisingly fast. Rubyful is nowhere near.

b) preserves the original HTML better.
Hmm this I don't know, but I guess the term 'preserves HTML better'
should be defined first with some metrics or something ( deviance from
the HTML standard? ). There are a lot of so badly formed HTML pages,
than even a human would come up with multiple solutions for their
correction.

I think the only real-life quality meter is to process your pages with
both of them and see which one yields better results. I did not play too
much with RubyfulSoup but I am writing a quite serious screen scraping
framework based on Hpricot, and so far I had no real problems - and I am
doing every kind of weird things.

Cheers,
Peter

__
http://www.rubyrailways.com

Luciano Ramalho · Nov 22, 2006

I did not do any benchmarks, but I am scraping a lot of relatively big
pages on a daily basis and I can tell you, RubyfulSoup is magnitudes
slower than HPricot.

HPricot is partially written in C, so it should be faster than a
pure-Ruby lib like RubyfulSoup.

Also, RubyfulSoup aims to be very resilient to malformed markup, so it
must resort to heuristics that have a performance cost. I don't know
fow HPricot handles HTML or XML with really serious flaws like tags
that open but never close and so on, but in my experience RubyfulSoup
has managed to deal amazingly well with such problems. If you need to
parse low quality markup, the performance penalty of RubyfulSoup may
be well worth the price.

Cheers,

Luciano

Peter Szinek · Nov 22, 2006

Luciano said:
HPricot is partially written in C, so it should be faster than a
pure-Ruby lib like RubyfulSoup. true

Also, RubyfulSoup aims to be very resilient to malformed markup,

So it's HPricot. HPricot is not just a HTML parser which can parse
(relatively) valid HTML - it can parse any HTML 'somehow'. We can argue
whether HPricot's 'somehow' is better or worse that RubyfulSoup's, but
it is a fact that HPricot is handling malformed pages very well.

so it

must resort to heuristics that have a performance cost. I don't know
fow HPricot handles HTML or XML with really serious flaws like tags
that open but never close and so on,

This concretely is absolutely OK. Maybe we would need a list of serious
problems and see how Hpricot vs RubyfulSoup is handling them. From what
I have seen, HPricot did not have any problems with any page...

has managed to deal amazingly well with such problems. If you need to
parse low quality markup, the performance penalty of RubyfulSoup may
be well worth the price.

I am still not sure what are the added benefits of RubyfulSoup parsing
over HPricot (although I am not claiming that there are none) - I would
like to see a real serious comparison to decide this...

Peter

__
http://www.rubyrailways.com

Gregory Seidman · Nov 22, 2006

[...]
} > Also, RubyfulSoup aims to be very resilient to malformed markup,
} So it's HPricot. HPricot is not just a HTML parser which can parse
} (relatively) valid HTML - it can parse any HTML 'somehow'. We can argue
} whether HPricot's 'somehow' is better or worse that RubyfulSoup's, but
} it is a fact that HPricot is handling malformed pages very well.
}
} > so it must resort to heuristics that have a performance cost. I don't
} > know fow HPricot handles HTML or XML with really serious flaws like
} > tags that open but never close and so on,
} This concretely is absolutely OK. Maybe we would need a list of serious
} problems and see how Hpricot vs RubyfulSoup is handling them. From what
} I have seen, HPricot did not have any problems with any page...

HPricot even keeps track of when tags are (incorrectly) closed by a
different close tag. This can allow you to track down issues in broken HTML
if that's your intent, but since I am mostly using HPricot for sanitization
I just set the close tags to nil so the output closes with the correct tag.
I do find it a little annoying that HPricot will always produce an
open/close pair even if the input was self-closing (e.g. <foo />) unless
the tag is known to be an empty tag by HPricot (see
Hpricot::ElementContent).

} > has managed to deal amazingly well with such problems. If you need to
} > parse low quality markup, the performance penalty of RubyfulSoup may
} > be well worth the price.
} I am still not sure what are the added benefits of RubyfulSoup parsing
} over HPricot (although I am not claiming that there are none) - I would
} like to see a real serious comparison to decide this...

I haven't tried RubyfulSoup, but HPricot suits my needs nicely. I am
delighted by its reliance on a bare minimum of HPricot-specific objects. It
doesn't try to behave like a real DOM, which means that it can use arrays
for child lists and ordinary references for parent nodes and hashes for
attributes, all read/write. It is possible to perform significant
transformations with minimal difficulty.

} Peter
--Greg

Bob Hutchison · Nov 22, 2006

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster
for an
average sized HTML page and b) preserves the original HTML better.

I switched from Rubyful Soup to Hpricot a while ago. The reason was
performance on 1000-2000 character html chunks -- I didn't do a
benchmark because there just was no need to... Hpricot is *a lot*
faster.

I have no idea which preserves html better, I'm only using them to
find specific bits of the html (e.g. links, images, a few other
things). I do not use either to transform the input html, I *always*
keep the input as it was. In all cases I have html in a string that I
give to the parser, I do know that with Rubyful Soup it was
absolutely necessary to dup the string first or you were liable to
have changes made to the input string.

Cheers,
Bob

Thanks,
Wes

----
Bob Hutchison -- blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. -- <http://www.recursive.ca/>
Raconteur -- <http://www.raconteur.info/>
xampl for Ruby -- <http://rubyforge.org/projects/xampl/>

Luciano Ramalho · Nov 22, 2006

So it's HPricot. HPricot is not just a HTML parser which can parse
(relatively) valid HTML - it can parse any HTML 'somehow'. We can argue
whether HPricot's 'somehow' is better or worse that RubyfulSoup's, but
it is a fact that HPricot is handling malformed pages very well.

Thanks for the input, Peter. From your opinion and other=B4s, it seems
HPricot is the best option. Coming from Python, I was used to
BeautifulSoup, from which RubyfulSoup derived, and I was very happy
with it. But if we can have the same benefits with better performance,
then it=B4s a no-brainer!

Cheers,

Luciano

Wes Gamble · Nov 22, 2006

Thanks for all of the comments.

I was pretty sure that Hpricot was faster since it is partially written
in C, but it's nice to hear a resounding "YES" on that topic.

My concern about "preserving original markup" has to do with this
application I'm writing, which grabs a page and then tries to display
it. When RubyfulSoup would encounter bad HTML, it could parse it ok,
but it always attempts to fix it when I went to write the parse tree.
Which can cause problems when you try to redisplay the HTML.

Some malformed HTML is handled fine by browsers, so I'd like to preserve
the original HTML regardless of its quality. If Hpricot will not only
parse my HTML quickly, but also not fix the HTML on the way out (dumping
the parse tree), that would be ideal.

Again, thanks for all of the discussion - it's quite helpful.

Wes

_why · Nov 22, 2006

My concern about "preserving original markup" has to do with this
application I'm writing, which grabs a page and then tries to display
it. When RubyfulSoup would encounter bad HTML, it could parse it ok,
but it always attempts to fix it when I went to write the parse tree.
Which can cause problems when you try to redisplay the HTML.

I totally agree with you regarding preserving the original markup. In fact,
the latest Hpricot code (in subversion) has two methods for output:

* `to_html` which outputs fully closed tags and strips out bogus end tags.
* `to_original_html` which outputs the original document (as close as it can)
with your modifications made.

So, for example, I use the `to_original_html` method in MouseHole, which is a
scriptable personal HTTP proxy (sort of like greasemonkey). Some pages (like
Boing Boing, for instance) completely break if you try to fix up the HTML. But
this new method can successfully remove stuff and alter stuff without turning the
whole page upside-down.

_why

_why · Nov 22, 2006

I do find it a little annoying that HPricot will always produce an
open/close pair even if the input was self-closing (e.g. <foo />) unless
the tag is known to be an empty tag by HPricot (see
Hpricot::ElementContent).

Mmm. Okay, good point. So if a tag comes in as self-closing, keep it that way?
I think that's reasonable.

_why

Wes Gamble · Nov 22, 2006

_why said:
I totally agree with you regarding preserving the original markup. In
fact,
the latest Hpricot code (in subversion) has two methods for output:

* `to_html` which outputs fully closed tags and strips out bogus end
tags.
* `to_original_html` which outputs the original document (as close as
it can)
with your modifications made.

sweet.

Wes Gamble · Nov 22, 2006

Wes said:
sweet.

Actually, I'm kind of hoping that I can make mods. to the parse tree,
but that no "unnecessary fixing" of bad HTML occurs.

So I'm wondering does modifying the parse tree at all and then
outputting it imply that all of the malformed HTML will be
fixed/modified in some way or not?

Thanks,
Wes

_why · Nov 22, 2006

Actually, I'm kind of hoping that I can make mods. to the parse tree,
but that no "unnecessary fixing" of bad HTML occurs.

So I'm wondering does modifying the parse tree at all and then
outputting it imply that all of the malformed HTML will be
fixed/modified in some way or not?

With `to_original_html`, no malformed HTML is fixed.
<div>Paragraph oneParagraph two with some tags in it 

With `to_html`, Hpricot will line up all the tags.

_why

Giles Bowkett · Nov 22, 2006

I have, in late August, and at that time, we found that Rubyful Soup
was ten times slower than Hpricot and Mechanize.

Henry Maddocks · Nov 23, 2006

I recently wrote a scrapper in rubyfulsoup and then rewrote it in
hpricot. The hpricot version was MUCH faster, had less code and is
easier to understand. I was a bit dubious of hpricot initially
because of the 'strange syntax' but I am definitely sold now.

As for correctness, I can't comment.

_why · Nov 25, 2006

In terms of preserving the original HTML, I found the libxml2 and Hpricot
parsers to be fairly even, with both doing pretty good job of fixing up
broken HTML.

Thanks, Ross, that was great. Libxml2 has HTML fixup stuff? That's
sensational. Are the bindings pretty stable?

_why

Ross Bamford · Nov 25, 2006

Thanks, Ross, that was great. Libxml2 has HTML fixup stuff? That's
sensational. Are the bindings pretty stable?

Surely does: http://xmlsoft.org/html/libxml-HTMLparser.html . It's a new
addition to the bindings (still in CVS) but it's really 'just another
parser' and uses the same (reasonably well tested) parser context / tree
bindings as the regular XML parsers.

inner_html = "" in hpricot	0	Jan 25, 2010
website screen scraping with Mechanize or Rubyful Soup	9	Sep 12, 2005
[ANN] hpricot 0.7	23	Mar 17, 2009
hpricot install failure on amd64	0	Aug 19, 2008
Hpricot Relative Path	2	Mar 12, 2010
Hpricot question	1	Jul 31, 2006
Image overlay and comparison code error.	2	Jul 1, 2021
[ANN] Hpricot 0.6 -- the swift, delightful HTML parser	0	Jun 16, 2007

Hpricot/Rubyful Soup comparison

Wes Gamble

lrlebron

Peter Szinek

Luciano Ramalho

Peter Szinek

Gregory Seidman

Bob Hutchison

Luciano Ramalho

Wes Gamble

_why

_why

Wes Gamble

Wes Gamble

_why

Giles Bowkett

Henry Maddocks

_why

Ross Bamford

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads