Hpricot/Rubyful Soup comparison

W

Wes Gamble

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster for an
average sized HTML page and b) preserves the original HTML better.

Thanks,
Wes
 
L

lrlebron

I've used both Hpricot and Rubyful Soup to parse the Google News page
and found Hpricot to be much faster.

Luis
 
P

Peter Szinek

Wes said:
Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster for an
average sized HTML page and
I did not do any benchmarks, but I am scraping a lot of relatively big
pages on a daily basis and I can tell you, RubyfulSoup is magnitudes
slower than HPricot. I am absolutely sure about this. I am doing things
with HPricot which should be extremely slow (e.g. traversing the whole
tree and doing expensive operations on all Hpricot::Elements) yet
HPricot is surprisingly fast. Rubyful is nowhere near.

b) preserves the original HTML better.
Hmm this I don't know, but I guess the term 'preserves HTML better'
should be defined first with some metrics or something ( deviance from
the HTML standard? ). There are a lot of so badly formed HTML pages,
than even a human would come up with multiple solutions for their
correction.

I think the only real-life quality meter is to process your pages with
both of them and see which one yields better results. I did not play too
much with RubyfulSoup but I am writing a quite serious screen scraping
framework based on Hpricot, and so far I had no real problems - and I am
doing every kind of weird things.

Cheers,
Peter

__
http://www.rubyrailways.com
 
L

Luciano Ramalho

I did not do any benchmarks, but I am scraping a lot of relatively big
pages on a daily basis and I can tell you, RubyfulSoup is magnitudes
slower than HPricot.

HPricot is partially written in C, so it should be faster than a
pure-Ruby lib like RubyfulSoup.

Also, RubyfulSoup aims to be very resilient to malformed markup, so it
must resort to heuristics that have a performance cost. I don't know
fow HPricot handles HTML or XML with really serious flaws like tags
that open but never close and so on, but in my experience RubyfulSoup
has managed to deal amazingly well with such problems. If you need to
parse low quality markup, the performance penalty of RubyfulSoup may
be well worth the price.

Cheers,

Luciano
 
P

Peter Szinek

Luciano said:
HPricot is partially written in C, so it should be faster than a
pure-Ruby lib like RubyfulSoup. true

Also, RubyfulSoup aims to be very resilient to malformed markup,
So it's HPricot. HPricot is not just a HTML parser which can parse
(relatively) valid HTML - it can parse any HTML 'somehow'. We can argue
whether HPricot's 'somehow' is better or worse that RubyfulSoup's, but
it is a fact that HPricot is handling malformed pages very well.

so it
must resort to heuristics that have a performance cost. I don't know
fow HPricot handles HTML or XML with really serious flaws like tags
that open but never close and so on,
This concretely is absolutely OK. Maybe we would need a list of serious
problems and see how Hpricot vs RubyfulSoup is handling them. From what
I have seen, HPricot did not have any problems with any page...
has managed to deal amazingly well with such problems. If you need to
parse low quality markup, the performance penalty of RubyfulSoup may
be well worth the price.
I am still not sure what are the added benefits of RubyfulSoup parsing
over HPricot (although I am not claiming that there are none) - I would
like to see a real serious comparison to decide this...

Peter

__
http://www.rubyrailways.com
 
G

Gregory Seidman

[...]
} > Also, RubyfulSoup aims to be very resilient to malformed markup,
} So it's HPricot. HPricot is not just a HTML parser which can parse
} (relatively) valid HTML - it can parse any HTML 'somehow'. We can argue
} whether HPricot's 'somehow' is better or worse that RubyfulSoup's, but
} it is a fact that HPricot is handling malformed pages very well.
}
} > so it must resort to heuristics that have a performance cost. I don't
} > know fow HPricot handles HTML or XML with really serious flaws like
} > tags that open but never close and so on,
} This concretely is absolutely OK. Maybe we would need a list of serious
} problems and see how Hpricot vs RubyfulSoup is handling them. From what
} I have seen, HPricot did not have any problems with any page...

HPricot even keeps track of when tags are (incorrectly) closed by a
different close tag. This can allow you to track down issues in broken HTML
if that's your intent, but since I am mostly using HPricot for sanitization
I just set the close tags to nil so the output closes with the correct tag.
I do find it a little annoying that HPricot will always produce an
open/close pair even if the input was self-closing (e.g. <foo />) unless
the tag is known to be an empty tag by HPricot (see
Hpricot::ElementContent).

} > has managed to deal amazingly well with such problems. If you need to
} > parse low quality markup, the performance penalty of RubyfulSoup may
} > be well worth the price.
} I am still not sure what are the added benefits of RubyfulSoup parsing
} over HPricot (although I am not claiming that there are none) - I would
} like to see a real serious comparison to decide this...

I haven't tried RubyfulSoup, but HPricot suits my needs nicely. I am
delighted by its reliance on a bare minimum of HPricot-specific objects. It
doesn't try to behave like a real DOM, which means that it can use arrays
for child lists and ordinary references for parent nodes and hashes for
attributes, all read/write. It is possible to perform significant
transformations with minimal difficulty.

} Peter
--Greg
 
B

Bob Hutchison

Has anyone done a head to head comparison of Hpricot and Rubyful Soup
(both HTML parsers)?

If so, would you be willing to comment on which one a) is faster
for an
average sized HTML page and b) preserves the original HTML better.

I switched from Rubyful Soup to Hpricot a while ago. The reason was
performance on 1000-2000 character html chunks -- I didn't do a
benchmark because there just was no need to... Hpricot is *a lot*
faster.

I have no idea which preserves html better, I'm only using them to
find specific bits of the html (e.g. links, images, a few other
things). I do not use either to transform the input html, I *always*
keep the input as it was. In all cases I have html in a string that I
give to the parser, I do know that with Rubyful Soup it was
absolutely necessary to dup the string first or you were liable to
have changes made to the input string.

Cheers,
Bob
Thanks,
Wes

----
Bob Hutchison -- blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. -- <http://www.recursive.ca/>
Raconteur -- <http://www.raconteur.info/>
xampl for Ruby -- <http://rubyforge.org/projects/xampl/>
 
L

Luciano Ramalho

So it's HPricot. HPricot is not just a HTML parser which can parse
(relatively) valid HTML - it can parse any HTML 'somehow'. We can argue
whether HPricot's 'somehow' is better or worse that RubyfulSoup's, but
it is a fact that HPricot is handling malformed pages very well.

Thanks for the input, Peter. From your opinion and other=B4s, it seems
HPricot is the best option. Coming from Python, I was used to
BeautifulSoup, from which RubyfulSoup derived, and I was very happy
with it. But if we can have the same benefits with better performance,
then it=B4s a no-brainer!

Cheers,

Luciano
 
W

Wes Gamble

Thanks for all of the comments.

I was pretty sure that Hpricot was faster since it is partially written
in C, but it's nice to hear a resounding "YES" on that topic.

My concern about "preserving original markup" has to do with this
application I'm writing, which grabs a page and then tries to display
it. When RubyfulSoup would encounter bad HTML, it could parse it ok,
but it always attempts to fix it when I went to write the parse tree.
Which can cause problems when you try to redisplay the HTML.

Some malformed HTML is handled fine by browsers, so I'd like to preserve
the original HTML regardless of its quality. If Hpricot will not only
parse my HTML quickly, but also not fix the HTML on the way out (dumping
the parse tree), that would be ideal.

Again, thanks for all of the discussion - it's quite helpful.

Wes
 
W

_why

My concern about "preserving original markup" has to do with this
application I'm writing, which grabs a page and then tries to display
it. When RubyfulSoup would encounter bad HTML, it could parse it ok,
but it always attempts to fix it when I went to write the parse tree.
Which can cause problems when you try to redisplay the HTML.

I totally agree with you regarding preserving the original markup. In fact,
the latest Hpricot code (in subversion) has two methods for output:

* `to_html` which outputs fully closed tags and strips out bogus end tags.
* `to_original_html` which outputs the original document (as close as it can)
with your modifications made.

So, for example, I use the `to_original_html` method in MouseHole, which is a
scriptable personal HTTP proxy (sort of like greasemonkey). Some pages (like
Boing Boing, for instance) completely break if you try to fix up the HTML. But
this new method can successfully remove stuff and alter stuff without turning the
whole page upside-down.

_why
 
W

_why

I do find it a little annoying that HPricot will always produce an
open/close pair even if the input was self-closing (e.g. <foo />) unless
the tag is known to be an empty tag by HPricot (see
Hpricot::ElementContent).

Mmm. Okay, good point. So if a tag comes in as self-closing, keep it that way?
I think that's reasonable.

_why
 
W

Wes Gamble

_why said:
I totally agree with you regarding preserving the original markup. In
fact,
the latest Hpricot code (in subversion) has two methods for output:

* `to_html` which outputs fully closed tags and strips out bogus end
tags.
* `to_original_html` which outputs the original document (as close as
it can)
with your modifications made.

sweet.
 
W

Wes Gamble

Wes said:

Actually, I'm kind of hoping that I can make mods. to the parse tree,
but that no "unnecessary fixing" of bad HTML occurs.

So I'm wondering does modifying the parse tree at all and then
outputting it imply that all of the malformed HTML will be
fixed/modified in some way or not?

Thanks,
Wes
 
W

_why

Actually, I'm kind of hoping that I can make mods. to the parse tree,
but that no "unnecessary fixing" of bad HTML occurs.

So I'm wondering does modifying the parse tree at all and then
outputting it imply that all of the malformed HTML will be
fixed/modified in some way or not?

With `to_original_html`, no malformed HTML is fixed.
<div><p class="new">Paragraph one<p class="new">Paragraph two <b>with <i>some</b> tags in it <b etc.=></p>

With `to_html`, Hpricot will line up all the tags.

_why
 
G

Giles Bowkett

I have, in late August, and at that time, we found that Rubyful Soup
was ten times slower than Hpricot and Mechanize.
 
H

Henry Maddocks

I recently wrote a scrapper in rubyfulsoup and then rewrote it in
hpricot. The hpricot version was MUCH faster, had less code and is
easier to understand. I was a bit dubious of hpricot initially
because of the 'strange syntax' but I am definitely sold now.

As for correctness, I can't comment.
 
W

_why

In terms of preserving the original HTML, I found the libxml2 and Hpricot
parsers to be fairly even, with both doing pretty good job of fixing up
broken HTML.

Thanks, Ross, that was great. Libxml2 has HTML fixup stuff? That's
sensational. Are the bindings pretty stable?

_why
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top