hpricot or nokogiri?

G

goodieboy

Jan 9, 2009

#1

OK, was completely sold on Hpricot and now am having my doubts. I
can't seem to get to any of the docs (the site is down). Is it still
being developed? Who are the developers? I love the API and really am
hoping to use it...

So then I tried out Nokogiri and it works well. The bug that Hpricot
had (re-naming a node only names the open-tag) is not present in
Nokogiri. Great! But it's built on libxml, which I don't know much
about. It seems a little more heavy weight than Hpricot. I also heard
that the main developer for libxml doesn't have much time to devote to
the project.

Any advice for me follks?

Matt

R

Ryan Davis

Jan 9, 2009

#2

OK, was completely sold on Hpricot and now am having my doubts. I
can't seem to get to any of the docs (the site is down). Is it still
being developed? Who are the developers? I love the API and really am
hoping to use it...

So then I tried out Nokogiri and it works well. The bug that Hpricot
had (re-naming a node only names the open-tag) is not present in
Nokogiri. Great! But it's built on libxml, which I don't know much
about. It seems a little more heavy weight than Hpricot. I also heard
that the main developer for libxml doesn't have much time to devote to
the project.

hpricot drops the ball in a lot of ways and is much more heavyweight
than nokogiri. Parsing an 8 meg itunes xml file takes over a gig in
hpricot (according to my students) and nokogiri zipped right through it.

The libxml developer doesn't need to devote much time to the project
(assuming you mean libxml itself, not nokogiri). It is a very mature
library. On the other hand, hpricot has had a lot of open bugs for a
long time and they've not been touched one way or another. I find
Aaron Patterson very responsive to my bug reports (but I'm biased,
he's just down the street--look at the bug tracker on rubyforge for
less biased data).

A

Aaron Patterson

Jan 9, 2009

#3

Hi Matt,

OK, was completely sold on Hpricot and now am having my doubts. I
can't seem to get to any of the docs (the site is down). Is it still
being developed? Who are the developers? I love the API and really am
hoping to use it...

So then I tried out Nokogiri and it works well. The bug that Hpricot
had (re-naming a node only names the open-tag) is not present in
Nokogiri. Great! But it's built on libxml, which I don't know much
about. It seems a little more heavy weight than Hpricot. I also heard
that the main developer for libxml doesn't have much time to devote to
the project.

Yes, Nokogiri is built on top of the libxml2 project from Gnome.
libxml2 is actively developed and well supported since it is the XML
parser used by the Gnome project:

http://xmlsoft.org/

If you find bugs, we have a

* mailing list: http://rubyforge.org/mailman/listinfo/nokogiri-talk
* IRC Channel on freenode: #nokogiri
* Ticketing system:
http://nokogiri.lighthouseapp.com/projects/19607-nokogiri/overview
* RDoc: http://nokogiri.rubyforge.org/nokogiri/

I've switched my projects from Hpricot to Nokogiri, and I'm quite happy.

M

matt mitchell

Jan 11, 2009

#4

Hi Matt,

Yes, Nokogiri is built on top of the libxml2 project from Gnome.
libxml2 is actively developed and well supported since it is the XML
parser used by the Gnome project:

http://xmlsoft.org/

If you find bugs, we have a

* mailing list:http://rubyforge.org/mailman/listinfo/nokogiri-talk
* IRC Channel on freenode: #nokogiri
* Ticketing system:
http://nokogiri.lighthouseapp.com/projects/19607-nokogiri/overview
* RDoc:http://nokogiri.rubyforge.org/nokogiri/

I've switched my projects from Hpricot to Nokogiri, and I'm quite happy.

This is great thank you. Definitely helps clear things up a bit. So
it's not just me... Hpricot has a few bugs that have been around for a
while. That's too bad

OK, for a quick Nokogiri question... is it possible to ask a node if
it responds to a certain xpath? Something like:

matching = nodes.select{|n| n.is_findable_by('[@class=plant]') }

Thanks,
Matt

L

Lance Bradley

Feb 12, 2009

#5

I've been going through a similar situation with my current project. I
was initially using Hpricot, and was very frustrated by the lack of
documentation and some of the lingering bugs. I've now switched to
nokogiri and have been very impressed with it.

I'm now running into some of the robustness issues that are faced when
you process data from the open web, like Dan alluded to. I'm using
nokogiri's sax implementation, and I've ran into some problems with
handling html entities, rather they are preserved or decoded into utf-8.
In both cases, nokogiri will quit calling my start and end element
handlers, but continue to call my character handler after an entity is
seen. Specifically, I've noticed this behavior when it sees   and
…. Has anyone else experienced this and have any advice to share?
I appreciate it!
-lance

(here's my code)

class Nokogiri::XML::SAX:

ocument
attr_accessor :rhtml
def initialize
@rhtml = ""
@keep_text = true
@keep_elements = %w{ br p img ul ol title li div table head body
meta base blockquote }
end

def start_element name, attrs = []
puts "start element called: " + name
if @keep_elements.include?(name)
puts "keeping: #{name}"
@rhtml << "<#{name}>\n"
end
if ['script', 'style'].include? name
@keep_text = false
end
end

def characters text
#@rhtml << @coder.decode( text ) if @keep_text
@rhtml << text if @keep_text
puts text
end

def end_element name
puts "end element called: " + name
if @keep_elements.include?(name)
@rhtml << "</#{name}>\n"
end
if ['script', 'style'].include? name
@keep_text = true
end
end

end

html = open(ARGV[0], 'r').collect { |l| l }.join

#coder = HTMLEntities.new
#html = coder.decode(html)

Tidy.path = '/usr/lib/libtidy-0.99.so.0'
xml = Tidy.open

show_warnings=>true) do |tidy|
tidy.options.output_xml = true
#tidy.options.char_encoding = 'utf8'
tidy.options.preserve_entities = true
xml = tidy.clean(html)
end

doc = Nokogiri::XML::SAX:

ocument.new
parser = Nokogiri::XML::SAX:

arser.new(doc)

parser.parse(xml)

puts "doc:"
puts doc.rhtml.gsub(/\n+/, "\n")

T

Trans

Feb 12, 2009

#6

I've been going through a similar situation with my current project. I
was initially using Hpricot, and was very frustrated by the lack of
documentation and some of the lingering bugs. I've now switched to
nokogiri and have been very impressed with it.

I'm now running into some of the robustness issues that are faced when
you process data from the open web, like Dan alluded to. I'm using
nokogiri's sax implementation, and I've ran into some problems with
handling html entities, rather they are preserved or decoded into utf-8.
In both cases, nokogiri will quit calling my start and end element
handlers, but continue to call my character handler after an entity is
seen. Specifically, I've noticed this behavior when it sees   and
…. Has anyone else experienced this and have any advice to share?
I appreciate it!
-lance

(here's my code)

class Nokogiri::XML::SAX:ocument
=A0 attr_accessor :rhtml
=A0 def initialize
=A0 =A0 @rhtml =3D ""
=A0 =A0 @keep_text =3D true
=A0 =A0 @keep_elements =3D %w{ br p img ul ol title li div table head bod= y
meta base blockquote }
=A0 end

=A0 def start_element name, attrs =3D []
=A0 =A0 puts "start element called: " + name
=A0 =A0 if @keep_elements.include?(name)
=A0 =A0 =A0 puts "keeping: #{name}"
=A0 =A0 =A0 @rhtml << "<#{name}>\n"
=A0 =A0 end
=A0 =A0 if ['script', 'style'].include? name
=A0 =A0 =A0 @keep_text =3D false
=A0 =A0 end
=A0 end

=A0 def characters text
=A0 =A0 #@rhtml << @coder.decode( text ) if @keep_text
=A0 =A0 @rhtml << text if @keep_text
=A0 =A0 puts text
=A0 end

=A0 def end_element name
=A0 =A0 puts "end element called: " + name
=A0 =A0 if @keep_elements.include?(name)
=A0 =A0 =A0 @rhtml << "</#{name}>\n"
=A0 =A0 end
=A0 =A0 if ['script', 'style'].include? name
=A0 =A0 =A0 @keep_text =3D true
=A0 =A0 end
=A0 end

end

html =3D open(ARGV[0], 'r').collect { |l| l }.join

#coder =3D HTMLEntities.new
#html =3D coder.decode(html)

Tidy.path =3D '/usr/lib/libtidy-0.99.so.0'
xml =3D Tidy.openshow_warnings=3D>true) do |tidy|
=A0 tidy.options.output_xml =3D true
=A0 #tidy.options.char_encoding =3D 'utf8'
=A0 tidy.options.preserve_entities =A0=3D true
=A0 xml =3D tidy.clean(html)
end

doc =3D Nokogiri::XML::SAX:ocument.new
parser =3D Nokogiri::XML::SAX:arser.new(doc)

parser.parse(xml)

puts "doc:"
puts doc.rhtml.gsub(/\n+/, "\n")

Note that there are also the libxml ruby bindings.

http://libxml.rubyforge.org

T.

Which behaves correctly, Hpricot or Nokogiri?	3	Feb 2, 2009
General Nokogiri problem	0	May 7, 2009
[ANN] hpricot 0.7	23	Mar 17, 2009
Should Nokogiri replace REXML?	4	Jan 21, 2010
[ANN] nokogiri 1.4.5 Released	0	Jun 16, 2011
[ANN] nokogiri 1.4.0 Released	1	Oct 31, 2009
trying to require nokogiri	2	Mar 26, 2011
Nokogiri bug?	2	Aug 18, 2010

goodieboy

Ryan Davis

Aaron Patterson

matt mitchell

Lance Bradley

Trans

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads