hpricot or nokogiri?

G

goodieboy

OK, was completely sold on Hpricot and now am having my doubts. I
can't seem to get to any of the docs (the site is down). Is it still
being developed? Who are the developers? I love the API and really am
hoping to use it...

So then I tried out Nokogiri and it works well. The bug that Hpricot
had (re-naming a node only names the open-tag) is not present in
Nokogiri. Great! But it's built on libxml, which I don't know much
about. It seems a little more heavy weight than Hpricot. I also heard
that the main developer for libxml doesn't have much time to devote to
the project.

Any advice for me follks?

Matt
 
R

Ryan Davis

OK, was completely sold on Hpricot and now am having my doubts. I
can't seem to get to any of the docs (the site is down). Is it still
being developed? Who are the developers? I love the API and really am
hoping to use it...

So then I tried out Nokogiri and it works well. The bug that Hpricot
had (re-naming a node only names the open-tag) is not present in
Nokogiri. Great! But it's built on libxml, which I don't know much
about. It seems a little more heavy weight than Hpricot. I also heard
that the main developer for libxml doesn't have much time to devote to
the project.

hpricot drops the ball in a lot of ways and is much more heavyweight
than nokogiri. Parsing an 8 meg itunes xml file takes over a gig in
hpricot (according to my students) and nokogiri zipped right through it.

The libxml developer doesn't need to devote much time to the project
(assuming you mean libxml itself, not nokogiri). It is a very mature
library. On the other hand, hpricot has had a lot of open bugs for a
long time and they've not been touched one way or another. I find
Aaron Patterson very responsive to my bug reports (but I'm biased,
he's just down the street--look at the bug tracker on rubyforge for
less biased data).
 
A

Aaron Patterson

Hi Matt,

OK, was completely sold on Hpricot and now am having my doubts. I
can't seem to get to any of the docs (the site is down). Is it still
being developed? Who are the developers? I love the API and really am
hoping to use it...

So then I tried out Nokogiri and it works well. The bug that Hpricot
had (re-naming a node only names the open-tag) is not present in
Nokogiri. Great! But it's built on libxml, which I don't know much
about. It seems a little more heavy weight than Hpricot. I also heard
that the main developer for libxml doesn't have much time to devote to
the project.

Yes, Nokogiri is built on top of the libxml2 project from Gnome.
libxml2 is actively developed and well supported since it is the XML
parser used by the Gnome project:

http://xmlsoft.org/

If you find bugs, we have a

* mailing list: http://rubyforge.org/mailman/listinfo/nokogiri-talk
* IRC Channel on freenode: #nokogiri
* Ticketing system:
http://nokogiri.lighthouseapp.com/projects/19607-nokogiri/overview
* RDoc: http://nokogiri.rubyforge.org/nokogiri/

I've switched my projects from Hpricot to Nokogiri, and I'm quite happy.
 
M

matt mitchell

Hi Matt,




Yes, Nokogiri is built on top of the libxml2 project from Gnome.
libxml2 is actively developed and well supported since it is the XML
parser used by the Gnome project:

 http://xmlsoft.org/

If you find bugs, we have a

* mailing list:http://rubyforge.org/mailman/listinfo/nokogiri-talk
* IRC Channel on freenode: #nokogiri
* Ticketing system:
 http://nokogiri.lighthouseapp.com/projects/19607-nokogiri/overview
* RDoc:http://nokogiri.rubyforge.org/nokogiri/

I've switched my projects from Hpricot to Nokogiri, and I'm quite happy.

This is great thank you. Definitely helps clear things up a bit. So
it's not just me... Hpricot has a few bugs that have been around for a
while. That's too bad :(

OK, for a quick Nokogiri question... is it possible to ask a node if
it responds to a certain xpath? Something like:

matching = nodes.select{|n| n.is_findable_by('[@class=plant]') }

Thanks,
Matt
 
L

Lance Bradley

I've been going through a similar situation with my current project. I
was initially using Hpricot, and was very frustrated by the lack of
documentation and some of the lingering bugs. I've now switched to
nokogiri and have been very impressed with it.

I'm now running into some of the robustness issues that are faced when
you process data from the open web, like Dan alluded to. I'm using
nokogiri's sax implementation, and I've ran into some problems with
handling html entities, rather they are preserved or decoded into utf-8.
In both cases, nokogiri will quit calling my start and end element
handlers, but continue to call my character handler after an entity is
seen. Specifically, I've noticed this behavior when it sees   and
…. Has anyone else experienced this and have any advice to share?
I appreciate it!
-lance

(here's my code)

class Nokogiri::XML::SAX::Document
attr_accessor :rhtml
def initialize
@rhtml = ""
@keep_text = true
@keep_elements = %w{ br p img ul ol title li div table head body
meta base blockquote }
end

def start_element name, attrs = []
puts "start element called: " + name
if @keep_elements.include?(name)
puts "keeping: #{name}"
@rhtml << "<#{name}>\n"
end
if ['script', 'style'].include? name
@keep_text = false
end
end

def characters text
#@rhtml << @coder.decode( text ) if @keep_text
@rhtml << text if @keep_text
puts text
end

def end_element name
puts "end element called: " + name
if @keep_elements.include?(name)
@rhtml << "</#{name}>\n"
end
if ['script', 'style'].include? name
@keep_text = true
end
end

end

html = open(ARGV[0], 'r').collect { |l| l }.join

#coder = HTMLEntities.new
#html = coder.decode(html)

Tidy.path = '/usr/lib/libtidy-0.99.so.0'
xml = Tidy.open:)show_warnings=>true) do |tidy|
tidy.options.output_xml = true
#tidy.options.char_encoding = 'utf8'
tidy.options.preserve_entities = true
xml = tidy.clean(html)
end

doc = Nokogiri::XML::SAX::Document.new
parser = Nokogiri::XML::SAX::parser.new(doc)

parser.parse(xml)

puts "doc:"
puts doc.rhtml.gsub(/\n+/, "\n")
 
T

Trans

I've been going through a similar situation with my current project. I
was initially using Hpricot, and was very frustrated by the lack of
documentation and some of the lingering bugs. I've now switched to
nokogiri and have been very impressed with it.

I'm now running into some of the robustness issues that are faced when
you process data from the open web, like Dan alluded to. I'm using
nokogiri's sax implementation, and I've ran into some problems with
handling html entities, rather they are preserved or decoded into utf-8.
In both cases, nokogiri will quit calling my start and end element
handlers, but continue to call my character handler after an entity is
seen. Specifically, I've noticed this behavior when it sees &nbsp; and
…. Has anyone else experienced this and have any advice to share?
I appreciate it!
-lance

(here's my code)

class Nokogiri::XML::SAX::Document
=A0 attr_accessor :rhtml
=A0 def initialize
=A0 =A0 @rhtml =3D ""
=A0 =A0 @keep_text =3D true
=A0 =A0 @keep_elements =3D %w{ br p img ul ol title li div table head bod= y
meta base blockquote }
=A0 end

=A0 def start_element name, attrs =3D []
=A0 =A0 puts "start element called: " + name
=A0 =A0 if @keep_elements.include?(name)
=A0 =A0 =A0 puts "keeping: #{name}"
=A0 =A0 =A0 @rhtml << "<#{name}>\n"
=A0 =A0 end
=A0 =A0 if ['script', 'style'].include? name
=A0 =A0 =A0 @keep_text =3D false
=A0 =A0 end
=A0 end

=A0 def characters text
=A0 =A0 #@rhtml << @coder.decode( text ) if @keep_text
=A0 =A0 @rhtml << text if @keep_text
=A0 =A0 puts text
=A0 end

=A0 def end_element name
=A0 =A0 puts "end element called: " + name
=A0 =A0 if @keep_elements.include?(name)
=A0 =A0 =A0 @rhtml << "</#{name}>\n"
=A0 =A0 end
=A0 =A0 if ['script', 'style'].include? name
=A0 =A0 =A0 @keep_text =3D true
=A0 =A0 end
=A0 end

end

html =3D open(ARGV[0], 'r').collect { |l| l }.join

#coder =3D HTMLEntities.new
#html =3D coder.decode(html)

Tidy.path =3D '/usr/lib/libtidy-0.99.so.0'
xml =3D Tidy.open:)show_warnings=3D>true) do |tidy|
=A0 tidy.options.output_xml =3D true
=A0 #tidy.options.char_encoding =3D 'utf8'
=A0 tidy.options.preserve_entities =A0=3D true
=A0 xml =3D tidy.clean(html)
end

doc =3D Nokogiri::XML::SAX::Document.new
parser =3D Nokogiri::XML::SAX::parser.new(doc)

parser.parse(xml)

puts "doc:"
puts doc.rhtml.gsub(/\n+/, "\n")

Note that there are also the libxml ruby bindings.

http://libxml.rubyforge.org

T.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,169
Latest member
ArturoOlne
Top