HTML dom

V

Victor Tanvuia

Hi,

I'm trying to build a HTML page indexer in ruby and I'd like to be able
to use DOM and or XPath on a document. The application is currently
using REXML, but that seems to be a bit too strict and any deviation
from XML causes the engine to throw an error and quit.

Is there a way to make REXML more permissive or is there another library
that does HTML DOM and XPath?
 
R

Robert Klemme

I'm trying to build a HTML page indexer in ruby and I'd like to be able
to use DOM and or XPath on a document. The application is currently
using REXML, but that seems to be a bit too strict and any deviation
from XML causes the engine to throw an error and quit.

Is there a way to make REXML more permissive
No.

or is there another library
that does HTML DOM and XPath?

Nokogiri and Hpricot seem to be the most popular.

Cheers

robert
 
S

Skye Shaw!@#$

Hi,

I'm trying to build a HTML page indexer in ruby and I'd like to be able
to use DOM and or XPath on a document. The application is currently
using REXML

Yes, REXML can be awkward if you're used to using the DOM. IMHO.
Is there a way to make REXML more permissive or is there another library

There's libxml bindings for Ruby, but I recall that library missing
getElementsByTagName and getElementsById. Though it does have a method
to query the DOM via Xpath.

Have you tried using REXML's SAX2 parser? I think it would be better
suited for your problem.

-Skye
 
V

Victor Tanvuia

Thanks for the help. I've decided to go for Hipricot and it works rather
well now. Don't know why but for some reason I was reluctant to go for
that. Anyway it's great... I love it. It feels like jQuery :)
 
R

Robert Klemme

Yes, REXML can be awkward if you're used to using the DOM. IMHO.

Why do you say that? REXML provides an XML DOM in similar ways as other
XML libs. You can even use XPath queries.
There's libxml bindings for Ruby, but I recall that library missing
getElementsByTagName and getElementsById. Though it does have a method
to query the DOM via Xpath.

libxml won't help as Victor is not processing XML.
Have you tried using REXML's SAX2 parser? I think it would be better
suited for your problem.

No, his problem is that he used an XML tool to process HTML. While many
web pages are valid XML not all are due to the history of browser
development. Thus it's better to use a tool suited to the job, i.e.
capable of parsing HTML which is not valid XML.

Kind regards

robert
 
M

Mark Thomas

libxml won't help as Victor is not processing XML.

Whoa... and right after you recommended the libxml-based Nokogiri.

I have been using libxml2 (in various forms) for years to parse HTML.
I find it to be the best HTML parser out there. It's also completely
XPath 1.0 compliant--my XPaths tend to break in Hpricot.

Both libxml-ruby and Nokogiri have similar functionality. I like the
Nokogiri API a little better.

-- Mark.
 
R

Robert Klemme

Whoa... and right after you recommended the libxml-based Nokogiri.

:-} Sorry, I did not knew that Nokogiri was based on libxml. Thanks to
you and Aaron for the update! Skye seemed to suggest XML tools only
which are clearly not suited for the job. I'll shut up now.

Kind regards

robert
 
S

Skye Shaw!@#$

Why do you say that?  REXML provides an XML DOM in similar ways as other
XML libs.  You can even use XPath queries.

Not sure what you mean by similar. Similar in that there is a tree of
elements that can be manipulated, but not similar to anything called
DOM.

In REXML, an Element is an REXML::Element; which is a REXML::parent
which is a REXML::Child (huh?) which includes REXML::Node.
There is no NodeList, createTextNode(), getElementById(), etc...

To get an element by its ID, I'd have to say something like:

my_document.root.elements("//@id['crap']").each { #do something with
crap }

I would have liked to been able to use the DOM when using REXML,
unfortunately REXML doesn't really support it.
libxml won't help as Victor is not processing XML.

That should be fine.
No, his problem is that he used an XML tool to process HTML.  

Your right. He should never have been using REXML.

-Skye
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top