Parsing HTML into a tree

E

Eli Bendersky

Hello,

I have a HTML document which I need to parse into a tree. I was
looking for a suitable module to use, and came to two:

1) Hpricot, while it looks like a nice and fast HTML parser, it
appears that its interface is completely unsuitable for parsing a HTML
file into a tree. Is this possible and I'm missing something ?

2) htree - looks closer to what I need, but it documentation is very
poor (almost inexistent). When I 'pp' a parsed HTree document I see a
representation, but how can I actually traverse the tree ? At the
moment I'm using reflection to "dissasemble" the htree tree structure.
There must be a better way ! Can someone please provide an example of
how to recursively print out the tree, telling for each node what kind
of node it is ?

Are there other options ?

Thanks in advance
Eli
 
G

Ganesh Gunasegaran

Hi Eli,

I am not sure what you are trying to do
Can someone please provide an example of
how to recursively print out the tree, telling for each node what kind
of node it is ?


gg.html (sample file)
~~~~~~~~~~~~~~~~
<html>
<head>
<title>Test Doc</title>
</head>
<body>
<div id="main">
Main div Content
<div id="sub">
Sub div content
<span> Test span content</span>
</div>
</div>
</body>
</html>


gg.rb
~~~~
require "htree"
tree = HTree.parse(STDIN)
tree.traverse_all_element do |e|
puts e.name
puts e.extract_text
end

Output
~~~~~~
imayam:~/work/temp/htree-0.6/test gg$ ruby gg.rb < gg.html
{http://www.w3.org/1999/xhtml}html
Test Doc
Main div Content
Sub div content
Test span content
{http://www.w3.org/1999/xhtml}head
Test Doc
{http://www.w3.org/1999/xhtml}title
Test Doc
{http://www.w3.org/1999/xhtml}body
Main div Content
Sub div content
Test span content
{http://www.w3.org/1999/xhtml}div
Main div Content
Sub div content
Test span content
{http://www.w3.org/1999/xhtml}div
Sub div content
Test span content
{http://www.w3.org/1999/xhtml}span
Test span content

You can also traverse using 'traverse_some_element', 'each_child',
'traverse_text' etc. You can also convert Htree to rexml and traverse
it. Probably some context on what actually you are trying to achieve
might be helpful.

Cheers,
Ganesh Gunasegaran
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,906
Latest member
SkinfixSkintag

Latest Threads

Top