Parsing HTML into a tree

Discussion in 'Ruby' started by Eli Bendersky, Apr 25, 2007.

  1. Hello,

    I have a HTML document which I need to parse into a tree. I was
    looking for a suitable module to use, and came to two:

    1) Hpricot, while it looks like a nice and fast HTML parser, it
    appears that its interface is completely unsuitable for parsing a HTML
    file into a tree. Is this possible and I'm missing something ?

    2) htree - looks closer to what I need, but it documentation is very
    poor (almost inexistent). When I 'pp' a parsed HTree document I see a
    representation, but how can I actually traverse the tree ? At the
    moment I'm using reflection to "dissasemble" the htree tree structure.
    There must be a better way ! Can someone please provide an example of
    how to recursively print out the tree, telling for each node what kind
    of node it is ?

    Are there other options ?

    Thanks in advance
    Eli
     
    Eli Bendersky, Apr 25, 2007
    #1
    1. Advertising

  2. Hi Eli,

    I am not sure what you are trying to do

    > Can someone please provide an example of
    > how to recursively print out the tree, telling for each node what kind
    > of node it is ?



    gg.html (sample file)
    ~~~~~~~~~~~~~~~~
    <html>
    <head>
    <title>Test Doc</title>
    </head>
    <body>
    <div id="main">
    Main div Content
    <div id="sub">
    Sub div content
    <span> Test span content</span>
    </div>
    </div>
    </body>
    </html>


    gg.rb
    ~~~~
    require "htree"
    tree = HTree.parse(STDIN)
    tree.traverse_all_element do |e|
    puts e.name
    puts e.extract_text
    end

    Output
    ~~~~~~
    imayam:~/work/temp/htree-0.6/test gg$ ruby gg.rb < gg.html
    {http://www.w3.org/1999/xhtml}html
    Test Doc
    Main div Content
    Sub div content
    Test span content
    {http://www.w3.org/1999/xhtml}head
    Test Doc
    {http://www.w3.org/1999/xhtml}title
    Test Doc
    {http://www.w3.org/1999/xhtml}body
    Main div Content
    Sub div content
    Test span content
    {http://www.w3.org/1999/xhtml}div
    Main div Content
    Sub div content
    Test span content
    {http://www.w3.org/1999/xhtml}div
    Sub div content
    Test span content
    {http://www.w3.org/1999/xhtml}span
    Test span content

    You can also traverse using 'traverse_some_element', 'each_child',
    'traverse_text' etc. You can also convert Htree to rexml and traverse
    it. Probably some context on what actually you are trying to achieve
    might be helpful.

    Cheers,
    Ganesh Gunasegaran

    On 25-Apr-07, at 11:50 PM, Eli Bendersky wrote:

    > Hello,
    >
    > I have a HTML document which I need to parse into a tree. I was
    > looking for a suitable module to use, and came to two:
    >
    > 1) Hpricot, while it looks like a nice and fast HTML parser, it
    > appears that its interface is completely unsuitable for parsing a HTML
    > file into a tree. Is this possible and I'm missing something ?
    >
    > 2) htree - looks closer to what I need, but it documentation is very
    > poor (almost inexistent). When I 'pp' a parsed HTree document I see a
    > representation, but how can I actually traverse the tree ? At the
    > moment I'm using reflection to "dissasemble" the htree tree structure.
    > There must be a better way ! Can someone please provide an example of
    > how to recursively print out the tree, telling for each node what kind
    > of node it is ?
    >
    > Are there other options ?
    >
    > Thanks in advance
    > Eli
    >
    >
     
    Ganesh Gunasegaran, Apr 25, 2007
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ramkumar Menon

    B+ Tree versus Ternary Search Tree

    Ramkumar Menon, Aug 16, 2005, in forum: Java
    Replies:
    2
    Views:
    1,605
    Roedy Green
    Aug 16, 2005
  2. Stub

    B tree, B+ tree and B* tree

    Stub, Nov 12, 2003, in forum: C Programming
    Replies:
    3
    Views:
    10,134
  3. impulse()
    Replies:
    0
    Views:
    2,540
    impulse()
    Oct 13, 2006
  4. Dale
    Replies:
    3
    Views:
    172
  5. Ninja Li

    Parsing HTML with HTML::Tree

    Ninja Li, Mar 1, 2010, in forum: Perl Misc
    Replies:
    1
    Views:
    150
    Ninja Li
    Mar 1, 2010
Loading...

Share This Page