C
Chris Gehlker
I'm trying to use Hpricot to clean up the text in a big site full of
old-style HTML. I'm just trying to do things like replacing literal
quote characters with <q> and </q>. I'm hampered by the fact that my
understanding of the HTML DOM comes from reading one web site
yesterday and I don't know any javascript. Nonetheless, it seems that
Hpricot should be able to easily give me all the text in the <body>
element of each page because it has a traverse_text() method. The
problem seems to be that if I apply it to a whole page, I get the
text in the <head> element and all the methods for selecting seem to
return an element, not a tree.
There is a get_subnode method but it doesn't seem to work as expected.
Thanks in advance for any help
old-style HTML. I'm just trying to do things like replacing literal
quote characters with <q> and </q>. I'm hampered by the fact that my
understanding of the HTML DOM comes from reading one web site
yesterday and I don't know any javascript. Nonetheless, it seems that
Hpricot should be able to easily give me all the text in the <body>
element of each page because it has a traverse_text() method. The
problem seems to be that if I apply it to a whole page, I get the
text in the <head> element and all the methods for selecting seem to
return an element, not a tree.
There is a get_subnode method but it doesn't seem to work as expected.
Thanks in advance for any help