help with htree

Ara.T.Howard · Oct 10, 2005

i'm trying use the htree library for a relatively simple task in the hope that
it's parser will outlive one i roll on my own - it certainly seems robust as
it is... in any case i'm trying to do the following

- scan a doc for any elements that have a certain attribute
- iff the element with that attribute is of a certain class (b, i, pre, etc)
then i'll want to replace the text of that element in-place
- finally, a new doc will be produced

any hints appreciated. i've got a hand parsed version running now - but
really would like to use htree.

thanks.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================

Ara.T.Howard · Oct 10, 2005

i'm trying use the htree library for a relatively simple task in the hope
that
it's parser will outlive one i roll on my own - it certainly seems robust as
it is... in any case i'm trying to do the following

- scan a doc for any elements that have a certain attribute
- iff the element with that attribute is of a certain class (b, i, pre,
etc)
then i'll want to replace the text of that element in-place
- finally, a new doc will be produced

any hints appreciated. i've got a hand parsed version running now - but
really would like to use htree.

on a related note, does this seem right to people:

irb(main):019:0> HTree("<hr/>").display_html ''
=> "<hr\n>"

notice it ate the closing slash. i know that's bad form, but i wonder if this
is intended or not...

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================

Gavin Kistner · Oct 11, 2005

i'm trying use the htree library for a relatively simple task in
the hope that
it's parser will outlive one i roll on my own - it certainly seems
robust as
it is... in any case i'm trying to do the following

- scan a doc for any elements that have a certain attribute
- iff the element with that attribute is of a certain class (b,
i, pre, etc)
then i'll want to replace the text of that element in-place
- finally, a new doc will be produced

Not answering your question: why not use regexp?

mydoc.gsub(
/(myattrname\s*=\s*(?:'#{value}'|"#{value}")[^>]+>[^<]*<(b|i|pre)
[^>]*>).*?<\/\\2>/,
"\\1#{new_text}"
)

(untested)

Notable flaw - will fail if the same element is nested within itself.

Ara.T.Howard · Oct 11, 2005

i'm trying use the htree library for a relatively simple task in the hope
that
it's parser will outlive one i roll on my own - it certainly seems robust
as
it is... in any case i'm trying to do the following

- scan a doc for any elements that have a certain attribute
- iff the element with that attribute is of a certain class (b, i, pre,
etc)
then i'll want to replace the text of that element in-place
- finally, a new doc will be produced

Click to expand...

Not answering your question: why not use regexp?

mydoc.gsub(
/(myattrname\s*=\s*(?:'#{value}'|"#{value}")[^>]+>[^<]*<(b|i|pre)
[^>]*>).*?<\/\\2>/,
"\\1#{new_text}"
)

i already have a working version. i just wanted to learn htree. it's very
robust and context sensitive.

(untested)

Notable flaw - will fail if the same element is nested within itself.

a big flaw for parsing hierarchical text no? ;-) my solution doesn't handle
nesting either - though it could quite easily : it's a mini/dumb parser.

cheers.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================

Gavin Kistner · Oct 11, 2005

a big flaw for parsing hierarchical text no? ;-)

Certainly, if your input is likely to have such. It would be illegal
for a pre to appear inside a pre tag, and simply unlikely (in my
world) to find nested or tags.

I'm a huge fan of clean structured markup. However (precisely because
it is so clean) I often find my screen-scraping activities can be
achieved more quickly by using a simple "look everywhere" regexp on a
structured document, compared to loading the document into REXML
(I've not used/seen htree) and using DOM/XPath to find the
appropriate node. There are gobs of cases where this isn't the case,
but it has been for a lot of simple cases.

Anyhow - glad you have a current working solution, and good luck with
htree.

Bob Hutchison · Oct 11, 2005

Hi,

Thanks for the pointer to HTree.

on a related note, does this seem right to people:

irb(main):019:0> HTree("<hr/>").display_html ''
=> "<hr\n>"

HTML doesn't have an element <hr/> it has <hr>, like the old tag.

notice it ate the closing slash. i know that's bad form, but i
wonder if this
is intended or not...

Give this code a shot, I think this is what you want assuming I read
you correctly. Had to dump yaml a few times to figure out what was
going on

Cheers,
Bob

require 'htree'

def handle_element(elem)
new_element_name = elem.element_name().to_s
new_attributes = Hash.new().merge(elem.attributes)
elem.attributes.each{ | name, value |
if "replace" == name.to_s then
new_element_name = value.to_s
new_attributes.delete(name)
break
end
}

if (elem.kind_of?(HTree::Text))
new_elem = elem.to_s
elsif (elem.kind_of?(HTree::Elem))
children = []
elem.children.each(){ | child |
if (child.kind_of?(HTree::Text)) then
children << child.to_s
elsif (child.kind_of?(HTree::Elem))
children << handle_element(child)
end
}
new_elem = HTree::Elem.new(new_element_name, new_attributes,
children)
end
return new_elem
end

def handle_root(root)
children = []
root.children.each(){ | elem |
children << handle_element(elem) if (elem.kind_of?(HTree::Elem))
}
HTree:

oc.new(children)
end

html = %Q{

hello there, how are you?

}

tree = HTree(html)
new_tree = handle_root(tree)

puts tree.display_html("")
puts new_tree.display_html("")
__EOF__

[SCIRUBY] interview with world famous biologist pjotr prins!	2	Aug 11, 2005
narf	3	Sep 17, 2005
sending EOF portably	7	Sep 16, 2005
[SCIRUBY] interview with swedish downhill ski racer mikael borg	1	Sep 27, 2005
[RCR] IO#clearerr	5	Sep 16, 2005
thread local $std(in\|out\|err)	0	Sep 29, 2005
[SCIRUBY] Interview with Chris Harrop - making sure you can makesure the sun will be shining	2	Aug 15, 2005
signals handlers and threads	3	Sep 21, 2005

help with htree

Ara.T.Howard

Ara.T.Howard

Gavin Kistner

Ara.T.Howard

Gavin Kistner

Bob Hutchison

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads