help with htree

A

Ara.T.Howard

i'm trying use the htree library for a relatively simple task in the hope that
it's parser will outlive one i roll on my own - it certainly seems robust as
it is... in any case i'm trying to do the following

- scan a doc for any elements that have a certain attribute
- iff the element with that attribute is of a certain class (b, i, pre, etc)
then i'll want to replace the text of that element in-place
- finally, a new doc will be produced

any hints appreciated. i've got a hand parsed version running now - but
really would like to use htree.

thanks.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================
 
A

Ara.T.Howard

i'm trying use the htree library for a relatively simple task in the hope
that
it's parser will outlive one i roll on my own - it certainly seems robust as
it is... in any case i'm trying to do the following

- scan a doc for any elements that have a certain attribute
- iff the element with that attribute is of a certain class (b, i, pre,
etc)
then i'll want to replace the text of that element in-place
- finally, a new doc will be produced

any hints appreciated. i've got a hand parsed version running now - but
really would like to use htree.

on a related note, does this seem right to people:

irb(main):019:0> HTree("<hr/>").display_html ''
=> "<hr\n>"

notice it ate the closing slash. i know that's bad form, but i wonder if this
is intended or not...

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================
 
G

Gavin Kistner

i'm trying use the htree library for a relatively simple task in
the hope that
it's parser will outlive one i roll on my own - it certainly seems
robust as
it is... in any case i'm trying to do the following

- scan a doc for any elements that have a certain attribute
- iff the element with that attribute is of a certain class (b,
i, pre, etc)
then i'll want to replace the text of that element in-place
- finally, a new doc will be produced

Not answering your question: why not use regexp?

mydoc.gsub(
/(myattrname\s*=\s*(?:'#{value}'|"#{value}")[^>]+>[^<]*<(b|i|pre)
[^>]*>).*?<\/\\2>/,
"\\1#{new_text}"
)

(untested)

Notable flaw - will fail if the same element is nested within itself.
 
A

Ara.T.Howard

i'm trying use the htree library for a relatively simple task in the hope
that
it's parser will outlive one i roll on my own - it certainly seems robust
as
it is... in any case i'm trying to do the following

- scan a doc for any elements that have a certain attribute
- iff the element with that attribute is of a certain class (b, i, pre,
etc)
then i'll want to replace the text of that element in-place
- finally, a new doc will be produced

Not answering your question: why not use regexp?

mydoc.gsub(
/(myattrname\s*=\s*(?:'#{value}'|"#{value}")[^>]+>[^<]*<(b|i|pre)
[^>]*>).*?<\/\\2>/,
"\\1#{new_text}"
)

i already have a working version. i just wanted to learn htree. it's very
robust and context sensitive.
(untested)

Notable flaw - will fail if the same element is nested within itself.

a big flaw for parsing hierarchical text no? ;-) my solution doesn't handle
nesting either - though it could quite easily : it's a mini/dumb parser.

cheers.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================
 
G

Gavin Kistner

a big flaw for parsing hierarchical text no? ;-)

Certainly, if your input is likely to have such. It would be illegal
for a pre to appear inside a pre tag, and simply unlikely (in my
world) to find nested <b> or <i> tags.

I'm a huge fan of clean structured markup. However (precisely because
it is so clean) I often find my screen-scraping activities can be
achieved more quickly by using a simple "look everywhere" regexp on a
structured document, compared to loading the document into REXML
(I've not used/seen htree) and using DOM/XPath to find the
appropriate node. There are gobs of cases where this isn't the case,
but it has been for a lot of simple cases.

Anyhow - glad you have a current working solution, and good luck with
htree.
 
B

Bob Hutchison

Hi,

Thanks for the pointer to HTree.

on a related note, does this seem right to people:

irb(main):019:0> HTree("<hr/>").display_html ''
=> "<hr\n>"

HTML doesn't have an element <hr/> it has <hr>, like the old <p> tag.
notice it ate the closing slash. i know that's bad form, but i
wonder if this
is intended or not...

Give this code a shot, I think this is what you want assuming I read
you correctly. Had to dump yaml a few times to figure out what was
going on :)

Cheers,
Bob

require 'htree'

def handle_element(elem)
new_element_name = elem.element_name().to_s
new_attributes = Hash.new().merge(elem.attributes)
elem.attributes.each{ | name, value |
if "replace" == name.to_s then
new_element_name = value.to_s
new_attributes.delete(name)
break
end
}

if (elem.kind_of?(HTree::Text))
new_elem = elem.to_s
elsif (elem.kind_of?(HTree::Elem))
children = []
elem.children.each(){ | child |
if (child.kind_of?(HTree::Text)) then
children << child.to_s
elsif (child.kind_of?(HTree::Elem))
children << handle_element(child)
end
}
new_elem = HTree::Elem.new(new_element_name, new_attributes,
children)
end
return new_elem
end

def handle_root(root)
children = []
root.children.each(){ | elem |
children << handle_element(elem) if (elem.kind_of?(HTree::Elem))
}
HTree::Doc.new(children)
end

html = %Q{
<p>
<i replace='em'>hello</i> there, <b replace='strong'>how</b> are you?
</p>
}

tree = HTree(html)
new_tree = handle_root(tree)

puts tree.display_html("")
puts new_tree.display_html("")
__EOF__
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,139
Latest member
JamaalCald
Top