REXML feature request: XPath.match.text & better text documentation

D

Dan Kohn

Sean, et al, thanks for a great piece of software in REXML. I would
appreciate if you would consider adding the text and texts method to
XPath and Elements.

I believe the following shows why it would be useful, but please let me
know if this isn't clear enough.

require "rexml/document"
include REXML
string = <<EOF
<html>
<td class="t4"><a href="javascript:lu('OZ')">OZ</a>
0204 F Class
<a href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/ICN,itn/air/mp">
ICN</a> to <a
href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/LAX,itn/air/mp">
LAX</a></td>
<tr>
<td class="t4"><font color="white">UNITED</font></td>
<td colspan="4" align="right">
<strong>48,164</strong></td>
</tr>
<tr>
<td class="t4"><font color="white">Star
Alliance</font></td>
<td colspan="4" align="right">
<strong>49,072</strong></td>
</tr>
</html>
EOF
doc = Document.new string.gsub!(/\s+|&nbsp;/," ")

#This works fine:
actsumarray = Array.new
XPath.each( doc,
"//td[@colspan='4']/child::*") { |cell|
actsumarray << cell.text.to_s }
puts actsumarray # 48,164 & 49,072

# But either of these would be much more convenient:
# actsumarray = Xpath.match.text ( doc, "//td[@colspan='4']/child::*")
# actsumarray = doc.elements.text.to_a( "//td[@colspan='4']/child::*")

# Converting to text is also pretty confusing.
# You might consider adding a method like
# remove_tag (which should be enhanced to support
# multiple tags). I suspect others would find it useful.

def remove_tag( rexml_array,tag)
# Removes tag but leaves the text inside the tag as text inside
# the parent of the now removed tag
while rexml_array.elements["//#{tag}"]
rexml_array.elements["//#{tag}"].replace_with( Text.new(
rexml_array.elements["//#{tag}"].text.strip))
end
end

# These sorts of examples would be great for the documentation
# to show how much the results can vary.
cell = doc.elements["//td[@class='t4']"]
puts cell #[ugly HTML]
puts cell.text.to_s # 0204 F Class
puts cell.texts.to_s # 0204 F Class to
remove_tag( cell, "a") #<td class='t4'>OZ 0204\
puts cell #F Class ICN to LAX</td>
puts cell.text.to_s #OZ
puts cell.texts.to_s #OZ 0204 F Class ICN to LAX



- dan
 
G

Gavin Kistner

doc = Document.new string.gsub!(/\s+|&nbsp;/," ")

One aside - you might like to know about:

doc = Document.new( string, :ignore_whitespace_nodes => :all )
 
G

Gavin Kistner

Sean, et al, thanks for a great piece of software in REXML. I would
appreciate if you would consider adding the text and texts method to
XPath and Elements.

Does this help you?

require 'rexml/document'
include REXML

d = Document.new <<ENDXML
<root>
<foo>Raw text</foo>
<foo>Raw text2</foo>
<foo>AA <bar>Nested Text</bar>ZZ</foo>
</root>
ENDXML

p XPath.match( d, '//foo//text()' ).collect{ |textnode|
textnode.value
}
#=> ["Raw text", "Raw text2", "AA", "Nested Text", "ZZ"]

class REXML::Element
def inner_text
self.each_element( './/text()' ){}.join( '' )
end
end

p XPath.match( d, '//foo' ).collect{ |foo|
foo.inner_text
}
#=> ["Raw text", "Raw text2", "AA Nested TextZZ"]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top