REXML feature request: XPath.match.text & better text documentation

Dan Kohn · Sep 15, 2005

Sean, et al, thanks for a great piece of software in REXML. I would
appreciate if you would consider adding the text and texts method to
XPath and Elements.

I believe the following shows why it would be useful, but please let me
know if this isn't clear enough.

require "rexml/document"
include REXML
string = <<EOF
<html>
<td class="t4"><a href="javascript:lu('OZ')">OZ</a>
0204 F Class
<a href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/ICN,itn/air/mp">
ICN</a> to <a
href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/LAX,itn/air/mp">
LAX</a></td>
<tr>
<td class="t4">UNITED</td>
<td colspan="4" align="right">
48,164</td>
</tr>
<tr>
<td class="t4">Star
Alliance</td>
<td colspan="4" align="right">
49,072</td>
</tr>
</html>
EOF
doc = Document.new string.gsub!(/\s+| /," ")

#This works fine:
actsumarray = Array.new
XPath.each( doc,
"//td[@colspan='4']/child::*") { |cell|
actsumarray << cell.text.to_s }
puts actsumarray # 48,164 & 49,072

# But either of these would be much more convenient:
# actsumarray = Xpath.match.text ( doc, "//td[@colspan='4']/child::*")
# actsumarray = doc.elements.text.to_a( "//td[@colspan='4']/child::*")

# Converting to text is also pretty confusing.
# You might consider adding a method like
# remove_tag (which should be enhanced to support
# multiple tags). I suspect others would find it useful.

def remove_tag( rexml_array,tag)
# Removes tag but leaves the text inside the tag as text inside
# the parent of the now removed tag
while rexml_array.elements["//#{tag}"]
rexml_array.elements["//#{tag}"].replace_with( Text.new(
rexml_array.elements["//#{tag}"].text.strip))
end
end

# These sorts of examples would be great for the documentation
# to show how much the results can vary.
cell = doc.elements["//td[@class='t4']"]
puts cell #[ugly HTML]
puts cell.text.to_s # 0204 F Class
puts cell.texts.to_s # 0204 F Class to
remove_tag( cell, "a") #<td class='t4'>OZ 0204\
puts cell #F Class ICN to LAX</td>
puts cell.text.to_s #OZ
puts cell.texts.to_s #OZ 0204 F Class ICN to LAX

- dan

Gavin Kistner · Sep 15, 2005

doc = Document.new string.gsub!(/\s+| /," ")

One aside - you might like to know about:

doc = Document.new( string, :ignore_whitespace_nodes => :all )

Gavin Kistner · Sep 15, 2005

Sean, et al, thanks for a great piece of software in REXML. I would
appreciate if you would consider adding the text and texts method to
XPath and Elements.

Does this help you?

require 'rexml/document'
include REXML

d = Document.new <<ENDXML
<root>
<foo>Raw text</foo>
<foo>Raw text2</foo>
<foo>AA <bar>Nested Text</bar>ZZ</foo>
</root>
ENDXML

p XPath.match( d, '//foo//text()' ).collect{ |textnode|
textnode.value
}
#=> ["Raw text", "Raw text2", "AA", "Nested Text", "ZZ"]

class REXML::Element
def inner_text
self.each_element( './/text()' ){}.join( '' )
end
end

p XPath.match( d, '//foo' ).collect{ |foo|
foo.inner_text
}
#=> ["Raw text", "Raw text2", "AA Nested TextZZ"]

REXML screen scraping questions	4	Sep 14, 2005
Only one table shows up with the information	2	Mar 29, 2023
Errors on REXML reading an HTML.	1	Dec 24, 2010
Table alignment	5	Aug 1, 2011
Screen scraping via regex vs. htmltools (vs. REXML)	4	Dec 2, 2005
Multi - Term Search with Stored Procedure	0	May 3, 2011
Slide and edit within grid	3	Jul 14, 2010
Help with my responsive home page	2	Dec 14, 2022

REXML feature request: XPath.match.text & better text documentation

Dan Kohn

Gavin Kistner

Gavin Kistner

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads