Pulling text from elements with REXML

Paul Willis · Mar 19, 2007

Hi

I am using REXML to pull text from a NewsML document.

require 'rexml/document'
include REXML
file = File.new("Main_News.xml")
doc = Document.new(file)
root = doc.root
puts
root.elements["NewsItem/NewsComponent/NewsComponent[1]/NewsComponent/ContentItem/DataContent/nitf/body/body.head/hedline/hl1"]

Gives me...

<hl1>Blueprint to cut emissions unveiled</hl1>

Is there an easy way (ie something in REXML) to pull just the text
without the containers <hl1> and </hl1>.

Paul

Peter Szinek · Mar 19, 2007

Paul said:
Hi

I am using REXML to pull text from a NewsML document.

require 'rexml/document'
include REXML
file = File.new("Main_News.xml")
doc = Document.new(file)
root = doc.root
puts
root.elements["NewsItem/NewsComponent/NewsComponent[1]/NewsComponent/ContentItem/DataContent/nitf/body/body.head/hedline/hl1"]

Gives me...

<hl1>Blueprint to cut emissions unveiled</hl1>

Is there an easy way (ie something in REXML) to pull just the text
without the containers <hl1> and </hl1>.

If I understood correctly, you need the text content of the node rather
than the whole node. This can be accomplished with:

some_element.text

so you could do something like

root.elements[...your stuff_here...].to_a.each {|e| puts e.text}

HTH,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby

Paul Willis · Mar 19, 2007

Peter said:
If I understood correctly, you need the text content of the node rather
than the whole node. This can be accomplished with:

some_element.text

You did understand correctly, .text on the end was all I needed.

Cheers

Paul

Phrogz · Mar 19, 2007

root.elements["NewsItem/NewsComponent/NewsComponent[1]/NewsComponent/Conten tItem/DataContent/nitf/body/body.head/hedline/hl1"]

Gives me...

<hl1>Blueprint to cut emissions unveiled</hl1>

Is there an easy way (ie something in REXML) to pull just the text
without the containers <hl1> and </hl1>.

require 'rexml/document'
doc = REXML:

ocument.new("<root><kid>hello world</kid></root>")
p REXML::XPath.first( doc, '/root/kid/text()' )
#=> "hello world"

Phrogz · Mar 19, 2007

root.elements["NewsItem/NewsComponent/NewsComponent[1]/NewsComponent/Conten tItem/DataContent/nitf/body/body.head/hedline/hl1"]

Click to expand...

Gives me...

Click to expand...

<hl1>Blueprint to cut emissions unveiled</hl1>

Click to expand...

Is there an easy way (ie something in REXML) to pull just the text
without the containers <hl1> and </hl1>.

Click to expand...

require 'rexml/document'
doc = REXML:ocument.new("<root><kid>hello world</kid></root>")
p REXML::XPath.first( doc, '/root/kid/text()' )
#=> "hello world"

Also, depending on your needs:

include REXML
doc = Document.new("<root><kid>hello</kid><kid>world</kid></root>")
p XPath.match( doc, '/root/kid/text()' )
#=> ["hello", "world"]

Keith Fahlgren · Mar 19, 2007

Hey,

Two notes:
1. I always suggest the REXML::XPath methods over the others for
people who grok XPath.
2. A REXML::XPath.* ... text() match will return a REXML::Text node,
which may _not_ be what you want:

$ irb --simple-prompt foo.rb=> REXML::Text

Just something to be aware of (use .to_s if you want a string, as usual).

HTH,
Keith

Paul Willis · Mar 22, 2007

require 'rexml/document'

doc = REXML:ocument.new("<root><kid>hello world</kid></root>")
p REXML::XPath.first( doc, '/root/kid/text()' )
#=> "hello world"

Thanks for that, I'm now using REXML::XPath with a combination of .first
and .match to pull the element text out.

One more thing, given an XML document...

<root><kid stuff="some-other-text">hello world</kid></root>

What would be the path to the attribute 'stuff' and return
'some-other-text'?

Paul

Phrogz · Mar 22, 2007

One more thing, given an XML document...

<root><kid stuff="some-other-text">hello world</kid></root>

What would be the path to the attribute 'stuff' and return
'some-other-text'?

require 'rexml/document'
include REXML
doc = Document.new( <<ENDDOC )
<root>
<kid stuff="some-other-text">hello world</kid>
<kid class="best" stuff="gibbles">hello world</kid>
</root>
ENDDOC

att = XPath.first( doc, '//kid/@stuff' )
p att, att.class, att.value
#=> stuff='some-other-text'
#=> REXML::Attribute
#=> "some-other-text"

p XPath.first( doc, '//kid[@class="best"]/@stuff' ).value
#=> "gibbles"

I don't know what the XPath syntax is to select the value of an
attribute directly. I'd be interested to know if someone else knows it.

Paul Willis · Mar 22, 2007

Gavin said:
att = XPath.first( doc, '//kid/@stuff' )

I don't know what the XPath syntax is to select the value of an
attribute directly. I'd be interested to know if someone else knows it.

Cheers, it was the kid/@stuff I needed...

puts XPath.first( doc, '/root/kid/@stuff' )

#=> some-other-text

Paul

Phrogz · Mar 22, 2007

Cheers, it was the kid/@stuff I needed...

puts XPath.first( doc, '/root/kid/@stuff' )

#=> some-other-text

Nice, I didn't realize that REXML::Attribute had such different output
for #inspect versus #to_s. It's nice, then, that you don't need to
call .value in this particular case. Just be aware that without
the .value call you still have an Attribute instance that can just be
treated as a string in some areas:

att = XPath.first( doc, '//kid/@stuff' )

puts att
#=> some-other-text

puts att.value + '-more'
#=> some-other-text-more

puts att + "-more"
#=> tmp.rb:17: undefined method `+' for
stuff='some-other-text':REXML::Attribute (NoMethodError)

help please with REXML	3	Jul 16, 2010
Ruby Weekly News 19th - 25th March 2007	0	Mar 27, 2007
Errors on REXML reading an HTML.	1	Dec 24, 2010
REXML	19	Nov 6, 2006
rexml error - REXML::Validation	2	Oct 12, 2004
ruby / rexml / xpath bug?	7	Sep 15, 2008
problems reading xml from a db field and using it in REXML	0	Jun 30, 2008
REXML and Date interaction	5	Nov 26, 2006

Pulling text from elements with REXML

Paul Willis

Peter Szinek

Paul Willis

Phrogz

Phrogz

Keith Fahlgren

Paul Willis

Phrogz

Paul Willis

Phrogz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads