REXML XPath not iterating in Source Order

G

gavin

I'm using REXML with XPath queries to parse a (horrific, ugly, nast)
XHTML document. Unfortunately, the nodes traversed in many cases are
not in the same order as they appear in the document, twiddling
paragraphs (and worse).


For example, following is a piece of Ruby code, the snippet of the
XHTML it's operating on, and the debug output. (I've trimmed the debug
output slightly to be clear.) Note that the lines are not in the
correct order:

RUBY:
puts "Looking inside #{tds[1].to_s.inspect}"
tds[1].elements.each( ".//*[@class='pCellBody']" ){ |line|
puts "Found line #{line.to_s.inspect}"
}


XHTML:
<td><a name="wp1140867"> </a><div class="pCellBody">
Specifies whether or not the event will bubble up the object model
hierarchy.
</div>
<a name="wp1140875"> </a><div class="pCellBody">
0 = Bubbling enabled. This is the default setting.
</div>
<a name="wp1140876"> </a><div class="pCellBody">
1 = Bubbling disabled.
</div>
<a name="wp1140871"> </a><div class="pCellBody">
For more information about event bubbling, see <a
href="Anark_Studio_Help-14-25.html">Working with mouse events and event
bubbling</a>.
</div>
</td>

OUTPUT:
Looking inside "<td><a name='wp1140867'> </a><div
class='pCellBody'>\nSpecifies whether ...</td>"
Found line "<div class='pCellBody'>\nSpecifies whether ... \n</div>"
Found line "<div class='pCellBody'>\nFor more information ...</div>"
Found line "<div class='pCellBody'>\n0 = Bubbling enabled. ...
\n</div>"
Found line "<div class='pCellBody'>\n1 = Bubbling disabled. \n</div>"



Unfortunately, I can't reproduce this problem in a simple test case.
For example, the following code behaves as expected:

require 'rexml/document'
doc = REXML::Document.new( <<-ENDXML
<root>
<p class="foo">Line 1</p>
<p class="bar">Line 2</p>
<p class="foo"><b>Line 3</b></p>
<p class="bar">Line <b>4</b></p>
<div class="subsection">
<p class="bar"><em><b>Line 5</b></em></p>
<p class="foo"><em>Line <b>6</b></em></p>
<div class="bar">Line 7</div>
</div>
</root>
ENDXML
)

doc.elements.each( ".//*[@class='foo' or @class='bar']" ) do |el|
puts el.to_s
end

doc.elements.each( "//b" ) do |el|
puts el.to_s
end

div = REXML::XPath.first( doc, "//*[@class='subsection']" )
div.elements.each( './*[@class="foo" or @class="bar"]' ) do |el|
puts el.to_s
end


OUTPUT:
<p class='foo'>Line 1</p>
<p class='bar'>Line 2</p>
<p class='foo'><b>Line 3</b></p>
<p class='bar'>Line <b>4</b></p>
<p class='bar'><em><b>Line 5</b></em></p>
<p class='foo'><em>Line <b>6</b></em></p>
<div class='bar'>Line 7</div>
<b>Line 3</b>
<b>4</b>
<b>Line 5</b>
<b>6</b>
<p class='bar'><em><b>Line 5</b></em></p>
<p class='foo'><em>Line <b>6</b></em></p>
<div class='bar'>Line 7</div>


Has anyone experienced this not-in-order behavior before? Any ideas how
to prevent it?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,012
Latest member
RoxanneDzm

Latest Threads

Top