REXML XPath not iterating in Source Order

gavin · Feb 9, 2005

I'm using REXML with XPath queries to parse a (horrific, ugly, nast)
XHTML document. Unfortunately, the nodes traversed in many cases are
not in the same order as they appear in the document, twiddling
paragraphs (and worse).

For example, following is a piece of Ruby code, the snippet of the
XHTML it's operating on, and the debug output. (I've trimmed the debug
output slightly to be clear.) Note that the lines are not in the
correct order:

RUBY:
puts "Looking inside #{tds[1].to_s.inspect}"
tds[1].elements.each( ".//*[@class='pCellBody']" ){ |line|
puts "Found line #{line.to_s.inspect}"
}

XHTML:
<td><a name="wp1140867"> </a><div class="pCellBody">
Specifies whether or not the event will bubble up the object model
hierarchy.
</div>
<a name="wp1140875"> </a><div class="pCellBody">
0 = Bubbling enabled. This is the default setting.
</div>
<a name="wp1140876"> </a><div class="pCellBody">
1 = Bubbling disabled.
</div>
<a name="wp1140871"> </a><div class="pCellBody">
For more information about event bubbling, see <a
href="Anark_Studio_Help-14-25.html">Working with mouse events and event
bubbling</a>.
</div>
</td>

OUTPUT:
Looking inside "<td><a name='wp1140867'> </a><div
class='pCellBody'>\nSpecifies whether ...</td>"
Found line "<div class='pCellBody'>\nSpecifies whether ... \n</div>"
Found line "<div class='pCellBody'>\nFor more information ...</div>"
Found line "<div class='pCellBody'>\n0 = Bubbling enabled. ...
\n</div>"
Found line "<div class='pCellBody'>\n1 = Bubbling disabled. \n</div>"

Unfortunately, I can't reproduce this problem in a simple test case.
For example, the following code behaves as expected:

require 'rexml/document'
doc = REXML:

ocument.new( <<-ENDXML
<root>
<p class="foo">Line 1</p>
<p class="bar">Line 2</p>
<p class="foo"><b>Line 3</b></p>
<p class="bar">Line <b>4</b></p>
<div class="subsection">
<p class="bar"><em><b>Line 5</b></em></p>
<p class="foo"><em>Line <b>6</b></em></p>
<div class="bar">Line 7</div>
</div>
</root>
ENDXML
)

doc.elements.each( ".//*[@class='foo' or @class='bar']" ) do |el|
puts el.to_s
end

doc.elements.each( "//b" ) do |el|
puts el.to_s
end

div = REXML::XPath.first( doc, "//*[@class='subsection']" )
div.elements.each( './*[@class="foo" or @class="bar"]' ) do |el|
puts el.to_s
end

OUTPUT:
<p class='foo'>Line 1</p>
<p class='bar'>Line 2</p>
<p class='foo'><b>Line 3</b></p>
<p class='bar'>Line <b>4</b></p>
<p class='bar'><em><b>Line 5</b></em></p>
<p class='foo'><em>Line <b>6</b></em></p>
<div class='bar'>Line 7</div>
<b>Line 3</b>
<b>4</b>
<b>Line 5</b>
<b>6</b>
<p class='bar'><em><b>Line 5</b></em></p>
<p class='foo'><em>Line <b>6</b></em></p>
<div class='bar'>Line 7</div>

Has anyone experienced this not-in-order behavior before? Any ideas how
to prevent it?

Survey details won't go through using php, ajax, Mysql	0	Oct 26, 2023
Jquery not triggering / acting as expected.	0	Mar 6, 2022
Help with code	0	Jun 12, 2022
Add recipes using JavaScript in table	20	Apr 17, 2023
REXML XPath question	2	Jun 25, 2007
Text does not display correctly to glow	2	Sep 16, 2022
Why is this WordPress comments form not submitting?	1	Jan 12, 2020
File content in descending order	0	Nov 8, 2022

REXML XPath not iterating in Source Order

gavin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads