Parse and modify an XML file with REXML

J

jeffnyman

Greetings all.

When processing XML, is there a way to check what the previous and what
the next "rows" are?

That probably makes no sense without context, so here is an example. I
need to find things in the XML based on rules. For example, one rule
might be "find the first 203 that comes after 202." Another rule is
"Find the first 203 that comes before 16." So say I have this:

<variable value="202">
<variable value="203">
<variable value="203">
<variable value="203">
<variable value="203">
<variable value="203">
<variable value="16">

I have to be able to find that the element after 202 is 203. (As
opposed to a situation where a 202 appeared, but the next element was
not 203.) I then have to determine that the element after a given 203
is 16. Then I have to change the value attribute of the first and last
203 elements. So the XML, after applying the rules, would look like
this:

<variable value="202">
<variable value="203First">
<variable value="203">
<variable value="203">
<variable value="203">
<variable value="203Last">
<variable value="16">

The 202 and 16 are essentially bracketers of data, in this case. There
can be many such groups in the XML that look like this.

I know how to parse through XML using XPath or using a stream listener.
I have read the tutorial that comes with REXML. But what I'm not sure
how to do is check for the conditions like I described above. One
thought was I could read the XML into an array because then I get an
enforced "line numbering" with the indexing. So I could check
currentLine - 1 and currentLine + 1. I'm not sure if that is a smart
approach, however.

Has anyone done something similar in their work?

- Jeff
 
P

Peter Szinek

Greetings all.

When processing XML, is there a way to check what the previous and what
the next "rows" are?

I don't know REXML that much (and using Hpricot anyway ;-) but standard
XPath axes ( following-sibling, preceding-sibling ) won't help? The
previous node in this case would be self::previous-sibling[1] etc.

HTH,

Peter
http://www.rubyrailways.com
 
P

Pete

Greetings all.

When processing XML, is there a way to check what the previous and what
the next "rows" are?

That probably makes no sense without context, so here is an example. I
need to find things in the XML based on rules. For example, one rule
might be "find the first 203 that comes after 202." Another rule is
"Find the first 203 that comes before 16." So say I have this:

<variable value="202">
<variable value="203">
<variable value="203">
<variable value="203">
<variable value="203">
<variable value="203">
<variable value="16">

I have to be able to find that the element after 202 is 203. (As
opposed to a situation where a 202 appeared, but the next element was
not 203.) I then have to determine that the element after a given 203
is 16. Then I have to change the value attribute of the first and last
203 elements. [.....]

Has anyone done something similar in their work?

- Jeff

I've just been playing with a project that looks like it might have
some similarities. I acquired an app that creates an XML representation
of a midifile, and I wanted to add useful info to the XML to help the
human reader (and maybe allow other postprocessing). In particular,
a 'note' in a midifile is begun with a NoteOn event, and ends sometime
later when a corrsponding NoteOff appears. I wanted to add an attribute
to each NoteOn element that gave its actual duration. Other elements
that had added attributes could (otherwise) be output again immediately,
but the NoteOns would have to be held until the NoteOff was read, and
as order is important that meant other events might have to wait, too.

(Of course I'm using stream parsing here. the XML-ized midifile can
get pretty long, and I don't like the idea of keeping an entire DOM
tree around. I'm kind of more at home with streams, anyway.)

Essentially I make a list of the elements waiting to be output. Each
object in the list has a 'complete' flag that is set immediately for
most tags, except for NoteOn, which is set complete when the NoteOff
arrives and the duration can be calculated. When the first element
in the list becomes complete, all finished items at the head of the list
are output.

To keep track of the reading end of things I have Element Handler
objects that can maintain knowledge of the current state (which in the
case of NoteOn/Offs means a fairly large array of references, but for
your purposes would just be the value of the previous 'variable').
I actually wrote an extension to REXML for this that I think is quite
useful, and will publish -- soon, I hope. I don't think that would
be needed for your job, though; a simple 'tag_start' handler (from
REXML::StreamListener) that recognized tag 'variable' should be
adequate.

You'd then just have to note, when you got a '203' whether the
previous was '202' and modify it if so. If not, you'd hold on to
it until the next 'variable'; if that was '16', you'd modify it
and output it, otherwise you'd just output it. You wouldn't even
need a list if there were never any intervening elements.

Oof! Sorry, that got rather long-winded, and I don't know if it made
any sense, but I hope it's useful.

-- Pete --
 
J

Jeff Nyman

Peter Szinek said:
Greetings all.

When processing XML, is there a way to check what the previous and what
the next "rows" are?

I don't know REXML that much (and using Hpricot anyway ;-) but standard
XPath axes ( following-sibling, preceding-sibling ) won't help? The
previous node in this case would be self::previous-sibling[1] etc.

Thanks for the suggestion. This sounds like it might work. I did not see
this in the REXML documentation initially but I see generally how these work
in concept. In practice, it does not seem to work for me.

I have my XML like this (greatly pared down):

<perflog>
<module>
<perfpoints>
<variable name="202G_OrdAdd">
<variable name="203G_OrdUpdate">
....
</perfpoints
</module>
</perflog>

I tried this:

<code>
xml = Document.new(File.open("test.xml"))

events = XPath.match(xml,
'/perflog/module/perfpoints/variable[@name="203G_OrdUpdate"]'
)

events.each do |event|
puts XPath.match(event, '[self:preceding-sibling[1](@name,
"202G_OrdAdd")]')
end
</code>

In the events iterator, I also tried the following variation:

puts XPath.match(event, 'self:preceding-sibling[1](@name, "202G_OrdAdd")')

I also tried replacing the 'self' with the full node path (i.e.,
"//perflog/module/perfpoints/variable").

I should note I don't get an error when I run the above. I simply get
nothing, so my guess is that I'm using preceding-sibling wrong. I'm guessing
it never feels it found the condition I'm indicating it should be finding.

I did find that I can do this:

puts XPath.match(event, '[self:preceding-sibling::variable[1](@name,
"202G_OrdAdd")]')

(Note the "::variable[1]" addition.) Some documentation I found suggests
that this should count backwards and reference the closest preceding
variable sibling. That does seem to work -- to an extent, but I get
everything returned. Meaning I get this in my results:

<variable name = "203G_OrdUpdate">
<variable name = "202G_OrdAdd">

.... but then I get all the other 203's in my XML listed as well. What I'm
trying to do is just return the one 203 that has a preceding sibling that
has the attribute name 202G_OrdAdd.

I'm getting closer, though. Thank you for the suggestion, as this does seem
to be the road I need to be on.

- Jeff
 
K

Ken

Actually, none of this will work. You can't do what you're trying to do
because preceding-sibling will look at all the preceding siblings. So you'll
find your first 203 gets reported correctly as being "after" 202. But all of
the other 203's in your XML will also say they are after 202 -- because they
are!

If you put a yield statement in your events.each iterator, you'll see what I
mean. It will report the first 203 correctly. The loop will break that that
point because yield will tell you that you have no block. But the point is
when you take out yield, you'll see that your output is all the 203's.

The issue is that you're trying to do two predicates at the same time. That
can work (just have two bracketed groups), but not with how you are trying
to do it in this case. I'd recommend just treating the XML file like a
regular old text file and parse it line by line with regular expressions.
Don't even use an XML parser.
 
J

Jeff Nyman

Ken said:
If you put a yield statement in your events.each iterator, you'll see what
I mean. It will report the first 203 correctly. The loop will break that
that point because yield will tell you that you have no block. But the
point is when you take out yield, you'll see that your output is all the
203's.

Hmmm. But, you know, you gave me an idea and it does appear to work, at
least when I get out of using my event iterator. Check this out.

If I use this:

XPath.first(xml,
'//variable[@name="203G_OrdUpdate"][following-sibling::variable[1][@name="16G_OrdAdd"]]')

I do get the 203 that appears just before the 16G_OrdAdd. (There are 30
203's in the file and I can tell it's grabbing the right one because each
has a unique count attribute.)

Similarly, I can do this:

XPath.first(xml,
'//variable[@name="203G_OrdUpdate"][preceding-sibling::variable[1][@name="202G_OrdAdd"]]')

That, in turn gets me the first 203 after my 202.

If I change my "first" to "match" then everything comes up just as I want.
So I think my use of the events iterator was throwing me off in terms of
getting my results. It looks like I don't really need to do that. Is the
iterator what you were referring to in terms of this not being workable?
(The "yield" thing kind of threw me off.)

- Jeff
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top