Parse and modify an XML file with REXML

Discussion in 'Ruby' started by jeffnyman@gmail.com, Oct 5, 2006.

  1. Guest

    Greetings all.

    When processing XML, is there a way to check what the previous and what
    the next "rows" are?

    That probably makes no sense without context, so here is an example. I
    need to find things in the XML based on rules. For example, one rule
    might be "find the first 203 that comes after 202." Another rule is
    "Find the first 203 that comes before 16." So say I have this:

    <variable value="202">
    <variable value="203">
    <variable value="203">
    <variable value="203">
    <variable value="203">
    <variable value="203">
    <variable value="16">

    I have to be able to find that the element after 202 is 203. (As
    opposed to a situation where a 202 appeared, but the next element was
    not 203.) I then have to determine that the element after a given 203
    is 16. Then I have to change the value attribute of the first and last
    203 elements. So the XML, after applying the rules, would look like
    this:

    <variable value="202">
    <variable value="203First">
    <variable value="203">
    <variable value="203">
    <variable value="203">
    <variable value="203Last">
    <variable value="16">

    The 202 and 16 are essentially bracketers of data, in this case. There
    can be many such groups in the XML that look like this.

    I know how to parse through XML using XPath or using a stream listener.
    I have read the tutorial that comes with REXML. But what I'm not sure
    how to do is check for the conditions like I described above. One
    thought was I could read the XML into an array because then I get an
    enforced "line numbering" with the indexing. So I could check
    currentLine - 1 and currentLine + 1. I'm not sure if that is a smart
    approach, however.

    Has anyone done something similar in their work?

    - Jeff
     
    , Oct 5, 2006
    #1
    1. Advertising

  2. Tomasz Wegrzanowski, Oct 5, 2006
    #2
    1. Advertising

  3. Peter Szinek Guest

    wrote:
    > Greetings all.
    >
    > When processing XML, is there a way to check what the previous and what
    > the next "rows" are?


    I don't know REXML that much (and using Hpricot anyway ;-) but standard
    XPath axes ( following-sibling, preceding-sibling ) won't help? The
    previous node in this case would be self::previous-sibling[1] etc.

    HTH,

    Peter
    http://www.rubyrailways.com
     
    Peter Szinek, Oct 5, 2006
    #3
  4. Pete Guest

    In article <>,
    <> wrote:
    >Greetings all.
    >
    >When processing XML, is there a way to check what the previous and what
    >the next "rows" are?
    >
    >That probably makes no sense without context, so here is an example. I
    >need to find things in the XML based on rules. For example, one rule
    >might be "find the first 203 that comes after 202." Another rule is
    >"Find the first 203 that comes before 16." So say I have this:
    >
    ><variable value="202">
    ><variable value="203">
    ><variable value="203">
    ><variable value="203">
    ><variable value="203">
    ><variable value="203">
    ><variable value="16">
    >
    >I have to be able to find that the element after 202 is 203. (As
    >opposed to a situation where a 202 appeared, but the next element was
    >not 203.) I then have to determine that the element after a given 203
    >is 16. Then I have to change the value attribute of the first and last
    >203 elements. [.....]
    >
    >Has anyone done something similar in their work?
    >
    >- Jeff


    I've just been playing with a project that looks like it might have
    some similarities. I acquired an app that creates an XML representation
    of a midifile, and I wanted to add useful info to the XML to help the
    human reader (and maybe allow other postprocessing). In particular,
    a 'note' in a midifile is begun with a NoteOn event, and ends sometime
    later when a corrsponding NoteOff appears. I wanted to add an attribute
    to each NoteOn element that gave its actual duration. Other elements
    that had added attributes could (otherwise) be output again immediately,
    but the NoteOns would have to be held until the NoteOff was read, and
    as order is important that meant other events might have to wait, too.

    (Of course I'm using stream parsing here. the XML-ized midifile can
    get pretty long, and I don't like the idea of keeping an entire DOM
    tree around. I'm kind of more at home with streams, anyway.)

    Essentially I make a list of the elements waiting to be output. Each
    object in the list has a 'complete' flag that is set immediately for
    most tags, except for NoteOn, which is set complete when the NoteOff
    arrives and the duration can be calculated. When the first element
    in the list becomes complete, all finished items at the head of the list
    are output.

    To keep track of the reading end of things I have Element Handler
    objects that can maintain knowledge of the current state (which in the
    case of NoteOn/Offs means a fairly large array of references, but for
    your purposes would just be the value of the previous 'variable').
    I actually wrote an extension to REXML for this that I think is quite
    useful, and will publish -- soon, I hope. I don't think that would
    be needed for your job, though; a simple 'tag_start' handler (from
    REXML::StreamListener) that recognized tag 'variable' should be
    adequate.

    You'd then just have to note, when you got a '203' whether the
    previous was '202' and modify it if so. If not, you'd hold on to
    it until the next 'variable'; if that was '16', you'd modify it
    and output it, otherwise you'd just output it. You wouldn't even
    need a list if there were never any intervening elements.

    Oof! Sorry, that got rather long-winded, and I don't know if it made
    any sense, but I hope it's useful.

    -- Pete --

    --
    ============================================================================
    The address in the header is a Spam Bucket -- don't bother replying to it...
    (If you do need to email, replace the account name with my true name.)
     
    Pete, Oct 5, 2006
    #4
  5. Jeff Nyman Guest

    "Peter Szinek" <> wrote in message
    news:...
    > wrote:
    >> Greetings all.
    >>
    >> When processing XML, is there a way to check what the previous and what
    >> the next "rows" are?

    >
    > I don't know REXML that much (and using Hpricot anyway ;-) but standard
    > XPath axes ( following-sibling, preceding-sibling ) won't help? The
    > previous node in this case would be self::previous-sibling[1] etc.


    Thanks for the suggestion. This sounds like it might work. I did not see
    this in the REXML documentation initially but I see generally how these work
    in concept. In practice, it does not seem to work for me.

    I have my XML like this (greatly pared down):

    <perflog>
    <module>
    <perfpoints>
    <variable name="202G_OrdAdd">
    <variable name="203G_OrdUpdate">
    ....
    </perfpoints
    </module>
    </perflog>

    I tried this:

    <code>
    xml = Document.new(File.open("test.xml"))

    events = XPath.match(xml,
    '/perflog/module/perfpoints/variable[@name="203G_OrdUpdate"]'
    )

    events.each do |event|
    puts XPath.match(event, '[self:preceding-sibling[1](@name,
    "202G_OrdAdd")]')
    end
    </code>

    In the events iterator, I also tried the following variation:

    puts XPath.match(event, 'self:preceding-sibling[1](@name, "202G_OrdAdd")')

    I also tried replacing the 'self' with the full node path (i.e.,
    "//perflog/module/perfpoints/variable").

    I should note I don't get an error when I run the above. I simply get
    nothing, so my guess is that I'm using preceding-sibling wrong. I'm guessing
    it never feels it found the condition I'm indicating it should be finding.

    I did find that I can do this:

    puts XPath.match(event, '[self:preceding-sibling::variable[1](@name,
    "202G_OrdAdd")]')

    (Note the "::variable[1]" addition.) Some documentation I found suggests
    that this should count backwards and reference the closest preceding
    variable sibling. That does seem to work -- to an extent, but I get
    everything returned. Meaning I get this in my results:

    <variable name = "203G_OrdUpdate">
    <variable name = "202G_OrdAdd">

    .... but then I get all the other 203's in my XML listed as well. What I'm
    trying to do is just return the one 203 that has a preceding sibling that
    has the attribute name 202G_OrdAdd.

    I'm getting closer, though. Thank you for the suggestion, as this does seem
    to be the road I need to be on.

    - Jeff
     
    Jeff Nyman, Oct 6, 2006
    #5
  6. Ken Guest

    Actually, none of this will work. You can't do what you're trying to do
    because preceding-sibling will look at all the preceding siblings. So you'll
    find your first 203 gets reported correctly as being "after" 202. But all of
    the other 203's in your XML will also say they are after 202 -- because they
    are!

    If you put a yield statement in your events.each iterator, you'll see what I
    mean. It will report the first 203 correctly. The loop will break that that
    point because yield will tell you that you have no block. But the point is
    when you take out yield, you'll see that your output is all the 203's.

    The issue is that you're trying to do two predicates at the same time. That
    can work (just have two bracketed groups), but not with how you are trying
    to do it in this case. I'd recommend just treating the XML file like a
    regular old text file and parse it line by line with regular expressions.
    Don't even use an XML parser.
     
    Ken, Oct 6, 2006
    #6
  7. Jeff Nyman Guest

    "Ken" <> wrote:

    > If you put a yield statement in your events.each iterator, you'll see what
    > I mean. It will report the first 203 correctly. The loop will break that
    > that point because yield will tell you that you have no block. But the
    > point is when you take out yield, you'll see that your output is all the
    > 203's.


    Hmmm. But, you know, you gave me an idea and it does appear to work, at
    least when I get out of using my event iterator. Check this out.

    If I use this:

    XPath.first(xml,
    '//variable[@name="203G_OrdUpdate"][following-sibling::variable[1][@name="16G_OrdAdd"]]')

    I do get the 203 that appears just before the 16G_OrdAdd. (There are 30
    203's in the file and I can tell it's grabbing the right one because each
    has a unique count attribute.)

    Similarly, I can do this:

    XPath.first(xml,
    '//variable[@name="203G_OrdUpdate"][preceding-sibling::variable[1][@name="202G_OrdAdd"]]')

    That, in turn gets me the first 203 after my 202.

    If I change my "first" to "match" then everything comes up just as I want.
    So I think my use of the events iterator was throwing me off in terms of
    getting my results. It looks like I don't really need to do that. Is the
    iterator what you were referring to in terms of this not being workable?
    (The "yield" thing kind of threw me off.)

    - Jeff
     
    Jeff Nyman, Oct 6, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Damphyr
    Replies:
    2
    Views:
    154
    Damphyr
    Jul 16, 2003
  2. Daniel Berger

    rexml error - REXML::Validation

    Daniel Berger, Oct 12, 2004, in forum: Ruby
    Replies:
    2
    Views:
    161
    Henrik Horneber
    Oct 12, 2004
  3. Patrick Plattes

    REXML/RSS parse error

    Patrick Plattes, Dec 7, 2006, in forum: Ruby
    Replies:
    4
    Views:
    116
    Patrick Plattes
    Dec 7, 2006
  4. Phlip
    Replies:
    0
    Views:
    156
    Phlip
    Jan 15, 2008
  5. Une Bévue
    Replies:
    3
    Views:
    206
    Une Bévue
    Apr 25, 2010
Loading...

Share This Page