One more way to parse XML...

Discussion in 'Ruby' started by pete@jwgibbs.cchem.berkeley.edu, Oct 21, 2006.

  1. Guest

    I thought I'd put this out to see if there's any interest.

    Recently I wanted to do some XML reading with Ruby, but looking
    (not deeply) at REXML and other packages like xmlcodec, I couldn't
    find anything that seemed to fit the way I thought about things,
    so I [of course... :)-)] wrote a little wrapper to REXML that
    fit me a little better.

    First, I wanted to have a 'stream' parser, rather than reading the
    whole tree into memory and then working on it. The documents I was
    interested in (XML representation of midifiles) are not very deep,
    but can get very lengthy, and the processing I wanted to do was
    mostly sequential.

    However, the stream parsers I've seen -- including REXML::StreamListener --
    simply pass the pieces of the document in turn to the app, without any
    real notion of the tree, so the app has to keep track of all that itself.

    In other languages, protocols, and situations (strarting with IFF on
    the Amiga, I guess!) I've had success with what then would have been
    a "table driven" scheme. Now, it's more a "linked object" approach:
    each node of the Document Model gets a node that specifies what is to
    be done when an element that it represents is encountered, and also has
    a list (the 'table') of the possible subordinate nodes. You create
    and develop these nodes before reading the document, then with a call to
    the stream reader all the appropriate actions get taken as needed.

    With Ruby it's a snap to create such a node structure. I just formalised
    it a bit and provided an 'XMLStreamListener' class to extend REXML's
    basic version, which keeps track of the node structure and dispatches
    to the appropriate current one. The 'XMLSpec' nodes themselves have
    methods to handle start, end, and empty tags and of course enclosed text.

    So if anyone is interested in digging deeper, I've provided a web
    page (with the module, example use, and downloadable archives) at

    http://jwgibbs.cchem.berkeley.edu/pete/xmlstreamin/

    Cheers,
    -- Pete --
    , Oct 21, 2006
    #1
    1. Advertising

  2. Ross Bamford Guest

    On Sat, 2006-10-21 at 16:15 +0900,
    wrote:
    > I thought I'd put this out to see if there's any interest.


    This looks pretty cool. It has echoes of the Jakarta Commons Digester,
    of which I made a Ruby port a while back (http://digestr.rubyforge.org),
    though using libxml-ruby rather than REXML.

    I quite like the digester model and have found it very handy for dealing
    with certain types of XML (mostly XML from the Java world I guess).

    --
    Ross Bamford -
    Ross Bamford, Oct 21, 2006
    #2
    1. Advertising

  3. Guest

    In article <>,
    Ross Bamford <> wrote:
    >On Sat, 2006-10-21 at 16:15 +0900,
    >wrote:
    >> I thought I'd put this out to see if there's any interest.

    >
    >This looks pretty cool. It has echoes of the Jakarta Commons Digester,
    >of which I made a Ruby port a while back (http://digestr.rubyforge.org),
    >though using libxml-ruby rather than REXML.


    Hmm, yes. I hadn't come across the "digester" before, but there
    do seem to be parallel trains of thought there. (I looked through the
    Jakarta version rather than yours -- finding a magazine article to
    read is more comfortable than chugging through documentation!)
    Looks nice (and much more extensive than mine, of course).

    The main difference (in philosophy) seems to be that the digester
    describes the tree with complete absolute paths for each node, where
    my scheme has each node only knowing about its immediate descendants.

    >I quite like the digester model and have found it very handy for dealing
    >with certain types of XML (mostly XML from the Java world I guess).


    The article I read did seem to be oriented to building a tree (of Beans)
    in memory (so aren't we sort of back to DOM?) but I gather that you can
    provide other custom methods to do other kinds of processing.

    Thanks,
    -- Pete --
    , Oct 21, 2006
    #3
  4. Ross Bamford Guest

    On Sun, 2006-10-22 at 04:30 +0900,
    wrote:
    > In article <>,
    > Ross Bamford <> wrote:
    > >On Sat, 2006-10-21 at 16:15 +0900,
    > >wrote:
    > >> I thought I'd put this out to see if there's any interest.

    > >
    > >This looks pretty cool. It has echoes of the Jakarta Commons Digester,
    > >of which I made a Ruby port a while back (http://digestr.rubyforge.org),
    > >though using libxml-ruby rather than REXML.

    >
    > Hmm, yes. I hadn't come across the "digester" before, but there
    > do seem to be parallel trains of thought there. (I looked through the
    > Jakarta version rather than yours -- finding a magazine article to
    > read is more comfortable than chugging through documentation!)
    > Looks nice (and much more extensive than mine, of course).
    >


    Yes, it is pretty useful in some cases. The Ruby version is rather
    trimmed down by the standards of the Java one, partly because Ruby gets
    more done with less code, and partly because I didn't need everything
    when I made the port :)

    > The main difference (in philosophy) seems to be that the digester
    > describes the tree with complete absolute paths for each node, where
    > my scheme has each node only knowing about its immediate descendants.
    >


    Ahh, I see. That's an interesting strategy (certainly would be easier to
    get on with, esp under refactoring which can be a nightmare). I'll have
    to have a closer look at your code.

    > >I quite like the digester model and have found it very handy for dealing
    > >with certain types of XML (mostly XML from the Java world I guess).

    >
    > The article I read did seem to be oriented to building a tree (of Beans)
    > in memory (so aren't we sort of back to DOM?) but I gather that you can
    > provide other custom methods to do other kinds of processing.
    >


    Yes, most of the standard rules are geared towards building DOM-like
    trees, though instead of a tree representing the XML they allow an
    arbitrary tree of objects to be built based on the XML, with rules to
    take object attribute values from XML attributes, tag bodies, and so on.

    You can just plug in your own rule implementations to do pretty much
    what you like - they're basically just SAX handlers (at least in the
    Ruby implementation, IIRC there's a bit more abstraction in the Java
    original).

    Cheers,
    --
    Ross Bamford -
    Ross Bamford, Oct 22, 2006
    #4
  5. On 10/21/06,
    <> wrote:
    >
    > I thought I'd put this out to see if there's any interest.
    >
    > Recently I wanted to do some XML reading with Ruby, but looking
    > (not deeply) at REXML and other packages like xmlcodec, I couldn't
    > find anything that seemed to fit the way I thought about things,
    > so I [of course... :)-)] wrote a little wrapper to REXML that
    > fit me a little better.
    >
    > First, I wanted to have a 'stream' parser, rather than reading the
    > whole tree into memory and then working on it. The documents I was
    > interested in (XML representation of midifiles) are not very deep,
    > but can get very lengthy, and the processing I wanted to do was
    > mostly sequential.

    [...]

    Did you take a look at magic/xml [ http://zabor.org/jrpg/magic_xml/ ] ?

    magic/xml provides a very nice interface for xml streams.
    It doesn't need any callbacks, subclassing or any ugly things,
    all you need is a single block.

    The block gets incomplete nodes.
    It can call node.complete! to read whole subtree there,
    or simply let the processing move to the first child.

    For example to process Wikipedia database dump, which looks like this:

    <doc>
    <page>
    <title>Foo</title>
    <id>435</id>
    </page>
    <page>
    <title>Bar</title>
    <id>754</id>
    </page>
    </doc>

    and extract titles and ids, you can use this code:

    XML.parse_as_twigs(STDIN) {|node|
    next unless node.name == :page
    node.complete! # Read all children of <page>...</page> node
    t = node[:mad:title] # :mad:title is a child
    i = node[:mad:id] # :mad:id is another child
    print "#{i}: #{t}\n"
    }

    The block first gets called with XML node <doc></doc>.
    As it does not complete!, the next processed block is <page></page>.
    complete! fills the node: <page><title>Foo</title><id>435</id></page>
    Then we can use all convenient tree-based methods.
    As children of <page> were already read, the next node is <page></page>,
    which is completed to <page><title>Bar</title><id>754</id></page> and so on.

    There is tutorial [ http://zabor.org/jrpg/magic_xml/tutorial.html ]
    and collection of solutions to W3C XQuery use cases
    [ http://zabor.org/jrpg/magic_xml/xquery_use_cases.html ]

    --
    Tomasz Wegrzanowski [ http://t-a-w.blogspot.com/ ]
    Tomasz Wegrzanowski, Oct 23, 2006
    #5
  6. Guest

    In article <>,
    Tomasz Wegrzanowski <> wrote:
    >
    >Did you take a look at magic/xml [ http://zabor.org/jrpg/magic_xml/ ] ?

    Yes, I took a brief look, but it didn't immediately seem to be what
    I wanted.
    >
    >magic/xml provides a very nice interface for xml streams.
    >It doesn't need any callbacks, subclassing or any ugly things,
    >all you need is a single block.

    I'm sure it would do the job, but I guess again it's a matter of
    'philosophy'. I knew pretty much what I wanted to do and, because
    I've used that approach before (in C++ for instance), it was actually
    easier for me to write a mechanism to do it than try to figure out
    somebody else's way of thinking about things... :)-))

    A gentle [though perhaps a little blunt :)-/] suggestion: include
    a README in your package! One reason I didn't probe very deeply
    was that I couldn't find any "easy road in". RDOC is fine for checking
    up on the details of an API, but in my experience it's almost useless
    as a 'road map'. (This has been a problem for me with Ruby generally.
    The "User's Guide" is nice, but there's a big gap between that and
    the "Documentation". I suppose I shouldn't be so stingy, and buy the
    "Pickaxe" or something. :)-/)

    Cheers,
    -- Pete --
    , Oct 24, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. loveNUNO
    Replies:
    2
    Views:
    884
    loveNUNO
    Nov 20, 2003
  2. jacquesh
    Replies:
    2
    Views:
    545
    jacquesh
    Nov 8, 2006
  3. Merciadri Luca
    Replies:
    4
    Views:
    789
  4. Steven D'Aprano
    Replies:
    0
    Views:
    65
    Steven D'Aprano
    Dec 23, 2013
  5. Replies:
    3
    Views:
    62
    Gary Herron
    Dec 23, 2013
Loading...

Share This Page