magic/xml library for easy XML processing

Tomasz Wegrzanowski · Aug 4, 2006

Hello,

Some of you may be interested in a new library for XML processing.
It is inspired by languages designed just for XML-processing like CDuce
and to some extend by Perl's XML::Twig. Basically easy things
should be easy, and everything should be integrated very tightly
with Ruby.

The code and (rather incomplete) documentation
are here -> http://zabor.org/taw/magic_xml/

A few examples, so you can quickly see whether you're interested or not

Parse ATOM feed for my blog and prints post titles and URLs:

doc = XML.from_url "http://t-a-w.blogspot.com/atom.xml"
doc.children

entry).children

link) {|c|
print "#{c[:title]}\n#{c[:href]}\n\n" if c[:rel] == "alternate"
}

Get my del.icio.us posts about magic/xml and format them
as a XHTML list (for magic/xml's website):

deli_passwd = File.read("/home/taw/.delipasswd").chomp
url = "http://taw:#{deli_passwd}@del.icio.us/api/posts/recent?tag=taw+blog+magicxml"
XML.from_url(url).children

post).reverse.each_with_index {|p,i|
print XML.li("#{i+1}. ", XML.a({:href => p[:href]}, p[:description]))
}

Extract articles and IDs from a Wikipedia dump. It keeps only
small fragments in memory, but provides all convenient access
methods (works like XML::Twig, but with much nicer interface):

XML.parse_as_twigs(STDIN) {|node|
next unless node.name ==

age
node.complete!
t = node.children

title)[0].contents
i = node.children

id)[0].contents
print "#{i}: #{t}\n"
}

More about stream processing with magic/xml at
http://t-a-w.blogspot.com/2006/08/xml-stream-processing-with-magicxml.html

The most important thing to do would be to find cases
where other libraries are more expressive than magic/xml
and fix these cases if possible

As I don't know half
of the other libraries, and you certainly do, I need your help here

And I guess I should also add XPath, port to a faster XML parser (currently
using REXML to get a stream of XML parse events), and add
some interface for accessing fancy XML features like
processing instructions to get it out of alpha

Adam Keys · Aug 5, 2006

Some of you may be interested in a new library for XML processing.
It is inspired by languages designed just for XML-processing like
CDuce
and to some extend by Perl's XML::Twig. Basically easy things
should be easy, and everything should be integrated very tightly
with Ruby.

This looks really promising. You almost lost me though. I suspected
from the mention of XML::Twig that magic/xml might handle Really Huge
XML (TM) more gracefully than our current options and it looks like
it does. Point being, you should mention it handles Really Huge XML
gracefully up front, as that is something I find myself inflicting
ugly hacks upon myself to achieve today.

And I guess I should also add XPath, port to a faster XML parser
(currently
using REXML to get a stream of XML parse events), and add
some interface for accessing fancy XML features like
processing instructions to get it out of alpha

Yes, please, XPath! It could be I'm the only one who likes XPath,
but I find it a great way to pluck data out of XML.

This library looks really good. I'm going to keep it in mind for all
my "some idiot sent me an 1 GB XML file that is a bunch of smaller
files concatenated together" needs.

Tomasz Wegrzanowski · Aug 5, 2006

This looks really promising. You almost lost me though. I suspected
from the mention of XML::Twig that magic/xml might handle Really Huge
XML (TM) more gracefully than our current options and it looks like
it does. Point being, you should mention it handles Really Huge XML
gracefully up front, as that is something I find myself inflicting
ugly hacks upon myself to achieve today.
[...]

This library looks really good. I'm going to keep it in mind for all
my "some idiot sent me an 1 GB XML file that is a bunch of smaller
files concatenated together" needs.

Handling huge XML files is just a bonus. The main reason the library exists
is its sheer expressive power.

I tried to recode W3C's XQuery Use Cases (
http://www.w3.org/TR/xquery-use-cases/ )
in magic/xml to see how it compares with XQuery on XQuery's terms,
and they're very close. For the use cases I translated so far the results are
(characters with whitespace merged and a few other transformations that
make it more meaningful):

Problem XMP 1: Ruby 187 (114%), XQuery: 164
Problem XMP 2: Ruby 132 (100%), XQuery: 132
Problem XMP 3: Ruby 115 (103%), XQuery: 112
Problem XMP 4: Ruby 400 (101%), XQuery: 398
Problem XMP 5: Ruby 367 (124%), XQuery: 296
Problem XMP 6: Ruby 220 (104%), XQuery: 211
Problem XMP 7: Ruby 232 (135%), XQuery: 172
Problem XMP 8: Ruby 150 (88%), XQuery: 170
Problem XMP 9: Ruby 157 (129%), XQuery: 122
Problem XMP 10: Ruby 298 (142%), XQuery: 210
Problem XMP 11: Ruby 295 (136%), XQuery: 217
Problem XMP 12: Ruby 457 (118%), XQuery: 387
Problem Tree 1: Ruby 166 (61%), XQuery: 270
Problem Tree 2: Ruby 118 (109%), XQuery: 108
Problem Tree 3: Ruby 133 (101%), XQuery: 132
Problem Tree 4: Ruby 75 (93%), XQuery: 81
Problem Tree 5: Ruby 168 (104%), XQuery: 161
Problem Tree 6: Ruby 255 (69%), XQuery: 369
Total: Ruby 3925 (106%), XQuery: 3712
Median ratio: 104%

I don't think any other Ruby library for XML can get anywhere
close to such results. And efficient processing of large XMLs ?
That's just a small freebie :-D

Zed Shaw · Aug 5, 2006

Hello,

Some of you may be interested in a new library for XML processing.
It is inspired by languages designed just for XML-processing like CDuce
and to some extend by Perl's XML::Twig. Basically easy things
should be easy, and everything should be integrated very tightly
with Ruby.

doc = XML.from_url "http://t-a-w.blogspot.com/atom.xml"
doc.childrenentry).childrenlink) {|c|
print "#{c[:title]}\n#{c[:href]}\n\n" if c[:rel] == "alternate"
}

I'd like to have it work like Hpricot as well:

(doc/:entry/:link).each do {|c|
...
}

Especially considering you overload [] to get attributes, why not / to
get children?

Trans · Aug 6, 2006

Zed said:
Hello,

Some of you may be interested in a new library for XML processing.
It is inspired by languages designed just for XML-processing like CDuce
and to some extend by Perl's XML::Twig. Basically easy things
should be easy, and everything should be integrated very tightly
with Ruby.

Click to expand...

doc = XML.from_url "http://t-a-w.blogspot.com/atom.xml"
doc.childrenentry).childrenlink) {|c|
print "#{c[:title]}\n#{c[:href]}\n\n" if c[:rel] == "alternate"
}

Click to expand...

I'd like to have it work like Hpricot as well:

(doc/:entry/:link).each do {|c|
...
}

Especially considering you overload [] to get attributes, why not / to
get children?

http://cherry.rubyforge.org
http://rubyforge.org/projects/cherry/

T.

Tomasz Wegrzanowski · Aug 6, 2006

doc = XML.from_url "http://t-a-w.blogspot.com/atom.xml"
doc.childrenentry).childrenlink) {|c|
print "#{c[:title]}\n#{c[:href]}\n\n" if c[:rel] == "alternate"
}

Click to expand...

I'd like to have it work like Hpricot as well:

(doc/:entry/:link).each do {|c|
...
}

Especially considering you overload [] to get attributes, why not / to
get children?

Basically because there are three reasonable things to with node[:foo]:
* return attribute :foo
* return the first child with tag :foo
* return list of children with tag :foo
Ruby is not Perl, so we cannot have both 2 and 3 folded into one,
and doing only second or only third doesn't sound that convincing ;-)

Another issue is that I'd have to overload Array#/ to get
(doc/:entry/:link) working,
and that would have much higher mental cost than adding long-named
method like #children to it. Or use something else than an Array
for sequences of XML nodes (hpricot does so with Hpricot::Elements),
but that wouldn't be nice. I'll look at it again after I have all W3C
XQuery Use Cases recoded

xml processing speed test	0	Jun 7, 2006
RLisp - Lisp naturally embedded in Ruby	28	Jul 23, 2006
using DOM for XML processing	3	Sep 27, 2005
Parsing XML RSS feed byte stream for <item> tag	2	Feb 7, 2013
xml::twig - writing utf-8	4	May 25, 2006
ANN: ThirdBase: A Fast and Easy Date/DateTime Class for Ruby	0	Nov 21, 2008
emacs lisp text processing example (html5 figure/figcaption)	7	Jul 4, 2011
Ripping out parts of a DOM using XML::XSLT	3	Jun 18, 2008

magic/xml library for easy XML processing

Tomasz Wegrzanowski

Adam Keys

Tomasz Wegrzanowski

Zed Shaw

Trans

Tomasz Wegrzanowski

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads