Perl + SAX2 = slow?

J

Jesse Thompson

Greetings fell XML folk.

I've just gotten started making SAX filters in Perl. I was hoping to
build an XML templating engine this way, but the performance of
XML::SAX::Expat and XML::SAX::Writer *appear* to be unthinkably bad.

This code:

XML::SAX::Expat->new(Handler => XML::SAX::Writer->new( Output => '>-'
))->parse_uri("test.xml");

takes 1 second to parse a 5 kilobyte peice of XML on my machine. Being
a 500Mhz, that's 10kilobytes per gigahertz second.

Is this in any way normal? I was hoping to be able to process XML
about a hundred times this fast, maybe 1mb per gigahertz second, or
about a thousand clock cycles per byte of consumed XML. I think that
sounds reasonable for the bare parsing/writing of XML in a zippy
language like Perl.. so I have to assume I am doing something very
very wrong in my setup :(

So am I somehow getting the PurePerl parser instead of Expat? I'm
asking for Expat by name and die($parser) gives me
"XML::SAX::Expat=HASH(0x83b0090)".

Further, I simply cannot believe these results are typical. SAX was
invented to handle multi-megabyte documents that DOM can't fit
in-memory, but at these rates that would mean it would take a dual
4Ghz Xeon server twenty minutes just to parse a 100 megabyte XML
document and write it back out to disk unaltered. What happens when
you want to plug in a pipeline of any merit? I'm not really sure how
fast DOM is, but big servers can have 3 gigabytes of ram (100mb * 30x
for dom memory bloat), and I know my web browser reads XHTML and
builds DOM trees out of it at better than 10 kb per gigahertz second..
So my results must, must be flawed somehow.

Does anyone know what could be going wrong, or how fast that code
snippet should be parsing XML? If I'm waiting around and then only
transforming nested XML nodes that match certain criterion (custom tag
names, attribute names, or maybe even just a custom namespace), sort
of like a templating engine (replace fake template data with data
pulled from a DB, or todays date for instance) would that mean there
is an XML solution more efficient for my goals than SAX? Twig maybe,
or Essex? (I can't find much to read about Essex)

I was hoping to become an XML evangelist because I love everything
about it (and even understand namespaces and encodings ;) but these
results kind of made it feel like my bubble had burst. Everyone I know
keeps molesting their XML projects with Regex, which just seems to me
so much like towing a inoperative car around with a team of horses.

Any insight will be appreciated.

- - Jesse Thompson
Lightsecond Technologies
http://www.lightsecond.com/
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Jesse said:
Further, I simply cannot believe these results are typical. SAX was
invented to handle multi-megabyte documents that DOM can't fit

Reading XML files of several GigaBytes length
with a SAX-implementation (Expat) took me some
minutes. I have tested this while integrating
Expat into GNU Awk.
I was hoping to become an XML evangelist because I love everything
about it (and even understand namespaces and encodings ;) but these
results kind of made it feel like my bubble had burst. Everyone I know
keeps molesting their XML projects with Regex, which just seems to me
so much like towing a inoperative car around with a team of horses.

I cannot help you with Perl, but maybe xmlgawk
can help you. This is GNU Awk extended (experimentally)
with Expat.
 
J

Jesse Thompson

Jürgen Kahrs said:
Reading XML files of several GigaBytes length
with a SAX-implementation (Expat) took me some
minutes. I have tested this while integrating
Expat into GNU Awk.
Yeah, well several gigabytes in several minutes (lets say 1 gigabyte
per minute) on a 1 Gigahertz machine would be 16Mb/Ghz*s. That is so
fast I wouldn't know what to do with myself: one thousand six hundred
times faster than my results are showing (10kb/Ghz*s). Break me off a
peice, would you? :)
I cannot help you with Perl, but maybe xmlgawk
can help you. This is GNU Awk extended (experimentally)
with Expat.

As interesting as that sounds I don't know anything about Awk.. But if
it's fast in awk it must also be fast in Perl. I simply have to
believe my results are atypical for Perl::SAX.

- - Jesse
 
J

Jesse Thompson

I am a bit surprised to hear about something built upon
SAX that is slow. Maybe this comparison helps you:

http://xmlbench.sourceforge.net/results/benchmark/index.html

Well, since you just quoted half the XML parsing benchmarks I've seen
on the internet, I'll go ahead and post the other half ;)
http://www.xml.com/pub/a/Benchmark/article.html

On that page you can notice something like a 30x speed difference
between C expat and Perl Expat. You're quoting a C/Java benchmark.
It's still not enough to explain my results though, Perl Expat (in
what looks like SAX1) is still shown to process 32 times faster than
my demonstration in SAX2. Perl::Expat vs. Perl::SAX::Expat *could*
explain that difference, but I don't buy it. :(

Just to recap, all I have is this:
XML::SAX::Expat->new(Handler => XML::SAX::Writer->new( Output => '>-'
))->parse_uri("test.xml");

That is the simplest possible command: "eat this file with expat and
write it back out again". There is no application built on top of it
yet, that's just the raw drivers turning my CPU into a radiator. :)

- - Jesse
 
M

Malcolm Dew-Jones

Jesse Thompson ([email protected]) wrote:
: Greetings fell XML folk.

: I've just gotten started making SAX filters in Perl. I was hoping to
: build an XML templating engine this way, but the performance of
: XML::SAX::Expat and XML::SAX::Writer *appear* to be unthinkably bad.

: This code:

: XML::SAX::Expat->new(Handler => XML::SAX::Writer->new( Output => '>-'
: ))->parse_uri("test.xml");

: takes 1 second to parse a 5 kilobyte peice of XML on my machine. Being
: a 500Mhz, that's 10kilobytes per gigahertz second.

: Is this in any way normal?

Might I suggest you ask this on

comp.lang.perl.modules

Various people hang out there who might have some definitive
answers, or useful suggestions.

$0.02
 
B

Bjoern Hoehrmann

* Jesse Thompson wrote in comp.text.xml:
This code:

XML::SAX::Expat->new(Handler => XML::SAX::Writer->new( Output => '>-'
))->parse_uri("test.xml");

takes 1 second to parse a 5 kilobyte peice of XML on my machine. Being
a 500Mhz, that's 10kilobytes per gigahertz second.

That does not say much. Did you eliminate all overhead from this test?
Like loading Perl, all the modules, initializing the parser, etc? If
not, that would explain a lot. Also note that a 10kb document is a bad
test case if you want to know how it performs with several MBs of data.
Further note that XML::SAX::Expat is a bad choice if you desire good
performance, what you are doing here is

Expat -> XML::parser::Expat -> XML::parser -> XML::SAX::Expat

along with other modules like XML::SAX::Base in the chain which is quite
some overhead. A better choice would be to use XML::SAX::ExpatXS which
omits XML::parser::Expat and XML::parser and other parts from the chain,
and is thus much faster. Other modules like XML::LibXML::SAX might give
even better performance. Another problem with your test is that you
generate XML in the chain through XML::SAX::Writer and then through
STDOUT, both of which might significantly slow down processing speed.
In other words, there might be many reasons why this might show poor
performance.

Using http://lists.w3.org/Archives/Public/www-archive/2004Mar/0169.html
as input your script takes 96 seconds on a 1066 MHz Mobile Celeron FWIW.
That's about 77KB per second. With no handler 40 seconds, 184KB/s. With
XML::LibXML::SAX and no handler 18 seconds, 409KB/s. And with the direct
Expat wrapper XML::SAX::ExpatXS and no handler 9 seconds, 819KB/s. Each
for a single run though, so the results are not all that meaningful. For
better results see `perldoc Benchmark`.
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Bjoern said:
Using http://lists.w3.org/Archives/Public/www-archive/2004Mar/0169.html
as input your script takes 96 seconds on a 1066 MHz Mobile Celeron FWIW.
That's about 77KB per second. With no handler 40 seconds, 184KB/s. With
XML::LibXML::SAX and no handler 18 seconds, 409KB/s. And with the direct
Expat wrapper XML::SAX::ExpatXS and no handler 9 seconds, 819KB/s. Each
for a single run though, so the results are not all that meaningful. For
better results see `perldoc Benchmark`.

This is interesting. For comparison: On a 1200 MHz AMD Duron,
xmlgawk parses between 4000 and 5000 KB/s.
 
B

Bjoern Hoehrmann

* Jürgen Kahrs wrote in comp.text.xml:
This is interesting. For comparison: On a 1200 MHz AMD Duron,
xmlgawk parses between 4000 and 5000 KB/s.

As I wrote, these results don't tell you much. If you have no handler
the processor might be optimized to ignore all data and just evaluate
the document for well-formedness; others might not as it is uncommon to
have no start_element handler, for example. SGML::parser::OpenSP, a soon
to be released SGML/XML processor based on OpenSP is optimized like that
and for 100 iterations for the documented generated via

`get http://www.w3.org/TR/REC-xml | tidy -utf8 -n --doctype omit`

with no handler versus with handlers for all events that don't do
anything,

Rate OpenSP1 OpenSP2
OpenSP1 1.34/s -- -87%
OpenSP2 10.4/s 677% --

which just shows that creating Perl data structures is quite expensive.
XML::parser has similar optimizations,

Rate XML::parser1 XML::parser2
XML::parser1 7.25/s -- -81%
XML::parser2 39.0/s 438% --

and both compared

Rate OpenSP2 XML::parser2 OpenSP1 XML::parser1
OpenSP2 1.35/s -- -82% -87% -97%
XML::parser2 7.34/s 445% -- -31% -81%
OpenSP1 10.6/s 685% 44% -- -73%
XML::parser1 39.0/s 2795% 431% 269% --

XML::SAX::Expat is not optimized like that in any way and does many more
things than what XML::parser1 would do. And as I wrote, the input is
highly relevant, too, using the 7,2MB example above (which is quite
different in terms of markup/pcdata, etc.) I get

s/iter OpenSP1 XML::parser1
OpenSP1 3.89 -- -82%
XML::parser1 0.681 470% --

which is quite different from the 269% before. So here you'd get a rate
of about 10MB/s using XML::parser with throw-away processing versus the
about 2MB/s of SGML::parser::OpenSP.
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Bjoern said:
As I wrote, these results don't tell you much. If you have no handler
the processor might be optimized to ignore all data and just evaluate
the document for well-formedness; others might not as it is uncommon to
have no start_element handler, for example. SGML::parser::OpenSP, a soon

Good argument, of course. But I really made sure that
all handlers were active. xmlgawk is stupid enough to
have all handlers active all the time. I also made sure
that the XML file was really parsed.
things than what XML::parser1 would do. And as I wrote, the input is
highly relevant, too, using the 7,2MB example above (which is quite
different in terms of markup/pcdata, etc.) I get

s/iter OpenSP1 XML::parser1
OpenSP1 3.89 -- -82%
XML::parser1 0.681 470% --

which is quite different from the 269% before. So here you'd get a rate
of about 10MB/s using XML::parser with throw-away processing versus the
about 2MB/s of SGML::parser::OpenSP.

Interesting, I just downloaded the file and xmlgawk (based
on Expat) parses around 2MB/s on a Pentium with 550 MHz;
which is not much different from 4MB/s with a Duron 1200 MHz.
 
W

William Park

J?rgen Kahrs said:
Good argument, of course. But I really made sure that
all handlers were active. xmlgawk is stupid enough to
have all handlers active all the time. I also made sure
that the XML file was really parsed.


Interesting, I just downloaded the file and xmlgawk (based
on Expat) parses around 2MB/s on a Pentium with 550 MHz;
which is not much different from 4MB/s with a Duron 1200 MHz.

On my P3/800, the 7.5MB file (W3C-Member-Validity.xml) takes
- 6sec for Bash + Expat --> 1.2MB/s
- 2.3sec for Gawk + Expat --> 3.2MB/s
which is in agreement with you data.
 
J

Jesse Thompson

Thank you Jürgen and Bjoern, that has all been very enlightning :)

So the way Bjoern puts it, Expat is quite fast and due to some
streamlining ExpatXS is faster (it bumped me up to 33kb/Ghz*s) (libXML
appears incompatable with my Glib2.2 system at the moment) but that
XML::SAX::Writer is very very slow.

The reason I'm keeping that in is because it /is/ a constant in my
operations.. in that I'm trying to set up a mechanism where I read an
XML file, I have a flexible chain of filters that transform the file
in the SAX stream, and then it gets written back out again. I don't
know if there's an obvious way to factor out need for Writer in a
scenario like this. Passing an event to a C-module has to be faster
than writing my own writing routines in Perl to avoid the SAX faucet.

But if Writer is the slowbe I guess I'll want to research a faster
Writer? Also I have to wonder about SAX in general for my goals. I
don't think SAX supports any prefilters. My filters will be looking
for XML tags with certain names or attributes before they begin doing
their jobs, so if there was a prefilter, like my filter said "skip me
unless tagname =~ /^abc/ (or) tagname eq 'abc' (or)
defined(attribute->{'{mynamespace}process'})" that might make things
much quicker.

Otherwise, since nearly all of my files will be less than 200k, maybe
I should start looking at DOM.

Jürgen: your xmlgawk project sounds very very interesting. I took a
look at some Gawk tutorials and it looks like a capable tool for very
many applications, more so with XML support. Is there anywhere I could
snag it from? Google seems only to know of some of your discussions
with William over the project :)

Thank you again!

- - Jesse
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Jesse said:
Jürgen: your xmlgawk project sounds very very interesting. I took a
look at some Gawk tutorials and it looks like a capable tool for very
many applications, more so with XML support. Is there anywhere I could
snag it from? Google seems only to know of some of your discussions
with William over the project :)

There is very few doc about xmlgawk currently.
A collection of pointers can be found in my
posting to comp.lang.awk on 2004-08-13:

http://groups.google.de/groups?hl=d...groups?hl=de&lr=&ie=UTF-8&group=comp.lang.awk
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top