Perl + SAX2 = slow?

Jesse Thompson · Sep 13, 2004

Greetings fell XML folk.

I've just gotten started making SAX filters in Perl. I was hoping to
build an XML templating engine this way, but the performance of
XML::SAX::Expat and XML::SAX::Writer *appear* to be unthinkably bad.

This code:

XML::SAX::Expat->new(Handler => XML::SAX::Writer->new( Output => '>-'
))->parse_uri("test.xml");

takes 1 second to parse a 5 kilobyte peice of XML on my machine. Being
a 500Mhz, that's 10kilobytes per gigahertz second.

Is this in any way normal? I was hoping to be able to process XML
about a hundred times this fast, maybe 1mb per gigahertz second, or
about a thousand clock cycles per byte of consumed XML. I think that
sounds reasonable for the bare parsing/writing of XML in a zippy
language like Perl.. so I have to assume I am doing something very
very wrong in my setup

So am I somehow getting the PurePerl parser instead of Expat? I'm
asking for Expat by name and die($parser) gives me
"XML::SAX::Expat=HASH(0x83b0090)".

Further, I simply cannot believe these results are typical. SAX was
invented to handle multi-megabyte documents that DOM can't fit
in-memory, but at these rates that would mean it would take a dual
4Ghz Xeon server twenty minutes just to parse a 100 megabyte XML
document and write it back out to disk unaltered. What happens when
you want to plug in a pipeline of any merit? I'm not really sure how
fast DOM is, but big servers can have 3 gigabytes of ram (100mb * 30x
for dom memory bloat), and I know my web browser reads XHTML and
builds DOM trees out of it at better than 10 kb per gigahertz second..
So my results must, must be flawed somehow.

Does anyone know what could be going wrong, or how fast that code
snippet should be parsing XML? If I'm waiting around and then only
transforming nested XML nodes that match certain criterion (custom tag
names, attribute names, or maybe even just a custom namespace), sort
of like a templating engine (replace fake template data with data
pulled from a DB, or todays date for instance) would that mean there
is an XML solution more efficient for my goals than SAX? Twig maybe,
or Essex? (I can't find much to read about Essex)

I was hoping to become an XML evangelist because I love everything
about it (and even understand namespaces and encodings

but these
results kind of made it feel like my bubble had burst. Everyone I know
keeps molesting their XML projects with Regex, which just seems to me
so much like towing a inoperative car around with a team of horses.

Any insight will be appreciated.

- - Jesse Thompson
Lightsecond Technologies
http://www.lightsecond.com/

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Sep 13, 2004

Jesse said:
Further, I simply cannot believe these results are typical. SAX was
invented to handle multi-megabyte documents that DOM can't fit

Reading XML files of several GigaBytes length
with a SAX-implementation (Expat) took me some
minutes. I have tested this while integrating
Expat into GNU Awk.

I was hoping to become an XML evangelist because I love everything
about it (and even understand namespaces and encodings but these
results kind of made it feel like my bubble had burst. Everyone I know
keeps molesting their XML projects with Regex, which just seems to me
so much like towing a inoperative car around with a team of horses.

I cannot help you with Perl, but maybe xmlgawk
can help you. This is GNU Awk extended (experimentally)
with Expat.

Jesse Thompson · Sep 13, 2004

Jürgen Kahrs said:
Reading XML files of several GigaBytes length
with a SAX-implementation (Expat) took me some
minutes. I have tested this while integrating
Expat into GNU Awk.

Yeah, well several gigabytes in several minutes (lets say 1 gigabyte
per minute) on a 1 Gigahertz machine would be 16Mb/Ghz*s. That is so
fast I wouldn't know what to do with myself: one thousand six hundred
times faster than my results are showing (10kb/Ghz*s). Break me off a
peice, would you?

I cannot help you with Perl, but maybe xmlgawk
can help you. This is GNU Awk extended (experimentally)
with Expat.

As interesting as that sounds I don't know anything about Awk.. But if
it's fast in awk it must also be fast in Perl. I simply have to
believe my results are atypical for Perl::SAX.

- - Jesse

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Sep 13, 2004

Jesse said:
As interesting as that sounds I don't know anything about Awk.. But if
it's fast in awk it must also be fast in Perl. I simply have to
believe my results are atypical for Perl::SAX.

I am a bit surprised to hear about something built upon
SAX that is slow. Maybe this comparison helps you:

http://xmlbench.sourceforge.net/results/benchmark/index.html

Jesse Thompson · Sep 14, 2004

I am a bit surprised to hear about something built upon

SAX that is slow. Maybe this comparison helps you:

http://xmlbench.sourceforge.net/results/benchmark/index.html

Well, since you just quoted half the XML parsing benchmarks I've seen
on the internet, I'll go ahead and post the other half

http://www.xml.com/pub/a/Benchmark/article.html

On that page you can notice something like a 30x speed difference
between C expat and Perl Expat. You're quoting a C/Java benchmark.
It's still not enough to explain my results though, Perl Expat (in
what looks like SAX1) is still shown to process 32 times faster than
my demonstration in SAX2. Perl::Expat vs. Perl::SAX::Expat *could*
explain that difference, but I don't buy it.

Just to recap, all I have is this:
XML::SAX::Expat->new(Handler => XML::SAX::Writer->new( Output => '>-'
))->parse_uri("test.xml");

That is the simplest possible command: "eat this file with expat and
write it back out again". There is no application built on top of it
yet, that's just the raw drivers turning my CPU into a radiator.

- - Jesse

Malcolm Dew-Jones · Sep 14, 2004

Jesse Thompson ([email protected]) wrote:
: Greetings fell XML folk.

: I've just gotten started making SAX filters in Perl. I was hoping to
: build an XML templating engine this way, but the performance of
: XML::SAX::Expat and XML::SAX::Writer *appear* to be unthinkably bad.

: This code:

: XML::SAX::Expat->new(Handler => XML::SAX::Writer->new( Output => '>-'
: ))->parse_uri("test.xml");

: takes 1 second to parse a 5 kilobyte peice of XML on my machine. Being
: a 500Mhz, that's 10kilobytes per gigahertz second.

: Is this in any way normal?

Might I suggest you ask this on

comp.lang.perl.modules

Various people hang out there who might have some definitive
answers, or useful suggestions.

$0.02

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Sep 14, 2004

Jesse said:
Well, since you just quoted half the XML parsing benchmarks I've seen
on the internet, I'll go ahead and post the other half
http://www.xml.com/pub/a/Benchmark/article.html

Thanks for the link. Interesting.

Bjoern Hoehrmann · Sep 14, 2004

* Jesse Thompson wrote in comp.text.xml:

This code:

XML::SAX::Expat->new(Handler => XML::SAX::Writer->new( Output => '>-'
))->parse_uri("test.xml");

takes 1 second to parse a 5 kilobyte peice of XML on my machine. Being
a 500Mhz, that's 10kilobytes per gigahertz second.

That does not say much. Did you eliminate all overhead from this test?
Like loading Perl, all the modules, initializing the parser, etc? If
not, that would explain a lot. Also note that a 10kb document is a bad
test case if you want to know how it performs with several MBs of data.
Further note that XML::SAX::Expat is a bad choice if you desire good
performance, what you are doing here is

Expat -> XML:

arser::Expat -> XML:

arser -> XML::SAX::Expat

along with other modules like XML::SAX::Base in the chain which is quite
some overhead. A better choice would be to use XML::SAX::ExpatXS which
omits XML:

arser::Expat and XML:

arser and other parts from the chain,
and is thus much faster. Other modules like XML::LibXML::SAX might give
even better performance. Another problem with your test is that you
generate XML in the chain through XML::SAX::Writer and then through
STDOUT, both of which might significantly slow down processing speed.
In other words, there might be many reasons why this might show poor
performance.

Using http://lists.w3.org/Archives/Public/www-archive/2004Mar/0169.html
as input your script takes 96 seconds on a 1066 MHz Mobile Celeron FWIW.
That's about 77KB per second. With no handler 40 seconds, 184KB/s. With
XML::LibXML::SAX and no handler 18 seconds, 409KB/s. And with the direct
Expat wrapper XML::SAX::ExpatXS and no handler 9 seconds, 819KB/s. Each
for a single run though, so the results are not all that meaningful. For
better results see `perldoc Benchmark`.

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Sep 14, 2004

Bjoern said:
Using http://lists.w3.org/Archives/Public/www-archive/2004Mar/0169.html
as input your script takes 96 seconds on a 1066 MHz Mobile Celeron FWIW.
That's about 77KB per second. With no handler 40 seconds, 184KB/s. With
XML::LibXML::SAX and no handler 18 seconds, 409KB/s. And with the direct
Expat wrapper XML::SAX::ExpatXS and no handler 9 seconds, 819KB/s. Each
for a single run though, so the results are not all that meaningful. For
better results see `perldoc Benchmark`.

This is interesting. For comparison: On a 1200 MHz AMD Duron,
xmlgawk parses between 4000 and 5000 KB/s.

Bjoern Hoehrmann · Sep 14, 2004

* Jürgen Kahrs wrote in comp.text.xml:

This is interesting. For comparison: On a 1200 MHz AMD Duron,
xmlgawk parses between 4000 and 5000 KB/s.

As I wrote, these results don't tell you much. If you have no handler
the processor might be optimized to ignore all data and just evaluate
the document for well-formedness; others might not as it is uncommon to
have no start_element handler, for example. SGML:

arser::OpenSP, a soon
to be released SGML/XML processor based on OpenSP is optimized like that
and for 100 iterations for the documented generated via

`get http://www.w3.org/TR/REC-xml | tidy -utf8 -n --doctype omit`

with no handler versus with handlers for all events that don't do
anything,

Rate OpenSP1 OpenSP2
OpenSP1 1.34/s -- -87%
OpenSP2 10.4/s 677% --

which just shows that creating Perl data structures is quite expensive.
XML:

arser has similar optimizations,

Rate XML:

arser1 XML:

arser2
XML:

arser1 7.25/s -- -81%
XML:

arser2 39.0/s 438% --

and both compared

Rate OpenSP2 XML:

arser2 OpenSP1 XML:

arser1
OpenSP2 1.35/s -- -82% -87% -97%
XML:

arser2 7.34/s 445% -- -31% -81%
OpenSP1 10.6/s 685% 44% -- -73%
XML:

arser1 39.0/s 2795% 431% 269% --

XML::SAX::Expat is not optimized like that in any way and does many more
things than what XML:

arser1 would do. And as I wrote, the input is
highly relevant, too, using the 7,2MB example above (which is quite
different in terms of markup/pcdata, etc.) I get

s/iter OpenSP1 XML:

arser1
OpenSP1 3.89 -- -82%
XML:

arser1 0.681 470% --

which is quite different from the 269% before. So here you'd get a rate
of about 10MB/s using XML:

arser with throw-away processing versus the
about 2MB/s of SGML:

arser::OpenSP.

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Sep 14, 2004

Bjoern said:
As I wrote, these results don't tell you much. If you have no handler
the processor might be optimized to ignore all data and just evaluate
the document for well-formedness; others might not as it is uncommon to
have no start_element handler, for example. SGML:arser::OpenSP, a soon

Good argument, of course. But I really made sure that
all handlers were active. xmlgawk is stupid enough to
have all handlers active all the time. I also made sure
that the XML file was really parsed.

things than what XML:arser1 would do. And as I wrote, the input is
highly relevant, too, using the 7,2MB example above (which is quite
different in terms of markup/pcdata, etc.) I get

s/iter OpenSP1 XML:arser1
OpenSP1 3.89 -- -82%
XML:arser1 0.681 470% --

which is quite different from the 269% before. So here you'd get a rate
of about 10MB/s using XML:arser with throw-away processing versus the
about 2MB/s of SGML:arser::OpenSP.

Interesting, I just downloaded the file and xmlgawk (based
on Expat) parses around 2MB/s on a Pentium with 550 MHz;
which is not much different from 4MB/s with a Duron 1200 MHz.

William Park · Sep 14, 2004

J?rgen Kahrs said:
Good argument, of course. But I really made sure that
all handlers were active. xmlgawk is stupid enough to
have all handlers active all the time. I also made sure
that the XML file was really parsed.

Interesting, I just downloaded the file and xmlgawk (based
on Expat) parses around 2MB/s on a Pentium with 550 MHz;
which is not much different from 4MB/s with a Duron 1200 MHz.

On my P3/800, the 7.5MB file (W3C-Member-Validity.xml) takes
- 6sec for Bash + Expat --> 1.2MB/s
- 2.3sec for Gawk + Expat --> 3.2MB/s
which is in agreement with you data.

Jesse Thompson · Sep 15, 2004

Thank you Jürgen and Bjoern, that has all been very enlightning

So the way Bjoern puts it, Expat is quite fast and due to some
streamlining ExpatXS is faster (it bumped me up to 33kb/Ghz*s) (libXML
appears incompatable with my Glib2.2 system at the moment) but that
XML::SAX::Writer is very very slow.

The reason I'm keeping that in is because it /is/ a constant in my
operations.. in that I'm trying to set up a mechanism where I read an
XML file, I have a flexible chain of filters that transform the file
in the SAX stream, and then it gets written back out again. I don't
know if there's an obvious way to factor out need for Writer in a
scenario like this. Passing an event to a C-module has to be faster
than writing my own writing routines in Perl to avoid the SAX faucet.

But if Writer is the slowbe I guess I'll want to research a faster
Writer? Also I have to wonder about SAX in general for my goals. I
don't think SAX supports any prefilters. My filters will be looking
for XML tags with certain names or attributes before they begin doing
their jobs, so if there was a prefilter, like my filter said "skip me
unless tagname =~ /^abc/ (or) tagname eq 'abc' (or)
defined(attribute->{'{mynamespace}process'})" that might make things
much quicker.

Otherwise, since nearly all of my files will be less than 200k, maybe
I should start looking at DOM.

Jürgen: your xmlgawk project sounds very very interesting. I took a
look at some Gawk tutorials and it looks like a capable tool for very
many applications, more so with XML support. Is there anywhere I could
snag it from? Google seems only to know of some of your discussions
with William over the project

Thank you again!

- - Jesse

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Sep 15, 2004

Jesse said:
Jürgen: your xmlgawk project sounds very very interesting. I took a
look at some Gawk tutorials and it looks like a capable tool for very
many applications, more so with XML support. Is there anywhere I could
snag it from? Google seems only to know of some of your discussions
with William over the project

There is very few doc about xmlgawk currently.
A collection of pointers can be found in my
posting to comp.lang.awk on 2004-08-13:

http://groups.google.de/groups?hl=d...groups?hl=de&lr=&ie=UTF-8&group=comp.lang.awk

How to speed up XML reading	11	Sep 11, 2012
How do you install expat and XML::Parser on Mac OS x?	0	Nov 12, 2011
Bad programmer and a slow learner needs advices	14	Nov 22, 2022
Why is SAX faster than DOM?	4	Jun 3, 2012
build a hierarchical tree, without using DOM,schema, and sax using expat parser and c	2	Nov 5, 2007
Looking for XML linearization information	3	Feb 3, 2011
XML in XMPP	8	Jul 6, 2012
unprefixed names in Perl SAX 2.0 parsers	1	May 5, 2005

Perl + SAX2 = slow?

Jesse Thompson

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Jesse Thompson

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Jesse Thompson

Malcolm Dew-Jones

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Bjoern Hoehrmann

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Bjoern Hoehrmann

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

William Park

Jesse Thompson

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads