J
Jesse Thompson
Greetings fell XML folk.
I've just gotten started making SAX filters in Perl. I was hoping to
build an XML templating engine this way, but the performance of
XML::SAX::Expat and XML::SAX::Writer *appear* to be unthinkably bad.
This code:
XML::SAX::Expat->new(Handler => XML::SAX::Writer->new( Output => '>-'
))->parse_uri("test.xml");
takes 1 second to parse a 5 kilobyte peice of XML on my machine. Being
a 500Mhz, that's 10kilobytes per gigahertz second.
Is this in any way normal? I was hoping to be able to process XML
about a hundred times this fast, maybe 1mb per gigahertz second, or
about a thousand clock cycles per byte of consumed XML. I think that
sounds reasonable for the bare parsing/writing of XML in a zippy
language like Perl.. so I have to assume I am doing something very
very wrong in my setup
So am I somehow getting the PurePerl parser instead of Expat? I'm
asking for Expat by name and die($parser) gives me
"XML::SAX::Expat=HASH(0x83b0090)".
Further, I simply cannot believe these results are typical. SAX was
invented to handle multi-megabyte documents that DOM can't fit
in-memory, but at these rates that would mean it would take a dual
4Ghz Xeon server twenty minutes just to parse a 100 megabyte XML
document and write it back out to disk unaltered. What happens when
you want to plug in a pipeline of any merit? I'm not really sure how
fast DOM is, but big servers can have 3 gigabytes of ram (100mb * 30x
for dom memory bloat), and I know my web browser reads XHTML and
builds DOM trees out of it at better than 10 kb per gigahertz second..
So my results must, must be flawed somehow.
Does anyone know what could be going wrong, or how fast that code
snippet should be parsing XML? If I'm waiting around and then only
transforming nested XML nodes that match certain criterion (custom tag
names, attribute names, or maybe even just a custom namespace), sort
of like a templating engine (replace fake template data with data
pulled from a DB, or todays date for instance) would that mean there
is an XML solution more efficient for my goals than SAX? Twig maybe,
or Essex? (I can't find much to read about Essex)
I was hoping to become an XML evangelist because I love everything
about it (and even understand namespaces and encodings
but these
results kind of made it feel like my bubble had burst. Everyone I know
keeps molesting their XML projects with Regex, which just seems to me
so much like towing a inoperative car around with a team of horses.
Any insight will be appreciated.
- - Jesse Thompson
Lightsecond Technologies
http://www.lightsecond.com/
I've just gotten started making SAX filters in Perl. I was hoping to
build an XML templating engine this way, but the performance of
XML::SAX::Expat and XML::SAX::Writer *appear* to be unthinkably bad.
This code:
XML::SAX::Expat->new(Handler => XML::SAX::Writer->new( Output => '>-'
))->parse_uri("test.xml");
takes 1 second to parse a 5 kilobyte peice of XML on my machine. Being
a 500Mhz, that's 10kilobytes per gigahertz second.
Is this in any way normal? I was hoping to be able to process XML
about a hundred times this fast, maybe 1mb per gigahertz second, or
about a thousand clock cycles per byte of consumed XML. I think that
sounds reasonable for the bare parsing/writing of XML in a zippy
language like Perl.. so I have to assume I am doing something very
very wrong in my setup
So am I somehow getting the PurePerl parser instead of Expat? I'm
asking for Expat by name and die($parser) gives me
"XML::SAX::Expat=HASH(0x83b0090)".
Further, I simply cannot believe these results are typical. SAX was
invented to handle multi-megabyte documents that DOM can't fit
in-memory, but at these rates that would mean it would take a dual
4Ghz Xeon server twenty minutes just to parse a 100 megabyte XML
document and write it back out to disk unaltered. What happens when
you want to plug in a pipeline of any merit? I'm not really sure how
fast DOM is, but big servers can have 3 gigabytes of ram (100mb * 30x
for dom memory bloat), and I know my web browser reads XHTML and
builds DOM trees out of it at better than 10 kb per gigahertz second..
So my results must, must be flawed somehow.
Does anyone know what could be going wrong, or how fast that code
snippet should be parsing XML? If I'm waiting around and then only
transforming nested XML nodes that match certain criterion (custom tag
names, attribute names, or maybe even just a custom namespace), sort
of like a templating engine (replace fake template data with data
pulled from a DB, or todays date for instance) would that mean there
is an XML solution more efficient for my goals than SAX? Twig maybe,
or Essex? (I can't find much to read about Essex)
I was hoping to become an XML evangelist because I love everything
about it (and even understand namespaces and encodings
results kind of made it feel like my bubble had burst. Everyone I know
keeps molesting their XML projects with Regex, which just seems to me
so much like towing a inoperative car around with a team of horses.
Any insight will be appreciated.
- - Jesse Thompson
Lightsecond Technologies
http://www.lightsecond.com/