Slowness of SAX

Sigfried · Nov 12, 2008

Hi, using a java profiler, i've realized that SAX is consuming too much
time:
- endElement + startElement 40 %
- *.read 7 %
- a few <= 1%

So SAX take about 50 % of the time !!

Do you know faster XML API ?

Roedy Green · Nov 12, 2008

Do you know faster XML API ?

XML/SAX is inherently a high-overhead format, best used for small
files. Consider converting your file to something else, e.g.
DataInputStream ar Serialised stream so you pay the overhead only
once.

See http://mindprod.com/jgloss/xml.html
for alternative processing techniques.

--
Roedy Green Canadian Mind Products
http://mindprod.com
Your old road is
Rapidly agin'.
Please get out of the new one
If you can't lend your hand
For the times they are a-changin'.

Lew · Nov 12, 2008

bugbear said:
Sigfried said:

Hi, using a java profiler, i've [sic] realized that SAX is consuming too
much time:
- endElement + startElement 40 %
- *.read 7 %
- a few <= 1%

So SAX take about 50 % of the time !!

Click to expand...

If all you're doing is parsing, what would you expect?
Indeed.

Give us more context.

I found SAX to be extremely fast, arguably the (possibly tied for) fastest XML
parsing in Java. Back in 1999 we were able to parse a million rather large
documents in about three hours over a 10MB/s Ethernet connection using Java
1.2 on the hardware extant in those days using SAX, and it was very
parsimonious of memory. Parsers and JVMs (and hardware) have improved
considerably since then.

As bugbear points out, 50% of the time parsing is quite reasonable if at least
50% of the work to do is parsing, and if three-quarters of the work is parsing
you're money ahead.

Sigfried · Nov 12, 2008

bugbear a écrit :

If all you're doing is parsing, what would you expect?

Give us more context.

I've tried the jdk 1.6 stax implementation which is 10 % faster, but the
DTD is ignored... So i guess Stax speed is the same as SAX. I would hope
pushing to 30 % for XML parsing.

Tom Anderson · Nov 12, 2008

Hi, using a java profiler, i've realized that SAX is consuming too much time:
- endElement + startElement 40 %
- *.read 7 %
- a few <= 1%

So SAX take about 50 % of the time !!

Which startElement and endElement methods are these? I assume not the ones
in the ContentHandler, right?

Do you know faster XML API ?

http://www.itu.int/rec/T-REC-X.891-200505-I/en
http://java.sun.com/developer/technicalArticles/xml/fastinfoset/

Although that's probably not what you meant.

But seriously, XML isn't fast. Never has been, never will be. If you need
fast, don't use XML. Fast XML parsing is like semi racing: even if you
win, you're still retarded.

tom

Tom Anderson · Nov 12, 2008

http://www.itu.int/rec/T-REC-X.891-200505-I/en
http://java.sun.com/developer/technicalArticles/xml/fastinfoset/

Although that's probably not what you meant.

But seriously, XML isn't fast. Never has been, never will be. If you need
fast, don't use XML. Fast XML parsing is like semi racing: even if you win,
you're still retarded.

Although you could try this:

http://piccolo.sourceforge.net/

tom

Arne Vajhøj · Nov 13, 2008

Roedy said:
XML/SAX is inherently a high-overhead format, best used for small
files.

No - SAX is the XML parser for huge files.

For small files DOM and XPath is much easier.

Arne

Arne Vajhøj · Nov 13, 2008

Sigfried said:
Hi, using a java profiler, i've realized that SAX is consuming too much
time:
- endElement + startElement 40 %
- *.read 7 %
- a few <= 1%

So SAX take about 50 % of the time !!

Do you know faster XML API ?

SAX is usually the fastest XML parser.

And I can not see why you are surprised that the XML parser
uses most of the CPU time when doing XML parsing.

Arne

Daniel Pitts · Nov 13, 2008

Sigfried said:
Hi, using a java profiler, i've realized that SAX is consuming too much
time:
- endElement + startElement 40 %
- *.read 7 %
- a few <= 1%

So SAX take about 50 % of the time !!

Do you know faster XML API ?

SAX uses callbacks. startElement/endElement probably calls some code
that processes the result. It is *that* code which is taking up CPU
time, you should see what is under that part of the callstack.

Lew · Nov 13, 2008

Arne said:
No - SAX is the XML parser for huge files.

For small files DOM and XPath is much easier.

Quite so. The advantage of SAX over DOM is that it is quite fast, very easy
on memory requirements and suitable for single-pass processing of XML
documents. Its disadvantage is that it does not keep an in-memory
representation of the XML document for repeated processing.

Arne VajhÃ¸j · Nov 13, 2008

Lew said:
Quite so. The advantage of SAX over DOM is that it is quite fast, very
easy on memory requirements and suitable for single-pass processing of
XML documents. Its disadvantage is that it does not keep an in-memory
representation of the XML document for repeated processing.

Plus compared to XPath you need to write a lot of code to do some
advanced searching.

Arne

Mike Schilling · Nov 13, 2008

Lew said:
Quite so. The advantage of SAX over DOM is that it is quite fast,
very easy on memory requirements and suitable for single-pass
processing of XML documents. Its disadvantage is that it does not
keep an in-memory representation of the XML document for repeated
processing.

However, if you want to create an in-memory representation of a subset
of a huge document, SAX is the way to build it. In fact, making SAX
callbacks create a DOM (optionally filtering out part of the
document's content) is a pretty trivial exercise.

Lew · Nov 13, 2008

Arne said:
Plus compared to XPath you need to write a lot of code to do some
advanced searching.

That isn't the point of SAX. SAX lets you import XML-encoded information
directly into an in-memory structure - that being the "lot" of code you need
to write but not really necessarily all that much. Once you have your object
model built, there shouldn't be a need for "advanced searching", you just
directly use the objects that you built.

If there is a need for advanced searching, then perhaps SAX is the wrong choice.

Arne VajhÃ¸j · Nov 13, 2008

Lew said:
That isn't the point of SAX. SAX lets you import XML-encoded
information directly into an in-memory structure - that being the "lot"
of code you need to write but not really necessarily all that much.
Once you have your object model built, there shouldn't be a need for
"advanced searching", you just directly use the objects that you built.

If there is a need for advanced searching, then perhaps SAX is the wrong
choice.

The last is my point.

Doing //sometag/someothertag[athirdtag/@someattr='foobar']/afourthtag/text()
in SAX would require a lot more code than just a selectSingleNode
call.

Arne

Lew · Nov 13, 2008

The last is my point.

Doing
//sometag/someothertag[athirdtag/@someattr='foobar']/afourthtag/text()
in SAX would require a lot more code than just a selectSingleNode
call.

But that wouldn't even be SAX - it's an entirely different universe. I know
that's your point, but it leaves me confused. If you use SAX, there wouldn't
even be a need to search - everything would already be right where you could
find it. The whole question of searching would never even come up.

That is one of the advantages of SAX over DOM. With DOM, you have this huge
memory structure that you have to search with XPath expressions that are hard
to figure out and run really slowly. With SAX you read things right into an
object model where you don't have to look for things, and you can access them
directly. Searching is irrelevant.

Sigfried · Nov 13, 2008

Tom Anderson a écrit :

Which startElement and endElement methods are these? I assume not the
ones in the ContentHandler, right?

http://www.itu.int/rec/T-REC-X.891-200505-I/en
http://java.sun.com/developer/technicalArticles/xml/fastinfoset/

Although that's probably not what you meant.

Your articles did convince me to use a binary format instead of text
format. But fastinfoset is still close to XML. Since my XML is mostly
Double.toString / parseDouble, i guess using java serialization would be
a better (and bigger) step.

But seriously, XML isn't fast. Never has been, never will be. If you
need fast, don't use XML. Fast XML parsing is like semi racing: even if
you win, you're still retarded.

lol i did knew it for arguing on the internet.

Arne VajhÃ¸j · Nov 16, 2008

Lew said:
The last is my point.

Doing
//sometag/someothertag[athirdtag/@someattr='foobar']/afourthtag/text()
in SAX would require a lot more code than just a selectSingleNode
call.

Click to expand...

But that wouldn't even be SAX - it's an entirely different universe. I
know that's your point, but it leaves me confused. If you use SAX,
there wouldn't even be a need to search - everything would already be
right where you could find it. The whole question of searching would
never even come up.

That is one of the advantages of SAX over DOM. With DOM, you have this
huge memory structure that you have to search with XPath expressions
that are hard to figure out and run really slowly. With SAX you read
things right into an object model where you don't have to look for
things, and you can access them directly. Searching is irrelevant.

Not necessarily.

You can use can use SAX to just pick a small subset of the XML as well.

And have a need to code that "pick".

Arne

Robert Klemme · Nov 17, 2008

Lew said:
Lew said:

Lew said:

If there is a need for advanced searching, then perhaps SAX is the
wrong choice.

Click to expand...

The last is my point.

Doing
//sometag/someothertag[athirdtag/@someattr='foobar']/afourthtag/text()
in SAX would require a lot more code than just a selectSingleNode
call.

Click to expand...

But that wouldn't even be SAX - it's an entirely different universe.
I know that's your point, but it leaves me confused. If you use SAX,
there wouldn't even be a need to search - everything would already be
right where you could find it. The whole question of searching would
never even come up.

That is one of the advantages of SAX over DOM. With DOM, you have
this huge memory structure that you have to search with XPath
expressions that are hard to figure out and run really slowly. With
SAX you read things right into an object model where you don't have to
look for things, and you can access them directly. Searching is
irrelevant.

Click to expand...

Not necessarily.

You can use can use SAX to just pick a small subset of the XML as well.

And have a need to code that "pick".

I fully agree with Lew: if you have to do XPath like searching on your
subset you picked the completely wrong data structure for your SAX
processing.

If you meant that the subset picking should be done with XPath then you
have a generic mechanism for which DOM is probably a better choice. If
your searching requirements are not as broad you can easily create your
own simplified searching with SAX - and it's still more efficient for
this than DOM.

robert

Arne Vajhøj · Nov 19, 2008

Robert said:
If you meant that the subset picking should be done with XPath then you
have a generic mechanism for which DOM is probably a better choice.

That was approx. my point.

Arne

SAX unicode and ascii parsing problem	4	Nov 30, 2010
SAX PARSING DESIGN PATTERN	1	Mar 28, 2007
Why SAX parser reads truncated data ?	4	Aug 18, 2008
The myth of Java's slowness	15	Dec 8, 2007
XML/SAX - endElement is never triggered	4	Apr 25, 2005
How to scale and/or "object orient" SAX parsing for big files?	7	Jan 14, 2008
Pluggability of SAX parsers into DOM in JAXP	8	Nov 29, 2006
Splitting SAX results	6	Jun 7, 2007

Slowness of SAX

Sigfried

Roedy Green

Lew

Sigfried

Tom Anderson

Tom Anderson

Arne Vajhøj

Arne Vajhøj

Daniel Pitts

Lew

Arne VajhÃ¸j

Mike Schilling

Lew

Arne VajhÃ¸j

Lew

Sigfried

Arne VajhÃ¸j

Robert Klemme

Arne Vajhøj

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads