Slowness of SAX

S

Sigfried

Hi, using a java profiler, i've realized that SAX is consuming too much
time:
- endElement + startElement 40 %
- *.read 7 %
- a few <= 1%

So SAX take about 50 % of the time !!

Do you know faster XML API ?
 
R

Roedy Green

Do you know faster XML API ?

XML/SAX is inherently a high-overhead format, best used for small
files. Consider converting your file to something else, e.g.
DataInputStream ar Serialised stream so you pay the overhead only
once.

See http://mindprod.com/jgloss/xml.html
for alternative processing techniques.

--
Roedy Green Canadian Mind Products
http://mindprod.com
Your old road is
Rapidly agin'.
Please get out of the new one
If you can't lend your hand
For the times they are a-changin'.
 
L

Lew

bugbear said:
Sigfried said:
Hi, using a java profiler, i've [sic] realized that SAX is consuming too
much time:
- endElement + startElement 40 %
- *.read 7 %
- a few <= 1%

So SAX take about 50 % of the time !!

If all you're doing is parsing, what would you expect?
Indeed.

Give us more context.

I found SAX to be extremely fast, arguably the (possibly tied for) fastest XML
parsing in Java. Back in 1999 we were able to parse a million rather large
documents in about three hours over a 10MB/s Ethernet connection using Java
1.2 on the hardware extant in those days using SAX, and it was very
parsimonious of memory. Parsers and JVMs (and hardware) have improved
considerably since then.

As bugbear points out, 50% of the time parsing is quite reasonable if at least
50% of the work to do is parsing, and if three-quarters of the work is parsing
you're money ahead.
 
S

Sigfried

bugbear a écrit :
If all you're doing is parsing, what would you expect?

Give us more context.

I've tried the jdk 1.6 stax implementation which is 10 % faster, but the
DTD is ignored... So i guess Stax speed is the same as SAX. I would hope
pushing to 30 % for XML parsing.
 
T

Tom Anderson

Hi, using a java profiler, i've realized that SAX is consuming too much time:
- endElement + startElement 40 %
- *.read 7 %
- a few <= 1%

So SAX take about 50 % of the time !!

Which startElement and endElement methods are these? I assume not the ones
in the ContentHandler, right?
Do you know faster XML API ?

http://www.itu.int/rec/T-REC-X.891-200505-I/en
http://java.sun.com/developer/technicalArticles/xml/fastinfoset/

Although that's probably not what you meant.

But seriously, XML isn't fast. Never has been, never will be. If you need
fast, don't use XML. Fast XML parsing is like semi racing: even if you
win, you're still retarded.

tom
 
A

Arne Vajhøj

Roedy said:
XML/SAX is inherently a high-overhead format, best used for small
files.

No - SAX is the XML parser for huge files.

For small files DOM and XPath is much easier.

Arne
 
A

Arne Vajhøj

Sigfried said:
Hi, using a java profiler, i've realized that SAX is consuming too much
time:
- endElement + startElement 40 %
- *.read 7 %
- a few <= 1%

So SAX take about 50 % of the time !!

Do you know faster XML API ?

SAX is usually the fastest XML parser.

And I can not see why you are surprised that the XML parser
uses most of the CPU time when doing XML parsing.

Arne
 
D

Daniel Pitts

Sigfried said:
Hi, using a java profiler, i've realized that SAX is consuming too much
time:
- endElement + startElement 40 %
- *.read 7 %
- a few <= 1%

So SAX take about 50 % of the time !!

Do you know faster XML API ?
SAX uses callbacks. startElement/endElement probably calls some code
that processes the result. It is *that* code which is taking up CPU
time, you should see what is under that part of the callstack.
 
L

Lew

Arne said:
No - SAX is the XML parser for huge files.

For small files DOM and XPath is much easier.

Quite so. The advantage of SAX over DOM is that it is quite fast, very easy
on memory requirements and suitable for single-pass processing of XML
documents. Its disadvantage is that it does not keep an in-memory
representation of the XML document for repeated processing.
 
A

Arne Vajhøj

Lew said:
Quite so. The advantage of SAX over DOM is that it is quite fast, very
easy on memory requirements and suitable for single-pass processing of
XML documents. Its disadvantage is that it does not keep an in-memory
representation of the XML document for repeated processing.

Plus compared to XPath you need to write a lot of code to do some
advanced searching.

Arne
 
M

Mike Schilling

Lew said:
Quite so. The advantage of SAX over DOM is that it is quite fast,
very easy on memory requirements and suitable for single-pass
processing of XML documents. Its disadvantage is that it does not
keep an in-memory representation of the XML document for repeated
processing.

However, if you want to create an in-memory representation of a subset
of a huge document, SAX is the way to build it. In fact, making SAX
callbacks create a DOM (optionally filtering out part of the
document's content) is a pretty trivial exercise.
 
L

Lew

Arne said:
Plus compared to XPath you need to write a lot of code to do some
advanced searching.

That isn't the point of SAX. SAX lets you import XML-encoded information
directly into an in-memory structure - that being the "lot" of code you need
to write but not really necessarily all that much. Once you have your object
model built, there shouldn't be a need for "advanced searching", you just
directly use the objects that you built.

If there is a need for advanced searching, then perhaps SAX is the wrong choice.
 
A

Arne Vajhøj

Lew said:
That isn't the point of SAX. SAX lets you import XML-encoded
information directly into an in-memory structure - that being the "lot"
of code you need to write but not really necessarily all that much.
Once you have your object model built, there shouldn't be a need for
"advanced searching", you just directly use the objects that you built.

If there is a need for advanced searching, then perhaps SAX is the wrong
choice.

The last is my point.

Doing //sometag/someothertag[athirdtag/@someattr='foobar']/afourthtag/text()
in SAX would require a lot more code than just a selectSingleNode
call.

Arne
 
L

Lew

The last is my point.

Doing
//sometag/someothertag[athirdtag/@someattr='foobar']/afourthtag/text()
in SAX would require a lot more code than just a selectSingleNode
call.

But that wouldn't even be SAX - it's an entirely different universe. I know
that's your point, but it leaves me confused. If you use SAX, there wouldn't
even be a need to search - everything would already be right where you could
find it. The whole question of searching would never even come up.

That is one of the advantages of SAX over DOM. With DOM, you have this huge
memory structure that you have to search with XPath expressions that are hard
to figure out and run really slowly. With SAX you read things right into an
object model where you don't have to look for things, and you can access them
directly. Searching is irrelevant.
 
S

Sigfried

Tom Anderson a écrit :
Which startElement and endElement methods are these? I assume not the
ones in the ContentHandler, right?


http://www.itu.int/rec/T-REC-X.891-200505-I/en
http://java.sun.com/developer/technicalArticles/xml/fastinfoset/

Although that's probably not what you meant.

Your articles did convince me to use a binary format instead of text
format. But fastinfoset is still close to XML. Since my XML is mostly
Double.toString / parseDouble, i guess using java serialization would be
a better (and bigger) step.

But seriously, XML isn't fast. Never has been, never will be. If you
need fast, don't use XML. Fast XML parsing is like semi racing: even if
you win, you're still retarded.

lol i did knew it for arguing on the internet.
 
A

Arne Vajhøj

Lew said:
The last is my point.

Doing
//sometag/someothertag[athirdtag/@someattr='foobar']/afourthtag/text()
in SAX would require a lot more code than just a selectSingleNode
call.

But that wouldn't even be SAX - it's an entirely different universe. I
know that's your point, but it leaves me confused. If you use SAX,
there wouldn't even be a need to search - everything would already be
right where you could find it. The whole question of searching would
never even come up.

That is one of the advantages of SAX over DOM. With DOM, you have this
huge memory structure that you have to search with XPath expressions
that are hard to figure out and run really slowly. With SAX you read
things right into an object model where you don't have to look for
things, and you can access them directly. Searching is irrelevant.

Not necessarily.

You can use can use SAX to just pick a small subset of the XML as well.

And have a need to code that "pick".

Arne
 
R

Robert Klemme

Lew said:
Lew said:
If there is a need for advanced searching, then perhaps SAX is the
wrong choice.
The last is my point.

Doing
//sometag/someothertag[athirdtag/@someattr='foobar']/afourthtag/text()
in SAX would require a lot more code than just a selectSingleNode
call.

But that wouldn't even be SAX - it's an entirely different universe.
I know that's your point, but it leaves me confused. If you use SAX,
there wouldn't even be a need to search - everything would already be
right where you could find it. The whole question of searching would
never even come up.

That is one of the advantages of SAX over DOM. With DOM, you have
this huge memory structure that you have to search with XPath
expressions that are hard to figure out and run really slowly. With
SAX you read things right into an object model where you don't have to
look for things, and you can access them directly. Searching is
irrelevant.

Not necessarily.

You can use can use SAX to just pick a small subset of the XML as well.

And have a need to code that "pick".

I fully agree with Lew: if you have to do XPath like searching on your
subset you picked the completely wrong data structure for your SAX
processing.

If you meant that the subset picking should be done with XPath then you
have a generic mechanism for which DOM is probably a better choice. If
your searching requirements are not as broad you can easily create your
own simplified searching with SAX - and it's still more efficient for
this than DOM.

robert
 
A

Arne Vajhøj

Robert said:
If you meant that the subset picking should be done with XPath then you
have a generic mechanism for which DOM is probably a better choice.

That was approx. my point.

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top