xpath, dom and multi threading

F

FrenKy

Hi *,
can someone please suggest thread safe DOM implementation with support
for Xpath for reading XML files?

Or if someone has a good source for hints how to make some dom
implementation thread safe...

Thanks in advance!
 
L

Lew

Hi *,
can someone please suggest thread safe DOM implementation with support
for Xpath for reading XML files?

Or if someone has a good source for hints how to make some dom
implementation thread safe...

'synchronized' keyword.

What part do you want to be thread safe, parsing the XML document or
accessing the DOM that results?

The former is almost certainly not practicable. The second boils down
to what you do to make any object model thread safe.
 
F

FrenKy

What part do you want to be thread safe, parsing the XML document or
accessing the DOM that results?

The former is almost certainly not practicable. The second boils down
to what you do to make any object model thread safe.

--

I'm building the DOM in a single thread and then I'm reading it in
several threads (usually not more then 20, depending on number of CPUs,
e.g. I'm running it sometimes on 100+ CPU machines).
But sometimes (seldom) I get NullPointer exception on most unexpected
locations during read operations... But _always_ when I've already built
XML. Threads are started after xml file is built. So I figured I'm doing
something wrong with multithreading and sync. Same application ran in
single threading mode does not throw NullPointerException.

But I guess from your answer I should take another look at how I'm doing
access to xml dom...
That is why I've asked if there is some thread safe implementation for
reading (guess I missed this part in my first post) xml DOM elements.
 
L

Lew

I'm building the DOM in a single thread and then I'm reading it in
several threads (usually not more then 20, depending on number of CPUs,
e.g. I'm running it sometimes on 100+ CPU machines).

Please provide an SSCCE.
<http://sscce.org/>

Without seeing what you're actually doing it is hard to know what you're doing
wrong.
But sometimes (seldom) I get NullPointer exception on most unexpected
locations during read operations... But _always_ when I've already built
XML. Threads are started after xml file is built. So I figured I'm doing
something wrong with multithreading and sync. Same application ran in
single threading mode does not throw NullPointerException.

Your description does not indicate what is wrong, but if the reading threads
are started from the same thread that build the DOM and only after the DOM is
completely constructed, and if nothing is changing the DOM, then threading is
not the problem.
But I guess from your answer I should take another look at how I'm doing
access to xml dom...
That is why I've asked if there is some thread safe implementation for
reading (guess I missed this part in my first post) xml DOM elements.

Once again, "[it] boils down to what you do to make any object model thread safe."

You didn't even describe your problem, much less show it, in your first post.
Now you've described your problem without showing it. With only a vague,
indirect description of your problem, one can only provide a vague, indirect
solution to it.

If it were a threading issue, then one would have to conclude that something
is altering your object model without proper synchronization after the threads
have started, or that they are started from a thread that does not have proper
synchronization with the thread that built the model. Otherwise it's not a
threading issue.
 
D

Daniel Pitts

Hi *,
can someone please suggest thread safe DOM implementation with support
for Xpath for reading XML files?

Or if someone has a good source for hints how to make some dom
implementation thread safe...

Thanks in advance!
XPath itself isn't multithread safe.

Are you sure you need multithreading for your use-case? If you have
something that is that performance intensive, perhaps a different
approach is called for (StAX/SAX based parsing of the XML file, Building
a domain object graph instead of a DOM, etc...)
 
T

Tom Anderson

XPath itself isn't multithread safe.

XPath is a language - threadsafety is not a property it can have or lack.
I presume what you mean is that the XPath implementation is not threadsafe
- but since we don't know what the implementation in use here is, that's
an interesting statement. I imagine you mean one or more of (a) the XPath
implementation is not required to be threadsafe, so one shouldn't build
software that requires it to be, (b) you (Daniel) know or strongly suspect
which implementation is in use, and know it isn't thread safe, (c) there
are no XPath implementations which are threadsafe, or (d) it is impossible
for there to be an XPath implementation which is threadsafe. Could you
elaborate?

If you don't have any concrete information about the threadsafety of your
XPath implementation, it might be worth doing some basic stuff to ward off
threading bugs. Make sure that there are memory barriers between the last
write to the DOM tree by any thread and the reads that all the worker
threads are doing. One way to do this would be for the workers to queue up
by calling await() on a CountDownLatch set up with a count of 1, which the
parser thread then releases by calling countDown() on the latch. If you do
that and still get problems, then you know that the XPath implementation
is mutating the heap even when doing read-only operations, at which point
it's probably safe to conclude that XPath isn't going to cut it for you.
Are you sure you need multithreading for your use-case? If you have
something that is that performance intensive, perhaps a different
approach is called for

Presumably, if he's throwing >100 CPUs at it, it's because doing it
singlethreaded would take too long.

But ...
(StAX/SAX based parsing of the XML file, Building
a domain object graph instead of a DOM, etc...)

This sounds like a good idea to me. A problem big enough to need >100 CPUs
working on it is big enough to be worth expressing in an efficient form -
i believe DOM implementations are generally deeply inefficient internally.
Lots of linked lists and other pessimicity. Your own model could be more
efficient, and also threadsafe (which after all is not hard to achieve for
read-only data).

tom
 
F

FrenKy

XPath is a language - threadsafety is not a property it can have or
lack. I presume what you mean is that the XPath implementation is not
threadsafe - but since we don't know what the implementation in use here
is, that's an interesting statement. I imagine you mean one or more of
(a) the XPath implementation is not required to be threadsafe, so one
shouldn't build software that requires it to be, (b) you (Daniel) know
or strongly suspect which implementation is in use, and know it isn't
thread safe, (c) there are no XPath implementations which are
threadsafe, or (d) it is impossible for there to be an XPath
implementation which is threadsafe. Could you elaborate?

If you don't have any concrete information about the threadsafety of
your XPath implementation, it might be worth doing some basic stuff to
ward off threading bugs. Make sure that there are memory barriers
between the last write to the DOM tree by any thread and the reads that
all the worker threads are doing. One way to do this would be for the
workers to queue up by calling await() on a CountDownLatch set up with a
count of 1, which the parser thread then releases by calling countDown()
on the latch. If you do that and still get problems, then you know that
the XPath implementation is mutating the heap even when doing read-only
operations, at which point it's probably safe to conclude that XPath
isn't going to cut it for you.


Presumably, if he's throwing >100 CPUs at it, it's because doing it
singlethreaded would take too long.

But ...


This sounds like a good idea to me. A problem big enough to need >100
CPUs working on it is big enough to be worth expressing in an efficient
form - i believe DOM implementations are generally deeply inefficient
internally. Lots of linked lists and other pessimicity. Your own model
could be more efficient, and also threadsafe (which after all is not
hard to achieve for read-only data).

tom

Thanks to all guys, I have some thinking to do.

If I have some more questions, I'll try to give it in a sscce form ;)
 
M

Mike Schilling

FrenKy said:
I'm building the DOM in a single thread and then I'm reading it in
several threads (usually not more then 20, depending on number of
CPUs, e.g. I'm running it sometimes on 100+ CPU machines).
But sometimes (seldom) I get NullPointer exception on most unexpected
locations during read operations... But _always_ when I've already
built XML. Threads are started after xml file is built. So I figured
I'm doing something wrong with multithreading and sync. Same
application ran in single threading mode does not throw
NullPointerException.

Are you using Xerces for XPath? It builds another representation of the DOM
(called a DTM) on which to run the XPath expressions, and it builds it
incrementally. Thus, even if the DOM is fully built, and thus safe to
travese, running XPath on it in two threads can result in exceptions. You
should be OK if you

1. Don't access a DOM until it's fully built.
2. Synchronize all use of XPath.
 
J

Joshua Cranmer

Are you using Xerces for XPath? It builds another representation of the DOM
(called a DTM) on which to run the XPath expressions, and it builds it
incrementally. Thus, even if the DOM is fully built, and thus safe to
travese, running XPath on it in two threads can result in exceptions. You
should be OK if you

1. Don't access a DOM until it's fully built.
2. Synchronize all use of XPath.

I don't know how the Xerces API works, but the DTM representation should
be local to the XPath object; if so, then creating a new object per
thread should do the trick.
 
M

Mike Schilling

Joshua said:
I don't know how the Xerces API works, but the DTM representation
should be local to the XPath object; if so, then creating a new
object per thread should do the trick.

It's actually local to the XPathContext object. That allows more choices,
like having multiple contexts, which would create multiple DTMs per DOM.
The best choice depends on your usage pattern: how many DOMs, how long
they're active for, how many threads each DOM is used in, etc.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top