stale objects in collections

T

Timo Nentwig

Hi!

I'm not entirely sure whether the Set needs to be synchronized. I think
yes, but would like to ask people here anyway, pseudo-code:

class Test{
final Set set = Collections.synchronizedSet(new HashSet()):

class MyThread extends Thread{
void run(){
while(foo) {
// set is either written to read from never both
set.put(someObject):
}
}
}

public main(){
Thread t = new Thread[10];
for( int i = 0; i < t.length... )
(t = new MyThread()).start();

for( int i = 0; i < t.length... )
t.join();

// this thread may (not) see stale objects in the collection
// without synchronization (?)
dump(set);
}
}

Thanks a lot
Timo
 
E

Eric Sosman

Timo Nentwig wrote On 08/21/06 13:38,:
Hi!

I'm not entirely sure whether the Set needs to be synchronized. I think
yes, but would like to ask people here anyway, pseudo-code:

class Test{
final Set set = Collections.synchronizedSet(new HashSet()):

class MyThread extends Thread{
void run(){
while(foo) {
// set is either written to read from never both
set.put(someObject):
}
}
}

public main(){
Thread t = new Thread[10];
for( int i = 0; i < t.length... )
(t = new MyThread()).start();

for( int i = 0; i < t.length... )
t.join();

// this thread may (not) see stale objects in the collection
// without synchronization (?)
dump(set);
}
}


Hard to be sure of your intent, because the sample code
isn't really Java but a sort of Java-ish patois. But if
there's only supposed to be one Set shared by the whole bunch
of MyThreads, then yes: The Set needs synchronization because
multiple threads are calling its methods "simultaneously."
The synchronizedSet() wrapper provides all the synchronization
you need at the level of individual method calls, but you need
additional protection if you want to make a sequence of method
calls "atomic:"

// WRONG
if (set.isEmpty()) {
//
// "set" can change here
//
set.add("Elvis");
}

// RIGHT
synchronized(set) {
if (set.isEmpty()) {
set.add("Elvis");
}
}

There's no problem with staleness at the end of main()
because [1] all the competing threads have been joined and
thus can no longer interfere with the Set, and [2] the join()
call itself is a synchronization point for the purposes of
things like memory visibility.
 
T

Timo Nentwig

Eric said:
additional protection if you want to make a sequence of method
calls "atomic:"

// WRONG
if (set.isEmpty()) {
//
// "set" can change here
//
set.add("Elvis");
}

// RIGHT
synchronized(set) {
if (set.isEmpty()) {
set.add("Elvis");
}
}

Why isEmpty()? Why should I start a bunch of threads and only one can
actually write to the collections? We are proably talking at
cross-purposes here...

What I acutally do is starting multiple threads running the same SQL
statement (with different parameters) against a database and storing
the result in a Set. So, no duplicates, no Set.get(), no nothing.
Simply run query and put result in a Set. So, as I understand your
comment regarding join() below, I wouldn't need to synchronize the Set
(?).
thus can no longer interfere with the Set, and [2] the join()
call itself is a synchronization point for the purposes of
things like memory visibility.
 
E

Eric Sosman

Timo Nentwig wrote On 08/21/06 15:15,:
Why isEmpty()? Why should I start a bunch of threads and only one can
actually write to the collections? We are proably talking at
cross-purposes here...

It was just an illustration of a situation where the
synchronization provided by synchronizedSet() is not enough.
In the WRONG example, isEmpty() is atomic and add() is atomic,
but the combination is not; the set is unlocked in between
the two operations, and the state of the world can change in
that interval. The RIGHT example avoids this by synchronizing
during the entire compound operation, leaving no "window of
opportunity" for interference. The synchronized(set) {...}
is necessary in RIGHT, even though the set's methods are
already synchronized by the synchronizedSet() machinery.

I'll admit the isEmpty()/add() example is perhaps not
too realistic, but that's not the essential point.
What I acutally do is starting multiple threads running the same SQL
statement (with different parameters) against a database and storing
the result in a Set. So, no duplicates, no Set.get(), no nothing.
Simply run query and put result in a Set. So, as I understand your
comment regarding join() below, I wouldn't need to synchronize the Set
(?).

You need to synchronize the different threads' accesses to
the set, or chaos will ensue. After you've joined all those
threads they are no longer running, hence they can no longer
be mucking around with the set. So no: Once the interfering
threads are out of the picture, no synchronization is needed.

Have you considered doing things just a little differently?
Why not let each thread put its results in its own private set,
and then combine them when the threads are all finished? The
threads don't need to synchronize their accesses to the sets
(because each thread manipulates only its own set, and doesn't
interfere with any other thread's set). And after the worker
threads have been joined, the main thread can access their sets
without synchronization because (as before) the threads are no
longer around to create interference. Rough sketch:

class MyThread extends Thread {
private Set mySet = new HashSet();
public void run() {
// fill mySet with results
}
public Set getSet() {
return mySet;
}
}

class Driver {
public static void main(String[] unused) {
MyThread[] t = new MyThread[10];
for (int i = 0; i < t.length; ++i) {
t = new MyThread();
t.start();
}
Set bigSet = new HashSet();
for (int i = 0; i < t.length; ++i) {
t.join();
bigSet.addAll(t.getSet());
t = null; // optional
}
// bigSet now holds combined results
}
}
 
T

Timo Nentwig

Eric said:
Timo Nentwig wrote On 08/21/06 15:15,:
In the WRONG example, isEmpty() is atomic and add() is atomic,
but the combination is not; the set is unlocked in between

That's clear...
You need to synchronize the different threads' accesses to
the set, or chaos will ensue. After you've joined all those

Why? So, the answer to my initial question whether a collection can
contain stale
data if multiple threads write to a non-synchronized collection is:
yes?
Have you considered doing things just a little differently?
Why not let each thread put its results in its own private set,
and then combine them when the threads are all finished? The

Well, sounds good despite I don't like copying data twice of not
neccessary. Anyway I want to *understand* why the collections must be
synchronized (as I thought).
 
T

Timo Nentwig

Timo said:
Why? So, the answer to my initial question whether a collection can

According to Concurrency in Practice immutable objects are always
threads safe. So, a immutable collection can be accessed (read) by
multi threads.

Now I wonder whether the same applies for multi threads writing (and
not reading) to a collection. If Thread.join() does sync IMO the answer
is yes (?).
 
C

Chris Uppal

Timo said:
Why? So, the answer to my initial question whether a collection can
contain stale
data if multiple threads write to a non-synchronized collection is:
yes?

The problem is not "stale data", but that the Set implementation may /break/ if
you modify it from two threads at the same time without synchronisation. There
are other issues too, about visibility of modifications made by one thread to
code running on another thread.

The Set may (in fact will) depend on certain invariants being maintained in its
internal representation of the data. If two or more threads modify its state
simultaneously, then there is a very real risk that those invariants will be
broken, and so the Set will stop working: go into an infinite loop, return
incorrect result, or start throwing exceptions.

So there are three levels at which synchronisation may be needed.

1) At the level of individual accesses to the Set -- it's own methods. This is
non-optional because the Set will break if you don't protect it somehow.

2) At the level of ensuring that changes made by one thread are visible by
others. This is also non-optional according to the JLS, but may not actually
be necessary on some specific JVM/hardware combinations. Again,
synchronisation at the level of individual Set methods is sufficient.

3) At the level of the logical meaning of the data contained in the Set. It's
hard to invent a plausible example of this using Sets (it's trivially easy with
other Collections), but say your application expected that the Set always
contained an even number of elements -- there is no way you can ensure that
just by synchronising the Set's own methods. You have to synchronise at a
coarser level, so that two elements are always added to the Set in a single
synchronised block.

-- chris
 
E

Eric Sosman

Timo said:
According to Concurrency in Practice immutable objects are always
threads safe. So, a immutable collection can be accessed (read) by
multi threads.

Right.
Now I wonder whether the same applies for multi threads writing (and
not reading) to a collection. If Thread.join() does sync IMO the answer
is yes (?).

"Immutable" is the combination of "mutable" (meaning "changeable"
or "capable of change") with a negating prefix. When we say that
something is "immutable" we mean that it is "not mutable," that is,
that it does not or cannot change.

Adding elements to a collection changes the collection; removing
elements from a collection also changes the collection. Something
that changes is not "immutable," so if you are adding and removing
things the the /Concurrency in Practice/ statement does not apply.

Perhaps you think the operation of adding an element to a
collection is in some sense "write-only," and that this makes it
safe to use without synchronization? (I'm imagining an argument
something like: "The trouble with a mutable object is that different
threads can see the object's internals while they are in an inconsistent
state. But since `write-only' operations don't report the state, it
doesn't matter.") Is that anything like what you're thinking?

The argument is wrong. Trivially, the add() method for a Set
does in fact report some of the Set's state, namely, whether the
element was already present in the Set before the add(). More
deeply, the add() method must inspect and modify the Set's internals
to insert the new element. The details of just what those internals
look like depend on what kind of Set you're using, but they will
all involve reading the state (or parts of it), making some decisions
and computations, and storing new values in state variables. It may
seem "from the outside" that add() is write-only (if you decide to
ignore the returned value), but in fact it cannot be so. (Consider:
even `++x' is not write-only, because it must retrieve the previous
value in order to calculate the new one.)

You need synchronization whenever a mutable object is (or can
be) manipulated by multiple threads "simultaneously."

You can skip the synchronization if only one thread at a time
can access the mutable object. A very common special case of this
is when only one thread has a reference to the object; you need no
synchronization because no other thread can interfere. Another case
arises in your scenario: There's an initial phase during which many
threads pound on the collection, and a final phase when only one
thread works with it. You need synchronization in the first phase
(to keep all those threads from trampling on each other), but you
do not need it in the second (because there's only one thread).

You do not need synchronization for immutable objects, even if
many threads access them "simultaneously."
 
T

Timo Nentwig

Eric said:
Perhaps you think the operation of adding an element to a
collection is in some sense "write-only," and that this makes it
safe to use without synchronization? (I'm imagining an argument
something like: "The trouble with a mutable object is that different
threads can see the object's internals while they are in an inconsistent
state. But since `write-only' operations don't report the state, it
doesn't matter.") Is that anything like what you're thinking?
Yes.

The argument is wrong. Trivially, the add() method for a Set

Okay, I think (hope :) I got it now. Thanks for your and Chris'
comprehensive explanation.
 
P

Patricia Shanahan

Timo said:
That's clear...


Why? So, the answer to my initial question whether a collection can
contain stale
data if multiple threads write to a non-synchronized collection is:
yes?


Well, sounds good despite I don't like copying data twice of not
neccessary. Anyway I want to *understand* why the collections must be
synchronized (as I thought).

You may be able to avoid the second copy, depending on how the result
set is used. For example, if it is only accessed through an iterator,
you could write your own iterator that goes through each per-thread set
in turn.

However, I would only do things like that after trying the simple
one-set approach, with synchronization, and finding a real bottleneck.

Patricia
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,053
Latest member
BrodieSola

Latest Threads

Top