Request For Comments: exception safe ConditionVariable#wait

B

Bill Kelly

Hi everyone,

I'd recently run into some trouble trying to use timeout
with ConditionVariable#wait. (Then I tried some custom
solutions, which turned out to be unsafe themselves and
led me down that "unexpected deadlock" wild goose chase.)

I'd like to propose making ConditionVariable#wait itself
exception safe, so that using it with timeout should be
as simple as:

begin
timeout(timeout_secs) { condition_var.wait(mutex) }
rescue Timeout::Error
end

This is the original #wait from thread.rb (1.8.2preview2)

def wait(mutex)
begin
mutex.exclusive_unlock do
@waiters.push(Thread.current)
Thread.stop
end
ensure
mutex.lock
end
end

It has two problems that I know of. The most serious
is that I believe an exception can occur between the
"begin" and the "mutex.exclusive_unlock" line, such
that, when the "ensure" tries to reacquire the lock,
the mutex had not been unlocked yet, and the thread
hangs trying to get a lock on a mutex it already owns.
(I can get it to hang at the mutex.lock, and I presume
that's the reason.)

The second is that Thread.current is not removed from
@waiters when an exception occurs. This became a
problem in conjunction with timeout, because the
@waiters array was filling up with hundreds of
Thread.current references over time.

My attempt to address these two issues looks like:

def wait(mutex)
unlocked = false
begin
mutex.exclusive_unlock do
unlocked = true
@waiters.push(Thread.current)
Thread.stop
end
rescue # Timeout::Error
@waiters.delete Thread.current # Q: is Array#delete an atomic operation?
raise
ensure
mutex.lock if unlocked
end
end

Does this seem reasonable? Have I made any obvious errors?
Is this something that would best be proposed as an RCR?

I was wondering whether Array#delete is an atomic operation?
I would not want competing threads to both be trying to
delete from the same array simultaneously, if that's not safe.
(Do I need to put Thread.exclusive around it?)

Also - Have I done that rescue/raise properly, to re-raise
the exception? I'm asking because in testing with this
modified #wait, I have to hit ^C many times to interrupt
my process (both win32 and linux). It's as though the
Interrupt exception is being gobbled, which is not what I
would want.

... Hmm... I see that Timeout::Error's ancestors
include Interrupt, SignalException ... I am intentionally
catching Timeout::Error when i call #wait, like:

begin
timeout(timeout_secs) { condition_var.wait(mutex) }
rescue Timeout::Error
end

...but this shouldn't catch Interrupt exceptions
should it?


Anyway any comments/criticism/suggestions/feedback would be
most welcome . . . Is RCR the correct process for a proposal
like this?


Thanks,

Regards,

Bill
 
P

Paul Brannan

IMO, this isn't so much an exception-safety issue as it is a
thread-safety issue. To write exception-safe code, it is necessary to
guarantee that certain code will never raise an exception. For example,
suppose that I have implemented my own string class:

class MyString
def initialize(characters, encoding)
@characters = characters.dup
@encoding = encoding
end

attr_reader :characters, :encoding
end

now suppose I want to write MyString#replace. I can easily do this if I
can guarantee that swap() never raises an exception:

class MyString
def replace(other)
copy = other.dup
swap(copy)
end

def swap(other)
tmp_characters = other.characters
tmp_encoding = other.encoding
other.characters = self.characters
other.encoding = self.encoding
self.characters = tmp_characters
# better not get an exception here!
self.encoding = tmp_encoding
end
end

Now if I write:

s1 = MyString.new([?a, ?b, ?c], :ASCII)
s2 = MyString.new([?1, ?2, ?3], :ASCII)
timeout(x) do
s.replace(s2)
end

I'm in trouble, because suddenly I can get a TimeoutError from inside
MyString#swap.

Your solution for ConditionVariable#wait works for that one case, but
it's not generally applicable to other cases like this one. Instead,
for each case, I either have to set Thread.critical or I have to write
code to rollback any changes I make after an exception anywhere I
non-atomically change the state of an object. What a pain (and a
performance bottleneck!)

I'm not sure what a good general solution the problem is, but consider
this:

- a keyword (or method) is added to Ruby that indicates a method (or
block) never raises an exception
- if Thread#raise is called while the thread is executing that method
(or block), then set a flag and wait for the block to exit before
actually raising the exception
- If the user still wants to abort execution of that block, the user
can call Thread#force_raise or Thread#kill.

So now MyString#swap looks like this:

def swap(other)
no_raise do
tmp_characters = other.characters
tmp_encoding = other.encoding
other.characters = self.characters
other.encoding = self.encoding
self.characters = tmp_characters
# better not get an exception here!
self.encoding = tmp_encoding
end
end

A similar solution can be applied to your revised version of
ConditionVariable#wait (consider what happens if you time out twice,
e.g.):

def foo(timeout_secs)
begin
timeout(timeout_secs) { @condition_var.wait(@mutex) }
rescue Timeout::Error
end
end

timeout(5.01) do
foo(5)
end

Paul
 
B

Bill Kelly

Hi Paul,
IMO, this isn't so much an exception-safety issue as it is a
thread-safety issue.

With timeout, I'm thinking it's sort of a combination of
exceptions and threads. Timeout can raise an exception at
"any point" within the timeout block; but it uses a secondary
thread to perform the raise. Which means, things like
Thread.exclusive { } can be used to protect blocks of code
against the exception that would be raised by the timeout thread.

I hadn't thought about the more general exception safety
issues you bring up:
To write exception-safe code, it is necessary to
guarantee that certain code will never raise an exception. For example,
suppose that I have implemented my own string class:

class MyString
def initialize(characters, encoding)
@characters = characters.dup
@encoding = encoding
end

attr_reader :characters, :encoding
end

now suppose I want to write MyString#replace. I can easily do this if I
can guarantee that swap() never raises an exception:

class MyString
def replace(other)
copy = other.dup
swap(copy)
end

def swap(other)
tmp_characters = other.characters
tmp_encoding = other.encoding
other.characters = self.characters
other.encoding = self.encoding
self.characters = tmp_characters
# better not get an exception here!
self.encoding = tmp_encoding
end
end

Now if I write:

s1 = MyString.new([?a, ?b, ?c], :ASCII)
s2 = MyString.new([?1, ?2, ?3], :ASCII)
timeout(x) do
s.replace(s2)
end

I'm in trouble, because suddenly I can get a TimeoutError from inside
MyString#swap.

Your solution for ConditionVariable#wait works for that one case, but
it's not generally applicable to other cases like this one. Instead,
for each case, I either have to set Thread.critical or I have to write
code to rollback any changes I make after an exception anywhere I
non-atomically change the state of an object. What a pain (and a
performance bottleneck!)

I agree - but this seems to be in the nature of timeout
itself. My proposed changes to ConditionVariable#wait were
indeed intended to only ensure that that one method, #wait,
would function properly when used inside a timeout block.
(Because I happen to be in need of a #wait that I can time
out on. :)

I see what you mean now (exception safety vs. thread
safety.) I guess the email subject should more properly
be, "timeout-safe ConditionVariable#wait". :)

My sense is that currently, one has to be very careful
where one uses a timeout. I don't use timeout very often,
so a lack of a general solution isn't necessarly giving me
too much grief. But I do need to be sure that whatever I
do use timeout with, is safe. (At least that's my current
mindset.)
I'm not sure what a good general solution the problem is, but consider
this:

- a keyword (or method) is added to Ruby that indicates a method (or
block) never raises an exception
- if Thread#raise is called while the thread is executing that method
(or block), then set a flag and wait for the block to exit before
actually raising the exception
- If the user still wants to abort execution of that block, the user
can call Thread#force_raise or Thread#kill.

So now MyString#swap looks like this:

def swap(other)
no_raise do
tmp_characters = other.characters
tmp_encoding = other.encoding
other.characters = self.characters
other.encoding = self.encoding
self.characters = tmp_characters
# better not get an exception here!
self.encoding = tmp_encoding
end
end

Interesting. Kind of like an interrupt mask for exceptions. :)

If you only wanted to protect against timeout, and not
exceptions in general, I think (as you've already
observed) you could:

def swap(other)
Thread.exclusive do
tmp_characters = other.characters
tmp_encoding = other.encoding
other.characters = self.characters
other.encoding = self.encoding
self.characters = tmp_characters
# better not get an exception here!
self.encoding = tmp_encoding
end
end
A similar solution can be applied to your revised version of
ConditionVariable#wait (consider what happens if you time out twice,
e.g.):

def foo(timeout_secs)
begin
timeout(timeout_secs) { @condition_var.wait(@mutex) }
rescue Timeout::Error
end
end

timeout(5.01) do
foo(5)
end

I'm sorry, I'm not sure I'm following. Does the nested
timeout here violate any safeguards inside the revised
ConditionVariable#wait implementation? I realize the
rescue Timeout::Error there is not sufficient to
squelch timeout exceptions from any nested/outer timeout
block. Your outer timeout would need to rescue its own
exceptions as well. . . . Is that what you were referring
to? Or am I not understanding your example?


Thanks for your feedback !!


Regards,

Bill
 
P

Paul Brannan

I'm sorry, I'm not sure I'm following. Does the nested
timeout here violate any safeguards inside the revised
ConditionVariable#wait implementation? I realize the
rescue Timeout::Error there is not sufficient to
squelch timeout exceptions from any nested/outer timeout
block. Your outer timeout would need to rescue its own
exceptions as well. . . . Is that what you were referring
to? Or am I not understanding your example?

Your revised ConditionVariable#wait:

def wait(mutex)
unlocked = false
begin
mutex.exclusive_unlock do
unlocked = true
@waiters.push(Thread.current)
Thread.stop
end
rescue # Timeout::Error
@waiters.delete Thread.current # Q: is Array#delete an atomic operation?
raise
ensure
mutex.lock if unlocked
end
end

What happens if you get a timeout first inside the exclusive_unlock
block, then inside the rescue (before the delete occurs)?

(I believe that this is a pretty rare race condition, but is still
possible).

Paul
 
B

Bill Kelly

Hi Paul,
Your revised ConditionVariable#wait:

def wait(mutex)
unlocked = false
begin
mutex.exclusive_unlock do
unlocked = true
@waiters.push(Thread.current)
Thread.stop
end
rescue # Timeout::Error
@waiters.delete Thread.current # Q: is Array#delete an atomic operation?
raise
ensure
mutex.lock if unlocked
end
end

What happens if you get a timeout first inside the exclusive_unlock
block, then inside the rescue (before the delete occurs)?

(I believe that this is a pretty rare race condition, but is still
possible).

Wow, nice catch! Thanks - !

Hmm... Would that mean it's possible for

ensure
# -------> %exception here%
mutex.lock if unlocked
end

potentially an exception firing once we're in the
ensure block as well? Or is ensure implemented to
have the properties of your no_raise style block,
I wonder?

Well, the reason I had proposed the timeout-safe
wait, or wanted to implement it using timeout,
is because I thought it would be easier than the
signal-based timed_wait I had also implemented.

I wrote what I believed to be a thread-safe (and
hopefully exception safe) timed_wait using signals,
to sidestep the "timeout" exception throwing issue.
But it came out to a disconcerting 36 lines...

01: class ConditionVariable
02:
03: def timed_wait(mutex, timeout_secs)
04: waiter_th = Thread.current
05: alarm_th = nil
06: unlocked = false
07: begin
08: awake = false
09:
10: alarm_th = Thread.start do
11: sleep timeout_secs
12: Thread.exclusive do
13: unless awake
14: awake = true
15: @waiters.delete waiter_th
16: waiter_th.wakeup
17: end
18: end
19: end
20:
21: Thread.exclusive do
22: unless awake
23: mutex.exclusive_unlock do
24: unlocked = true
25: @waiters.push(Thread.current)
26: begin
27: Thread.stop
28: ensure
29: awake = true
30: end
31: end
32: end
33: end
34: ensure
35: mutex.lock if unlocked
36: alarm_th.kill if alarm_th and alarm_th.alive?
37: end
38: end
39:
40: end

I say disconcerting because the code in thread.rb is
so beautifully simple. (Although, as we have seen,
it is not necessarily 100% well-protected, on the other
hand.) But this routine above seemed more complicated
than I'd like. [Note - I didn't care if it worked with
timeout, since it doesn't use timeout.]

. .

Wow your observation has really got me curious now:
Can the exception raised by timeout occur anywhere
in:
- a rescue block?
- an ensure block?

That would seem to make it really hard to guard
against...


Thanks for your thoughts !


Regards,

Bill
 
P

Paul Brannan

Wow, nice catch! Thanks - !

Hmm... Would that mean it's possible for

ensure
# -------> %exception here%
mutex.lock if unlocked
end

potentially an exception firing once we're in the
ensure block as well? Or is ensure implemented to
have the properties of your no_raise style block,
I wonder?

Actually the exception has the potential to occur when calling the
Mutex#lock method, if Ruby decides to switch threads just at the point
of the call (so it's not really between the ensure and the next
statement).
Wow your observation has really got me curious now:
Can the exception raised by timeout occur anywhere
in:
- a rescue block?
- an ensure block?

I'm not an expert on the Ruby interpreter, but if I understand eval.c
correctly, it can occur any time Ruby can switch threads, which
includes but is not limited to:
- each time a node is evaluated
- each time (just before) a method is called
- each time an extension waits for IO using rb_thread_select
That would seem to make it really hard to guard
against...

I agree. I've never really liked the thread-based timeout for exactly
that reason (that and my C++ extensions don't always play nicely with
Ruby threads, since they keep data on the stack). Instead I use an
event loop, and let the event loop handle my timeouts; this guarantees
that timeout exceptions are only raised from well-defined points in the
code.

Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,059
Latest member
cryptoseoagencies

Latest Threads

Top