[BUG?] apparent hang (win32 and Linux), have stack trace

B

Bill Kelly

------=_NextPart_000_0393_01C48AEB.490BE340
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

Hi Ruby interals gurus !

I'm hoping anyone can tell me, based on the attached stack
trace (from CVS 1.8.2preview2, [i386-mswin32]), whether this
is indicitave of a bug in Ruby.

I have a fairly small script now, which hangs on both
win32 (1.8.2preview2) and linux (1.8.0). It takes anywhere
from a few minutes to a few hours to lock up. (You can
just start it at let it run, no interaction required.)

On Linux, I am able to ^C the process and get a ruby
stack trace. On Windows ^C doesn't abort (common enough
when stuck in Windows IO routines), but I was able to
attach the debugger to the process and get a stack trace.

The script is at:
http://bwk.homeip.net/ftp/buffered-io-test4a.rb

The ruby stack trace I get when interrupting the hung
process on Linux, is always at or immediately following
the Thread.stop in ConditionVariable#wait:

/usr/local/lib/ruby/1.8/thread.rb:192:in `stop': Interrupt from /usr/local/lib/ruby/1.8/thread.rb:192:in `wait'
from /usr/local/lib/ruby/1.8/thread.rb:190:in `exclusive_unlock'
from /usr/local/lib/ruby/1.8/thread.rb:140:in `exclusive'
from /usr/local/lib/ruby/1.8/thread.rb:140:in `exclusive_unlock'
from /usr/local/lib/ruby/1.8/thread.rb:190:in `wait'
from buffered-io-test4a.rb:59:in `global_wait'
from buffered-io-test4a.rb:58:in `synchronize'
from buffered-io-test4a.rb:58:in `global_wait'
from buffered-io-test4a.rb:87
from buffered-io-test4a.rb:72:in `loop'
from buffered-io-test4a.rb:88

The windows stack trace (attached) always looks about the
same each time, too. (rb_thread_stop -> rb_thread_schedule
-> rb_w32_select).


Thanks for your help !!


Regards,

Bill


------=_NextPart_000_0393_01C48AEB.490BE340
Content-Type: text/plain;
name="hang-1.8.2preview2-stacktrace.txt"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
filename="hang-1.8.2preview2-stacktrace.txt"


"The process appears to be deadlocked (or is not running any user-mode =
code)."

[[note: I believe gc_mark_children() to be showing up erroneously in =
the trace.]]

NTDLL.DLL!77f82ca0() =20
msafd.dll!74fd3bac() =20
msvcrt-ruby18.dll!gc_mark_children(unsigned long ptr=3D1, int =
lev=3D37778976) Line 753 + 0xc C
ws2_32.dll!750312f5() =20
msvcrt-ruby18.dll!rb_w32_select(int nfds=3D1, fd_set * rd=3D0x02407620, =
fd_set * wr=3D0x02407750, fd_set * ex=3D0x02407870, timeval * =
timeout=3D0x00000000) Line 1870 + 0x1e C
1865: RUBY_CRITICAL({
1866: r =3D select(nfds, rd, wr, ex, timeout);
1867: if (r =3D=3D SOCKET_ERROR) {
1868: errno =3D map_errno(WSAGetLastError());
1869: }
1870: });
msvcrt-ruby18.dll!rb_thread_schedule() Line 10190 + 0x2b C
10189:
10190: n =3D select(max+1, &readfds, &writefds, &exceptfds, =
delay_ptr);
10191: if (n < 0) {
msvcrt-ruby18.dll!rb_thread_stop() Line 10882 C
10880: curr_thread->status =3D THREAD_STOPPED;
10881: rb_thread_schedule();
10882: curr_thread->status =3D last_status;
msvcrt-ruby18.dll!call_cfunc() Line 5405 + 0x7 C
msvcrt-ruby18.dll!rb_call0() Line 5536 + 0x1c C
msvcrt-ruby18.dll!rb_call(unsigned long klass=3D44588960, unsigned long =
recv=3D44588984, unsigned long mid=3D5041, int argc=3D0, const unsigned =
long * argv=3D0x00000000, int scope=3D0) Line 5757 + 0x28 C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44517416, RNode * =
n=3D0x02a717c8) Line 3257 + 0xb4 C
msvcrt-ruby18.dll!rb_yield_0(unsigned long val=3D6, unsigned long =
self=3D44517416, unsigned long klass=3D0, int flags=3D0, int avalue=3D0) =
Line 4745 + 0xd C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44514224, RNode * =
n=3D0x02a71ff0) Line 3053 + 0x16 C
msvcrt-ruby18.dll!rb_yield_0(unsigned long val=3D6, unsigned long =
self=3D44514224, unsigned long klass=3D0, int flags=3D0, int avalue=3D0) =
Line 4745 + 0xd C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44588984, RNode * =
n=3D0x02a74360) Line 3053 + 0x16 C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44588984, RNode * =
n=3D0x02a74300) Line 3201 + 0x10 C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44588984, RNode * =
n=3D0x02a741c8) Line 3113 + 0x10 C
msvcrt-ruby18.dll!rb_call0() Line 5664 + 0xd C
msvcrt-ruby18.dll!rb_call(unsigned long klass=3D44588960, unsigned long =
recv=3D44588984, unsigned long mid=3D11041, int argc=3D0, const unsigned =
long * argv=3D0x00000000, int scope=3D0) Line 5757 + 0x28 C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44514224, RNode * =
n=3D0x02a723b0) Line 3257 + 0xb4 C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44514224, RNode * =
n=3D0x02a71fa8) Line 2977 + 0x10 C
msvcrt-ruby18.dll!rb_call0() Line 5664 + 0xd C
msvcrt-ruby18.dll!rb_call(unsigned long klass=3D44529080, unsigned long =
recv=3D44514224, unsigned long mid=3D11121, int argc=3D0, const unsigned =
long * argv=3D0x00000000, int scope=3D0) Line 5757 + 0x28 C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44517416, RNode * =
n=3D0x02a71900) Line 3257 + 0xb4 C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44517416, RNode * =
n=3D0x02a71768) Line 2977 + 0x10 C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44517416, RNode * =
n=3D0x02a71648) Line 3113 + 0x10 C
msvcrt-ruby18.dll!rb_call0() Line 5664 + 0xd C
msvcrt-ruby18.dll!rb_call(unsigned long klass=3D44528408, unsigned long =
recv=3D44517416, unsigned long mid=3D8545, int argc=3D1, const unsigned =
long * argv=3D0x0240c008, int scope=3D0) Line 5757 + 0x28 C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44601944, RNode * =
n=3D0x02a75ec0) Line 3257 + 0xb4 C
msvcrt-ruby18.dll!rb_yield_0(unsigned long val=3D6, unsigned long =
self=3D44601944, unsigned long klass=3D0, int flags=3D0, int avalue=3D0) =
Line 4745 + 0xd C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44514224, RNode * =
n=3D0x02a72698) Line 3053 + 0x16 C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44514224, RNode * =
n=3D0x02a725c0) Line 3113 + 0x10 C
msvcrt-ruby18.dll!rb_call0() Line 5664 + 0xd C
msvcrt-ruby18.dll!rb_call(unsigned long klass=3D44529080, unsigned long =
recv=3D44514224, unsigned long mid=3D9929, int argc=3D0, const unsigned =
long * argv=3D0x00000000, int scope=3D0) Line 5757 + 0x28 C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44601944, RNode * =
n=3D0x02a75f68) Line 3257 + 0xb4 C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44601944, RNode * =
n=3D0x02a75e60) Line 2977 + 0x10 C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44601944, RNode * =
n=3D0x02a75cb0) Line 3113 + 0x10 C
msvcrt-ruby18.dll!rb_call0() Line 5664 + 0xd C
msvcrt-ruby18.dll!rb_call(unsigned long klass=3D44607896, unsigned long =
recv=3D44601944, unsigned long mid=3D9945, int argc=3D1, const unsigned =
long * argv=3D0x0240e93c, int scope=3D1) Line 5757 + 0x28 C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44601944, RNode * =
n=3D0x02a74e88) Line 3272 + 0x9f C
msvcrt-ruby18.dll!rb_yield_0(unsigned long val=3D6, unsigned long =
self=3D44601944, unsigned long klass=3D0, int flags=3D0, int avalue=3D0) =
Line 4745 + 0xd C
msvcrt-ruby18.dll!rb_f_loop() Line 4889 + 0xf C
msvcrt-ruby18.dll!call_cfunc() Line 5405 + 0x7 C
msvcrt-ruby18.dll!rb_call0() Line 5536 + 0x1c C
msvcrt-ruby18.dll!rb_call(unsigned long klass=3D44607752, unsigned long =
recv=3D44601944, unsigned long mid=3D3929, int argc=3D0, const unsigned =
long * argv=3D0x00000000, int scope=3D1) Line 5757 + 0x28 C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44601944, RNode * =
n=3D0x02a74e58) Line 3272 + 0x9f C
msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44601944, RNode * =
n=3D0x02a77528) Line 2977 + 0x10 C
msvcrt-ruby18.dll!eval_node() Line 1287 + 0xd C
msvcrt-ruby18.dll!ruby_exec() Line 1456 + 0x12 C
msvcrt-ruby18.dll!ruby_run() Line 1477 + 0x5 C
ruby.exe!main() Line 52 C
ruby.exe!mainCRTStartup() Line 398 + 0xe C
KERNEL32.DLL!7c581af6() =20


------=_NextPart_000_0393_01C48AEB.490BE340--
 
B

Bill Kelly

Hi again,

I wondered if I could check an assumption or two here...
I have a slightly tweaked version of the prior script,
which for whatever reason seems to hang even more quickly.
(Seconds to minutes....)
http://bwk.homeip.net/ftp/buffered-io-test4b.rb

I've recompiled ruby without optimization so I can see
what's going on in the debugger more easily.

One assumption I'd like to check, is whether a
(delay == DELAY_INFTY) select should *ever* occur in
rb_thread_schedule while my script is running.

I am thinking this should not occur, because my main
thread *only* ever sleeps in the following situation:

begin
alarm_th = Thread.start {
sleep timeout_secs
$global_signal.broadcast # %%WAS: .signal
}
$global_mutex.synchronize {
print "{"
$global_signal.wait($global_mutex)
print "}"
}
ensure
alarm_th.kill if alarm_th and alarm_th.alive?
end

...that is, my main thread only goes to sleep when
another thread has specifically been created with
a non-INFTY delay (sleep timeout_secs) in order to
wake my main thread up. So, I'd think that there
should be no way to get a (delay == DELAY_INFTY) in
rb_thread_schedule.

...UNLESS... . . . The other assumption I'd like to
check:

My main thread also reads and writes to a socket:

if select(nil, [$cl_sock], nil, 0)
print "W"
sent = $cl_sock.send(" #$frame ", 0)
end

if select([$cl_sock], nil, nil, 0)
print "R"
dat = $cl_sock.recv(65536)
end

... my assumption here is that send/recv WILL NOT
block, because select just told me the socket is
readable or writable. (No other thread is touching
this socket.)


According to what I'm seeing in the debugger, what
is apparently happening is that I eventually get
a (delay == DELAY_INFTY) select, and it waits
forever.

Also - it would appear this hanging condition only
occurs when the main thread is in the:

print "{"
-> $global_signal.wait($global_mutex)
print "}"

...and never in the send/recv part of the code.

So it seems a (delay == DELAY_INFTY) is happening
there, when I think it shouldn't.... Am I wrong?


Thanks for your help,

Regards,

Bill
 
B

Bill Kelly

Hi yet again, :)

I think my script is simple enough to post now...
After running it for a few minutes, I get:

w{-}{}{}S+w{-}{}{}S+w{-}{}{}S+w{-}{
deadlock 0x2a74888: sleep:- - c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:195
deadlock 0x2a85628: sleep:- (main) - c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:195
c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:195:in `stop': Thread(0x2a85628): deadlock (fatal)
from c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:195:in `wait'
from c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:193:in `exclusive_unlock'
from c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:142:in `exclusive'
from c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:142:in `exclusive_unlock'
from c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:193:in `wait'
from buffered-io-test4d.rb:46:in `global_wait'
from buffered-io-test4d.rb:44:in `synchronize'
from buffered-io-test4d.rb:44:in `global_wait'
from buffered-io-test4d.rb:66
from buffered-io-test4d.rb:60:in `loop'
from buffered-io-test4d.rb:60


I believe this script is simple enough that based on
inspection we should be able to say there is no way
it could deadlock.

I'll intersperse some comments to try to support this
assertion:

01:
02: require 'socket'
03: require 'thread'
04: require 'timeout'
05:
06: $frame = 0
07:
08: $global_mutex = Mutex.new
09: $global_signal = ConditionVariable.new
10:
11: $background_mutex = Mutex.new
12: $background_signal = ConditionVariable.new
13:
14: $sock = TCPSocket.new("curmudgeon", 27999)

# "curmudgeon:27999" is a server that will merely
# send back a prompt then wait for us to respond.
# We are not going to respond - so after we've read
# the initial prompt, we expect line 18 to wait
# forever in the select. This is by intention, as
# it seems to be important to have a thread doing
# a time-unlimited select to get these problems to
# occur.
#
# (Note that we can replace line 14 with:
# $server = TCPServer.new(12345)
# $sock = TCPSocket.new("localhost", 12345)
# and get similar erroneous behavior. But for
# whatever reason the localhost version exhibits
# the "hanging" behavior, while the "curmudgeon:27999"
# version prefers to deadlock.)

# So - once background_read has read the initial
# data it receives from the server, we expect it
# to just wait forever on line 18.

15:
16: def background_read
17: loop do
18: if select([$sock], nil, nil, nil)
19: print "r"
20: dat = $bg_rd_sock.recv(65536)
21: $global_signal.signal
22: end
23: end
24: end
25:

# background_write is the only code owning the
# $background_mutex... so there should never be
# any deadlock contention there. So all we do
# is repeatedly wait for the main thread to
# signal us. Then print chars to stdout and go
# back to sleep.

26: def background_write
27: loop do
28: $background_mutex.synchronize {
29: print "-"
30: $background_signal.wait($background_mutex)
31: print "+"
32: }
33:
34: print "w"
35: end
36: end
37:

# global_wait creates a secondary alarm thread to
# awaken the primary thread doing the wait after
# the timeout elapses. Line 44 is the only code
# owning the $global_mutex, so there should not be
# any deadlock contention there.
# Line 46 is where the deadlocks and hangs always
# occur.

38: def global_wait(timeout_secs)
39: begin
40: alarm_th = Thread.start {
41: sleep timeout_secs
42: $global_signal.broadcast # %%WAS: .signal
43: }
44: $global_mutex.synchronize {
45: print "{"
46: $global_signal.wait($global_mutex)
47: print "}"
48: }
49: ensure
50: alarm_th.kill if alarm_th and alarm_th.alive?
51: end
52: end
53:
54:
55: bg_read_th = Thread.new { background_read }
56: bg_write_th = Thread.new { background_write }
57:
58: $stdout.sync = true
59:

# Our main loop does nothing but periodically signal
# the background_write thread, and sleep in global_wait.

60: loop do
61: $frame += 1
62: if ($frame % 3) == 0
63: print "S"
64: $background_signal.signal
65: end
66: global_wait(0.003)
67: end
68:



This script will either deadlock, or hang, on both
Windows (1.8.2preview2) and Linux (1.8.0).

I believe it is simple enough to claim that by inspection
it should never deadlock or hang. (Unless I am just
being dense.)

I will begin delving into rb_thread_shedule to try to
determine what is going awry. Any tips from the
experts would be most welcome. :)


Regards,

Bill
 
B

Bill Kelly

15:
16: def background_read
17: loop do
18: if select([$sock], nil, nil, nil)
19: print "r"
20: dat = $bg_rd_sock.recv(65536)
21: $global_signal.signal
22: end
23: end
24: end
25:

Whoops ---- line 20 has a typo. That thread was
exiting anyway. <blush> .... However, that just
makes the script simpler... I've removed the
background_read thread entirely, and its associated
socket. I'm still getting the deadlock...

Hopefully have this figured out soon.


Regards,

Bill
 
N

nobu.nokada

Hi,

At Fri, 27 Aug 2004 05:18:27 +0900,
Bill Kelly wrote in [ruby-talk:110584]:
This script will either deadlock, or hang, on both
Windows (1.8.2preview2) and Linux (1.8.0).

1.9 doesn't seem to deadlock, I'll investigate it more.
 
B

Bill Kelly

At Fri, 27 Aug 2004 05:18:27 +0900,
Bill Kelly wrote in [ruby-talk:110584]:
This script will either deadlock, or hang, on both
Windows (1.8.2preview2) and Linux (1.8.0).

1.9 doesn't seem to deadlock, I'll investigate it more.

Hi Nobu - Thanks!

I'm just writing another email right now, the problem
seems to be in my global_wait function.. =(


Thanks for investigating !!! Sorry to waste your time...


Regards,

Bill
 
B

Bill Kelly

Hello once again,
Hopefully have this figured out soon.

Er. Well I was hoping all this wouldn't turn out to
be ENTIRELY in my code... <rueful grin>

It appears the problem is with my wait/timeout code:

def global_wait(timeout_secs)
begin
alarm_th = Thread.start {
sleep timeout_secs
$global_signal.broadcast
}
$global_mutex.synchronize {
############## if alarm_th's broadcast happens
############## before we get to the wait here,
############## then, obviously, we may never wake up
$global_signal.wait($global_mutex)
}
ensure
alarm_th.kill if alarm_th and alarm_th.alive?
end
end


As a workaround, I've changed it to:

def global_wait(timeout_secs)
begin
awake = false
alarm_th = Thread.start {
sleep timeout_secs
while not awake
$global_signal.broadcast
sleep(0.001) # WAS: Thread.pass
end
}
$global_mutex.synchronize {
$global_signal.wait($global_mutex)
awake = true
}
ensure
alarm_th.kill if alarm_th and alarm_th.alive?
end
end

...So the alarm thread sticks around until it's
*sure* the primary thread has awakened.

I guess it's pretty cheesy, with the sleep(0.001)
and all... But as long as it's foolproof, it will
work for my purposes.

I'm now wondering if I should add any begin/rescue
in the alarm thread, in case #broadcast could ever
raise an exception.... This code will be in a
server that I want to run indefinitely, with no
problems... :)


Thanks everyone who followed the thread; sorry if
I've wasted your time . . .



Regards,

Bill
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top