[BUG?] apparent hang (win32 and Linux), have stack trace

Discussion in 'Ruby' started by Bill Kelly, Aug 26, 2004.

  1. Bill Kelly

    Bill Kelly Guest

    ------=_NextPart_000_0393_01C48AEB.490BE340
    Content-Type: text/plain;
    charset="iso-8859-1"
    Content-Transfer-Encoding: 7bit

    Hi Ruby interals gurus !

    I'm hoping anyone can tell me, based on the attached stack
    trace (from CVS 1.8.2preview2, [i386-mswin32]), whether this
    is indicitave of a bug in Ruby.

    I have a fairly small script now, which hangs on both
    win32 (1.8.2preview2) and linux (1.8.0). It takes anywhere
    from a few minutes to a few hours to lock up. (You can
    just start it at let it run, no interaction required.)

    On Linux, I am able to ^C the process and get a ruby
    stack trace. On Windows ^C doesn't abort (common enough
    when stuck in Windows IO routines), but I was able to
    attach the debugger to the process and get a stack trace.

    The script is at:
    http://bwk.homeip.net/ftp/buffered-io-test4a.rb

    The ruby stack trace I get when interrupting the hung
    process on Linux, is always at or immediately following
    the Thread.stop in ConditionVariable#wait:

    /usr/local/lib/ruby/1.8/thread.rb:192:in `stop': Interrupt from /usr/local/lib/ruby/1.8/thread.rb:192:in `wait'
    from /usr/local/lib/ruby/1.8/thread.rb:190:in `exclusive_unlock'
    from /usr/local/lib/ruby/1.8/thread.rb:140:in `exclusive'
    from /usr/local/lib/ruby/1.8/thread.rb:140:in `exclusive_unlock'
    from /usr/local/lib/ruby/1.8/thread.rb:190:in `wait'
    from buffered-io-test4a.rb:59:in `global_wait'
    from buffered-io-test4a.rb:58:in `synchronize'
    from buffered-io-test4a.rb:58:in `global_wait'
    from buffered-io-test4a.rb:87
    from buffered-io-test4a.rb:72:in `loop'
    from buffered-io-test4a.rb:88

    The windows stack trace (attached) always looks about the
    same each time, too. (rb_thread_stop -> rb_thread_schedule
    -> rb_w32_select).


    Thanks for your help !!


    Regards,

    Bill


    ------=_NextPart_000_0393_01C48AEB.490BE340
    Content-Type: text/plain;
    name="hang-1.8.2preview2-stacktrace.txt"
    Content-Transfer-Encoding: quoted-printable
    Content-Disposition: attachment;
    filename="hang-1.8.2preview2-stacktrace.txt"


    "The process appears to be deadlocked (or is not running any user-mode =
    code)."

    [[note: I believe gc_mark_children() to be showing up erroneously in =
    the trace.]]

    NTDLL.DLL!77f82ca0() =20
    msafd.dll!74fd3bac() =20
    msvcrt-ruby18.dll!gc_mark_children(unsigned long ptr=3D1, int =
    lev=3D37778976) Line 753 + 0xc C
    ws2_32.dll!750312f5() =20
    msvcrt-ruby18.dll!rb_w32_select(int nfds=3D1, fd_set * rd=3D0x02407620, =
    fd_set * wr=3D0x02407750, fd_set * ex=3D0x02407870, timeval * =
    timeout=3D0x00000000) Line 1870 + 0x1e C
    1865: RUBY_CRITICAL({
    1866: r =3D select(nfds, rd, wr, ex, timeout);
    1867: if (r =3D=3D SOCKET_ERROR) {
    1868: errno =3D map_errno(WSAGetLastError());
    1869: }
    > 1870: });

    msvcrt-ruby18.dll!rb_thread_schedule() Line 10190 + 0x2b C
    10189:
    >10190: n =3D select(max+1, &readfds, &writefds, &exceptfds, =

    delay_ptr);
    10191: if (n < 0) {
    msvcrt-ruby18.dll!rb_thread_stop() Line 10882 C
    10880: curr_thread->status =3D THREAD_STOPPED;
    >10881: rb_thread_schedule();

    10882: curr_thread->status =3D last_status;
    msvcrt-ruby18.dll!call_cfunc() Line 5405 + 0x7 C
    msvcrt-ruby18.dll!rb_call0() Line 5536 + 0x1c C
    msvcrt-ruby18.dll!rb_call(unsigned long klass=3D44588960, unsigned long =
    recv=3D44588984, unsigned long mid=3D5041, int argc=3D0, const unsigned =
    long * argv=3D0x00000000, int scope=3D0) Line 5757 + 0x28 C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44517416, RNode * =
    n=3D0x02a717c8) Line 3257 + 0xb4 C
    msvcrt-ruby18.dll!rb_yield_0(unsigned long val=3D6, unsigned long =
    self=3D44517416, unsigned long klass=3D0, int flags=3D0, int avalue=3D0) =
    Line 4745 + 0xd C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44514224, RNode * =
    n=3D0x02a71ff0) Line 3053 + 0x16 C
    msvcrt-ruby18.dll!rb_yield_0(unsigned long val=3D6, unsigned long =
    self=3D44514224, unsigned long klass=3D0, int flags=3D0, int avalue=3D0) =
    Line 4745 + 0xd C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44588984, RNode * =
    n=3D0x02a74360) Line 3053 + 0x16 C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44588984, RNode * =
    n=3D0x02a74300) Line 3201 + 0x10 C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44588984, RNode * =
    n=3D0x02a741c8) Line 3113 + 0x10 C
    msvcrt-ruby18.dll!rb_call0() Line 5664 + 0xd C
    msvcrt-ruby18.dll!rb_call(unsigned long klass=3D44588960, unsigned long =
    recv=3D44588984, unsigned long mid=3D11041, int argc=3D0, const unsigned =
    long * argv=3D0x00000000, int scope=3D0) Line 5757 + 0x28 C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44514224, RNode * =
    n=3D0x02a723b0) Line 3257 + 0xb4 C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44514224, RNode * =
    n=3D0x02a71fa8) Line 2977 + 0x10 C
    msvcrt-ruby18.dll!rb_call0() Line 5664 + 0xd C
    msvcrt-ruby18.dll!rb_call(unsigned long klass=3D44529080, unsigned long =
    recv=3D44514224, unsigned long mid=3D11121, int argc=3D0, const unsigned =
    long * argv=3D0x00000000, int scope=3D0) Line 5757 + 0x28 C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44517416, RNode * =
    n=3D0x02a71900) Line 3257 + 0xb4 C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44517416, RNode * =
    n=3D0x02a71768) Line 2977 + 0x10 C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44517416, RNode * =
    n=3D0x02a71648) Line 3113 + 0x10 C
    msvcrt-ruby18.dll!rb_call0() Line 5664 + 0xd C
    msvcrt-ruby18.dll!rb_call(unsigned long klass=3D44528408, unsigned long =
    recv=3D44517416, unsigned long mid=3D8545, int argc=3D1, const unsigned =
    long * argv=3D0x0240c008, int scope=3D0) Line 5757 + 0x28 C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44601944, RNode * =
    n=3D0x02a75ec0) Line 3257 + 0xb4 C
    msvcrt-ruby18.dll!rb_yield_0(unsigned long val=3D6, unsigned long =
    self=3D44601944, unsigned long klass=3D0, int flags=3D0, int avalue=3D0) =
    Line 4745 + 0xd C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44514224, RNode * =
    n=3D0x02a72698) Line 3053 + 0x16 C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44514224, RNode * =
    n=3D0x02a725c0) Line 3113 + 0x10 C
    msvcrt-ruby18.dll!rb_call0() Line 5664 + 0xd C
    msvcrt-ruby18.dll!rb_call(unsigned long klass=3D44529080, unsigned long =
    recv=3D44514224, unsigned long mid=3D9929, int argc=3D0, const unsigned =
    long * argv=3D0x00000000, int scope=3D0) Line 5757 + 0x28 C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44601944, RNode * =
    n=3D0x02a75f68) Line 3257 + 0xb4 C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44601944, RNode * =
    n=3D0x02a75e60) Line 2977 + 0x10 C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44601944, RNode * =
    n=3D0x02a75cb0) Line 3113 + 0x10 C
    msvcrt-ruby18.dll!rb_call0() Line 5664 + 0xd C
    msvcrt-ruby18.dll!rb_call(unsigned long klass=3D44607896, unsigned long =
    recv=3D44601944, unsigned long mid=3D9945, int argc=3D1, const unsigned =
    long * argv=3D0x0240e93c, int scope=3D1) Line 5757 + 0x28 C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44601944, RNode * =
    n=3D0x02a74e88) Line 3272 + 0x9f C
    msvcrt-ruby18.dll!rb_yield_0(unsigned long val=3D6, unsigned long =
    self=3D44601944, unsigned long klass=3D0, int flags=3D0, int avalue=3D0) =
    Line 4745 + 0xd C
    msvcrt-ruby18.dll!rb_f_loop() Line 4889 + 0xf C
    msvcrt-ruby18.dll!call_cfunc() Line 5405 + 0x7 C
    msvcrt-ruby18.dll!rb_call0() Line 5536 + 0x1c C
    msvcrt-ruby18.dll!rb_call(unsigned long klass=3D44607752, unsigned long =
    recv=3D44601944, unsigned long mid=3D3929, int argc=3D0, const unsigned =
    long * argv=3D0x00000000, int scope=3D1) Line 5757 + 0x28 C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44601944, RNode * =
    n=3D0x02a74e58) Line 3272 + 0x9f C
    msvcrt-ruby18.dll!rb_eval(unsigned long self=3D44601944, RNode * =
    n=3D0x02a77528) Line 2977 + 0x10 C
    msvcrt-ruby18.dll!eval_node() Line 1287 + 0xd C
    msvcrt-ruby18.dll!ruby_exec() Line 1456 + 0x12 C
    msvcrt-ruby18.dll!ruby_run() Line 1477 + 0x5 C
    ruby.exe!main() Line 52 C
    ruby.exe!mainCRTStartup() Line 398 + 0xe C
    KERNEL32.DLL!7c581af6() =20


    ------=_NextPart_000_0393_01C48AEB.490BE340--
     
    Bill Kelly, Aug 26, 2004
    #1
    1. Advertising

  2. Bill Kelly

    Bill Kelly Guest

    Hi again,

    I wondered if I could check an assumption or two here...
    I have a slightly tweaked version of the prior script,
    which for whatever reason seems to hang even more quickly.
    (Seconds to minutes....)
    http://bwk.homeip.net/ftp/buffered-io-test4b.rb

    I've recompiled ruby without optimization so I can see
    what's going on in the debugger more easily.

    One assumption I'd like to check, is whether a
    (delay == DELAY_INFTY) select should *ever* occur in
    rb_thread_schedule while my script is running.

    I am thinking this should not occur, because my main
    thread *only* ever sleeps in the following situation:

    begin
    alarm_th = Thread.start {
    sleep timeout_secs
    $global_signal.broadcast # %%WAS: .signal
    }
    $global_mutex.synchronize {
    print "{"
    $global_signal.wait($global_mutex)
    print "}"
    }
    ensure
    alarm_th.kill if alarm_th and alarm_th.alive?
    end

    ...that is, my main thread only goes to sleep when
    another thread has specifically been created with
    a non-INFTY delay (sleep timeout_secs) in order to
    wake my main thread up. So, I'd think that there
    should be no way to get a (delay == DELAY_INFTY) in
    rb_thread_schedule.

    ...UNLESS... . . . The other assumption I'd like to
    check:

    My main thread also reads and writes to a socket:

    if select(nil, [$cl_sock], nil, 0)
    print "W"
    sent = $cl_sock.send(" #$frame ", 0)
    end

    if select([$cl_sock], nil, nil, 0)
    print "R"
    dat = $cl_sock.recv(65536)
    end

    ... my assumption here is that send/recv WILL NOT
    block, because select just told me the socket is
    readable or writable. (No other thread is touching
    this socket.)


    According to what I'm seeing in the debugger, what
    is apparently happening is that I eventually get
    a (delay == DELAY_INFTY) select, and it waits
    forever.

    Also - it would appear this hanging condition only
    occurs when the main thread is in the:

    print "{"
    -> $global_signal.wait($global_mutex)
    print "}"

    ...and never in the send/recv part of the code.

    So it seems a (delay == DELAY_INFTY) is happening
    there, when I think it shouldn't.... Am I wrong?


    Thanks for your help,

    Regards,

    Bill
     
    Bill Kelly, Aug 26, 2004
    #2
    1. Advertising

  3. Bill Kelly

    Bill Kelly Guest

    Re: [BUG?] erroneous deadlock? (was: apparent hang...)

    Hi yet again, :)

    I think my script is simple enough to post now...
    After running it for a few minutes, I get:

    w{-}{}{}S+w{-}{}{}S+w{-}{}{}S+w{-}{
    deadlock 0x2a74888: sleep:- - c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:195
    deadlock 0x2a85628: sleep:- (main) - c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:195
    c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:195:in `stop': Thread(0x2a85628): deadlock (fatal)
    from c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:195:in `wait'
    from c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:193:in `exclusive_unlock'
    from c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:142:in `exclusive'
    from c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:142:in `exclusive_unlock'
    from c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:193:in `wait'
    from buffered-io-test4d.rb:46:in `global_wait'
    from buffered-io-test4d.rb:44:in `synchronize'
    from buffered-io-test4d.rb:44:in `global_wait'
    from buffered-io-test4d.rb:66
    from buffered-io-test4d.rb:60:in `loop'
    from buffered-io-test4d.rb:60


    I believe this script is simple enough that based on
    inspection we should be able to say there is no way
    it could deadlock.

    I'll intersperse some comments to try to support this
    assertion:

    01:
    02: require 'socket'
    03: require 'thread'
    04: require 'timeout'
    05:
    06: $frame = 0
    07:
    08: $global_mutex = Mutex.new
    09: $global_signal = ConditionVariable.new
    10:
    11: $background_mutex = Mutex.new
    12: $background_signal = ConditionVariable.new
    13:
    14: $sock = TCPSocket.new("curmudgeon", 27999)

    # "curmudgeon:27999" is a server that will merely
    # send back a prompt then wait for us to respond.
    # We are not going to respond - so after we've read
    # the initial prompt, we expect line 18 to wait
    # forever in the select. This is by intention, as
    # it seems to be important to have a thread doing
    # a time-unlimited select to get these problems to
    # occur.
    #
    # (Note that we can replace line 14 with:
    # $server = TCPServer.new(12345)
    # $sock = TCPSocket.new("localhost", 12345)
    # and get similar erroneous behavior. But for
    # whatever reason the localhost version exhibits
    # the "hanging" behavior, while the "curmudgeon:27999"
    # version prefers to deadlock.)

    # So - once background_read has read the initial
    # data it receives from the server, we expect it
    # to just wait forever on line 18.

    15:
    16: def background_read
    17: loop do
    18: if select([$sock], nil, nil, nil)
    19: print "r"
    20: dat = $bg_rd_sock.recv(65536)
    21: $global_signal.signal
    22: end
    23: end
    24: end
    25:

    # background_write is the only code owning the
    # $background_mutex... so there should never be
    # any deadlock contention there. So all we do
    # is repeatedly wait for the main thread to
    # signal us. Then print chars to stdout and go
    # back to sleep.

    26: def background_write
    27: loop do
    28: $background_mutex.synchronize {
    29: print "-"
    30: $background_signal.wait($background_mutex)
    31: print "+"
    32: }
    33:
    34: print "w"
    35: end
    36: end
    37:

    # global_wait creates a secondary alarm thread to
    # awaken the primary thread doing the wait after
    # the timeout elapses. Line 44 is the only code
    # owning the $global_mutex, so there should not be
    # any deadlock contention there.
    # Line 46 is where the deadlocks and hangs always
    # occur.

    38: def global_wait(timeout_secs)
    39: begin
    40: alarm_th = Thread.start {
    41: sleep timeout_secs
    42: $global_signal.broadcast # %%WAS: .signal
    43: }
    44: $global_mutex.synchronize {
    45: print "{"
    46: $global_signal.wait($global_mutex)
    47: print "}"
    48: }
    49: ensure
    50: alarm_th.kill if alarm_th and alarm_th.alive?
    51: end
    52: end
    53:
    54:
    55: bg_read_th = Thread.new { background_read }
    56: bg_write_th = Thread.new { background_write }
    57:
    58: $stdout.sync = true
    59:

    # Our main loop does nothing but periodically signal
    # the background_write thread, and sleep in global_wait.

    60: loop do
    61: $frame += 1
    62: if ($frame % 3) == 0
    63: print "S"
    64: $background_signal.signal
    65: end
    66: global_wait(0.003)
    67: end
    68:



    This script will either deadlock, or hang, on both
    Windows (1.8.2preview2) and Linux (1.8.0).

    I believe it is simple enough to claim that by inspection
    it should never deadlock or hang. (Unless I am just
    being dense.)

    I will begin delving into rb_thread_shedule to try to
    determine what is going awry. Any tips from the
    experts would be most welcome. :)


    Regards,

    Bill
     
    Bill Kelly, Aug 26, 2004
    #3
  4. Bill Kelly

    Bill Kelly Guest

    Re: [BUG?] erroneous deadlock? (was: apparent hang...)


    > 15:
    > 16: def background_read
    > 17: loop do
    > 18: if select([$sock], nil, nil, nil)
    > 19: print "r"
    > 20: dat = $bg_rd_sock.recv(65536)
    > 21: $global_signal.signal
    > 22: end
    > 23: end
    > 24: end
    > 25:


    Whoops ---- line 20 has a typo. That thread was
    exiting anyway. <blush> .... However, that just
    makes the script simpler... I've removed the
    background_read thread entirely, and its associated
    socket. I'm still getting the deadlock...

    Hopefully have this figured out soon.


    Regards,

    Bill
     
    Bill Kelly, Aug 27, 2004
    #4
  5. Bill Kelly

    Guest

    Re: [BUG?] erroneous deadlock? (was: apparent hang...)

    Hi,

    At Fri, 27 Aug 2004 05:18:27 +0900,
    Bill Kelly wrote in [ruby-talk:110584]:
    > This script will either deadlock, or hang, on both
    > Windows (1.8.2preview2) and Linux (1.8.0).


    1.9 doesn't seem to deadlock, I'll investigate it more.

    --
    Nobu Nakada
     
    , Aug 27, 2004
    #5
  6. Bill Kelly

    Bill Kelly Guest

    Re: [BUG?] erroneous deadlock? (was: apparent hang...)

    > At Fri, 27 Aug 2004 05:18:27 +0900,
    > Bill Kelly wrote in [ruby-talk:110584]:
    > > This script will either deadlock, or hang, on both
    > > Windows (1.8.2preview2) and Linux (1.8.0).

    >
    > 1.9 doesn't seem to deadlock, I'll investigate it more.


    Hi Nobu - Thanks!

    I'm just writing another email right now, the problem
    seems to be in my global_wait function.. =(


    Thanks for investigating !!! Sorry to waste your time...


    Regards,

    Bill
     
    Bill Kelly, Aug 27, 2004
    #6
  7. Bill Kelly

    Bill Kelly Guest

    Re: [BUG?] erroneous deadlock? (was: apparent hang...)

    Hello once again,

    > Hopefully have this figured out soon.


    Er. Well I was hoping all this wouldn't turn out to
    be ENTIRELY in my code... <rueful grin>

    It appears the problem is with my wait/timeout code:

    def global_wait(timeout_secs)
    begin
    alarm_th = Thread.start {
    sleep timeout_secs
    $global_signal.broadcast
    }
    $global_mutex.synchronize {
    ############## if alarm_th's broadcast happens
    ############## before we get to the wait here,
    ############## then, obviously, we may never wake up
    $global_signal.wait($global_mutex)
    }
    ensure
    alarm_th.kill if alarm_th and alarm_th.alive?
    end
    end


    As a workaround, I've changed it to:

    def global_wait(timeout_secs)
    begin
    awake = false
    alarm_th = Thread.start {
    sleep timeout_secs
    while not awake
    $global_signal.broadcast
    sleep(0.001) # WAS: Thread.pass
    end
    }
    $global_mutex.synchronize {
    $global_signal.wait($global_mutex)
    awake = true
    }
    ensure
    alarm_th.kill if alarm_th and alarm_th.alive?
    end
    end

    ...So the alarm thread sticks around until it's
    *sure* the primary thread has awakened.

    I guess it's pretty cheesy, with the sleep(0.001)
    and all... But as long as it's foolproof, it will
    work for my purposes.

    I'm now wondering if I should add any begin/rescue
    in the alarm thread, in case #broadcast could ever
    raise an exception.... This code will be in a
    server that I want to run indefinitely, with no
    problems... :)


    Thanks everyone who followed the thread; sorry if
    I've wasted your time . . .



    Regards,

    Bill
     
    Bill Kelly, Aug 27, 2004
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Integer Software
    Replies:
    3
    Views:
    752
    Tejal Joshi \(MSFT\)
    Apr 19, 2004
  2. Bengt Richter
    Replies:
    3
    Views:
    305
    Steve Holden
    Jan 19, 2005
  3. Replies:
    4
    Views:
    411
    Branimir Maksimovic
    Oct 20, 2005
  4. Saad
    Replies:
    1
    Views:
    563
    Victor Bazarov
    Aug 27, 2009
  5. Casey Hawthorne
    Replies:
    3
    Views:
    1,158
    Flash Gordon
    Nov 1, 2009
Loading...

Share This Page