DRb Mysterious Stops

D

Darrin Thompson

I'm running a fairly complicated build and test system with DRb over
Ruby 1.8.6. It involves 12 Linux machines running several different
distro versions and one Windows machine.

Lately I've been having problems where once in awhile the machines
involved in this system just stop communicating, and I can't figure out
why. I've found on occasion I can work around the problem by changing
the order of the operations or the frequency of them. It's more or less
random when it occurs.

The only thing I can think of is that this all started when I added suse
9.3 and 9.4 machines to this system.

The other possibility is that now I have 12 Linux machines and a Windows
machine all more or less arbitrarily talking with each other, so there
might be a slowly increasing probability of a deadlock that I'm suddenly
noticing because it's more likely with more machines.

I'm sitting here thinking of exotic ways TCP could be misconfigured out
of the box on suse 9. But deep in my soul I'm sure it's some stupid code
I wrote.

Anyway, the idea here is that a Windows machine sends messages to
several Linux machines and the Linux machines send back log messages and
occasionally a series of messages that represent the contents of a file.

If anyone has insight, I'd appreciate it. I'm running out of good ideas
here.
 
J

Joel VanderWerf

Darrin said:
I'm running a fairly complicated build and test system with DRb over
Ruby 1.8.6. It involves 12 Linux machines running several different
distro versions and one Windows machine.

Lately I've been having problems where once in awhile the machines
involved in this system just stop communicating, and I can't figure out
why. I've found on occasion I can work around the problem by changing
the order of the operations or the frequency of them. It's more or less
random when it occurs.

The only thing I can think of is that this all started when I added suse
9.3 and 9.4 machines to this system.

The other possibility is that now I have 12 Linux machines and a Windows
machine all more or less arbitrarily talking with each other, so there
might be a slowly increasing probability of a deadlock that I'm suddenly
noticing because it's more likely with more machines.

I'm sitting here thinking of exotic ways TCP could be misconfigured out
of the box on suse 9. But deep in my soul I'm sure it's some stupid code
I wrote.

Anyway, the idea here is that a Windows machine sends messages to
several Linux machines and the Linux machines send back log messages and
occasionally a series of messages that represent the contents of a file.

If anyone has insight, I'd appreciate it. I'm running out of good ideas
here.

It might help to add

Thread.abort_on_exception = true

in case a drb thread is dying silently. (DRb might be smarter than that,
though.)
 
D

Darrin Thompson

Joel said:
It might help to add

Thread.abort_on_exception = true

in case a drb thread is dying silently. (DRb might be smarter than that,
though.)

Tried that. No processes died or left any traces in my logs.

What I did get was more consistent [bad] behavior, at least for today.

It seems that any time my windows machine calls a method on my hosted
suse9 64 bit machine, _and_ I return a large return value from that
method, the conversation somehow gets "stuck". Large might be an array
of 8000 lines of text from a file.

I watched the conversation with wireshark and I saw that on one of my
failures, right before everything hung, there were 15 tcp dup acks.

That's all I've got so far. Any more help?
 
M

Martin Boese

That's all I've got so far. Any more help?


Hi,
I've had same problem a while ago - my program simply stopped to
communicate with the remote and I also couldn't figure out why. First I
was restarting the program periodically via cron and later rewrote it to
send just UDP messages. I just needed to signal another process on
another host so this was a good option for me. The program was running
(and stopping) on FreeBSD.
If you have a lot of data to send I'd consider to use xml-rpc or soap
instead or drb.

Martin
 
R

Robert Klemme

2009/8/27 Martin Boese said:
I've had same problem a while ago - my program simply stopped to
communicate with the remote and I also couldn't figure out why. First I
was restarting the program periodically via cron and later rewrote it to
send just UDP messages. I just needed to signal another process on
another host so this was a good option for me. The program was running
(and stopping) on FreeBSD.
If you have a lot of data to send I'd consider to use xml-rpc or soap
instead or drb.

Alternatively implement file transfer on top of DRb, which could be
simply remote iterating through the file in chunks. That would avoid
issues with arbitrary large DRb method arguments or return values.
Although I have to say that my expectation would be that arbitrary
large Strings should not cause issues with DRb - that would sound like
a bug to me.

Kind regards

robert


PS: I don't believe in IP misconfiguration either. :)
 
D

Darrin Thompson

Robert said:
Although I have to say that my expectation would be that arbitrary
large Strings should not cause issues with DRb - that would sound like
a bug to me.

So trolling through the drb code I came across this:

def load(soc) # :nodoc:
begin
sz = soc.read(4) # sizeof (N)
rescue
raise(DRbConnError, $!.message, $!.backtrace)
end
raise(DRbConnError, 'connection closed') if sz.nil?
raise(DRbConnError, 'premature header') if sz.size < 4
sz = sz.unpack('N')[0]
raise(DRbConnError, "too large packet #{sz}") if @load_limit < sz
begin
str = soc.read(sz)
rescue
raise(DRbConnError, $!.message, $!.backtrace)
end
raise(DRbConnError, 'connection closed') if str.nil?
raise(DRbConnError, 'premature marshal format(can\'t read)') if
str.size <
sz
Thread.exclusive do
begin
save = Thread.current[:drb_untaint]
Thread.current[:drb_untaint] = []
Marshal::load(str)
rescue NameError, ArgumentError
DRbUnknown.new($!, str)
ensure
Thread.current[:drb_untaint].each do |x|
x.untaint
end
Thread.current[:drb_untaint] = save
end
end
end

Is it possible that the thread.exclusive bit could deadlock on a windows
machine?
 
M

Michal Suchanek

2009/8/26 Darrin Thompson said:
Joel said:
It might help to add

Thread.abort_on_exception = true

in case a drb thread is dying silently. (DRb might be smarter than that,
though.)

Tried that. No processes died or left any traces in my logs.

What I did get was more consistent [bad] behavior, at least for today.

It seems that any time my windows machine calls a method on my hosted
suse9 64 bit machine, _and_ I return a large return value from that
method, the conversation somehow gets "stuck". Large might be an array
of 8000 lines of text from a file.

I watched the conversation with wireshark and I saw that on one of my
failures, right before everything hung, there were 15 tcp dup acks.

That many dup acks are somewhat suspicious but by themselves they
should not cause a lockup. If there is network problem the connection
should break eventually. However, some dumps of the part of the
conversation that causes excessive packet duplication might be useful.
Can you replicate the packet duplication with something simple like
scp file transfer or the like?

Are some of the earlier machines also 64bit?

I am not sure how 32bit vs 64bit integers work with marshalling. It
should work but perhaps some testing to ensure it really works well
would be a good idea.

Thanks

Michal
 
D

Darrin Thompson

Michal said:
That many dup acks are somewhat suspicious but by themselves they
should not cause a lockup. If there is network problem the connection
should break eventually. However, some dumps of the part of the
conversation that causes excessive packet duplication might be useful.
Can you replicate the packet duplication with something simple like
scp file transfer or the like?

I can replicate it with a tiny tiny drb pair of programs.

On my SLES9 machine:
# cat test.rb
require 'drb'

class Echo
def ping(length)
return 'a' * length
end
end

echo = Echo.new

DRb.start_service("druby://0.0.0.0:9000", echo)
DRb.thread.join

On my Windows machine:
require 'drb'

echo = DRb::DRbObject.new_with_uri("druby://172.31.192.159:9000")
response = echo.ping(ARGV[0].to_i)
puts response.length

When the program succeeds it's prints the number given. When it fails,
it hangs until I kill it. I'm finding that short values always succeed,
like 1024. I get some successful and some failed when I provide 44230 as
the arg.

I have saved packet traces of a successful run at 1024 and a failed run
at 1 Mb. The traces are a few K and 70+K respectively. I can provide
them privately.
Are some of the earlier machines also 64bit?

Yes.

I am not sure how 32bit vs 64bit integers work with marshalling. It
should work but perhaps some testing to ensure it really works well
would be a good idea.

A lot of other 32/64 bit conversations with other machines are working
fine, so I'm reluctant to go there.
 
S

SEKI Masatoshi

I can replicate it with a tiny tiny drb pair of programs.

On my SLES9 machine:
# cat test.rb
require 'drb'

class Echo
def ping(length)
return 'a' * length
end
end

echo = Echo.new

DRb.start_service("druby://0.0.0.0:9000", echo)
DRb.thread.join

DRb.start_service("druby://0.0.0.0:9000", echo, {:load_limit =>
2**31}) ?
 
D

Darrin Thompson

Darrin said:
I can replicate it with a tiny tiny drb pair of programs.

And I think I've just ruled out ruby and DRb as the culprits here.

I ran a test like this:

ssh root@ipofbadmachine cat /dev/urandom | hexdump -C

I run that from windows xp in cygwin and it hangs after a few seconds.
Other machines it can run as long as I let it.

Sorry for the noise.
 
R

Robert Klemme

2009/8/31 Darrin Thompson said:
And I think I've just ruled out ruby and DRb as the culprits here.

I ran a test like this:

ssh root@ipofbadmachine cat /dev/urandom | hexdump -C

I run that from windows xp in cygwin and it hangs after a few seconds.
Other machines it can run as long as I let it.

Sorry for the noise.

No problem. Btw, netcat is another tool for your network debugging
toolbox which might help for these network transmission tests:

http://de.wikipedia.org/wiki/Netcat

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top