FCGI not responding to signals

J

Jamis Buck

Rails applications that use FCGI have been observing some strange
behavior. I have a hypothesis regarding the cause, but I'd like some
feedback as to whether it is a reasonable hypothesis, and any
solutions/workarounds that people might have.

Sometimes (and some apps experience this more frequently than others)
a FCGI process that is not currently handling a request will fail to
respond to a signal (specifically USR1 or HUP) until a request is
received. This is problematic when updating an application, because
you typically want to gracefully terminate all existing FCGI
processes and start up some new ones pointing at your updated code.
But some (or many) of the processes don't respond until a request is
received, meaning the user can get anything from a stale version of
your app, to a 500 error, depending on how well-behaved (or ill-
behaved) the FCGI process is.

Currently, Rails uses a "nudge" approach (EXTREMELY hacky) to handle
this. When an application is restarted, you send _n_ requests to the
application with the assumption that those requests will be
sufficient to trigger the sleeping processes and let them gracefully
terminate. The problem is, it doesn't work very well, especially in
the case of Apache- or Lighttpd-managed FCGI processes. And even
independently-managed FCGI processes will sometimes croak with this
approach.

My hypothesis regarding the cause of the unresponsiveness is this
(and please feel free to gently debunk it--I'm not ashamed to admit
that I'm in somewhat over my head here): the processes in question
are stuck on some IO-bound process (like listening on a socket), and
Ruby is blocking until that finishes. This prevents Ruby from
invoking the signal handler callback until the IO finishes. Sounds
reasonable? If not, any other ideas what might be causing it?

And even more importantly, is there a sane way to work around (or
better yet, _fix_) this problem? It's a rather nasty stumbling block
to automated application deployment.

Thanks for any help,

Jamis
 
A

Ara.T.Howard

Rails applications that use FCGI have been observing some strange behavior.
I have a hypothesis regarding the cause, but I'd like some feedback as to
whether it is a reasonable hypothesis, and any solutions/workarounds that
people might have.

on which platforms?
Sometimes (and some apps experience this more frequently than others) a FCGI
process that is not currently handling a request will fail to respond to a
signal (specifically USR1 or HUP) until a request is received.

just to clarify - a fcgi process is __always__ handling a request. for
instance, if i run this code as a fcgi process:

[ahoward@localhost html]$ cat ./env.fcgi
#! /usr/local/bin/ruby
require 'fcgi'
loaded, pid = Time::now, Process::pid
FCGI.each_cgi do |cgi|
env = cgi.env_table.sort.map{|kv| kv.join " = "}.join " <br>\n"
content = <<-html
LOADED @ #{ loaded } <br>\n
PID @ #{ pid } <br>\n
<hr><hr>
#{ env }
html
cgi.out{ content }
end

[ahoward@localhost html]$ links -dump http://localhost/env.fcgi |grep PID
PID @ 12568

and then check that process

[root@localhost ahoward]# strace -p 12568
Process 12568 attached - interrupt to quit
select(1, [0], NULL, NULL, NULL ...

is see it's waiting for a request and blocked in select to io multiplex.
checking os_unix.c in the fcgi lib source we see

void OS_ShutdownPending()
{
shutdownPending = TRUE;
}
static void OS_Sigusr1Handler(int signo)
{
OS_ShutdownPending();
}

...

int OS_Accept(int listen_sock, int fail_on_intr, const char *webServerAddrs)
{
int socket = -1;
union {
struct sockaddr_un un;
struct sockaddr_in in;
} sa;

for (;;) {
if (AcquireLock(listen_sock, fail_on_intr))
return -1;

for (;;) {
do {
#ifdef HAVE_SOCKLEN
socklen_t len = sizeof(sa);
#else
int len = sizeof(sa);
#endif
if (shutdownPending) break;
/* There's a window here */

socket = accept(listen_sock, (struct sockaddr *)&sa, &len);
} while (socket < 0
&& errno == EINTR
&& ! fail_on_intr
&& ! shutdownPending);


...


so it seems that the signal handler sets a global flag which is checked at
appropriate times. we can send a signal to the process and see what happens:

[root@localhost html]# kill -HUP 12568

and, back in our strace window we see:

--- SIGHUP (Hangup) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigaction(SIGINT, {SIG_DFL}, {0x80a4884, [], SA_RESTART}, 8) = 0
exit_group(1) = ?
Process 12568 detached

looks fine - so it does, in fact, receive and handle the signal asap. but
wait a minute.... it exited with 1 for failure. checking the apache logs we
see :

[root@localhost ahoward]# tail -2 /var/log/httpd/error_log
[Thu Sep 15 10:10:42 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env.fcgi" (pid 12568) terminated by calling exit with status '1'
[Thu Sep 15 10:10:42 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env.fcgi" restarted (pid 12614)

seems __ok__. but let's do it a few times:

[root@localhost html]# echo `links -dump http://localhost/env.fcgi |grep PID|sed 's/[^0-9]//g'`
12614
[root@localhost html]# kill -HUP `links -dump http://localhost/env.fcgi |grep PID|sed 's/[^0-9]//g'`
[root@localhost html]# kill -HUP `links -dump http://localhost/env.fcgi |grep PID|sed 's/[^0-9]//g'`

now we check the logs:

[root@localhost ahoward]# tail -2 /var/log/httpd/error_log
[Thu Sep 15 10:15:34 2005] [error] [client 127.0.0.1] FastCGI: incomplete headers (0 bytes) received from server "/var/www/html/env.fcgi"
[Thu Sep 15 10:15:34 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env.fcgi" has failed to remain running for 30 seconds given 3 attempts, its restart interval has been backed off to 600 seconds

so now the bloody thing won't run for ten minutes! the apache process manager
is prevent rapid startup/shutdown by buggy fcgi processes and this makes sense
since thousands of them could hose a system.

but, let's assume we sometimes want to shutdown nicely and know what we are
doing. we run this:

[ahoward@localhost html]$ cat env2.fcgi
#! /usr/local/bin/ruby
require 'fcgi'
trap('USR2'){ exit 0 }
loaded, pid = Time::now, Process::pid
FCGI.each_cgi do |cgi|
env = cgi.env_table.sort.map{|kv| kv.join " = "}.join " <br>\n"
content = <<-html
LOADED @ #{ loaded } <br>\n
PID @ #{ pid } <br>\n
<hr><hr>
#{ env }
html
cgi.out{ content }
end

[ahoward@localhost html]$ lynx -dump http://localhost/env2.fcgi |grep PID
PID @ 12690


note that this one exits, doing no cleanup, immediately with success if it gets
USR2. let's test it out:

[root@localhost html]# kill -USR2 `links -dump http://localhost/env2.fcgi |grep PID|sed 's/[^0-9]//g'`
[root@localhost html]# kill -USR2 `links -dump http://localhost/env2.fcgi |grep PID|sed 's/[^0-9]//g'`
[root@localhost html]# kill -USR2 `links -dump http://localhost/env2.fcgi |grep PID|sed 's/[^0-9]//g'`

checking the log

[root@localhost ahoward]# tail -2 /var/log/httpd/error_log
[Thu Sep 15 10:40:06 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" (pid 12865) terminated by calling exit with status '0'
[Thu Sep 15 10:40:06 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" restarted (pid 12877)
[Thu Sep 15 10:40:11 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" (pid 12877) terminated by calling exit with status '0'
[Thu Sep 15 10:40:11 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" restarted (pid 12883)
[Thu Sep 15 10:40:15 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" (pid 12883) terminated by calling exit with status '0'
[Thu Sep 15 10:40:15 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" has failed to remain running for 30 seconds given 3 attempts, its restart interval has been backed off to 600 seconds

so this is better - at least we got a few restarts out of it once by exiting
with zero - the process manager thought this was ok and just logged it.
however, restarting too rapidly caused us to be backed off into oblivion.
there are config options to control this, but consider setting them to NOT
backoff - a typo in a script would cause a loop in the webserver where is just
tried over and over to restart the app. a bunch of these could easily bring a
system to it's knees. so i'm thinking that 'fixing' this problem would create
a far worse one with system crashing implications.

so i'm not sure what to do, but adding a signal handler that exits with sucess
may be a start in the right direction. this would allow nice restarts so long
as you didn't do them too quickly. if you are doing them too quickly you
really shouldn't be hitting the fcgi page anyhow so maybe this is good enough.

so... all that is totally nix/apache specific and i'd imagine none of it would
work in windows. but maybe it's a start ;-)

please let me know if you end up learning more - i'll apply anything i find to
my acgi package since all the same things apply there.

cheers.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================
 
A

Ara.T.Howard

On Thu, 15 Sep 2005, Jamis Buck wrote:

<snip trouble restarting fcgi apps>

how about a completely different approach to restarting - restart without
exiting using the 'exec' system call. this won't return an exit status to the
fcgi pm and, i think, therefore won't cause trouble:

[ahoward@localhost html]$ cat reloadable.fcgi
#! /usr/local/bin/ruby
require 'fcgi'
loaded, pid = Time::now, Process::pid

FCGI.each_cgi do |cgi|
env = cgi.env_table.sort.map{|kv| kv.join " = "}.join " <br>\n"
content = <<-html
command_line : #{ $command_line } <br>
loaded : #{ loaded } <br>
pid : #{ pid } <br>
<hr><hr>
#{ env }
html
cgi.out{ content }
end

BEGIN {
require 'rbconfig'
$config = ::Config::CONFIG
$ruby = File::join($config['bindir'], $config['ruby_install_name']) + $config['EXEEXT']
$this = $0
$command_line = [$ruby, $this, ARGV].flatten.join(' ')
trap('USR2'){ exec $command_line }
}


[ahoward@localhost html]$ lynx -dump http://localhost/reloadable.fcgi |egrep 'loaded|pid'
loaded : Thu Sep 15 12:46:36 MDT 2005
pid : 16018


[ahoward@localhost html]$ lynx -dump http://localhost/reloadable.fcgi |egrep 'loaded|pid'
loaded : Thu Sep 15 12:46:36 MDT 2005
pid : 16018


so we are running in fastcgi mode, the process has been loaded only once. force a restart:


[ahoward@localhost html]$ sudo kill -USR2 16018


[ahoward@localhost html]$ lynx -dump http://localhost/reloadable.fcgi |egrep 'loaded|pid'
loaded : Thu Sep 15 12:47:33 MDT 2005
pid : 16018

and it works!


[ahoward@localhost html]$ lynx -dump http://localhost/reloadable.fcgi |egrep 'loaded|pid'
loaded : Thu Sep 15 12:47:33 MDT 2005
pid : 16018

and sticks.


[ahoward@localhost html]$ sudo kill -USR2 16018


[ahoward@localhost html]$ lynx -dump http://localhost/reloadable.fcgi |egrep 'loaded|pid'
loaded : Thu Sep 15 12:47:43 MDT 2005
pid : 16018


[ahoward@localhost html]$ lynx -dump http://localhost/reloadable.fcgi |egrep 'loaded|pid'
loaded : Thu Sep 15 12:47:43 MDT 2005
pid : 16018


and works again.


checking the log


[ahoward@localhost html]$ sudo tail -3 /var/log/httpd/error_log
[Thu Sep 15 12:47:25 2005] [warn] (32)Broken pipe: FastCGI: write() to PM failed (ignore if a restart or shutdown is pending)
[Thu Sep 15 12:47:40 2005] [warn] (32)Broken pipe: FastCGI: write() to PM failed (ignore if a restart or shutdown is pending)
[Thu Sep 15 12:47:49 2005] [warn] (32)Broken pipe: FastCGI: write() to PM failed (ignore if a restart or shutdown is pending)

so i guess i'll ignore it.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================
 
M

Minero Aoki

Hi,

In mail "FCGI not responding to signals"
Jamis Buck said:
Sometimes (and some apps experience this more frequently than others)
a FCGI process that is not currently handling a request will fail to
respond to a signal (specifically USR1 or HUP) until a request is
received. This is problematic when updating an application, because
And even more importantly, is there a sane way to work around (or
better yet, _fix_) this problem? It's a rather nasty stumbling block
to automated application deployment.

If you are using pure Ruby fcgi.rb, try latest CVS version.
I'm running FastCGI process in some monthes, but I have
not experienced any signal problem with latest one. You can
get it from my repository:

$ cvs -d :pserver:[email protected]:/src co bitchannel

fcgi.rb is in lib/.

Regards,
Minero Aoki
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,778
Messages
2,569,605
Members
45,238
Latest member
Top CryptoPodcasts

Latest Threads

Top