server bootstrapping upon connection (WARNING: LONG)

F

Fortepianissimo

Here is the situation: I want my server started up upon connection.
When the first connection comes in, the server is not running. The
client realizes the fact, and then starts up the server and tries to
connect again. This of course all happens on the same machine (local
connection only).

The connections can come in as fast as 30+/sec, so the server is
threaded (using SocketServer.ThreadingTCPServer). Further, the server
initialization is threaded into two: one is to do the real, lengthy
setup, the other is start lisenting ASAP.

The problem: I need to prevent multiple copies of the server being
started. I did this by using file locking (fcntl.lockf()). However,
not every time the code successfully prevented the server being
started up more than one time. Here is the relevant code:

---- CLIENT CODE SNIPPET STARTS ----
from fooConsts import *

import socket,fcntl,os,sys,time,logging

FOO_CLIENT_DEBUG=1

POLL_SERVER_INTVL=0.5

if FOO_CLIENT_DEBUG:
myPID=os.getpid()
logger = logging.getLogger('FOO Client')
hdlr = logging.FileHandler(TMP_PATH+'fooClient.log')
formatter = logging.Formatter('%(asctime)s %(message)s')
hdlr.setFormatter(formatter)
logger.addHandler(hdlr)
logger.setLevel(logging.INFO)

def log (info):
logger.info('%d: %s'%(myPID,info))


class FooClient:
def __init__ (self, startServer=True):
"""Connects to FooServer if it exists, otherwise starts it and
connects to it"""
self.connected=True
if self.connect(): return
elif not startServer:
if FOO_CLIENT_DEBUG: log('connection failed 1')
self.connected=False
return

# try to obtain the right to start a server; if we can't,
someone else
# must be doing it - try to reconnect
try:
if FOO_CLIENT_DEBUG: log('try to get right to start
server')
serverStartLock=open(TMP_PATH+'serverStart.lock','w')
fcntl.lockf(serverStartLock.fileno(),fcntl.LOCK_EX|fcntl.LOCK_NB)
except:
if FOO_CLIENT_DEBUG: log('someone else is doing it; wait
2')
while 1:
time.sleep(POLL_SERVER_INTVL)
if self.connect(): return

# safe to start a server and connect to it
if FOO_CLIENT_DEBUG: log('start server')
exitCode=os.system('python -OO "%sfooServer.py" &'%FOO_PATH)
if exitCode==0:
if FOO_CLIENT_DEBUG: log('server is being started; wait
3')
while 1:
time.sleep(POLL_SERVER_INTVL)
if self.connect(): return
else:
if FOO_CLIENT_DEBUG: log('sever bootstrapping failed')
self.connected=False
raise "Cannot start FOOServer"

def connect (self):
"""Attempts to connect to FOOServer from PORT_MIN to
PORT_MAX"""
if FOO_CLIENT_DEBUG: log('connection attempt')
port=PORT_MIN
while port<=PORT_MAX:
try:
self.socket=socket.socket(socket.AF_INET,
socket.SOCK_STREAM)
self.socket.connect(('',port))
break
except:
self.socket.close()
port+=1

if port>PORT_MAX:
if FOO_CLIENT_DEBUG: log('connection failed 2')
return False
if FOO_CLIENT_DEBUG: log('connection succeeded at port
%d'%port)
return True
....

FooClient()
---- CLIENT CODE SNIPPET ENDS ----

From the log (when problem occurred) I see even *AFTER* the server was
started and accepted connections (several connections came and went
happily), a connection would come in and hit the "connection failed 1"
log line. This shouldn't have happened as the default value of
startServer for FooClient.__init__() is True. In the very same
incident, FooServer was started twice, but in the log there's no trace
of two "start server" lines, but only the "connection failed 1" line
in the very end of the log.

I realize this is a rather complex question to ask for help, but I'm
really at wits end. Feel free to skip the detailed description, and
just throw me some general tips and hints. Thank you VERY MUCH!
 
R

ralf

Here is the situation: I want my server started up upon connection.
When the first connection comes in, the server is not running. The
client realizes the fact, and then starts up the server and tries to
connect again. This of course all happens on the same machine (local
connection only).

The connections can come in as fast as 30+/sec, so the server is
threaded (using SocketServer.ThreadingTCPServer). Further, the server
initialization is threaded into two: one is to do the real, lengthy
setup, the other is start lisenting ASAP.

The problem: I need to prevent multiple copies of the server being
started. I did this by using file locking (fcntl.lockf()). However,
not every time the code successfully prevented the server being
started up more than one time. Here is the relevant code:

When using fcntl.lockf different FooClient instances in the same
process will be able lock the file and start another server. You could
either use fcntl.flock to prevent that or use some global flag.
Also be sure to keep the file open by keeping a reference to it
(i.e. self.serverStartLock=open(...)).
For debugging purposes, remove that 'if self.connect(): return' and
I think you'll see much more servers being started.
---- CLIENT CODE SNIPPET STARTS ----
from fooConsts import *

import socket,fcntl,os,sys,time,logging

FOO_CLIENT_DEBUG=1

POLL_SERVER_INTVL=0.5

if FOO_CLIENT_DEBUG:
myPID=os.getpid()
logger = logging.getLogger('FOO Client')
hdlr = logging.FileHandler(TMP_PATH+'fooClient.log')
formatter = logging.Formatter('%(asctime)s %(message)s')
hdlr.setFormatter(formatter)
logger.addHandler(hdlr)
logger.setLevel(logging.INFO)

def log (info):
logger.info('%d: %s'%(myPID,info))


class FooClient:
def __init__ (self, startServer=True):
"""Connects to FooServer if it exists, otherwise starts it and
connects to it"""
self.connected=True
if self.connect(): return
elif not startServer:
if FOO_CLIENT_DEBUG: log('connection failed 1')
self.connected=False
return

# try to obtain the right to start a server; if we can't,
someone else
# must be doing it - try to reconnect
try:
if FOO_CLIENT_DEBUG: log('try to get right to start
server')
serverStartLock=open(TMP_PATH+'serverStart.lock','w')
fcntl.lockf(serverStartLock.fileno(),fcntl.LOCK_EX|fcntl.LOCK_NB)
except:
if FOO_CLIENT_DEBUG: log('someone else is doing it; wait
2')
while 1:
time.sleep(POLL_SERVER_INTVL)
if self.connect(): return

# safe to start a server and connect to it
if FOO_CLIENT_DEBUG: log('start server')
exitCode=os.system('python -OO "%sfooServer.py" &'%FOO_PATH)
if exitCode==0:
if FOO_CLIENT_DEBUG: log('server is being started; wait
3')
while 1:
time.sleep(POLL_SERVER_INTVL)
if self.connect(): return
else:
if FOO_CLIENT_DEBUG: log('sever bootstrapping failed')
self.connected=False
raise "Cannot start FOOServer"

def connect (self):
"""Attempts to connect to FOOServer from PORT_MIN to
PORT_MAX"""
if FOO_CLIENT_DEBUG: log('connection attempt')
port=PORT_MIN
while port<=PORT_MAX:
try:
self.socket=socket.socket(socket.AF_INET,
socket.SOCK_STREAM)
self.socket.connect(('',port))
break
except:
self.socket.close()
port+=1

if port>PORT_MAX:
if FOO_CLIENT_DEBUG: log('connection failed 2')
return False
if FOO_CLIENT_DEBUG: log('connection succeeded at port
%d'%port)
return True
...

FooClient()
---- CLIENT CODE SNIPPET ENDS ----

From the log (when problem occurred) I see even *AFTER* the server was
started and accepted connections (several connections came and went
happily), a connection would come in and hit the "connection failed 1"
log line. This shouldn't have happened as the default value of
startServer for FooClient.__init__() is True. In the very same

Well, maybe too many connection attempts are pending...
 
B

Benjamin Han

When using fcntl.lockf different FooClient instances in the same
process will be able lock the file and start another server. You could
either use fcntl.flock to prevent that or use some global flag.

Hm... I didn't know there's a difference between flock() and lockf(),
and I didn't get much info from the document either. Could you explain
a bit on why lockf() would not lock the file?

Actually I wrote a tiny script just to test if lockf() does what it
claims to do:

--- CODE STARTS ---
#!/usr/bin/env python

import os,fcntl,sys

print "* about to open flock.txt"
f=open('flock.txt','w')
print "* opened the file"
fcntl.lockf(f.fileno(),fcntl.LOCK_EX|fcntl.LOCK_NB)
print "* obtained the lock, enter your line below:"
l=sys.stdin.readline()
f.truncate()
f.write(l)
f.flush()
sys.stdin.readline()
f.close()

--- CODE ENDS ---

It seems it does lock the file? (Mac OS X 10.3.2).
Also be sure to keep the file open by keeping a reference to it
(i.e. self.serverStartLock=open(...)). For debugging purposes, remove
that 'if self.connect(): return' and
I think you'll see much more servers being started.

--- CODE SNIPPET STARTS ---
class FooClient:
def __init__ (self, startServer=True):
"""Connects to FooServer if it exists, otherwise starts it and
connects to it"""
self.connected=True
if self.connect(): return
elif not startServer:
if FOO_CLIENT_DEBUG: log('connection failed 1')
self.connected=False
return
...
--- CODE SNIPPET ENDS ---

Well in that case every connection will try to start a server. Good
point on keeping a reference to the lock file though - I'll add to it
and see what happens.
Well, maybe too many connection attempts are pending...

I failed to see why this should affect the default value of the
argument... if startServer is True (default), that log line should have
never been reached.

Thanks!

Ben
 
R

ralf

Benjamin Han said:
Hm... I didn't know there's a difference between flock() and lockf(),
and I didn't get much info from the document either. Could you explain
a bit on why lockf() would not lock the file?

Well, it might lock the file multiple times in the same
process. That's the problem:
----
import fcntl

def getlockfile():
return open('serverStart.lock', 'w')

def getlock(f):
try:
lock(f.fileno(), fcntl.LOCK_EX|fcntl.LOCK_NB)
except:
return False
return True

def doit():
f1 = getlockfile()
f2 = getlockfile()
print getlock(f1), getlock(f2)

lock = fcntl.lockf
doit()

lock = fcntl.flock
doit()
---

Output is:
True True
True False

Actually I wrote a tiny script just to test if lockf() does what it
claims to do:

--- CODE STARTS ---
#!/usr/bin/env python

import os,fcntl,sys

print "* about to open flock.txt"
f=open('flock.txt','w')
print "* opened the file"
fcntl.lockf(f.fileno(),fcntl.LOCK_EX|fcntl.LOCK_NB)
print "* obtained the lock, enter your line below:"
l=sys.stdin.readline()
f.truncate()
f.write(l)
f.flush()
sys.stdin.readline()
f.close()

--- CODE ENDS ---

It seems it does lock the file? (Mac OS X 10.3.2).


--- CODE SNIPPET STARTS ---
class FooClient:
def __init__ (self, startServer=True):
"""Connects to FooServer if it exists, otherwise starts it and
connects to it"""
self.connected=True
if self.connect(): return
elif not startServer:
if FOO_CLIENT_DEBUG: log('connection failed 1')
self.connected=False
return
..
--- CODE SNIPPET ENDS ---

Well in that case every connection will try to start a server. Good
point on keeping a reference to the lock file though - I'll add to it
and see what happens.


I failed to see why this should affect the default value of the
argument... if startServer is True (default), that log line should
have never been reached.

Well, then I suppose you're passing an false argument to
FooClient.__init__. I just wanted to say, that even if the server is
listening on that port, that connection attempts may fail.
 
B

Benjamin Han

Well, it might lock the file multiple times in the same
process. That's the problem:
----
import fcntl

def getlockfile():
return open('serverStart.lock', 'w')

def getlock(f):
try:
lock(f.fileno(), fcntl.LOCK_EX|fcntl.LOCK_NB)
except:
return False
return True

def doit():
f1 = getlockfile()
f2 = getlockfile()
print getlock(f1), getlock(f2)

lock = fcntl.lockf
doit()

lock = fcntl.flock
doit()
---

Output is:
True True
True False

Ok I've since changed all lockf() to flock(), but from the "ps" log
during the stress test, I still get two servers started up, listed by
ps like this:

27308 ?? S 0:01.25 python -OO fooServer.py
27465 ?? SV 0:01.25 python -OO fooServer.py

In this case the one with pid 27308 was the first server process
started. Some time later, with unknown reason, another server entry
will show up in ps log, but the status is always "RV" or "SV" (e.g.,
pid 27465). The way I started the server in the code is os.system()
call, and from the man page of ps, 'V' means "The process is suspended
during a vfork."

The weird thing is I also have log statements in the server script, and
from the log file only ONE copy is actually started. I start to suspect
that I might have misunderstood the ps log - could it be there really
is only ONE copy running, despite there're (up to) 2 entries shown in
the ps log?
I just wanted to say, that even if the server is
listening on that port, that connection attempts may fail.

How is it possible? Could you explain a bit more?

Thanks again,

Ben
 
K

Krzysztof Stachlewski

How is it possible? Could you explain a bit more?

Your server may be accepting connections slower than
the speed at which new connections are coming.
In such situation your operating system maintains a list
of those awaiting connections. You can control the length of
this list with the "backlog" parameter of the "listen" function.
If the list is full, new connections may simply be dropped
as if the server was not listening at all.

Stach
 
B

Benjamin Han

Your server may be accepting connections slower than
the speed at which new connections are coming.
In such situation your operating system maintains a list
of those awaiting connections. You can control the length of
this list with the "backlog" parameter of the "listen" function.
If the list is full, new connections may simply be dropped
as if the server was not listening at all.

Stach

Thanks for the tip - the default queue size is 5. I've since changed it
into 100 (much more than what it actually needs), but the original
problem remains.
 
B

Benjamin Han

27308 ?? S 0:01.25 python -OO fooServer.py
27465 ?? SV 0:01.25 python -OO fooServer.py

In this case the one with pid 27308 was the first server process
started. Some time later, with unknown reason, another server entry
will show up in ps log, but the status is always "RV" or "SV" (e.g.,
pid 27465). The way I started the server in the code is os.system()
call, and from the man page of ps, 'V' means "The process is suspended
during a vfork."

The weird thing is I also have log statements in the server script, and
from the log file only ONE copy is actually started. I start to suspect
that I might have misunderstood the ps log - could it be there really
is only ONE copy running, despite there're (up to) 2 entries shown in
the ps log?

Ok this is by far the most puzzling one: I use logger module in the
server script to notify its execution. The code is like this:

--- LOGGING CODE STARTS ---
if FOO_COMM_DEBUG:
myPID=os.getpid()
logger = logging.getLogger('FOO Comm')
hdlr = logging.FileHandler(TMP_PATH+'fooComm.log')
formatter = logging.Formatter('%(asctime)s %(message)s')
hdlr.setFormatter(formatter)
logger.addHandler(hdlr)
logger.setLevel(logging.INFO)

def log (info):
logger.info('Server %d: %s'%(myPID,info))

if FOO_COMM_DEBUG: log('process %d started me'%os.getppid())
--- LOGGING CODE ENDS ---

This is the FIRST block of code right after my "import ..." statement
in the server script, so there's no way that it could be skipped if the
script gets executed. Of course FOO_COMM_DEBUG is set to 1.

But as described, the ps log showed two entries of fooServer.py, but in
the log produced, only one server actually was started (I can tell that
from the PID in the log file). I really can't explain this discrepency!
 
A

Alan Kennedy

[Benjamin Han]
But as described, the ps log showed two entries of fooServer.py, but in
the log produced, only one server actually was started (I can tell that
from the PID in the log file). I really can't explain this discrepency!

Quick sanity check: Are you using an operating system that reports an
entry in the ps list for every *thread* of a process, rather an entry
for every process, as you might expect. So if you have 2 threads
running under the "fooServer.py" process, you get 2 entries in the
output of ps, rather than 1.

Some unixen do this.

HTH,
 
B

Benjamin Han

[Benjamin Han]
But as described, the ps log showed two entries of fooServer.py, but in
the log produced, only one server actually was started (I can tell that
from the PID in the log file). I really can't explain this discrepency!

Quick sanity check: Are you using an operating system that reports an
entry in the ps list for every *thread* of a process, rather an entry
for every process, as you might expect. So if you have 2 threads
running under the "fooServer.py" process, you get 2 entries in the
output of ps, rather than 1.

Some unixen do this.

HTH,

Hm, this is Mac OS X 10.3.2, and I checked the man page of ps, it has
another option '-M' to show the threads (which I didn't use to produce
the ps log in question) - but of course they all show up with the same
PID, not the phenomena I saw in the ps log file (where more than one
fooServer.py processes showed up with unique PIDs).

But fooServer.py does use more than one thread (both from the
SocketServer.ThreadingTCPServer and from some other threads it creates).
 
A

Alan Kennedy

[[email protected]]
[Benjamin Han]
[Krzysztof Stachlewski]
[Benjamin Han]
Thanks for the tip - the default queue size is 5. I've since changed it
into 100 (much more than what it actually needs), but the original
problem remains.

Another quick point: even if you set the backlog to 100 connections,
that only gives you 3.3333r seconds of grace before the backlog of
connections fills up, given the "up to 30 connections/second" you
mentioned in another message.

The 101st client to attempt a connect() will fail, if the server
doesn't start accept()ing connections before the backlog fills up.

HTH,
 
K

Krzysztof Stachlewski

Thanks for the tip - the default queue size is 5. I've since changed it
into 100 (much more than what it actually needs), but the original
problem remains.

I don't really think that your OS supports
so big backlog queues.

Stach
 
B

Benjamin Han

[[email protected]]
[Benjamin Han]
[Krzysztof Stachlewski]
[Benjamin Han]
Thanks for the tip - the default queue size is 5. I've since changed it
into 100 (much more than what it actually needs), but the original
problem remains.

Another quick point: even if you set the backlog to 100 connections,
that only gives you 3.3333r seconds of grace before the backlog of
connections fills up, given the "up to 30 connections/second" you
mentioned in another message.

The 101st client to attempt a connect() will fail, if the server
doesn't start accept()ing connections before the backlog fills up.

HTH,

The server is threaded, so it only takes fractions of a sec to finish
off a request. Actually from the log once the server is up there is
nothing I can see which could fill the entire queue (even if with a
much smaller size).

My hunch is that the problem is somewhere else. First I can't still
account for the discrepancy between the server log file and ps log
file: the former told me only one server instance has been started, but
the later showed from time to time a second instance had indeed been
fired up (with state "RV" listed in the log file).

BTW I use this simple line in bash to collec the ps log:

while [ '1' = '1' ] ; do ps awx | grep python | grep -v grep>> ps.log ;
echo --- >> ps.log ; done

I don't suppose this could interfere anything?
 
B

Benjamin Han

Here is the situation: I want my server started up upon connection.
When the first connection comes in, the server is not running. The
client realizes the fact, and then starts up the server and tries to
connect again. This of course all happens on the same machine (local
connection only).

The connections can come in as fast as 30+/sec, so the server is
threaded (using SocketServer.ThreadingTCPServer). Further, the server
initialization is threaded into two: one is to do the real, lengthy
setup, the other is start lisenting ASAP.

... from the "ps" log during the stress test, I still get two servers
started up, listed by ps like this:

27308 ?? S 0:01.25 python -OO fooServer.py
27465 ?? SV 0:01.25 python -OO fooServer.py

In this case the one with pid 27308 was the first server process
started. Some time later, with unknown reason, another server entry
will show up in ps log, but the status is always "RV" or "SV" (e.g.,
pid 27465). The way I started the server in the code is os.system()
call, and from the man page of ps, 'V' means "The process is suspended
during a vfork."

The weird thing is I also have log statements in the server script, and
from the log file only ONE copy is actually started. I start to suspect
that I might have misunderstood the ps log - could it be there really
is only ONE copy running, despite there're (up to) 2 entries shown in
the ps log?

Because of this discrepancy between the ps log and the log produced by
running the server script, I started to suspect that the second server
instance shown in the ps log was a forked process, especially since it
always came with a 'V' (vforked) state. Adding a "-o flags" to the ps
command showed the following (exerpt);

8896 Fri Feb 13 03:47:25 2004 R 2004004 - - 2a567f8
python -OO fooServer.py
8997 Fri Feb 13 03:47:34 2004 RV 8000014 - - 2a54028
python -OO fooServer.py

After checking sys/proc.h:

#define P_VFORK 0x2000000 /* process has vfork children */
#define P_INVFORK 0x8000000 /* proc in vfork */

Basically confirmed that the server process was vforked. I then did a
"grep -r fork *" but none showed in any of my code. My question then
is, where could such a "hidden" vfork happen? I did use the following
in my code:

1. pyDNS (http://pydns.sourceforge.net/)
2. os.popen(), os.popen2(), os.system()

Any hint is extremely welcome!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top