Python randomly exits with Linux OS error -9 or -15


J

Janis

Hello!

I have this problem with my script exiting randomly with Linux OS
status code -9 (most often) or -15 (also sometimes, but much more
rarely). As far as I understand -9 corresponds to Bad file descriptor
and -15 Block device required.

1) Is there a way how I could find out what exactly causes Python
process to exit?
2) What could be the reason of Python exiting with these status code?

The script is a crawler that crawls several web sites to download web
pages and extract information. Most often it exits after having run
for 2 hours and having downloaded ~24 000 files. Some specific web
sites are more affected than others, i.e., there are other instances
of the script running in parallel that download more pages and
complete normally. That could be related to the speed each page is
returned etc.

I have a try-catch block in the root of the script which works very
well to catch any kind of exceptions, but in these cases either Python
does not catch the exception or fails to log it into MySQL error log
table.

I can not use debugger because the code exits after it has run for an
hour or more. In order to try to catch the exact place I have put
logging after each line in some places. The logging prints to stdout,
to a file and logs the line into MySQL DB. This has revealed that code
may exit upon a random simple lines such as: time.sleep(0.1) - most
often, f = opener.open(request) - also often, but sometimes also such
simple statements as list.add('string').

It can also be that the problem occues and then the script exits upon
attempt to do any output - stdout (i.e. regular print '1'), writing to
file and logging into MySQL. I have changed either of these to be the
first ones in the debug log function, but each time the script did not
fail in between these lines.

The main libraries that are in use: MySQLDB, urllib and urllib2, but
none of them seems to be the direct cause of the problem, there is no
direct call of any of these upon exit. I suspect that this could be
related to some garbadge collecting process.

Versions tried with: Python 2.6, Python 2.7 (and I think it happened
also in Python 2.5, but I'm not sure)

Thanks,
Janis
 
Ad

Advertisements

A

Alain Ketterlin

Janis said:
I have this problem with my script exiting randomly with Linux OS
status code -9 (most often) or -15 (also sometimes, but much more
rarely). As far as I understand -9 corresponds to Bad file descriptor
and -15 Block device required.

How do you get -9 and -15? Exit status is supposed to be between 0 and
127. With bash, 128+N is also used for processes that terminate on a
signal. At the python level, subprocess.wait() uses negative numbers for
signal-terminated processes. And 9 is SIGKILL and 15 is SIGTERM.

My guess is that your script hits a limit, e.g., number of open files,
or stack-size, or... But of course it's only a guess.

-- Alain.
 
A

Adam Skutt

How do you get -9 and -15? Exit status is supposed to be between 0 and
127.

0-255 are perfectly legal in UNIX. Chances are that something is
interpreting the unsigned integer as a signed integer accidentally.
Of course, without any output, there's no way to know for sure.

Adam
 
A

Adam Skutt

Hello!

I have this problem with my script exiting randomly with Linux OS
status code -9 (most often) or -15 (also sometimes, but much more
rarely). As far as I understand -9 corresponds to Bad file descriptor
and -15 Block device required.

As Alain already said, you're probably hitting a resource limit of
some sort or another and your program is being forcibly terminated as
a result. You may just need to increase ulimits or system-wide
limits, or you may have a resource leak. There may be output in the
kernel logs (dmesg) or some /var/log file if this is the case.
1) Is there a way how I could find out what exactly causes Python
process to exit?

Use a debugger or at least turn on core dumps, so you have something
to examine after the crash. Tracking through the output to figure out
where you crashed in the Python code is difficult, but possible.
I have a try-catch block in the root of the script which works very
well to catch any kind of exceptions, but in these cases either Python
does not catch the exception or fails to log it into MySQL error log
table.

You need to fix your handler to log the exception to stderr or
similiar, as it will make fixing your application much easier. It's
virtually impossible for anyone to help you if you don't know exactly
what's going on.

It's also highly likely that your current handler is throwing an
exception while running and masking the original error. I would
seriously advise taking it out, at least temporarily. This is why
catch-all handlers tend to be a poor idea, as they're rarely robust in
the cases you didn't consider.

Adam
 
J

Janis

Thank you all for the help! I will need to think a bit about the other
suggestions.

But, Alan, as to this:

I have the following code that has caught these:

p = subprocess.Popen([Config.PYTHON_EXE,'Load.py',"%s" % (row[1],)],
bufsize=0, executable=None, stdin=None, stdout=None,
stderr=subprocess.PIPE, preexec_fn=None, close_fds=False, shell=False,
cwd='../Loader/')
stdout, stderr = p.communicate()
if p.returncode != 0:
....

So you say it's SIGKILL and SIGTERM? Then I guess these could be
misleading statuses from those cases when I have terminated the
sessions myself, and when there is the real problem apparently the
caller detected nothing here. SIGKILL and SIGTERM would probably also
explain why there was nothing in stderr.

Thanks,
Janis
P.S. the problem seems to occure also when I call the process directly
and not via Popen from another script.
 
M

Martin P. Hellwig

On 09/04/2012 11:01, Janis wrote:
<cut weird exit codes>
My experience is that these kind of behaviors are observed when (from
most to least likeliness):
- Your kernel barfs on a limit, e.g. space/inodes/processes/memory/etc.
- You have a linked library mismatch
- You have bit rot on your system
- You have a faulty linked library
- You have a faulty kernel

The last two are academic for me as I never have seen it in real life,
but could be possible and the bit rot one only bit me once in the last
15 years (well I since then use RAID on all but convenience systems).

hth
 
Ad

Advertisements

J

Janis

I have confirmed that the signal involved is SIGKILL and, yes,
apparently OS is simply running out of memory.

Thank you all, again!

Best Regards,
Janis
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top