Python reliability

Ville Voipio · Oct 9, 2005

I would need to make some high-reliability software
running on Linux in an embedded system. Performance
(or lack of it) is not an issue, reliability is.

The piece of software is rather simple, probably a
few hundred lines of code in Python. There is a need
to interact with network using the socket module,
and then probably a need to do something hardware-
related which will get its own driver written in
C.

Threading and other more error-prone techniques can
be left aside, everything can run in one thread with
a poll loop.

The software should be running continously for
practically forever (at least a year without a reboot).
Is the Python interpreter (on Linux) stable and
leak-free enough to achieve this?

- Ville

Paul Rubin · Oct 9, 2005

Ville Voipio said:
The software should be running continously for
practically forever (at least a year without a reboot).
Is the Python interpreter (on Linux) stable and
leak-free enough to achieve this?

I would say give the app the heaviest stress testing that you can
before deploying it, checking carefully for leaks and crashes. I'd
say that regardless of the implementation language.

Paul Rubin · Oct 10, 2005

Steven D'Aprano said:
If performance is really not such an issue, would it really matter if you
periodically restarted Python? Starting Python takes a tiny amount of time:

If you have to restart an application, every network peer connected to
it loses its connection. Think of a phone switch. Do you really want
your calls dropped every few hours of conversation time, just because
some lame application decided to restart itself? Phone switches go to
great lengths to keep running through both hardware failures and
software upgrades, without dropping any calls. That's the kind of
application it sounds like the OP is trying to run.

To the OP: besides Python you might also consider Erlang.

Steven D'Aprano · Oct 10, 2005

I would need to make some high-reliability software
running on Linux in an embedded system. Performance
(or lack of it) is not an issue, reliability is.
[snip]

The software should be running continously for
practically forever (at least a year without a reboot).
Is the Python interpreter (on Linux) stable and
leak-free enough to achieve this?

If performance is really not such an issue, would it really matter if you
periodically restarted Python? Starting Python takes a tiny amount of time:

$ time python -c pass
real 0m0.164s
user 0m0.021s
sys 0m0.015s

If performance isn't an issue, your users may not even care about ten
times that delay even once an hour. In other words, built your software to
deal gracefully with restarts, and your users won't even notice or care if
it restarts.

I'm not saying that you will need to restart Python once an hour, or even
once a month. But if you did, would it matter? What's more important is
the state of the operating system. (I'm assuming that, with a year uptime
the requirements, you aren't even thinking of WinCE.)

George Sakkis · Oct 10, 2005

Steven said:
I would need to make some high-reliability software
running on Linux in an embedded system. Performance
(or lack of it) is not an issue, reliability is.
[snip]

The software should be running continously for
practically forever (at least a year without a reboot).
Is the Python interpreter (on Linux) stable and
leak-free enough to achieve this?

Click to expand...

If performance is really not such an issue, would it really matter if you
periodically restarted Python? Starting Python takes a tiny amount of time:

You must have missed or misinterpreted the "The software should be
running continously for practically forever" part. The problem of
restarting python is not the 200 msec lost but putting at stake
reliability (e.g. for health monitoring devices, avionics, nuclear
reactor controllers, etc.) and robustness (e.g. a computation that
takes weeks of cpu time to complete is interrupted without the
possibility to restart from the point it stopped).

George

Neal Norwitz · Oct 10, 2005

Ville said:
The software should be running continously for
practically forever (at least a year without a reboot).
Is the Python interpreter (on Linux) stable and
leak-free enough to achieve this?

Jp gave you the answer that he has done this.

I've spent quite a bit of time since 2.1 days trying to improve the
reliability. I think it has gotten much better. Valgrind is run on
(nearly) every release. We look for various kinds of problems. I try
to review C code for these sorts of problems etc.

There are very few known issues that can crash the interpreter. I
don't know of any memory leaks. socket code is pretty well tested and
heavily used, so you should be in fairly safe territory, particularly
on Unix.

n

Steven D'Aprano · Oct 10, 2005

George said:
Steven D'Aprano wrote:

I would need to make some high-reliability software
running on Linux in an embedded system. Performance
(or lack of it) is not an issue, reliability is.
[snip]

The software should be running continously for
practically forever (at least a year without a reboot).
Is the Python interpreter (on Linux) stable and
leak-free enough to achieve this?

Click to expand...

If performance is really not such an issue, would it really matter if you
periodically restarted Python? Starting Python takes a tiny amount of time:

Click to expand...

You must have missed or misinterpreted the "The software should be
running continously for practically forever" part. The problem of
restarting python is not the 200 msec lost but putting at stake
reliability (e.g. for health monitoring devices, avionics, nuclear
reactor controllers, etc.) and robustness (e.g. a computation that
takes weeks of cpu time to complete is interrupted without the
possibility to restart from the point it stopped).

Er, no, I didn't miss that at all. I did miss that it
needed continual network connections. I don't know if
there is a way around that issue, although mobile
phones move in and out of network areas, swapping
connections when and as needed.

But as for reliability, well, tell that to Buzz Aldrin
and Neil Armstrong. The Apollo 11 moon lander rebooted
multiple times on the way down to the surface. It was
designed to recover gracefully when rebooting unexpectedly:

http://www.hq.nasa.gov/office/pao/History/alsj/a11/a11.1201-pa.html

I don't have an authoritive source of how many times
the computer rebooted during the landing, but it was
measured in the dozens. Calculations were performed in
an iterative fashion, with an initial estimate that was
improved over time. If a calculation was interupted the
computer lost no more than one iteration.

I'm not saying that this strategy is practical or
useful for the original poster, but it *might* be. In a
noisy environment, it pays to design a system that can
recover transparently from a lost connection.

If your heart monitor can reboot in 200 ms, you might
miss one or two beats, but so long as you pick up the
next one, that's just noise. If your calculation takes
more than a day of CPU time to complete, you should
design it in such a way that you can save state and
pick it up again when you are ready. You never know
when the cleaner will accidently unplug the computer...

Ville Voipio · Oct 10, 2005

I would say give the app the heaviest stress testing that you can
before deploying it, checking carefully for leaks and crashes. I'd
say that regardless of the implementation language.

Goes without saying. But I would like to be confident (or as
confident as possible) that all bugs are mine. If I use plain
C, I think this is the case. Of course, bad memory management
in the underlying platform will wreak havoc. I am planning to
use Linux 2.4.somethingnew as the OS kernel, and there I have
not experienced too many problems before.

Adding the Python interpreter adds one layer on uncertainty.
On the other hand, I am after the simplicity of programming
offered by Python.

- Ville

Ville Voipio · Oct 10, 2005

If performance is really not such an issue, would it really matter if you
periodically restarted Python? Starting Python takes a tiny amount of time:

Uhhh. Sounds like playing with Microsoft

I know of a mission-
critical system which was restarted every week due to some memory
leaks. If it wasn't, it crashed after two weeks. Guess which
platform...

$ time python -c pass
real 0m0.164s
user 0m0.021s
sys 0m0.015s

This is on the limit of being acceptable. I'd say that a one-second
time lag is the maximum. The system is a safety system after all,
and there will be a hardware watchdog to take care of odd crashes.
The software itself is stateless in the sense that its previous
state does not affect the next round. Basically, it is just checking
a few numbers over the network. Even the network connection is
stateless (single UDP packet pairs) to avoid TCP problems with
partial closings, etc.

There are a gazillion things which may go wrong. A stray cosmic
ray may change the state of one bit in the wrong place of memory,
and that's it, etc. So, the system has to be able to recover from
pretty much everything. I will in any case build an independent
process which probes the state of the main process. However,
I hope it is never really needed.

I'm not saying that you will need to restart Python once an hour, or even
once a month. But if you did, would it matter? What's more important is
the state of the operating system. (I'm assuming that, with a year uptime
the requirements, you aren't even thinking of WinCE.)

Not even in my worst nightmares! The platform will be an embedded
Linux computer running 2.4.somethingnew.

- Ville

Paul Rubin · Oct 10, 2005

Ville Voipio said:
Goes without saying. But I would like to be confident (or as
confident as possible) that all bugs are mine. If I use plain
C, I think this is the case. Of course, bad memory management
in the underlying platform will wreak havoc. I am planning to
use Linux 2.4.somethingnew as the OS kernel, and there I have
not experienced too many problems before.

You might be better off with a 2.6 series kernel. If you use Python
conservatively (be careful with the most advanced features, and don't
stress anything too hard) you should be ok. Python works pretty well
if you use it the way the implementers expected you to. Its
shortcomings are when you try to press it to its limits.

You do want reliable hardware with ECC and all that, maybe with multiple
servers and automatic failover. This site might be of interest:

http://www.linux-ha.org/

Steven D'Aprano · Oct 10, 2005

Ville said:
There are a gazillion things which may go wrong. A stray cosmic
ray may change the state of one bit in the wrong place of memory,
and that's it, etc. So, the system has to be able to recover from
pretty much everything. I will in any case build an independent
process which probes the state of the main process. However,
I hope it is never really needed.

If you have enough hardware grunt, you could think
about having three independent processes working in
parallel. They vote on their output, and best out of
three gets reported back to the user. In other words,
only if all three results are different does the device
throw its hands up in the air and say "I don't know!"

Of course, unless you are running each of them on an
independent set of hardware and OS, you really aren't
getting that much benefit. And then there is the
question, can you trust the voting mechanism... But if
this is so critical you are worried about cosmic rays,
maybe it is the way to go.

If it is not a secret, what are you monitoring with
this device?

Ville Voipio · Oct 10, 2005

If you have enough hardware grunt, you could think
about having three independent processes working in
parallel. They vote on their output, and best out of
three gets reported back to the user. In other words,
only if all three results are different does the device
throw its hands up in the air and say "I don't know!"

Ok, I will give you a bit more information, so that the
situation is a bit clearer. (Sorry, I cannot tell you
the exact application.)

The system is a safety system which supervises several
independent measurements (two or more). The measurements
are carried out by independent measurement instruments
which have their independent power supplies, etc.

The application communicates with the independent
measurement instruments thrgough the network. Each
instrument is queried its measurement results and
status information regularly. If the results given
by different instruments differ more than a given
amount, then an alarm is set (relay contacts opened).

Naturally, in case of equipment malfunction, the
alarm is set. This covers a wide range of problems from
errors reported by the instrument to physical failures
or program bugs.

The system has several weak spots. However, the basic
principle is simple: if anything goes wrong, start
yelling. A false alarm is costly, but not giving the
alarm when required is downright impossible.

I am not building a redundant system with independent
instruments voting. At this point I am trying to minimize
the false alarms. This is why I want to know if Python
is reliable enough to be used in this application.

By the postings I have seen in this thread it seems that
the answer is positive. At least if I do not try
apply any adventorous programming techniques.

- Ville

Ville Voipio · Oct 10, 2005

You might be better off with a 2.6 series kernel. If you use Python
conservatively (be careful with the most advanced features, and don't
stress anything too hard) you should be ok. Python works pretty well
if you use it the way the implementers expected you to. Its
shortcomings are when you try to press it to its limits.

Just one thing: how reliable is the garbage collecting system?
Should I try to either not produce any garbage or try to clean
up manually?

You do want reliable hardware with ECC and all that, maybe with multiple
servers and automatic failover. This site might be of interest:

Well... Here the uptime benefit from using several servers is
not eceonomically justifiable. I am right now at the phase of
trying to minimize the downtime with given hardware resources.
This is not flying; downtime does not kill anyone. I just want
to avoid choosing tools which belong more to the problem than
to the solution set.

- Ville

Paul Rubin · Oct 10, 2005

Ville Voipio said:
Just one thing: how reliable is the garbage collecting system?
Should I try to either not produce any garbage or try to clean
up manually?

The GC is a simple, manually-updated reference counting system
augmented with some extra contraption to resolve cyclic dependencies.
It's extremely easy to make errors with the reference counts in C
extensions, and either leak references (causing memory leaks) or
forget to add them (causing double-free crashes). The standard
libraries are pretty careful about managing references but if you're
using 3rd party C modules, or writing your own, then watch out.

There is no way you can avoid making garbage. Python conses
everything, even integers (small positive ones are cached). But I'd
say, avoid making cyclic dependencies, be very careful if you use the
less popular C modules or any 3rd party ones, and stress test the hell
out of your app while monitoring memory usage very carefully. If you
can pound it with as much traffic in a few hours as it's likely to see
in a year of deployment, without memory leaks or thread races or other
errors, that's a positive sign.

Well... Here the uptime benefit from using several servers is not
eceonomically justifiable. I am right now at the phase of trying to
minimize the downtime with given hardware resources. This is not
flying; downtime does not kill anyone. I just want to avoid choosing
tools which belong more to the problem than to the solution set.

You're probably ok with Python in this case.

Max M · Oct 10, 2005

Ville said:
Goes without saying. But I would like to be confident (or as
confident as possible) that all bugs are mine. If I use plain
C, I think this is the case. Of course, bad memory management
in the underlying platform will wreak havoc.

Python isn't perfect, but I do believe that is as good as the best of
the major "standard" systems out there.

You will have *far* greater chances of introducing errors yourself by
coding in c, than you will encounter in Python.

You can see the bugs fixed in recent versions, and see for yourself
whether they would have crashed your system. That should be an indicator:

http://www.python.org/2.4.2/NEWS.html

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science

Tom Anderson · Oct 10, 2005

The GC is a simple, manually-updated reference counting system augmented
with some extra contraption to resolve cyclic dependencies. It's
extremely easy to make errors with the reference counts in C extensions,
and either leak references (causing memory leaks) or forget to add them
(causing double-free crashes).

Has anyone looked into using a real GC for python? I realise it would be a
lot more complexity in the interpreter itself, but it would be faster,
more reliable, and would reduce the complexity of extensions.

Hmm. Maybe it wouldn't make extensions easier or more reliable. You'd
still need some way of figuring out which variables in C-land held
pointers to objects; if anything, that might be harder, unless you want to
impose a horrendous JAI-like bondage-and-discipline interface.

There is no way you can avoid making garbage. Python conses everything,
even integers (small positive ones are cached).

So python doesn't use the old SmallTalk 80 SmallInteger hack, or similar?
Fair enough - the performance gain is nice, but the extra complexity would
be a huge pain, i imagine.

tom

Aahz · Oct 10, 2005

Has anyone looked into using a real GC for python? I realise it would be a
lot more complexity in the interpreter itself, but it would be faster,
more reliable, and would reduce the complexity of extensions.

Hmm. Maybe it wouldn't make extensions easier or more reliable. You'd
still need some way of figuring out which variables in C-land held
pointers to objects; if anything, that might be harder, unless you want to
impose a horrendous JAI-like bondage-and-discipline interface.

Bingo! There's a reason why one Python motto is "Plays well with
others".

Mike Meyer · Oct 10, 2005

Tom Anderson said:
Has anyone looked into using a real GC for python? I realise it would
be a lot more complexity in the interpreter itself, but it would be
faster, more reliable, and would reduce the complexity of extensions.

Hmm. Maybe it wouldn't make extensions easier or more reliable. You'd
still need some way of figuring out which variables in C-land held
pointers to objects; if anything, that might be harder, unless you
want to impose a horrendous JAI-like bondage-and-discipline interface.

Wouldn't necessarily be faster, either. I rewrote an program that
built a static data structure of a couple of hundred thousand objects
and then went traipsing through that while generating a few hundred
objects in a compiled language with a real garbage collector. The
resulting program ran about an order of magnitude slower than the
Python version.

Profiling revealed that it was spending 95% of it's time in the
garbage collector, marking and sweeping that large data structure.

There's lots of research on dealing with this problem, as my usage
pattern isn't unusual - just a little extreme. Unfortunately, none of
them were applicable to comiled code without a serious performance
impact on pretty much everything. Those could probably be used in
Python without a problem.

<mike

Thomas Bartkus · Oct 10, 2005

Ville Voipio said:
I would need to make some high-reliability software
running on Linux in an embedded system. Performance
(or lack of it) is not an issue, reliability is.

Click to expand...

The software should be running continously for
practically forever (at least a year without a reboot).
is the Python interpreter (on Linux) stable and
leak-free enough to achieve this?

Click to expand...

Adding the Python interpreter adds one layer on uncertainty.
On the other hand, I am after the simplicity of programming
offered by Python.

Click to expand...

I would need to make some high-reliability software
running on Linux in an embedded system. Performance
(or lack of it) is not an issue, reliability is.

Click to expand...

The software should be running continously for
practically forever (at least a year without a reboot).
is the Python interpreter (on Linux) stable and
leak-free enough to achieve this?

Click to expand...

<snip>

All in all, it would seem that the reliability of the Python run time is the
least of your worries. The best multi-tasking operating systems do a good
job of segragating different processes BUT what multitasking operating
system meets the standard you request in that last paragraph? Assuming that
the Python interpreter itself is robust enough to meet that standard, what
about that other 99% of everything else that is competing with your Python
script for cpu, memory, and other critical resources? Under ordinary Linux,
your Python script will be interrupted frequently and regularly by processes
entirely outside of Python's control.

You may not want a multitasking OS at all but rather a single tasking OS
where nothing happens that isn't 100% under your program control. Or if you
do need a multitasking system, you probably want something designed for the
type of rugged use you are demanding. I would google "embedded systems".
If you want to use Python/Linux, I might suggest you search "Embedded
Linux".

And I wouldn't be surprised if some dedicated microcontrollers aren't
showing up with Python capability. In any case, it would seem you need more
control than a Python interpreter would receive when running under Linux.

Good Luck.
Thomas Bartkus

Paul Rubin · Oct 10, 2005

Tom Anderson said:
Has anyone looked into using a real GC for python? I realise it would
be a lot more complexity in the interpreter itself, but it would be
faster, more reliable, and would reduce the complexity of extensions.

The next PyPy sprint (this week I think) is going to focus partly on GC.

Hmm. Maybe it wouldn't make extensions easier or more reliable. You'd
still need some way of figuring out which variables in C-land held
pointers to objects; if anything, that might be harder, unless you
want to impose a horrendous JAI-like bondage-and-discipline interface.

I'm not sure what JAI is (do you mean JNI?) but you might look at how
Emacs Lisp does it. You have to call a macro to protect intermediate
heap results in C functions from GC'd, so it's possible to make
errors, but it cleans up after itself and is generally less fraught
with hazards than Python's method is.

Embroidermodder2 Kickstarter for Python also!	0	Mar 29, 2014
Python Front-end to GCC	108	Oct 20, 2013
ANN: eGenix mxODBC 3.2.0 - Python ODBC Database Interface	0	Aug 28, 2012
Send commands to USB device in Python	1	Feb 11, 2014
software app and Python: any experience?	0	Jan 28, 2013
ANN: eGenix PyRun - One file Python Runtime 1.3.1	0	Dec 6, 2013
What version of glibc is Python using?	10	Oct 12, 2013
ANN: eGenix mxODBC Connect 2.1.0 - Python ODBC Database Interface	0	May 28, 2014

Python reliability

Ville Voipio

Paul Rubin

Paul Rubin

Steven D'Aprano

George Sakkis

Neal Norwitz

Steven D'Aprano

Ville Voipio

Ville Voipio

Paul Rubin

Steven D'Aprano

Ville Voipio

Ville Voipio

Paul Rubin

Max M

Tom Anderson

Aahz

Mike Meyer

Thomas Bartkus

Paul Rubin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads