The reliability of python threads

Paddy · Jan 25, 2007

|> Three to four months before `strange errors`? I'd spend some time
|> correlating logs; not just for your program, but for everything running
|> on the server. Then I'd expect to cut my losses and arrange to safely
|> re-start the program every TWO months.
|> (I'd arrange the re-start after collecting logs but before their
|> analysis. Life is too short).

Forget it. That strategy is fine in general, but is a waste of time
where threading issues are involved (or signal handling, or some types
of communication problem, for that matter).

Nah, Its a great strategy. it keeps you up and running when all you
know for sure is that you will most likely be able to keep things
together for three months normally.
The OP only thinks its a threading problem - it doesn't matter what the
true fix will be, as long as arranging to re-start the server well
before its likely to go down doesn't take too long, compared to your
exploration of the problem, and, of course, you have to be able to
afford the glitch in availability.

Carl J. Van Arsdall · Jan 25, 2007

Aahz said:
[snip]

My response is that you're asking the wrong questions here. Our database
server locked up hard Sunday morning, and we still have no idea why (the
machine itself, not just the database app). I think it's more important
to focus on whether you have done all that is reasonable to make your
application reliable -- and then put your efforts into making your app
recoverable.

Well, I assume that I have done all I can to make it reliable. This
list is usually my last resort, or a place where I come hoping to find
ideas that aren't coming to me naturally. The only other thing I
thought to come up with was that there might be network errors. But
i've gone back and forth on that, because TCP should handle that for me
and I shouldn't have to deal with it directly in pyro, although I've
added (and continue to add) checks in places that appear appropriate
(and in some cases, checks because I prefer to be paranoid about errors).

I'm particularly making this comment in the context of your later point
about the bug showing up only every three or four months.

Side note: without knowing what error messages you're getting, there's
not much anybody can say about your programs or the reliability of
threads for your application.

Right, I wasn't coming here to get someone to debug my app, I'm just
looking for ideas. I constantly am trying to find new ways to improve
my software and new ways to reduce bugs, and when i get really stuck,
new ways to track bugs down. The exception won't mean much, but I can
say that the error appears to me as bad data. I do checks prior to
performing actions on any data, if the data doesn't look like what it
should look like, then the system flags an exception.

The problem I'm having is determining how the data went bad. In
tracking down the problem a couple guys mentioned that problems like
that usually are a race condition. From here I examined my code,
checked out all the locking stuff, made sure it was good, and wasn't
able to find anything. Being that there's one lock and the critical
sections are well defined, I'm having difficulty. One idea I have to
try and get a better understanding might be to check data before its
stored. Again, I still don't know how it would get messed up nor can I
reproduce the error on my own.

Do any of you think that would be a good practice for trying to track
this down? (Check the data after reading it, check the data before
saving it)

--

Carl J. Van Arsdall
(e-mail address removed)
Build and Release
MontaVista Software

Nick Maclaren · Jan 25, 2007

|>
|> > |> Three to four months before `strange errors`? I'd spend some time
|> > |> correlating logs; not just for your program, but for everything running
|> > |> on the server. Then I'd expect to cut my losses and arrange to safely
|> > |> re-start the program every TWO months.
|> > |> (I'd arrange the re-start after collecting logs but before their
|> > |> analysis. Life is too short).
|> >
|> > Forget it. That strategy is fine in general, but is a waste of time
|> > where threading issues are involved (or signal handling, or some types
|> > of communication problem, for that matter).
|>
|> Nah, Its a great strategy. it keeps you up and running when all you
|> know for sure is that you will most likely be able to keep things
|> together for three months normally.
|>
|> The OP only thinks its a threading problem - it doesn't matter what the
|> true fix will be, as long as arranging to re-start the server well
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|> before its likely to go down doesn't take too long, compared to your
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|> exploration of the problem, and, of course, you have to be able to
|> afford the glitch in availability.

Consider the marked phrase in the context of a Poisson process failure
model, and laugh. If you don't understand why I say that, I suggest
finding out the properties of the Poisson process!

Regards,
Nick Maclaren.

Paddy · Jan 25, 2007

|> > |> Three to four months before `strange errors`? I'd spend some time
|> > |> correlating logs; not just for your program, but for everything running
|> > |> on the server. Then I'd expect to cut my losses and arrange to safely
|> > |> re-start the program every TWO months.
|> > |> (I'd arrange the re-start after collecting logs but before their
|> > |> analysis. Life is too short).
|> >
|> > Forget it. That strategy is fine in general, but is a waste of time
|> > where threading issues are involved (or signal handling, or some types
|> > of communication problem, for that matter).
|>
|> Nah, Its a great strategy. it keeps you up and running when all you
|> know for sure is that you will most likely be able to keep things
|> together for three months normally.
|>
|> The OP only thinks its a threading problem - it doesn't matter what the
|> true fix will be, as long as arranging to re-start the server well
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|> before its likely to go down doesn't take too long, compared to your
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|> exploration of the problem, and, of course, you have to be able to
|> afford the glitch in availability.

Consider the marked phrase in the context of a Poisson process failure
model, and laugh. If you don't understand why I say that, I suggest
finding out the properties of the Poisson process!

Regards,
Nick Maclaren.

No, you should think of the service that needs to be up. You seem to be
talking about how it can't be fixed rather than looking for ways to
keep things going. A little learning is fine but "it can't
theoretically be fixed" is no solution.
With a program that stays up for that long, the situation will usualy
work out for the better when either software versions are upgraded, or
OS and drivers are upgraded. (Sometimes as a result of the analysis,
sometimes not).

Keep your eye on the goal and your more likely to score!

- Paddy.

Paul Rubin · Jan 25, 2007

Paddy said:
No, you should think of the service that needs to be up. You seem to be
talking about how it can't be fixed rather than looking for ways to
keep things going.

But you're proposing cargo cult programming. There is no reason
whatsoever to expect that restarting the server now and then will help
the problem in the slightest. Nick used the fancy term Poisson
process but it just means that the probability of failure at any
moment is independent of what's happened in the past, like the
spontaneous radioactive decay of an atom. It's not like a mechanical
system where some part gradually gets worn out and eventually breaks,
so you can prevent the failure by replacing the part every so often.

A little learning is fine but "it can't theoretically be fixed" is
no solution.

The best you can do is identify the unfixable situations precisely and
work around them. Precision is important.

The next best thing is have several servers running simultaneously,
with failure detection and automatic failover.

If a server is failing at random every few months, trying to prevent
that by restarting it every so often is just shooting in the dark.
Think of your server stopping now and then because there's a power
failure, where you get power failures every few months on the average.
Shutting down your server once a month, unplugging it, and plugging it
back in will do nothing to prevent those outages. You need to either
identify and fix whatever is causing the power outages, or install a
backup generator.

Paddy · Jan 25, 2007

But you're proposing cargo cult programming.

i don't know that term. What I'm proposing is that if, for example, a
process stops running three times in a year at roughly three to four
months intervals , and it should have stayed up; then restart the
server sooner, at aa time of your choosing, whilst taking other
measures to investicate the error.

There is no reason
whatsoever to expect that restarting the server now and then will help
the problem in the slightest.

Thats where we most likely differ. The problem is only indirecctly the
program failing. the customer wants reliable service. Which you can get
from unreliable components. It happens all the time in firmware
controlled systems that periodically reboot themselves as a matter of
course.

Nick used the fancy term Poisson
process but it just means that the probability of failure at any
moment is independent of what's happened in the past, like the
spontaneous radioactive decay of an atom. It's not like a mechanical
system where some part gradually gets worn out and eventually breaks,
so you can prevent the failure by replacing the part every so often.

Whilst you sit agreeing on how many fairys can dance on the end of a
pin or not Your company could be loosing customers. You and Nick seem
to be saying it *must* be Poisson, therefore we can't do...

work around them. Precision is important.

I'm sorry, but your argument reminds me of when Western statistical
quality control first met with the Japanese Zero defects methodologies.
We had argued ourselves into accepting a certain amount of defective
cars getting out to customers as the result of our theories. The
Japanese practices emphasized *no* defects were acceptable at the
customer, and they seemed to deliver better made cars.

The next best thing is have several servers running simultaneously,
with failure detection and automatic failover.

Yah, finally. I can work with that

If a server is failing at random every few months, trying to prevent
that by restarting it every so often is just shooting in the dark.

"at random" - "every few months"
Me thinking it happens "every few months" allows me to search for a
fix.
If thinking it happens "at random" leads you to a brick wall, then
switch!

Think of your server stopping now and then because there's a power
failure, where you get power failures every few months on the average.
Shutting down your server once a month, unplugging it, and plugging it
back in will do nothing to prevent those outages. You need to either
identify and fix whatever is causing the power outages, or install a
backup generator.

Yep. I also know that a mad bloke entering the server room with a
hammer every three to four months is also not likely to be fixed by
restarting the server every two months ;-)

- Paddy.

Paul Rubin · Jan 25, 2007

Paddy said:
i don't know that term.
http://en.wikipedia.org/wiki/Cargo_cult_programming

What I'm proposing is that if, for example, a process stops running
three times in a year at roughly three to four months intervals ,
and it should have stayed up; then restart the server sooner, at aa
time of your choosing,

What makes you think that restarting the server will make it less
likely to fail? It sounds to me like there's zero evidence of that,
since you say "roughly three or four month intervals" and talk about
threading and race conditions. If it's failing every 3 months, 15
days and 2.43 hours like clockwork, that's different, sure, restart it
every three months. But the description I see so far sounds like a
random failure caused by some events occurring with low enough
probability that they only happen on average every few months of
operation. That kind of thing is very common and is often best
diagnosed by instrumenting the hell out of the code.

Thats where we most likely differ.

Do you think there is a reason to expect that restarting the server
will help the problem in the slightest? I realize you seem to expect
that, but you have not given a REASON. That's what I mean by cargo
cult programming.

Whilst you sit agreeing on how many fairys can dance on the end of a
pin or not Your company could be loosing customers. You and Nick seem
to be saying it *must* be Poisson, therefore we can't do...

I dunno about Nick, I'm saying it's best to assume that it's Poisson
and do whatever is necessary to diagnose and fix the bug, and that the
voodoo measure you're proposing is not all that likely to help and it
will take years to find out whether it helps or not (i.e. restarting
after 3 months and going another 3 months without a failure proves
nothing).

I'm sorry, but your argument reminds me of when Western statistical
quality control first met with the Japanese Zero defects methodologies.
We had argued ourselves into accepting a certain amount of defective
cars getting out to customers as the result of our theories. The
Japanese practices emphasized *no* defects were acceptable at the
customer, and they seemed to deliver better made cars.

I don't see your point. You're the one who wants to keep operating
defective software instead of fixing it.

"at random" - "every few months"
Me thinking it happens "every few months" allows me to search for a
fix. If thinking it happens "at random" leads you to a brick wall,
then switch!

But you need evidence before you can say it happens every few months.
Do you have, say, a graph of the exact dates and times of failure, the
number of requests processed so far, etc.? If it happened at some
exact or almost exact uniform time interval or precisely once every
1.273 million requests or whatever, that tells you something. But the
earlier description didn't sound like that. Restarting the server is
not much better than carrying a lucky rabbit's foot.

Nick Maclaren · Jan 25, 2007

|>
|> No, you should think of the service that needs to be up. You seem to be
|> talking about how it can't be fixed rather than looking for ways to
|> keep things going. A little learning is fine but "it can't
|> theoretically be fixed" is no solution.

I suggest that you do invest in a little learning and look up Poisson
processes.

|> Keep your eye on the goal and your more likely to score!

And, if you have your eye on the wrong goal, you would generally be
better off not scoring

Regards,
Nick Maclaren.

skip · Jan 26, 2007

Paul> I dunno about Nick, I'm saying it's best to assume that it's
Paul> Poisson and do whatever is necessary to diagnose and fix the bug,
Paul> and that the voodoo measure you're proposing is not all that
Paul> likely to help and it will take years to find out whether it helps
Paul> or not (i.e. restarting after 3 months and going another 3 months
Paul> without a failure proves nothing).

What makes you think Paddy indicated he wouldn't try to solve the problem?
Here's what he wrote:

What I'm proposing is that if, for example, a process stops running
three times in a year at roughly three to four months intervals , and it
should have stayed up; then restart the server sooner, at aa time of
your choosing, whilst taking other measures to investicate the error.

I see nothing wrong with trying to minimize the chances of a problem rearing
its ugly head while at the same time trying to investigate its cause (and
presumably solve it).

Skip

Paul Rubin · Jan 26, 2007

What makes you think Paddy indicated he wouldn't try to solve the problem?
Here's what he wrote:

What I'm proposing is that if, for example, a process stops running
three times in a year at roughly three to four months intervals , and it
should have stayed up; then restart the server sooner, at aa time of
your choosing, whilst taking other measures to investicate the error.

Well, ok, that's better than just rebooting every so often and leaving
it at that, like the firmware systems he cited.

I see nothing wrong with trying to minimize the chances of a problem

I think a measure to minimize the chance of some problem is only valid
if there's some plausible theory that it WILL decrease the chance of
the problem (e.g. if there's reason to think that the problem is
caused by a very slow resource leak, but that hasn't been suggested).
That's the part that I'm missing from this story.

One thing I'd certainly want to do is set up a test server under a
much heavier load than the real server sees, and check whether the
problem occurs faster.

Hendrik van Rooyen · Jan 26, 2007

Carl J. Van Arsdall said:
Right, I wasn't coming here to get someone to debug my app, I'm just
looking for ideas. I constantly am trying to find new ways to improve
my software and new ways to reduce bugs, and when i get really stuck,
new ways to track bugs down. The exception won't mean much, but I can
say that the error appears to me as bad data. I do checks prior to
performing actions on any data, if the data doesn't look like what it
should look like, then the system flags an exception.

The problem I'm having is determining how the data went bad. In
tracking down the problem a couple guys mentioned that problems like
that usually are a race condition. From here I examined my code,
checked out all the locking stuff, made sure it was good, and wasn't
able to find anything. Being that there's one lock and the critical
sections are well defined, I'm having difficulty. One idea I have to

Are you 100% rock bottom gold plated guaranteed sure that there is
not something else that is also critical that you just haven't realised is?

This stuff is never obvious before the fact - and always seems stupid
afterward, when you have found it. Your best (some would say only)
weapon is your imagination, fueled by scepticism...

try and get a better understanding might be to check data before its
stored. Again, I still don't know how it would get messed up nor can I
reproduce the error on my own.

Do any of you think that would be a good practice for trying to track
this down? (Check the data after reading it, check the data before
saving it)

Nothing wrong with doing that to find a bug - not as a general
practice, of course - that would be too pessimistic.

In hard to find bugs - doing anything to narrow the time and place
of the error down is fair game - the object is to get you to read
some code that you *know works* with new eyes...

I build in a global boolean variable that I call trace, and when its on
I do all sort of weird stuff, giving a running commentary (either by
print or in some log like file) of what the programme is doing,
like read this, wrote that, received this, done that here, etc.
A bare useful minimum is a "we get here" indicator like the routine
name, but the data helps a lot too.

Compared to an assert, it does not stop the execution, and you
could get lucky by cross correlating such "traces" from different
threads. - or better, if you use a queue or a pipe for the "log",
you might see the timing relationships directly.

But this in itself is fraught with danger, as you can hit file size
limits, or slow the whole thing down to unusability.

On the other hand it does not generate the volume that a genuine
trace does, it is easier to read, and you can limit it to the bits that
you are currently suspicious of.

Programming is such fun...

hth - Hendrik

Nick Maclaren · Jan 26, 2007

|>
|> What makes you think Paddy indicated he wouldn't try to solve the problem?
|> Here's what he wrote:
|>
|> What I'm proposing is that if, for example, a process stops running
|> three times in a year at roughly three to four months intervals , and it
|> should have stayed up; then restart the server sooner, at aa time of
|> your choosing, whilst taking other measures to investicate the error.
|>
|> I see nothing wrong with trying to minimize the chances of a problem rearing
|> its ugly head while at the same time trying to investigate its cause (and
|> presumably solve it).

No, nor do I, but look more closely. His quote makes it quite clear that
he has got it firmly in his mind that this is a degradation problem, and
so regular restarting will improve the reliability. Well, it could also
be one where failure becomes LESS likely the longer the server stays up
(i.e. the "settling down" problem).

No problem is as hard to find as one where you are firmly convinced that
it is somewhere other than where it is.

Regards,
Nick Maclaren.

Paddy · Jan 26, 2007

|> What makes you think Paddy indicated he wouldn't try to solve the problem?
|> Here's what he wrote:
|>
|> What I'm proposing is that if, for example, a process stops running
|> three times in a year at roughly three to four months intervals , and it
|> should have stayed up; then restart the server sooner, at aa time of
|> your choosing, whilst taking other measures to investicate the error.
|>
|> I see nothing wrong with trying to minimize the chances of a problem rearing
|> its ugly head while at the same time trying to investigate its cause (and
|> presumably solve it).

No, nor do I, but look more closely. His quote makes it quite clear that
he has got it firmly in his mind that this is a degradation problem, and
so regular restarting will improve the reliability. Well, it could also
be one where failure becomes LESS likely the longer the server stays up
(i.e. the "settling down" problem).

If in the past year the settling down problem did not rear its head
when the server crashed after three to four months and was restarted,
then why not implement a regular , notified, downtime - whilst also
looking into the problem in more depth?

* You are already having to restart.
* restarts last for 3-4 months.
Why burden yourself with "Oh but it could fail once in three hours,
you've not prooved that it can't, we'll have to stop everything whilst
we do a thorough investigation. Is it Poisson? Is it 'settling down'?
Just wait whilst I prepare my next doctoral thesis... "

- Okay, the last was extreme. but cathartic

No problem is as hard to find as one where you are firmly convinced that
it is somewhere other than where it is. Amen!

Regards,
Nick Maclaren.

- Paddy.

Carl J. Van Arsdall · Jan 26, 2007

Hendrik said:
[snip]

Click to expand...

Are you 100% rock bottom gold plated guaranteed sure that there is
not something else that is also critical that you just haven't realised is?

100%? No, definitely not. I know myself, as I explore this option and
other options, I will of course be going into and out of the code,
looking for that small piece I might have missed. But I'm like a modern
operating system, I do lots of things at once. So after being unable to
solve it the first few times, I thought to pose a question, but as I
pose the question that never means that I'm done looking at my code and
hoping I missed something. I'd much rather have this be my fault...
that means I have a much higher probability of fixing it. But i sought
to explore some tips given to me. Ah, but the day I could be 100%
sure, that would be a good day (hell, i'd go ask for a raise for being
the best coder ever!)

This stuff is never obvious before the fact - and always seems stupid
afterward, when you have found it. Your best (some would say only)
weapon is your imagination, fueled by scepticism...

Yea, seriously!

Nothing wrong with doing that to find a bug - not as a general
practice, of course - that would be too pessimistic.

In hard to find bugs - doing anything to narrow the time and place
of the error down is fair game - the object is to get you to read
some code that you *know works* with new eyes...

I really like that piece of wisdom, I'll add that to my list of coding
mantras. Thanks!

I build in a global boolean variable that I call trace, and when its on
I do all sort of weird stuff, giving a running commentary (either by
print or in some log like file) of what the programme is doing,
like read this, wrote that, received this, done that here, etc.
A bare useful minimum is a "we get here" indicator like the routine
name, but the data helps a lot too.

Yea, I do some of that too. I use that with conditional print
statements to stderr when i'm doing my validation against my test
cases. But I could definitely do more of them. The thing will be
simulating the failure. In the production server, thousands of printed
messages would be bad.

I've done short but heavy simulations, but to no avail. For example,
I'll have a couple systems infinitely loop and beat on the system. This
is a much heavier load than the system will ever normally face, as its
hit a lot at once and then idles for a while. The test environment
constantly hits it, and I let that run for several days. Maybe a longer
run is needed, but how long is reasonable before determining that its
something beyond my control?

Compared to an assert, it does not stop the execution, and you
could get lucky by cross correlating such "traces" from different
threads. - or better, if you use a queue or a pipe for the "log",
you might see the timing relationships directly.

Ah, store the logs in a rotating queue of fixed size? That would work
pretty well to maintain control on a large run, thanks!

But this in itself is fraught with danger, as you can hit file size
limits, or slow the whole thing down to unusability.

On the other hand it does not generate the volume that a genuine
trace does, it is easier to read, and you can limit it to the bits that
you are currently suspicious of.

Programming is such fun...

Yea, I'm one of those guys who really gets a sense of satisfaction out
of coding. Thanks for the tips.

-carl

--

Carl J. Van Arsdall
(e-mail address removed)
Build and Release
MontaVista Software

Hendrik van Rooyen · Jan 27, 2007

8< ---------------------------------------------------

Yea, I do some of that too. I use that with conditional print
statements to stderr when i'm doing my validation against my test
cases. But I could definitely do more of them. The thing will be

When I read this - I thought - probably your stuff is working
perfectly - on your test cases - you could try to send it some
random data and to see what happens - seeing as you have a test
server, throw the kitchen sink at it.

Possibly "random" here means something that "looks like" data
but that is malformed in some way. Kind of try to "trick" the
system to get it to break reliably.

I'm sorry I can't be more specific - it sounds so weak, and you
probably already have test cases that "must fail" but I don't
know how to put it any better...

- Hendrik

Carl J. Van Arsdall · Jan 29, 2007

Hendrik said:
[snip]

could definitely do more of them. The thing will be

Click to expand...

When I read this - I thought - probably your stuff is working
perfectly - on your test cases - you could try to send it some
random data and to see what happens - seeing as you have a test
server, throw the kitchen sink at it.

Possibly "random" here means something that "looks like" data
but that is malformed in some way. Kind of try to "trick" the
system to get it to break reliably.

I'm sorry I can't be more specific - it sounds so weak, and you
probably already have test cases that "must fail" but I don't
know how to put it any better...

Well, sometimes a weak analogy is the best thing because it allows me to
fill in the blanks "How can I throw a kitchen sink at it in a way I
never have before"

And away my mind goes, so thank you.

-carl

--

Carl J. Van Arsdall
(e-mail address removed)
Build and Release
MontaVista Software

Aahz · Jan 30, 2007

Well, I assume that I have done all I can to make it reliable. This
list is usually my last resort, or a place where I come hoping to find
ideas that aren't coming to me naturally. The only other thing I
thought to come up with was that there might be network errors. But
i've gone back and forth on that, because TCP should handle that for me
and I shouldn't have to deal with it directly in pyro, although I've
added (and continue to add) checks in places that appear appropriate
(and in some cases, checks because I prefer to be paranoid about errors).

My point is that an app that dies only once every few months under load
is actually pretty damn stable! That is not the kind of problem that
you are likely to stimulate.

Right, I wasn't coming here to get someone to debug my app, I'm just
looking for ideas. I constantly am trying to find new ways to improve
my software and new ways to reduce bugs, and when i get really stuck,
new ways to track bugs down. The exception won't mean much, but I can
say that the error appears to me as bad data. I do checks prior to
performing actions on any data, if the data doesn't look like what it
should look like, then the system flags an exception.

The problem I'm having is determining how the data went bad. In
tracking down the problem a couple guys mentioned that problems like
that usually are a race condition. From here I examined my code,
checked out all the locking stuff, made sure it was good, and wasn't
able to find anything. Being that there's one lock and the critical
sections are well defined, I'm having difficulty. One idea I have to
try and get a better understanding might be to check data before its
stored. Again, I still don't know how it would get messed up nor can I
reproduce the error on my own.

Do any of you think that would be a good practice for trying to track
this down? (Check the data after reading it, check the data before
saving it)

What we do at my company is maintain log files. When we think we have
identified a potential choke point for problems, we add a log call.
Tracking this down will involve logging the changes to your data until
you can figure out where it goes wrong -- once you know where it goes
wrong, you have an excellent chance of figuring out why.

Steve Holden · Jan 30, 2007

Carl said:
Aahz said:

[snip]

My response is that you're asking the wrong questions here. Our database
server locked up hard Sunday morning, and we still have no idea why (the
machine itself, not just the database app). I think it's more important
to focus on whether you have done all that is reasonable to make your
application reliable -- and then put your efforts into making your app
recoverable.

Click to expand...

Well, I assume that I have done all I can to make it reliable. This
list is usually my last resort, or a place where I come hoping to find
ideas that aren't coming to me naturally. The only other thing I
thought to come up with was that there might be network errors. But
i've gone back and forth on that, because TCP should handle that for me
and I shouldn't have to deal with it directly in pyro, although I've
added (and continue to add) checks in places that appear appropriate
(and in some cases, checks because I prefer to be paranoid about errors).

I'm particularly making this comment in the context of your later point
about the bug showing up only every three or four months.

Side note: without knowing what error messages you're getting, there's
not much anybody can say about your programs or the reliability of
threads for your application.

Click to expand...

Right, I wasn't coming here to get someone to debug my app, I'm just
looking for ideas. I constantly am trying to find new ways to improve
my software and new ways to reduce bugs, and when i get really stuck,
new ways to track bugs down. The exception won't mean much, but I can
say that the error appears to me as bad data. I do checks prior to
performing actions on any data, if the data doesn't look like what it
should look like, then the system flags an exception.

The problem I'm having is determining how the data went bad. In
tracking down the problem a couple guys mentioned that problems like
that usually are a race condition. From here I examined my code,
checked out all the locking stuff, made sure it was good, and wasn't
able to find anything. Being that there's one lock and the critical
sections are well defined, I'm having difficulty. One idea I have to
try and get a better understanding might be to check data before its
stored. Again, I still don't know how it would get messed up nor can I
reproduce the error on my own.

Do any of you think that would be a good practice for trying to track
this down? (Check the data after reading it, check the data before
saving it)

Are you using memory with built-in error detection and correction?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

John Nagle · Jan 30, 2007

Aahz said:
My point is that an app that dies only once every few months under load
is actually pretty damn stable! That is not the kind of problem that
you are likely to stimulate.

This has all been so vague. How does it die?

It would be useful if Python detected obvious deadlock. If all threads
are blocked on mutexes, you're stuck, and at that point, it's time
to abort and do tracebacks on all threads. You shouldn't have to
run under a debugger to detect that.

Then a timer, so that if the Global Python Lock
stays locked for more than N seconds, you get an abort and a traceback.
That way, if you get stuck in some C library, it gets noticed.

Those would be some good basic facilities to have in thread support.

In real-time work, you usually have a high-priority thread which
wakes up periodically and checks that a few flags have been set
indicating progress of the real time work, then clears the flags.
Throughout the real time code, flags are set indicating progress
for the checking thread to notice. All serious real time systems
have some form of stall timer like that; there's often a stall
timer in hardware.

John Nagle

Carl J. Van Arsdall · Jan 30, 2007

Steve said:
[snip]

Are you using memory with built-in error detection and correction?

You mean in the hardware? I'm not really sure, I'd assume so but is
there any way I can check on this? If the hardware isn't doing that, is
there anything I can do with my software to offer more stability?

--

Carl J. Van Arsdall
(e-mail address removed)
Build and Release
MontaVista Software

python reliability with EINTR handling in general modules	4	Feb 1, 2012
Threads and Interpreter death	2	Feb 9, 2006
Threads vs Processes	38	Jul 26, 2006
Create a process with a "time to live"	0	Aug 15, 2008
Tools Designing large/complicated applications	4	Jan 12, 2007
Threading in python	5	Dec 13, 2005
Does python have an internal list/dictionary of functions?	1	Feb 16, 2006
Discussion: Python and OpenMP	1	May 12, 2006

The reliability of python threads

Paddy

Carl J. Van Arsdall

Nick Maclaren

Paddy

Paul Rubin

Paddy

Paul Rubin

Nick Maclaren

skip

Paul Rubin

Hendrik van Rooyen

Nick Maclaren

Paddy

Carl J. Van Arsdall

Hendrik van Rooyen

Carl J. Van Arsdall

Aahz

Steve Holden

John Nagle

Carl J. Van Arsdall

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads