CPython thread starvation

J

John Nagle

I have a multi-threaded CPython program, which has up to four
threads. One thread is simply a wait loop monitoring the other
three and waiting for them to finish, so it can give them more
work to do. When the work threads, which read web pages and
then parse them, are compute-bound, I've had the monitoring thread
starved of CPU time for as long as 120 seconds.
It's sleeping for 0.5 seconds, then checking on the other threads
and for new work do to, so the work thread isn't using much
compute time.

I know that the CPython thread dispatcher sucks, but I didn't
realize it sucked that bad. Is there a preference for running
threads at the head of the list (like UNIX, circa 1979) or
something like that?

(And yes, I know about "multiprocessing". These threads are already
in one of several service processes. I don't want to launch even more
copies of the Python interpreter. The threads are usually I/O bound,
but when they hit unusually long web pages, they go compute-bound
during parsing.)

Setting "sys.setcheckinterval" from the default to 10000 seems
to have little effect. This is on Windows 7.

John Nagle
 
P

Paul Rubin

John Nagle said:
I know that the CPython thread dispatcher sucks, but I didn't
realize it sucked that bad. Is there a preference for running
threads at the head of the list (like UNIX, circa 1979) or
something like that?

I think it's left up to the OS thread scheduler, Windows in your case.
See http://www.dabeaz.com/python/NewGIL.pdf starting around slide 18.

One idea that comes to mind is putting a periodic interrupt and signal
handler into your main thread, to make sure the GIL gets released every
so often.
 
A

Adam Skutt

     I have a multi-threaded CPython program, which has up to four
threads.  One thread is simply a wait loop monitoring the other
three and waiting for them to finish, so it can give them more
work to do.  When the work threads, which read web pages and
then parse them, are compute-bound, I've had the monitoring thread
starved of CPU time for as long as 120 seconds.

How exactly are you determining that this is the case?
    I know that the CPython thread dispatcher sucks, but I didn't
realize it sucked that bad.  Is there a preference for running
threads at the head of the list (like UNIX, circa 1979) or
something like that?

Not in CPython, which is at the mercy of what the operating system
does. Under the covers, CPython uses a semaphore on Windows, which do
not have FIFO ordering as per http://msdn.microsoft.com/en-us/library/windows/desktop/ms685129(v=vs.85).aspx.
As a result, I think your thread is succumbing to the same issues that
impact signal delivery as described on 22-24 and 35-41 of
http://www.dabeaz.com/python/GIL.pdf.

I'm not sure there's any easy or reliable way to "fix" that from your
code. I am not a WinAPI programmer though, and I'd suggest finding
one to help you out. It doesn't appear possible to change the
scheduling policy for semaphore programatically, and I don't know
closely they pay any attention to thread priority.

That's just a guess though, and finding out for sure would take some
low-level debugging. However, it seems to be the most probable
situation assuming your code is correct.
    (And yes, I know about "multiprocessing".  These threads are already
in one of several service processes.  I don't want to launch even more
copies of the Python interpreter.

Why? There's little harm in launching more instances. Processes have
some additional startup and memory overhead compared to threads, but I
can't imagine it woudl be an issue. Given what you're trying to do,
I'd expect to run out of other resources long before I ran out of
memory because I created too many processes or threads.
 The threads are usually I/O bound,
but when they hit unusually long web pages, they go compute-bound
during parsing.)

If your concern is being CPU oversubscribed by using lots of
processes, I suspect it's probably misplaced. A whole mess of CPU-
bound tasks is pretty much the easiest case for a scheduler to
handle.

Adam
 
J

John Nagle

How exactly are you determining that this is the case?

Found the problem. The threads, after doing their compute
intensive work of examining pages, stored some URLs they'd found.
The code that stored them looked them up with "getaddrinfo()", and
did this while a lock was set. On CentOS, "getaddrinfo()" at the
glibc level doesn't always cache locally (ref
https://bugzilla.redhat.com/show_bug.cgi?id=576801). Python
doesn't cache either. So huge numbers of DNS requests were being
made. For some pages being scanned, many of the domains required
accessing a rather slow DNS server. The combination of thousands
of instances of the same domain, a slow DNS server, and no caching
slowed the crawler down severely.

Added a local cache in the program to prevent this.
Performance much improved.

John Nagle
 
P

Paul Rubin

John Nagle said:
The code that stored them looked them up with "getaddrinfo()", and
did this while a lock was set.

Don't do that!!
Added a local cache in the program to prevent this.
Performance much improved.

Better to release the lock while the getaddrinfo is running, if you can.
 
J

John Nagle

Don't do that!!


Better to release the lock while the getaddrinfo is running, if you can.

I may do that to prevent the stall. But the real problem was all
those DNS requests. Parallizing them wouldn't help much when it took
hours to grind through them all.

John Nagle
 
P

Paul Rubin

John Nagle said:
I may do that to prevent the stall. But the real problem was all
those DNS requests. Parallizing them wouldn't help much when it took
hours to grind through them all.

True dat. But building a DNS cache into the application seems like a
kludge. Unless the number of requests is insane, running a caching
nameserver on the local box seems cleaner.
 
J

John Nagle

True dat. But building a DNS cache into the application seems like a
kludge. Unless the number of requests is insane, running a caching
nameserver on the local box seems cleaner.

I know. When I have a bit more time, I'll figure out why
CentOS 5 and Webmin didn't set up a caching DNS resolver by
default.

Sometimes the number of requests IS insane. When the
system hits a page with a thousand links, it has to resolve
all of them. (Beyond a thousand links, we classify it as
link spam and stop. The record so far is a page with over
10,000 links.)

John Nagle
 
R

Roy Smith

Paul Rubin said:
True dat. But building a DNS cache into the application seems like a
kludge. Unless the number of requests is insane, running a caching
nameserver on the local box seems cleaner.

I agree that application-level name cacheing is "wrong", but sometimes
doing it the wrong way just makes sense. I could whip up a simple
cacheing wrapper around getaddrinfo() in 5 minutes. Depending on the
environment (both technology and bureaucracy), getting a cacheing
nameserver installed might take anywhere from 5 minutes to a few days to
kicking a dead whale down the beach (if you need to involve your
corporate IT department) to it just ain't happening (if you need to
involve your corporate IT department).

Doing DNS cacheing correctly is non-trivial. In fact, if you're
building it on top of getaddrinfo(), it may be impossible, since I don't
think getaddrinfo() exposes all the data you need (i.e. TTL values).
But, doing a half-assed job of cache expiration is better than not
expiring your cache at all. I would suggest (from experience) that if
you build a getaddrinfo() wrapper, you have cache entries time out after
a fairly short time. From the problem description, it sounds like using
a 1-minute timeout would get 99% of the benefit and might keep you from
doing some bizarre things.

PS -- I've also learned by experience that nscd can mess up. If DNS
starts doing stuff that doesn't make sense, my first line of attack is
usually killing and restarting the local nscd. Often enough, that
solves the problem, and it rarely causes any problems that anybody would
notice.
 
P

Paul Rubin

Roy Smith said:
I agree that application-level name cacheing is "wrong", but sometimes
doing it the wrong way just makes sense. I could whip up a simple
cacheing wrapper around getaddrinfo() in 5 minutes. Depending on the
environment (both technology and bureaucracy), getting a cacheing
nameserver installed might take anywhere from 5 minutes to a few days to ...

IMHO this really isn't one of those times. The in-app wrapper would
only be usable to just that process, and we already know that the OP has
multiple processes running the same app on the same machine. They would
benefit from being able to share the cache, so now your wrapper gets
more complicated. If it's not a nameserver then it's something that
fills in for one. And then, since the application appears to be a large
scale web spider, it probably wants to run on a cluster, and the cache
should be shared across all the machines. So you really probably want
an industrial strength nameserver with a big persistent cache, and maybe
a smaller local cache because of high locality when crawling specific
sites, etc.

Also, since this is a production application, doing something in 5
minutes is less important than making it solid and configurable.
 
J

John Nagle

IMHO this really isn't one of those times. The in-app wrapper would
only be usable to just that process, and we already know that the OP has
multiple processes running the same app on the same machine. They would
benefit from being able to share the cache, so now your wrapper gets
more complicated. If it's not a nameserver then it's something that
fills in for one. And then, since the application appears to be a large
scale web spider, it probably wants to run on a cluster, and the cache
should be shared across all the machines. So you really probably want
an industrial strength nameserver with a big persistent cache, and maybe
a smaller local cache because of high locality when crawling specific
sites, etc.

Each process is analyzing one web site, and has its own cache.
Once the site is analyzed, which usually takes about a minute,
the cache disappears. Multiple threads are reading multiple pages
from the web site during that time.

A local cache is enough to fix the huge overhead problem of
doing a DNS lookup for every link found. One site with a vast
number of links took over 10 hours to analyze before this fix;
now it takes about four minutes. That solved the problem.
We can probably get an additional minor performance boost with a real
local DNS daemon, and will probably configure one.

We recently changed servers from Red Hat to CentOS, and management
from CPanel to Webmin. Before the change, we had a local DNS daemon
with cacheing, so we didn't have this problem. Webmin's defaults
tend to be on the minimal side.

The DNS information is used mostly to help decide whether two URLs
actually point to the same IP address, as part of deciding whether a
link is on-site or off-site. Most of those links will never be read.
We're not crawling the entire site, just looking at likely pages to
find the name and address of the business behind the site. (It's
part of our "Know who you're dealing with" system, SiteTruth.)

John Nagle
 
R

Roy Smith

Paul Rubin said:
IMHO this really isn't one of those times. The in-app wrapper would
only be usable to just that process, and we already know that the OP has
multiple processes running the same app on the same machine. They would
benefit from being able to share the cache, so now your wrapper gets
more complicated.

So, use memcache. Trivial to set up, easy Python integration, and it
has the expiration mechanism built in. Not to mention it has a really
cute web site (http://memcached.org/).
Also, since this is a production application, doing something in 5
minutes is less important than making it solid and configurable.

Maybe. On the other hand, the time you save with a 5 minute solution
can be spent solving other, harder, problems.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top