J
Joey Bersche
I've been experiencing an intermittent crash where no python
stacktrace is provided. It happens for a url downloading process that
can last up to 12 hours and crawls about 50,000 urls.
I'm using urllib2 for the downloads. There are 5-10 downloading
threads, and some custom website exploration code for providing the
urls to crawl.
The downloads are completed in memory (not piped), then saved to a
file. There are also nice per domain / IP guidelines upheld so lots
of concurrent downloads and exploration are either waiting or taking
place sometimes up to 40 at once. As a result, I've seen the process
memory footprint clime upwards of 800 megs.
About 20-40% of the time, the entire process bails out with no
stacktrace, at random memory allocation and running time periods..
sometimes as little as 2 hours. My guess is that there is a bug in
urllib2 or some third party software I'm using, or it was not meant to
be run in a multithreaded environment. Decreasing the
bandwidth/aggressiveness of the crawler MAY seem to have an effect on
the frequency.. haven't done any formal 'studies' on that yet. My
current solution is to restart the crawler, but this is bad business
to the websites (recrawling), and extra crawl time on my part.
I bet if I switch to a 1-download-per-process scenario with pyro for
IPC (to uphold niceness rules, etc), I will fix this situation as I
suspect from reading similar SIGABRT issues that it has something to
do with the multi-threading. But I figured I'd ask around before I
take such drastic measures.
Since the process is so long-running, I have not tried running strace,
and I'm not even sure if it would make sense to me or someone else.
Let me know if you have a method of catching just the last 1000 calls
and not saving earlier ones or whatever, if that would be useful.
I'm using an older version of Python 2.4.4c1. Since the bug is
intermittent, I'm not sure yet if an upgrade to Pyhton 2.5 has solved
my problem.
Does anyone have any clues for me to try? My threading code uses a
messaging queue per thread, and one notification queue that the main
thread checks and assigns new crawls back to free threads. No other
variables are referenced by multiple threads other than the thread
objects themselves (to my knowledge).
stacktrace is provided. It happens for a url downloading process that
can last up to 12 hours and crawls about 50,000 urls.
I'm using urllib2 for the downloads. There are 5-10 downloading
threads, and some custom website exploration code for providing the
urls to crawl.
The downloads are completed in memory (not piped), then saved to a
file. There are also nice per domain / IP guidelines upheld so lots
of concurrent downloads and exploration are either waiting or taking
place sometimes up to 40 at once. As a result, I've seen the process
memory footprint clime upwards of 800 megs.
About 20-40% of the time, the entire process bails out with no
stacktrace, at random memory allocation and running time periods..
sometimes as little as 2 hours. My guess is that there is a bug in
urllib2 or some third party software I'm using, or it was not meant to
be run in a multithreaded environment. Decreasing the
bandwidth/aggressiveness of the crawler MAY seem to have an effect on
the frequency.. haven't done any formal 'studies' on that yet. My
current solution is to restart the crawler, but this is bad business
to the websites (recrawling), and extra crawl time on my part.
I bet if I switch to a 1-download-per-process scenario with pyro for
IPC (to uphold niceness rules, etc), I will fix this situation as I
suspect from reading similar SIGABRT issues that it has something to
do with the multi-threading. But I figured I'd ask around before I
take such drastic measures.
Since the process is so long-running, I have not tried running strace,
and I'm not even sure if it would make sense to me or someone else.
Let me know if you have a method of catching just the last 1000 calls
and not saving earlier ones or whatever, if that would be useful.
I'm using an older version of Python 2.4.4c1. Since the bug is
intermittent, I'm not sure yet if an upgrade to Pyhton 2.5 has solved
my problem.
Does anyone have any clues for me to try? My threading code uses a
messaging queue per thread, and one notification queue that the main
thread checks and assigns new crawls back to free threads. No other
variables are referenced by multiple threads other than the thread
objects themselves (to my knowledge).