John Bokma said:
Good question
I think that ParallellUA = UA + threading, and doesn't
add anything, and since UA is more a core module, I prefer the latter.
I think ParallelUA = UA + nonblocking IO, rather than threading. Assuming
it is well implemented (I haven't used ParallelUA enough to know), I think
non-blocking IO is better than threads for this task.
Also, with ParallellUA the documentation was a bit unclear to me.
OK, fair enough. Anything in particular you found unclear?
I want to have n workers in parallell, each getting a request from a
Queue, fetching the page, storing the result, and next.
Store the results on the filesystem or DB, or in Perl memory?
Is the queue dynamically added to (based on the results returned from
earlier tasks in the queue) or is it built in a start-up phase and then
only consumed from then on?
If the queue is dynamically added to, that argues for threads. If each
page-fetch takes less than 1/20 of a second or so (and there are tens of
thousands of them), that argues for threads, (althought I might instead
just batch them up into chunks of several page fetches) . Otherwise, I'd
go with forking with Parallel::ForkManager. (or ParallelUA
).
Sleeping (not
wasting CPU cycles) in between each fetch.
This part I'm not sure of. Why sleep rather than just fetch the next
item from the queue? Are you sleeping only in the case of an empty queue
(which of course only makes sense if the queue is dynamic)? Or to avoid
overloading the remote server(s) you are fetching from?
Xho