Mechanize MySQL and threads - deadlock?

M

Marc Weber

First of all: I'm still new to Ruby.

So pointing me to documentation or books is fine.

Use case:

Use mechanize to gather information. Because there are many pages I'd
like to run multiple threads each fetching pages. The fetched data
should be written to a MySQL database.

Can you point me to information telling me how to do this?

The failure looks like this now:

/pr/tasks/get_data_ruby/tasks.rb:364:in `join': deadlock detected (fatal)
from /pr/tasks/get_data_ruby/tasks.rb:364:in `block in run_tasks_wait'
from /pr/tasks/get_data_ruby/tasks.rb:364:in `each'
from /pr/tasks/get_data_ruby/tasks.rb:364:in `run_tasks_wait'
from get-data.rb:37:in `<mai

What is causing such deadlocks at all?

Details about my implementation:
=================================
Ruby version: ruby 1.9.1p378 (2010-01-10 revision 26273) [x86_64-linux]
sequel-3.8.0
mysqlplus-0.1.1

Because things always go wrong I'd like store state in database to
resume work where the script failed.

To keep things simple I tried giving each thread it's own agent and DB
connection:


def newDBConnection
Sequel.connect(
:adapter => 'mysql',
:user => 'root',
:host => 'localhost',
:database => 'get_data',
:password=>'XXX')
end

# share one agent and db connection per thread
class MyThread < Thread
def agent
if !@agent
@agent = Mechanize.new
@agent.max_history =1
end
@agent
end

def db
@dbCache ||= newDBConnection
end
end

next I defined a task which reuses the db and Mechanize agent from the
thread which is running the task:

class Task
def run
# override
@thread = Thread.current
task
end

def agent
@agent ||= @thread.agent
end

def db
@dbCache ||= @thread.db
end
end



Next I wrote a simple function taking a list of tasks and a thread class
MyThread. it spawns parallel threads each getting a task from the task
list (Queue). They all may add more tasks to the queue.
The script should run until all tasks are done.

# t: class extending Thread
# tasks: type Queue.new
# parallel: num of threads used to run those tasks
def run_tasks_wait(t, tasks, parallel)
working = 0
threads = []
# run 3 threads
(1..parallel).each {|i|
threads << t.new {
firstTime = true
while working > 0 || firstTime
firstTime = false
while task = tasks.pop
working += 1
$log.debug("starting task #{task.to_s}")
$log.catchAndLog "caught exception in main worker thread" do
task.run if !task.nil?
end
$log.debug("finished task #{task.to_s} threads-working: #{working}")
working -= 1
end
# even if there is nothing left in queue keep thread running if there is one thread running
# this thread may push additional tasks to the queue
sleep 1
end
} }
# wait for threads
threads.each {|t| t.join() }
end


Thanks for any pointers
Marc Weber
 
M

Marc Weber

# t: class extending Thread
# tasks: type Queue.new
# parallel: num of threads used to run those tasks
def run_tasks_wait(t, tasks, parallel)
Replacing the Queue by an Array seems to fix the issue.

Marc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top