trouble with threads

G

gm gm

I am trying to expand my web crawler to use multiple threads (with
mechanize), and I ma having some trouble. It seems that each thread is
not creating a local variable, but rather they are sharing the "index"
variable below:

threads = []
mutex = Mutex.new

10.times do |i|
threads = Thread.new(i) { |index|
while index < @will_visit.size
current_link = @will_visit[index]
begin
index += 10
puts current_link
page = @agent.get(current_link)
if(page.kind_of? WWW::Mechanize::page)
page.links.each do |link|
mutex.synchronize do
if(validLink?(link))
@will_visit.push(link.href)
end
end
end
end

puts "Currently visiting page #{index} of #{@will_visit.size}"
rescue Exception => msg
puts "Error with " + current_link
puts msg
puts msg.backtrace
end
end
}
end
threads.each {|t| t.join }

From what I have read from google, the 'index' variable should be
independent between threads, but it seems that it is shared. The problem
may also be with the face that @agent is shared, but I am not sure
 
M

MenTaLguY

From what I have read from google, the 'index' variable should be
independent between threads, but it seems that it is shared. The problem
may also be with the face that @agent is shared, but I am not sure

'index' is indeed independent here; @agent being shared, however, is
very likely to cause problems. As far as I know, WWW::Mechanize agents
are not safe for use by multiple threads. Each thread will need its
own agent.

-mental
 
J

Judson Lester

'index' is indeed independent here; @agent being shared, however, is
very likely to cause problems. As far as I know, WWW::Mechanize agents
are not safe for use by multiple threads. Each thread will need its
own agent.

That agrees completely with my direct experience. WWW::Mechanize is,
sadly, not threadsafe - it reuses a buffer for each request, and if
you start a new request before another completes, the new request
will clobber the input buffer. I gladly share my painfully won
experience with you.

Judson
 
G

gm gm

Judson said:
That agrees completely with my direct experience. WWW::Mechanize is,
sadly, not threadsafe - it reuses a buffer for each request, and if
you start a new request before another completes, the new request
will clobber the input buffer. I gladly share my painfully won
experience with you.

Judson

that would explain the crazy output. I will try and modify it a bit, and
see how it works. Thanks!
 
G

gm gm

Judson said:
That agrees completely with my direct experience. WWW::Mechanize is,
sadly, not threadsafe - it reuses a buffer for each request, and if
you start a new request before another completes, the new request
will clobber the input buffer. I gladly share my painfully won
experience with you.

Judson

that seemed to make a big difference...also, do you think I need to put
a mutex around the 'out.puts' call (ie file output?)
 
7

7stud --

gm said:
I am trying to expand my web crawler to use multiple threads (with
mechanize), and I ma having some trouble. It seems that each thread is
not creating a local variable, but rather they are sharing the "index"
variable


threads = []

10.times do |i|
threads = Thread.new(i) do |index|
if index == 0
index += 100
end

puts index

end
end

threads.each do |t|
t.join
end

--output:--
100
1
2
3
4
5
6
7
8
9


If the threads shared the index variable, then each line of the output
would be 100. Instead, i gets assigned to index for each thread, and
nothing a thread does to its index variable has any effect on another
thread's index variable.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,077
Latest member
SangMoor21

Latest Threads

Top