Thread and HTTP troubles

K

Keegan Dunn

I'm trying to write a threaded program that will run through a list of
web sites and download/process a set number of them at a
time(maintaining a pool of threads that can process page
downloads/processing). I have something simple working, but I am
unsure how to approach the "pool" of threads idea. Is that even the
way to go about processing multiple pages simultaneously? Is there a
better way?

Also, how can I deal with a "socket read timeout" error? I have the
http get call wrapped in a begin...rescue...end block, but it doesn't
seem to be catching it. Here is the code in question:

def getHTTP(site)
siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
begin
masterSite = Net::HTTP.new(siteHost,80)
siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
resp, data = masterSite.get2(siteURL, nil)
return data
rescue
return "-999"
end
end


Sorry about the two for one question :p

Thanks!
 
R

Robert Klemme

Keegan Dunn said:
I'm trying to write a threaded program that will run through a list of
web sites and download/process a set number of them at a
time(maintaining a pool of threads that can process page
downloads/processing). I have something simple working, but I am
unsure how to approach the "pool" of threads idea. Is that even the
way to go about processing multiple pages simultaneously? Is there a
better way?

It's most likely the most efficient way. You need these ingredients:

- a thread safe queue
- a pool of processors
- a main thread that does the distribution of work

You also likely want to have a class or method that deals with the details
of fetching data and analysing / storing it to keep thread body blocks
small.

# untested but you'll get the picture
require 'thread'

THREADS = 10
TERM = Object.new
queue = Queue.new
threads = []

THREADS.times do
threads << Thread.new( queue ) do |q|
until ( TERM == ( url = q.deq ) )
begin
# get data from url
rescue
# in case of timeout try again by putting
# it back
end
end
end
end

# now read urls and distribute work
while ( line = gets )
line.chomp!
queue.enq line
end

# write terminators
THREADS.times { queue.enq TERM }

# ... and wait for threads to terminate properly
threads.each {|t| t.join}

# exiting
Also, how can I deal with a "socket read timeout" error? I have the
http get call wrapped in a begin...rescue...end block, but it doesn't
seem to be catching it. Here is the code in question:

def getHTTP(site)
siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
begin
masterSite = Net::HTTP.new(siteHost,80)
siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
resp, data = masterSite.get2(siteURL, nil)
return data
rescue
return "-999"
end
end

You'll likely need to catch another exception. Try "rescue Exception => e"
and then print e's class.
Sorry about the two for one question :p

You get one answer for free. :)

Kind regards

robert
 
L

Leslie Hensley

You'll also want to include 'resolv-replace'. Otherwise all of your
threads will block whenever any thread does a name lookup. Hopefully
this wont be needed once Rite gets here...

Leslie Hensley

Keegan Dunn said:
I'm trying to write a threaded program that will run through a list of
web sites and download/process a set number of them at a
time(maintaining a pool of threads that can process page
downloads/processing). I have something simple working, but I am
unsure how to approach the "pool" of threads idea. Is that even the
way to go about processing multiple pages simultaneously? Is there a
better way?

It's most likely the most efficient way. You need these ingredients:

- a thread safe queue
- a pool of processors
- a main thread that does the distribution of work

You also likely want to have a class or method that deals with the details
of fetching data and analysing / storing it to keep thread body blocks
small.

# untested but you'll get the picture
require 'thread'

THREADS = 10
TERM = Object.new
queue = Queue.new
threads = []

THREADS.times do
threads << Thread.new( queue ) do |q|
until ( TERM == ( url = q.deq ) )
begin
# get data from url
rescue
# in case of timeout try again by putting
# it back
end
end
end
end

# now read urls and distribute work
while ( line = gets )
line.chomp!
queue.enq line
end

# write terminators
THREADS.times { queue.enq TERM }

# ... and wait for threads to terminate properly
threads.each {|t| t.join}

# exiting


Also, how can I deal with a "socket read timeout" error? I have the
http get call wrapped in a begin...rescue...end block, but it doesn't
seem to be catching it. Here is the code in question:

def getHTTP(site)
siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
begin
masterSite = Net::HTTP.new(siteHost,80)
siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
resp, data = masterSite.get2(siteURL, nil)
return data
rescue
return "-999"
end
end

You'll likely need to catch another exception. Try "rescue Exception => e"
and then print e's class.
Sorry about the two for one question :p

You get one answer for free. :)

Kind regards

robert
 
K

Keegan Dunn

I noticed the threads were doing that. I meant to ask about that as
well. Thank you for the help, Leslie and Robert.


You'll also want to include 'resolv-replace'. Otherwise all of your
threads will block whenever any thread does a name lookup. Hopefully
this wont be needed once Rite gets here...

Leslie Hensley



Keegan Dunn said:
I'm trying to write a threaded program that will run through a list of
web sites and download/process a set number of them at a
time(maintaining a pool of threads that can process page
downloads/processing). I have something simple working, but I am
unsure how to approach the "pool" of threads idea. Is that even the
way to go about processing multiple pages simultaneously? Is there a
better way?

It's most likely the most efficient way. You need these ingredients:

- a thread safe queue
- a pool of processors
- a main thread that does the distribution of work

You also likely want to have a class or method that deals with the details
of fetching data and analysing / storing it to keep thread body blocks
small.

# untested but you'll get the picture
require 'thread'

THREADS = 10
TERM = Object.new
queue = Queue.new
threads = []

THREADS.times do
threads << Thread.new( queue ) do |q|
until ( TERM == ( url = q.deq ) )
begin
# get data from url
rescue
# in case of timeout try again by putting
# it back
end
end
end
end

# now read urls and distribute work
while ( line = gets )
line.chomp!
queue.enq line
end

# write terminators
THREADS.times { queue.enq TERM }

# ... and wait for threads to terminate properly
threads.each {|t| t.join}

# exiting


Also, how can I deal with a "socket read timeout" error? I have the
http get call wrapped in a begin...rescue...end block, but it doesn't
seem to be catching it. Here is the code in question:

def getHTTP(site)
siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
begin
masterSite = Net::HTTP.new(siteHost,80)
siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
resp, data = masterSite.get2(siteURL, nil)
return data
rescue
return "-999"
end
end

You'll likely need to catch another exception. Try "rescue Exception => e"
and then print e's class.
Sorry about the two for one question :p

You get one answer for free. :)

Kind regards

robert
 
J

Jim Weirich

Robert Klemme said:

You'll likely need to catch another exception. Try "rescue Exception =>
e" and then print e's class.

The error in question is Timeout::Error which inherits from Interrupt
which in turn inherits from SignalException. Since a plain vanilla rescue
clause will only rescue exceptions deriving from StandardError (and
SignalException is not derived from StandardError), it won't pick up this
exception.

If you use

begin
# stuff
rescue Timeout::Error => ex
# handle timeout
end

you should be ok.
 
K

Keegan Dunn

Thank you for the elaboration.


Robert Klemme said:


The error in question is Timeout::Error which inherits from Interrupt
which in turn inherits from SignalException. Since a plain vanilla rescue
clause will only rescue exceptions deriving from StandardError (and
SignalException is not derived from StandardError), it won't pick up this
exception.

If you use

begin
# stuff
rescue Timeout::Error => ex
# handle timeout
end

you should be ok.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,040
Latest member
papereejit

Latest Threads

Top