Thread and HTTP troubles

Keegan Dunn · Dec 13, 2004

I'm trying to write a threaded program that will run through a list of
web sites and download/process a set number of them at a
time(maintaining a pool of threads that can process page
downloads/processing). I have something simple working, but I am
unsure how to approach the "pool" of threads idea. Is that even the
way to go about processing multiple pages simultaneously? Is there a
better way?

Also, how can I deal with a "socket read timeout" error? I have the
http get call wrapped in a begin...rescue...end block, but it doesn't
seem to be catching it. Here is the code in question:

def getHTTP(site)
siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
begin
masterSite = Net::HTTP.new(siteHost,80)
siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
resp, data = masterSite.get2(siteURL, nil)
return data
rescue
return "-999"
end
end

Sorry about the two for one question

Thanks!

Robert Klemme · Dec 13, 2004

Keegan Dunn said:
I'm trying to write a threaded program that will run through a list of
web sites and download/process a set number of them at a
time(maintaining a pool of threads that can process page
downloads/processing). I have something simple working, but I am
unsure how to approach the "pool" of threads idea. Is that even the
way to go about processing multiple pages simultaneously? Is there a
better way?

It's most likely the most efficient way. You need these ingredients:

- a thread safe queue
- a pool of processors
- a main thread that does the distribution of work

You also likely want to have a class or method that deals with the details
of fetching data and analysing / storing it to keep thread body blocks
small.

# untested but you'll get the picture
require 'thread'

THREADS = 10
TERM = Object.new
queue = Queue.new
threads = []

THREADS.times do
threads << Thread.new( queue ) do |q|
until ( TERM == ( url = q.deq ) )
begin
# get data from url
rescue
# in case of timeout try again by putting
# it back
end
end
end
end

# now read urls and distribute work
while ( line = gets )
line.chomp!
queue.enq line
end

# write terminators
THREADS.times { queue.enq TERM }

# ... and wait for threads to terminate properly
threads.each {|t| t.join}

# exiting

Also, how can I deal with a "socket read timeout" error? I have the
http get call wrapped in a begin...rescue...end block, but it doesn't
seem to be catching it. Here is the code in question:

def getHTTP(site)
siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
begin
masterSite = Net::HTTP.new(siteHost,80)
siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
resp, data = masterSite.get2(siteURL, nil)
return data
rescue
return "-999"
end
end

You'll likely need to catch another exception. Try "rescue Exception => e"
and then print e's class.

Sorry about the two for one question

You get one answer for free.

Kind regards

robert

Leslie Hensley · Dec 13, 2004

You'll also want to include 'resolv-replace'. Otherwise all of your
threads will block whenever any thread does a name lookup. Hopefully
this wont be needed once Rite gets here...

Leslie Hensley

Keegan Dunn said:
Keegan Dunn said:

I'm trying to write a threaded program that will run through a list of
web sites and download/process a set number of them at a
time(maintaining a pool of threads that can process page
downloads/processing). I have something simple working, but I am
unsure how to approach the "pool" of threads idea. Is that even the
way to go about processing multiple pages simultaneously? Is there a
better way?

Click to expand...

It's most likely the most efficient way. You need these ingredients:

- a thread safe queue
- a pool of processors
- a main thread that does the distribution of work

You also likely want to have a class or method that deals with the details
of fetching data and analysing / storing it to keep thread body blocks
small.

# untested but you'll get the picture
require 'thread'

THREADS = 10
TERM = Object.new
queue = Queue.new
threads = []

THREADS.times do
threads << Thread.new( queue ) do |q|
until ( TERM == ( url = q.deq ) )
begin
# get data from url
rescue
# in case of timeout try again by putting
# it back
end
end
end
end

# now read urls and distribute work
while ( line = gets )
line.chomp!
queue.enq line
end

# write terminators
THREADS.times { queue.enq TERM }

# ... and wait for threads to terminate properly
threads.each {|t| t.join}

# exiting

Also, how can I deal with a "socket read timeout" error? I have the
http get call wrapped in a begin...rescue...end block, but it doesn't
seem to be catching it. Here is the code in question:

def getHTTP(site)
siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
begin
masterSite = Net::HTTP.new(siteHost,80)
siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
resp, data = masterSite.get2(siteURL, nil)
return data
rescue
return "-999"
end
end

Click to expand...

You'll likely need to catch another exception. Try "rescue Exception => e"
and then print e's class.

Sorry about the two for one question

Click to expand...

You get one answer for free.

Kind regards

robert

Keegan Dunn · Dec 13, 2004

I noticed the threads were doing that. I meant to ask about that as
well. Thank you for the help, Leslie and Robert.

Keegan Dunn said:
You'll also want to include 'resolv-replace'. Otherwise all of your
threads will block whenever any thread does a name lookup. Hopefully
this wont be needed once Rite gets here...

Leslie Hensley

Keegan Dunn said:

I'm trying to write a threaded program that will run through a list of
web sites and download/process a set number of them at a
time(maintaining a pool of threads that can process page
downloads/processing). I have something simple working, but I am
unsure how to approach the "pool" of threads idea. Is that even the
way to go about processing multiple pages simultaneously? Is there a
better way?

Click to expand...

It's most likely the most efficient way. You need these ingredients:

- a thread safe queue
- a pool of processors
- a main thread that does the distribution of work

You also likely want to have a class or method that deals with the details
of fetching data and analysing / storing it to keep thread body blocks
small.

# untested but you'll get the picture
require 'thread'

THREADS = 10
TERM = Object.new
queue = Queue.new
threads = []

THREADS.times do
threads << Thread.new( queue ) do |q|
until ( TERM == ( url = q.deq ) )
begin
# get data from url
rescue
# in case of timeout try again by putting
# it back
end
end
end
end

# now read urls and distribute work
while ( line = gets )
line.chomp!
queue.enq line
end

# write terminators
THREADS.times { queue.enq TERM }

# ... and wait for threads to terminate properly
threads.each {|t| t.join}

# exiting

Also, how can I deal with a "socket read timeout" error? I have the
http get call wrapped in a begin...rescue...end block, but it doesn't
seem to be catching it. Here is the code in question:

def getHTTP(site)
siteHost = site.gsub(/http:\/\//,'').gsub(/\/.*/,'')
begin
masterSite = Net::HTTP.new(siteHost,80)
siteURL = "/" + site.gsub(/http:\/\//,'').gsub(siteHost,'')
resp, data = masterSite.get2(siteURL, nil)
return data
rescue
return "-999"
end
end

Click to expand...

You'll likely need to catch another exception. Try "rescue Exception => e"
and then print e's class.

Sorry about the two for one question

Click to expand...

You get one answer for free.

Kind regards

robert

Click to expand...

Jim Weirich · Dec 13, 2004

Robert Klemme said:

You'll likely need to catch another exception. Try "rescue Exception =>
e" and then print e's class.

The error in question is Timeout::Error which inherits from Interrupt
which in turn inherits from SignalException. Since a plain vanilla rescue
clause will only rescue exceptions deriving from StandardError (and
SignalException is not derived from StandardError), it won't pick up this
exception.

If you use

begin
# stuff
rescue Timeout::Error => ex
# handle timeout
end

you should be ok.

Keegan Dunn · Dec 13, 2004

Thank you for the elaboration.

Robert Klemme said:

The error in question is Timeout::Error which inherits from Interrupt
which in turn inherits from SignalException. Since a plain vanilla rescue
clause will only rescue exceptions deriving from StandardError (and
SignalException is not derived from StandardError), it won't pick up this
exception.

If you use

begin
# stuff
rescue Timeout::Error => ex
# handle timeout
end

you should be ok.

Thread, locking TcpSocket from each other?	1	Jan 14, 2011
how many thread?!	1	Jul 31, 2009
Thread error "undefined method `keys' for nil:NilClass"	0	Oct 29, 2009
How to control thread number	3	Sep 15, 2008
Parallel::ForkManager, Net::HTTP and catching Timeout::Error	2	Dec 30, 2009
save http response file attachment	0	Sep 26, 2008
Weird thread behavior in windows.	1	Mar 14, 2008
cond_wait problem with Thread::Pool	1	Apr 7, 2011

Thread and HTTP troubles

Keegan Dunn

Robert Klemme

Leslie Hensley

Keegan Dunn

Jim Weirich

Keegan Dunn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads