open-uri / net/http bug?

D

Dick Davies

I was trying to use RSSscraper to pul some web forums, and something
level went bang in the Net::* libraries.

I found some old references to this error from last year, and I
got the impression it was platform specific?

Can anyone else let me know if this causes problems for them?

It's obviously site specific , url = 'http://www.google.com' has no problems...

Here's the miniaml code (open(url)... is 'line 6' in the code below):

require 'open-uri'

url = 'http://p218.ezboard.com/fdebatingukfrm9'
page = open(url).readlines

If I run this I get:

rasputin@lb:rss$ ./regex.rb
/data/ruby/lib/ruby/1.9/net/protocol.rb:135:in `sysread': End of file reached (EOFError)
from /data/ruby/lib/ruby/1.9/net/protocol.rb:135:in `rbuf_fill'
from /data/ruby/lib/ruby/1.9/net/protocol.rb:116:in `readuntil'
from /data/ruby/lib/ruby/1.9/net/protocol.rb:126:in `readline'
from /data/ruby/lib/ruby/1.9/net/http.rb:1850:in `read_status_line'
from /data/ruby/lib/ruby/1.9/net/http.rb:1839:in `read_new'
from /data/ruby/lib/ruby/1.9/net/http.rb:934:in `request'
from /data/ruby/lib/ruby/1.9/net/http.rb:834:in `request_get'
from /data/ruby/lib/ruby/1.9/open-uri.rb:545:in `proxy_open'
... 7 levels...
from /data/ruby/lib/ruby/1.9/open-uri.rb:134:in `open_uri'
from /data/ruby/lib/ruby/1.9/open-uri.rb:424:in `open'
from /data/ruby/lib/ruby/1.9/open-uri.rb:85:in `open'
from ./regex.rb:6

This is exactly the error I was getting on the front of RSSscraper.
If it helps narrow it down, through a proxy i get:

rasputin@lb:rss$ ./regex.rb
/data/ruby/lib/ruby/1.9/open-uri.rb:574:in `proxy_open': 503 Service Unavailable (OpenURI::HTTPError)
from /data/ruby/lib/ruby/1.9/open-uri.rb:167:in `open_loop'
from /data/ruby/lib/ruby/1.9/open-uri.rb:164:in `catch'
from /data/ruby/lib/ruby/1.9/open-uri.rb:164:in `open_loop'
from /data/ruby/lib/ruby/1.9/open-uri.rb:134:in `open_uri'
from /data/ruby/lib/ruby/1.9/open-uri.rb:424:in `open'
from /data/ruby/lib/ruby/1.9/open-uri.rb:85:in `open'
from ./regex.rb:6
 
C

Chad Fowler

I was trying to use RSSscraper to pul some web forums, and something
level went bang in the Net::* libraries.

I found some old references to this error from last year, and I
got the impression it was platform specific?

Can anyone else let me know if this causes problems for them?

It's obviously site specific , url = 'http://www.google.com' has no problems...

Here's the miniaml code (open(url)... is 'line 6' in the code below):

require 'open-uri'

url = 'http://p218.ezboard.com/fdebatingukfrm9'
page = open(url).readlines

If I run this I get:

rasputin@lb:rss$ ./regex.rb
/data/ruby/lib/ruby/1.9/net/protocol.rb:135:in `sysread': End of file reached (EOFError)
from /data/ruby/lib/ruby/1.9/net/protocol.rb:135:in `rbuf_fill'
from /data/ruby/lib/ruby/1.9/net/protocol.rb:116:in `readuntil'
from /data/ruby/lib/ruby/1.9/net/protocol.rb:126:in `readline'
from /data/ruby/lib/ruby/1.9/net/http.rb:1850:in `read_status_line'
from /data/ruby/lib/ruby/1.9/net/http.rb:1839:in `read_new'
from /data/ruby/lib/ruby/1.9/net/http.rb:934:in `request'
from /data/ruby/lib/ruby/1.9/net/http.rb:834:in `request_get'
from /data/ruby/lib/ruby/1.9/open-uri.rb:545:in `proxy_open'
... 7 levels...
from /data/ruby/lib/ruby/1.9/open-uri.rb:134:in `open_uri'
from /data/ruby/lib/ruby/1.9/open-uri.rb:424:in `open'
from /data/ruby/lib/ruby/1.9/open-uri.rb:85:in `open'
from ./regex.rb:6

This is exactly the error I was getting on the front of RSSscraper.
If it helps narrow it down, through a proxy i get:

rasputin@lb:rss$ ./regex.rb
/data/ruby/lib/ruby/1.9/open-uri.rb:574:in `proxy_open': 503 Service Unavailable (OpenURI::HTTPError)
from /data/ruby/lib/ruby/1.9/open-uri.rb:167:in `open_loop'
from /data/ruby/lib/ruby/1.9/open-uri.rb:164:in `catch'
from /data/ruby/lib/ruby/1.9/open-uri.rb:164:in `open_loop'
from /data/ruby/lib/ruby/1.9/open-uri.rb:134:in `open_uri'
from /data/ruby/lib/ruby/1.9/open-uri.rb:424:in `open'
from /data/ruby/lib/ruby/1.9/open-uri.rb:85:in `open'
from ./regex.rb:6


It appears to me that this site refuses to respond unless you have a
recognized User-agent set in the request header. That's probably the
problem with open-uri.

Chad
 
D

Dick Davies

It appears to me that this site refuses to respond unless you have a
recognized User-agent set in the request header. That's probably the
problem with open-uri.

Ah crap. wget worked fine.

Is there a workaround (other than wget'ting the file to a local
webserver and pulling it from there)? I can't see an easy way of
adding a user-agent header to net/http.rb headers.....
 
D

Dick Davies

Bad form to reply to myself, but for the record, adding a
header was incredibly easy:

.....
class DukPolScanner < RSSscraper::AbstractScanner
def initialize
@get_headers = {'User-agent' => 'RssScraper' }
.....

thanks Chad for the pointer, and RSSScrapers creator for a
well-designed tool....
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,527
Members
44,998
Latest member
MarissaEub

Latest Threads

Top