Net::http.get has a 50K limit?

M

Meihua Liang

I'm trying to write a screenscraper and am getting a 50K limit on the data
returned

require "net/http"
begin
Net::HTTP.start("www.washingtonpost.com", 80){ |http|
response , = http.get("/wl/jobs/JS_JobSearch?TS=1012409733026")
data=response.body
puts data.length
}
rescue => err
puts "Error: #{err}"
exit
end

The last line returns 52166 . (the file is considerable bigger) What did I
do wrong?
 
V

Vivek Nallur

Just a small change required.

|require "net/http"
|begin
| Net::HTTP.start("www.washingtonpost.com", 80){ |http|
| response , = http.get("/wl/jobs/JS_JobSearch?TS=1012409733026")

File.open("/some/file/","wb+"){|f|
resp, = http.get(url, nil){|gotit|
f.print(gotit)
}
}


| data=response.body
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The data is not part of the response any more. This behaviour has changed
from ruby 1.6

regs
Vivek


| puts data.length
| }
|rescue => err
| puts "Error: #{err}"
| exit
|end
|
|The last line returns 52166 . (the file is considerable bigger) What did I
|do wrong?
|
|
|
|
 
M

Meihua Liang

I tried your suggestion (ie. using a block and putting it to a file: see
copy below) but it still cuts the page short just as before. The actual web
page is ~58K but I'm only getting ~51K. Any more suggestions?

require "net/http"
# using block
Net::HTTP.start("www.washingtonpost.com", 80){ |http|
File.open('result.txt', 'wb+') {|f|
resp,=http.get('/wl/jobs/JS_JobSearch?TS=1012409733026',nil) {
|str|
f.print( str )
}
}
}
 
R

Robert Klemme

Meihua Liang said:
I tried your suggestion (ie. using a block and putting it to a file: see
copy below) but it still cuts the page short just as before. The actual web
page is ~58K but I'm only getting ~51K. Any more suggestions?

Did you verify with wget that the server actually serves the complete
document? If not, that's what I'd do.

robert
 
M

Meihua Liang

Yes the server completes the document. Browser serves it nicely, and I also
verified via wget, which give the complete copy. I still don't know why
Net::http prematurely cuts off the document.

meihua
 
D

Daniel Lichtenberger

Hi!

Meihua said:
Yes the server completes the document. Browser serves it nicely, and I
also verified via wget, which give the complete copy. I still don't know
why Net::http prematurely cuts off the document.

I played around with your script a bit, and noticed something strange: When
trying to fetch the file via telnet, it is also cut off early. However, as
you said, wget correctly retrieves the whole document. Why? wget sends a
user-agent header field, and only in this case the whole document is
served. So, by adding a user-agent header field to your request, it works
for me (with Ruby 1.8):

response = http.get("/wl/jobs/JS_JobSearch?TS=1012409733026",
{"user-agent" => "blub"})

returns around 60000 bytes in response.body.
When writing www spiders, you sometimes have to outsmart the webservers ;).

Hth,
Daniel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top