screen scraping programaticprogrammer.com ?

7stud -- · Sep 13, 2007

The following is from "Programming Ruby 2nd" p.133:

----
require "net/http"

h = Net::HTTP.new("www.programaticprogrammer.com", 80)
response = h.get("/index.html")

if response.message == "OK"
puts response.body.scan(/<img src="(.*?)"/m).uniq
end
----

It doesn't work: nothing is printed. So, I modified it a little:

-----
require "net/http"

h = Net::HTTP.new("www.programaticprogrammer.com", 80)
response = h.get("/index.html")

puts response.message
puts response.code

if response.message == "OK"
puts "*"
puts response.body.scan(/<img src="(.*?)"/m).uniq
end
-----

and the output was:

Found
302

I clicked a link on their home page and tried to access the page that
was displayed, but I got the same result. What am I doing wrong?

Ronald Fischer · Sep 13, 2007

h =3D Net::HTTP.new("www.programaticprogrammer.com", 80)

response =3D h.get("/index.html")
=20
puts response.message
puts response.code
=20
if response.message =3D=3D "OK"
puts "*"
puts response.body.scan(/<img src=3D"(.*?)"/m).uniq
end
-----
=20
and the output was:
=20
Found
302
=20
=20
I clicked a link on their home page and tried to access the page that
was displayed, but I got the same result. What am I doing wrong?

Wrong URL. How about using www.pragmaticprogrammer.com instead?

I think the prOGRamatic programmers are slowly dying out anyway in
favour of the pragmatic programmers.... ;-)

Ronald
--=20
Ronald Fischer <[email protected]>
Phone: +49-89-452133-162

7stud -- · Sep 13, 2007

Ronald said:
Wrong URL. How about using www.pragmaticprogrammer.com instead?

Whoops. Thanks.

John Joyce · Sep 14, 2007

Just remember that with screen scraping, you are anticipating a file
served by a file server, on top of that you are generally
anticipating a very particular structure in that document. Web sites
change frequently and without notice and even the smallest changes
can blow out your scraper. So be very careful to inspect the various
pages of sites you plan to scrape, and then try to write your scraper
to check for things and not fail if it isn't found.

With some clever programming and a little knowledge of the site, you
can make a simple but smart scraper. However, it will still be pretty
fragile. html/xhtml is just too loose and human-language like, full
of ambiguity and implicit meaning that humans would get, but machines
would work hard to fail at.

I don't get how the method is added in this case	4	Sep 4, 2007
Regex html	8	May 15, 2007
Broken link check for Authenticated Url using http,uri	1	Oct 5, 2010
Non empty string complained to be 'nil' in equality check	2	Apr 12, 2010
HTTP Proxy problem	2	May 4, 2005
How to fetch Cookie from response	2	Sep 17, 2009
Thread error "undefined method `keys' for nil:NilClass"	0	Oct 29, 2009
Net::HTTP::Put with 302 redirect?	1	May 21, 2007

screen scraping programaticprogrammer.com ?

7stud --

Ronald Fischer

7stud --

John Joyce

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads