Page crawling and URL grabbing

P

Patrick L.

Hey guys,
I'm trying to write an application that goes onto a website (istockphoto
specifically), opens up istockphoto.com/file_browse.php and grabs the
URLs of the photos that appear there.

It's my first time doing something like this. I'm reading some
documentation right now...but a hand would be greatly appreciated. I'm
not really sure how to do regex on an html file...or even find the right
stuff within that file. I'm guessing its..

open('http://www.istockphoto.com/file_browse.php/') do |f|
f.find # dot something something
end

but I really have no idea. Any help would be great - thanks in advance!
 
J

Jesús Gabriel y Galán

Hey guys,
I'm trying to write an application that goes onto a website (istockphoto
specifically), opens up istockphoto.com/file_browse.php and grabs the
URLs of the photos that appear there.

It's my first time doing something like this. I'm reading some
documentation right now...but a hand would be greatly appreciated. I'm
not really sure how to do regex on an html file...or even find the right
stuff within that file. I'm guessing its..

Generally speaking, regular expressions are not the best tool to extract
information from HTML. Take a look at these other tools:

Mechanize
Hpricot
Scrubyt
Nokogiri

This is an example that might get you started, although I recommend taking
a look at the above tools:

require 'open-uri'
require 'hpricot'

h = Hpricot(open("http://www.istockphoto.com/file_browse.php"))
imgs = h.search("//[@class = searchImg]")
imgs.map {|img| img["src"]}

# => ["http://www2.istockphoto.com/file_th...3/1/istockphoto_8137463-budapest-by-night.jpg",
"http://www2.istockphoto.com/file_th...8139472-four-antique-wood-tennis-racquets.jpg",
"http://www2.istockphoto.com/file_th...0/1/istockphoto_6731990-two-female-lovers.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8308377/1/istockphoto_8308377-beauty.jpg",
"http://www2.istockphoto.com/file_th...ckphoto_6349299-lovers-interested-in-smth.jpg",
"http://www2.istockphoto.com/file_th...03/1/istockphoto_8322403-happy-piggy-bank.jpg",
"http://www2.istockphoto.com/file_th...-cetara-little-town-in-amalfi-coast-italy.jpg",
"http://www2.istockphoto.com/file_th...94/1/istockphoto_8322394-yellow-red-paper.jpg",
"http://www1.istockphoto.com/file_th...stockphoto_4660654-the-art-of-eye-shadows.jpg",
"http://www1.istockphoto.com/file_th...photo_8301075-3d-render-of-the-olive-tree.jpg",
"http://www1.istockphoto.com/file_thumbview_approve/6921717/1/istockphoto_6921717-manicure.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8322391/1/istockphoto_8322391-pomegranate.jpg",
"http://www2.istockphoto.com/file_th.../istockphoto_8138975-junger-mann-seitlich.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8139815/1/istockphoto_8139815-winter.jpg",
"http://www2.istockphoto.com/file_th...53-beadworkafrican_pictureframe_p3406-jpg.jpg",
"http://www2.istockphoto.com/file_th...7/1/istockphoto_8139787-statue-of-liberty.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8322388/1/istockphoto_8322388-cold-winter-day.jpg",
"http://www2.istockphoto.com/file_th...2/1/istockphoto_8139602-statue-of-liberty.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8137801/1/istockphoto_8137801-litchi.jpg",
"http://www2.istockphoto.com/file_th...6/1/istockphoto_8139406-statue-of-liberty.jpg",
"http://www1.istockphoto.com/file_th...stockphoto_6850893-polka-dot-wedding-cake.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8139802/1/istockphoto_8139802-snow-woman.jpg",
"http://www2.istockphoto.com/file_th.../istockphoto_8322364-white-cherry-blossom.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8139808/1/istockphoto_8139808-airport.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8322357/1/istockphoto_8322357-ciruit.jpg",
"http://www2.istockphoto.com/file_thumbview_approve/8139597/1/istockphoto_8139597-cheese-and-wine.jpg",
"http://www2.istockphoto.com/file_th.../1/istockphoto_8138075-employee-of-office.jpg"]


You should customize the criteria to choose the images (in my little
example I selected all tags which had a class searchImg, which at a
quick glance seemed what you wanted, but double check).

I recall reading somewhere that nokogiri has better XPath support than
Hpricot, so check it out.

Jesus.
 
M

Miroslaw Niegowski

2009/1/27 Patrick L. said:
Hey guys,
I'm trying to write an application that goes onto a website (istockphoto
specifically), opens up istockphoto.com/file_browse.php and grabs the
URLs of the photos that appear there.

It's my first time doing something like this. I'm reading some
documentation right now...but a hand would be greatly appreciated. I'm
not really sure how to do regex on an html file...or even find the right
stuff within that file. I'm guessing its..

open('http://www.istockphoto.com/file_browse.php/') do |f|
f.find # dot something something
end


Try Mechanize.
It's easy :

agent = WWW::Mechanize.new
agent.user_agent_alias='Mac Safari'
page = agent.get('http://www.istockphoto.com/file_browse.php');
page.links.text(/jpg/)
...
 
T

Tsunami Script

mechanize is very easy and intuitive ... you could basically learn to
use mechanize just by playing with it in irb . Combine that with reading
some/the docs , and you're good to go .
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top