Page crawling and URL grabbing

Discussion in 'Ruby' started by Patrick L., Jan 27, 2009.

  1. Patrick L.

    Patrick L. Guest

    Hey guys,
    I'm trying to write an application that goes onto a website (istockphoto
    specifically), opens up istockphoto.com/file_browse.php and grabs the
    URLs of the photos that appear there.

    It's my first time doing something like this. I'm reading some
    documentation right now...but a hand would be greatly appreciated. I'm
    not really sure how to do regex on an html file...or even find the right
    stuff within that file. I'm guessing its..

    open('http://www.istockphoto.com/file_browse.php/') do |f|
    f.find # dot something something
    end

    but I really have no idea. Any help would be great - thanks in advance!
    --
    Posted via http://www.ruby-forum.com/.
    Patrick L., Jan 27, 2009
    #1
    1. Advertising

  2. On Tue, Jan 27, 2009 at 1:55 AM, Patrick L. <> wrote:
    > Hey guys,
    > I'm trying to write an application that goes onto a website (istockphoto
    > specifically), opens up istockphoto.com/file_browse.php and grabs the
    > URLs of the photos that appear there.
    >
    > It's my first time doing something like this. I'm reading some
    > documentation right now...but a hand would be greatly appreciated. I'm
    > not really sure how to do regex on an html file...or even find the right
    > stuff within that file. I'm guessing its..


    Generally speaking, regular expressions are not the best tool to extract
    information from HTML. Take a look at these other tools:

    Mechanize
    Hpricot
    Scrubyt
    Nokogiri

    This is an example that might get you started, although I recommend taking
    a look at the above tools:

    require 'open-uri'
    require 'hpricot'

    h = Hpricot(open("http://www.istockphoto.com/file_browse.php"))
    imgs = h.search("//[@class = searchImg]")
    imgs.map {|img| img["src"]}

    # => ["http://www2.istockphoto.com/file_thumbview_approve/8137463/1/istockphoto_8137463-budapest-by-night.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8139472/1/istockphoto_8139472-four-antique-wood-tennis-racquets.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/6731990/1/istockphoto_6731990-two-female-lovers.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8308377/1/istockphoto_8308377-beauty.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/6349299/1/istockphoto_6349299-lovers-interested-in-smth.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8322403/1/istockphoto_8322403-happy-piggy-bank.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8138976/1/istockphoto_8138976-tower-guard-of-cetara-little-town-in-amalfi-coast-italy.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8322394/1/istockphoto_8322394-yellow-red-paper.jpg",
    "http://www1.istockphoto.com/file_thumbview_approve/4660654/1/istockphoto_4660654-the-art-of-eye-shadows.jpg",
    "http://www1.istockphoto.com/file_thumbview_approve/8301075/1/istockphoto_8301075-3d-render-of-the-olive-tree.jpg",
    "http://www1.istockphoto.com/file_thumbview_approve/6921717/1/istockphoto_6921717-manicure.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8322391/1/istockphoto_8322391-pomegranate.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8138975/1/istockphoto_8138975-junger-mann-seitlich.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8139815/1/istockphoto_8139815-winter.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8137153/1/istockphoto_8137153-beadworkafrican_pictureframe_p3406-jpg.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8139787/1/istockphoto_8139787-statue-of-liberty.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8322388/1/istockphoto_8322388-cold-winter-day.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8139602/1/istockphoto_8139602-statue-of-liberty.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8137801/1/istockphoto_8137801-litchi.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8139406/1/istockphoto_8139406-statue-of-liberty.jpg",
    "http://www1.istockphoto.com/file_thumbview_approve/6850893/1/istockphoto_6850893-polka-dot-wedding-cake.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8139802/1/istockphoto_8139802-snow-woman.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8322364/1/istockphoto_8322364-white-cherry-blossom.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8139808/1/istockphoto_8139808-airport.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8322357/1/istockphoto_8322357-ciruit.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8139597/1/istockphoto_8139597-cheese-and-wine.jpg",
    "http://www2.istockphoto.com/file_thumbview_approve/8138075/1/istockphoto_8138075-employee-of-office.jpg"]


    You should customize the criteria to choose the images (in my little
    example I selected all tags which had a class searchImg, which at a
    quick glance seemed what you wanted, but double check).

    I recall reading somewhere that nokogiri has better XPath support than
    Hpricot, so check it out.

    Jesus.
    Jesús Gabriel y Galán, Jan 27, 2009
    #2
    1. Advertising

  3. 2009/1/27 Patrick L. <>:
    > Hey guys,
    > I'm trying to write an application that goes onto a website (istockphoto
    > specifically), opens up istockphoto.com/file_browse.php and grabs the
    > URLs of the photos that appear there.
    >
    > It's my first time doing something like this. I'm reading some
    > documentation right now...but a hand would be greatly appreciated. I'm
    > not really sure how to do regex on an html file...or even find the right
    > stuff within that file. I'm guessing its..
    >
    > open('http://www.istockphoto.com/file_browse.php/') do |f|
    > f.find # dot something something
    > end



    Try Mechanize.
    It's easy :

    agent = WWW::Mechanize.new
    agent.user_agent_alias='Mac Safari'
    page = agent.get('http://www.istockphoto.com/file_browse.php');
    page.links.text(/jpg/)
    ...
    Miroslaw Niegowski, Jan 27, 2009
    #3
  4. Patrick L.

    Patrick L. Guest

    Miroslaw Niegowski wrote:
    > 2009/1/27 Patrick L. <>:
    >> open('http://www.istockphoto.com/file_browse.php/') do |f|
    >> f.find # dot something something
    >> end

    >
    >
    > Try Mechanize.
    > It's easy :
    >
    > agent = WWW::Mechanize.new
    > agent.user_agent_alias='Mac Safari'
    > page = agent.get('http://www.istockphoto.com/file_browse.php');
    > page.links.text(/jpg/)
    > ...


    That's great, or it sounds great. Is there any documentation aside from
    blog posts and this: http://mechanize.rubyforge.org/mechanize/ ? What
    did you use to learn it?

    --
    Posted via http://www.ruby-forum.com/.
    Patrick L., Jan 27, 2009
    #4
  5. mechanize is very easy and intuitive ... you could basically learn to
    use mechanize just by playing with it in irb . Combine that with reading
    some/the docs , and you're good to go .
    --
    Posted via http://www.ruby-forum.com/.
    Tsunami Script, Jan 27, 2009
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. darrel
    Replies:
    4
    Views:
    773
    darrel
    Sep 29, 2004
  2. Randy

    Grabbing paramter in the URL

    Randy, Feb 8, 2005, in forum: ASP .Net
    Replies:
    3
    Views:
    356
    Eliyahu Goldin
    Feb 8, 2005
  3. Mark
    Replies:
    3
    Views:
    437
    fd123456
    Mar 7, 2005
  4. Remarkable
    Replies:
    1
    Views:
    321
  5. Tim W
    Replies:
    2
    Views:
    677
    Tim W
    Jun 15, 2012
Loading...

Share This Page