confusion trying to get IMG tags from html page

P

pkellner

I'm trying to download images from a web page that has them listed with
html like what I've pasted below. Basically, I want to iterate through
all the <IMG tags and grab the SRC= info and download those files.
I've tried a bunch of things with not much luck. Here is my last
attempt. Any help would be appreciated.

require 'net/http'
require 'rexml/document'

Net::HTTP.start('www.myphotowebsite.com') do |http|
response =
http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts "Code = #{response.code}"
puts "Message = #{response.message}"
#puts "Body = #{response.body}"

#parser = HTMLTree::XMLParser.new(false,false)
#parser.feed(client.getContent(url))
xml=response.body

xml.elements.each('//HREF]') do |node|

end





<IMG SRC="/icons/image2.gif" ALT=""> <A
HREF="IMG_1516.jpg">IMG_1516.jpg</A> 28-Jul-2005 08:59
233k
<IMG SRC="/icons/image2.gif" ALT="[IMG]"> <A
HREF="IMG_1517.jpg">IMG_1517.jpg</A> 18-Jun-2005 08:03
819k
<IMG SRC="/icons/image2.gif" ALT="[IMG]"> <A
HREF="IMG_1518.jpg">IMG_1518.jpg</A> 28-Jul-2005 09:00
398k
<I
 
C

Charles Steinman

pkellner said:
I'm trying to download images from a web page that has them listed with
html like what I've pasted below. Basically, I want to iterate through
all the <IMG tags and grab the SRC= info and download those files.
I've tried a bunch of things with not much luck. Here is my last
attempt. Any help would be appreciated.

require 'net/http'
require 'rexml/document'

Net::HTTP.start('www.myphotowebsite.com') do |http|
response =
http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts "Code = #{response.code}"
puts "Message = #{response.message}"
#puts "Body = #{response.body}"

#parser = HTMLTree::XMLParser.new(false,false)
#parser.feed(client.getContent(url))
xml=response.body

xml.elements.each('//HREF]') do |node|

end





<IMG SRC="/icons/image2.gif" ALT=""> <A
HREF="IMG_1516.jpg">IMG_1516.jpg</A> 28-Jul-2005 08:59
233k
<IMG SRC="/icons/image2.gif" ALT="[IMG]"> <A
HREF="IMG_1517.jpg">IMG_1517.jpg</A> 18-Jun-2005 08:03
819k
<IMG SRC="/icons/image2.gif" ALT="[IMG]"> <A
HREF="IMG_1518.jpg">IMG_1518.jpg</A> 28-Jul-2005 09:00
398k[/QUOTE]

That isn't valid XML (tags without matching end-tags must have a
trailing slash), so the parser probably doesn't understand it. Assuming
the HTML isn't too complicated, you should be able to get the info with
regular expressions.
 
P

pkellner

I was really hoping for some code or pseudo code. I'm new to ruby and
have been thrashing over this for hours. I promise to put some back
later when I know more about this. (and sadly, I'm not a regular
expression wizard)

Thanks


Charles said:
pkellner said:
I'm trying to download images from a web page that has them listed with
html like what I've pasted below. Basically, I want to iterate through
all the <IMG tags and grab the SRC= info and download those files.
I've tried a bunch of things with not much luck. Here is my last
attempt. Any help would be appreciated.

require 'net/http'
require 'rexml/document'

Net::HTTP.start('www.myphotowebsite.com') do |http|
response =
http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts "Code = #{response.code}"
puts "Message = #{response.message}"
#puts "Body = #{response.body}"

#parser = HTMLTree::XMLParser.new(false,false)
#parser.feed(client.getContent(url))
xml=response.body

xml.elements.each('//HREF]') do |node|

end





<IMG SRC="/icons/image2.gif" ALT=""> <A
HREF="IMG_1516.jpg">IMG_1516.jpg</A> 28-Jul-2005 08:59
233k
<IMG SRC="/icons/image2.gif" ALT="[IMG]"> <A
HREF="IMG_1517.jpg">IMG_1517.jpg</A> 18-Jun-2005 08:03
819k
<IMG SRC="/icons/image2.gif" ALT="[IMG]"> <A
HREF="IMG_1518.jpg">IMG_1518.jpg</A> 28-Jul-2005 09:00
398k[/QUOTE]

That isn't valid XML (tags without matching end-tags must have a
trailing slash), so the parser probably doesn't understand it. Assuming
the HTML isn't too complicated, you should be able to get the info with
regular expressions.[/QUOTE]
 
J

James Britt

pkellner said:
I was really hoping for some code or pseudo code. I'm new to ruby and
have been thrashing over this for hours. I promise to put some back
later when I know more about this. (and sadly, I'm not a regular
expression wizard)

I use WWW::Mechanize to slurp down numerous CafePress shop pages and
snarf out the img info, which I use to automagically create the product
pages for rubystuff.com.

The code sample here is a much simplified version.

Mechanize lets you use custom classes to encapsulate node types, which
in turn makes it simpler to manipulate assorted HTML elements. I need
to extract assorted data from image URLs, so I coded up some additional
trickery not shown here.

Also note that some sites reject bots, spiders, etc. when the declared
user-agent is not something acceptable. Hence the random selection from
UA here.

#!/usr/local/bin/ruby

require 'mechanize'

UA = [
'Windows IE 6' ,
'Windows Mozilla',
'Mac Safari' ,
'Mac Mozilla' ,
'Linux Mozilla',
'Linux Konqueror' ]

# Wrap certain nodes in an Img class to make
# node attribute access a bit easier to grok.
class Img
attr_reader :alt, :src

def initialize( node )
@node = node
@alt = ''
@src = ''

if @node.attributes[ 'alt' ]
@alt = @node.attributes[ 'alt' ].to_s.strip
end
if @node.attributes[ 'src' ]
@src = @node.attributes[ 'src' ].to_s.strip
end
end
end

# Now with Rails tote bags and thongs and stuff!
url = 'http://www.cafepress.com/rubyonrailsshop'

agent = WWW::Mechanize.new {|a| a.log = Logger.new( STDERR ) }
agent.user_agent_alias = UA[ rand( UA.size - 1 ) ]

# This tells Mechanize to watch for certain elements, and
# map matching nodes to the keyed class. Here, when an img
# element is encountered, mechanize will use the node to create
# an Img object and store it for us.
agent.watch_for_set = { 'img' => Img }

page = agent.get( url )

# Get the watch items we're interested in
images = page.watches[ 'img' ]

# What did we get?
images.each do |img|
p img.src
end

#----------------

Hope this helps.

Get Mechanize from rubyforge.org, from the wee project page.

http://rubyforge.org/projects/wee/


James Britt

--

http://www.ruby-doc.org - The Ruby Documentation Site
http://www.rubyxml.com - News, Articles, and Listings for Ruby & XML
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
 
B

Brian Schröder

I'm trying to download images from a web page that has them listed with
html like what I've pasted below. Basically, I want to iterate through
all the <IMG tags and grab the SRC=3D info and download those files.
I've tried a bunch of things with not much luck. Here is my last
attempt. Any help would be appreciated.
=20
require 'net/http'
require 'rexml/document'
=20
Net::HTTP.start('www.myphotowebsite.com') do |http|
response =3D
http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts "Code =3D #{response.code}"
puts "Message =3D #{response.message}"
#puts "Body =3D #{response.body}"
=20
#parser =3D HTMLTree::XMLParser.new(false,false)
#parser.feed(client.getContent(url))
xml=3Dresponse.body
=20
xml.elements.each('//HREF]') do |node|
=20
end
=20
=20
=20
=20
=20
<IMG SRC=3D"/icons/image2.gif" ALT=3D""> <A
HREF=3D"IMG_1516.jpg">IMG_1516.jpg</A> 28-Jul-2005 08:59
233k
<IMG SRC=3D"/icons/image2.gif" ALT=3D"[IMG]"> <A
HREF=3D"IMG_1517.jpg">IMG_1517.jpg</A> 18-Jun-2005 08:03
819k
<IMG SRC=3D"/icons/image2.gif" ALT=3D"[IMG]"> <A
HREF=3D"IMG_1518.jpg">IMG_1518.jpg</A> 28-Jul-2005 09:00
398k
<I
=20
=20
=20[/QUOTE]

Or use the simplest variant:

require 'open-uri'

open('www.myphotowebsite.com') do |http| =20
response =3D http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts response.body.scan(/[^\t "'=3D]+\.(?:jpg|gif|png)/).flatten
end

regards,

Brian

--=20
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/
 
B

Brian Schröder

[snip]
=20
Or use the simplest variant:
=20
require 'open-uri'
=20
open('www.myphotowebsite.com') do |http|
response =3D http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts response.body.scan(/[^\t "'=3D]+\.(?:jpg|gif|png)/).flatten
end
=20

obviously that should have read

require 'net/http'=20

Net::HTTP.start('www.myphotowebsite.com') do |http|
response =3D http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts response.body.scan(/[^\t "']+\.(?:jpg|gif|png)/).flatten
end

regards,
=20
Brian
=20
--
http://ruby.brian-schroeder.de/
=20
Stringed instrument chords: http://chordlist.brian-schroeder.de/
=20


--=20
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/
 
W

William James

Brian said:
require 'net/http'

Net::HTTP.start('www.myphotowebsite.com') do |http|
response = http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts response.body.scan(/[^\t "']+\.(?:jpg|gif|png)/).flatten
end

If you want only picture names that are in tags:

puts response.body.scan(/<[^>]*?([^\t "']+\.(?:jpg|gif|png))[^>]*>/)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,046
Latest member
Gavizuho

Latest Threads

Top