confusion trying to get IMG tags from html page

pkellner · Jul 29, 2005

I'm trying to download images from a web page that has them listed with
html like what I've pasted below. Basically, I want to iterate through
all the <IMG tags and grab the SRC= info and download those files.
I've tried a bunch of things with not much luck. Here is my last
attempt. Any help would be appreciated.

require 'net/http'
require 'rexml/document'

Net::HTTP.start('www.myphotowebsite.com') do |http|
response =
http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts "Code = #{response.code}"
puts "Message = #{response.message}"
#puts "Body = #{response.body}"

#parser = HTMLTree::XMLParser.new(false,false)
#parser.feed(client.getContent(url))
xml=response.body

xml.elements.each('//HREF]') do |node|

end

<IMG SRC="/icons/image2.gif" ALT=""> <A
HREF="IMG_1516.jpg">IMG_1516.jpg</A> 28-Jul-2005 08:59
233k
<IMG SRC="/icons/image2.gif" ALT="[IMG]"> <A
HREF="IMG_1517.jpg">IMG_1517.jpg</A> 18-Jun-2005 08:03
819k
<IMG SRC="/icons/image2.gif" ALT="[IMG]"> <A
HREF="IMG_1518.jpg">IMG_1518.jpg</A> 28-Jul-2005 09:00
398k
<I

Charles Steinman · Jul 30, 2005

pkellner said:
I'm trying to download images from a web page that has them listed with
html like what I've pasted below. Basically, I want to iterate through
all the <IMG tags and grab the SRC= info and download those files.
I've tried a bunch of things with not much luck. Here is my last
attempt. Any help would be appreciated.

require 'net/http'
require 'rexml/document'

Net::HTTP.start('www.myphotowebsite.com') do |http|
response =
http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts "Code = #{response.code}"
puts "Message = #{response.message}"
#puts "Body = #{response.body}"

#parser = HTMLTree::XMLParser.new(false,false)
#parser.feed(client.getContent(url))
xml=response.body

xml.elements.each('//HREF]') do |node|

end

<IMG SRC="/icons/image2.gif" ALT=""> <A
HREF="IMG_1516.jpg">IMG_1516.jpg</A> 28-Jul-2005 08:59
233k
<IMG SRC="/icons/image2.gif" ALT="[IMG]"> <A
HREF="IMG_1517.jpg">IMG_1517.jpg</A> 18-Jun-2005 08:03
819k
<IMG SRC="/icons/image2.gif" ALT="[IMG]"> <A
HREF="IMG_1518.jpg">IMG_1518.jpg</A> 28-Jul-2005 09:00
398k[/QUOTE]

That isn't valid XML (tags without matching end-tags must have a
trailing slash), so the parser probably doesn't understand it. Assuming
the HTML isn't too complicated, you should be able to get the info with
regular expressions.

pkellner · Jul 30, 2005

I was really hoping for some code or pseudo code. I'm new to ruby and
have been thrashing over this for hours. I promise to put some back
later when I know more about this. (and sadly, I'm not a regular
expression wizard)

Thanks

Charles said:
pkellner said:

I'm trying to download images from a web page that has them listed with
html like what I've pasted below. Basically, I want to iterate through
all the <IMG tags and grab the SRC= info and download those files.
I've tried a bunch of things with not much luck. Here is my last
attempt. Any help would be appreciated.

require 'net/http'
require 'rexml/document'

Net::HTTP.start('www.myphotowebsite.com') do |http|
response =
http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts "Code = #{response.code}"
puts "Message = #{response.message}"
#puts "Body = #{response.body}"

#parser = HTMLTree::XMLParser.new(false,false)
#parser.feed(client.getContent(url))
xml=response.body

xml.elements.each('//HREF]') do |node|

end

<IMG SRC="/icons/image2.gif" ALT=""> <A
HREF="IMG_1516.jpg">IMG_1516.jpg</A> 28-Jul-2005 08:59
233k
<IMG SRC="/icons/image2.gif" ALT="[IMG]"> <A
HREF="IMG_1517.jpg">IMG_1517.jpg</A> 18-Jun-2005 08:03
819k
<IMG SRC="/icons/image2.gif" ALT="[IMG]"> <A
HREF="IMG_1518.jpg">IMG_1518.jpg</A> 28-Jul-2005 09:00
398k[/QUOTE]

That isn't valid XML (tags without matching end-tags must have a
trailing slash), so the parser probably doesn't understand it. Assuming
the HTML isn't too complicated, you should be able to get the info with
regular expressions.[/QUOTE]

Click to expand...

James Britt · Jul 30, 2005

pkellner said:
I was really hoping for some code or pseudo code. I'm new to ruby and
have been thrashing over this for hours. I promise to put some back
later when I know more about this. (and sadly, I'm not a regular
expression wizard)

I use WWW::Mechanize to slurp down numerous CafePress shop pages and
snarf out the img info, which I use to automagically create the product
pages for rubystuff.com.

The code sample here is a much simplified version.

Mechanize lets you use custom classes to encapsulate node types, which
in turn makes it simpler to manipulate assorted HTML elements. I need
to extract assorted data from image URLs, so I coded up some additional
trickery not shown here.

Also note that some sites reject bots, spiders, etc. when the declared
user-agent is not something acceptable. Hence the random selection from
UA here.

#!/usr/local/bin/ruby

require 'mechanize'

UA = [
'Windows IE 6' ,
'Windows Mozilla',
'Mac Safari' ,
'Mac Mozilla' ,
'Linux Mozilla',
'Linux Konqueror' ]

# Wrap certain nodes in an Img class to make
# node attribute access a bit easier to grok.
class Img
attr_reader :alt, :src

def initialize( node )
@node = node
@alt = ''
@src = ''

if @node.attributes[ 'alt' ]
@alt = @node.attributes[ 'alt' ].to_s.strip
end
if @node.attributes[ 'src' ]
@src = @node.attributes[ 'src' ].to_s.strip
end
end
end

# Now with Rails tote bags and thongs and stuff!
url = 'http://www.cafepress.com/rubyonrailsshop'

agent = WWW::Mechanize.new {|a| a.log = Logger.new( STDERR ) }
agent.user_agent_alias = UA[ rand( UA.size - 1 ) ]

# This tells Mechanize to watch for certain elements, and
# map matching nodes to the keyed class. Here, when an img
# element is encountered, mechanize will use the node to create
# an Img object and store it for us.
agent.watch_for_set = { 'img' => Img }

page = agent.get( url )

# Get the watch items we're interested in
images = page.watches[ 'img' ]

# What did we get?
images.each do |img|
p img.src
end

#----------------

Hope this helps.

Get Mechanize from rubyforge.org, from the wee project page.

http://rubyforge.org/projects/wee/

James Britt

--

http://www.ruby-doc.org - The Ruby Documentation Site
http://www.rubyxml.com - News, Articles, and Listings for Ruby & XML
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys

Brian Schröder · Jul 30, 2005

I'm trying to download images from a web page that has them listed with
html like what I've pasted below. Basically, I want to iterate through
all the <IMG tags and grab the SRC=3D info and download those files.
I've tried a bunch of things with not much luck. Here is my last
attempt. Any help would be appreciated.
=20
require 'net/http'
require 'rexml/document'
=20
Net::HTTP.start('www.myphotowebsite.com') do |http|
response =3D
http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts "Code =3D #{response.code}"
puts "Message =3D #{response.message}"
#puts "Body =3D #{response.body}"
=20
#parser =3D HTMLTree::XMLParser.new(false,false)
#parser.feed(client.getContent(url))
xml=3Dresponse.body
=20
xml.elements.each('//HREF]') do |node|
=20
end
=20
=20
=20
=20
=20
<IMG SRC=3D"/icons/image2.gif" ALT=3D""> <A
HREF=3D"IMG_1516.jpg">IMG_1516.jpg</A> 28-Jul-2005 08:59
233k
<IMG SRC=3D"/icons/image2.gif" ALT=3D"[IMG]"> <A
HREF=3D"IMG_1517.jpg">IMG_1517.jpg</A> 18-Jun-2005 08:03
819k
<IMG SRC=3D"/icons/image2.gif" ALT=3D"[IMG]"> <A
HREF=3D"IMG_1518.jpg">IMG_1518.jpg</A> 28-Jul-2005 09:00
398k
<I
=20
=20
=20[/QUOTE]

Or use the simplest variant:

require 'open-uri'

open('www.myphotowebsite.com') do |http| =20
response =3D http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts response.body.scan(/[^\t "'=3D]+\.(?:jpg|gif|png)/).flatten
end

regards,

Brian

--=20
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

Brian Schröder · Jul 30, 2005

[snip]

=20
Or use the simplest variant:
=20
require 'open-uri'
=20
open('www.myphotowebsite.com') do |http|
response =3D http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts response.body.scan(/[^\t "'=3D]+\.(?:jpg|gif|png)/).flatten
end
=20

obviously that should have read

require 'net/http'=20

Net::HTTP.start('www.myphotowebsite.com') do |http|
response =3D http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts response.body.scan(/[^\t "']+\.(?:jpg|gif|png)/).flatten
end

regards,
=20
Brian
=20
--
http://ruby.brian-schroeder.de/
=20
Stringed instrument chords: http://chordlist.brian-schroeder.de/
=20

--=20
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

William James · Jul 30, 2005

Brian said:
require 'net/http'

Net::HTTP.start('www.myphotowebsite.com') do |http|
response = http.get('/terry/temp/2005-06-18%20Kiss%20of%20Death%203/')
puts response.body.scan(/[^\t "']+\.(?:jpg|gif|png)/).flatten
end

If you want only picture names that are in tags:

puts response.body.scan(/<[^>]*?([^\t "']+\.(?:jpg|gif|png))[^>]*>/)

Errors on REXML reading an HTML.	1	Dec 24, 2010
I can't seem to find an easy way to set Net::HTTP GET params	1	Nov 22, 2009
How to fetch Cookie from response	2	Sep 17, 2009
How to use ReXML "in the wild"?	2	Dec 16, 2008
How to set the src of a html <img> tag to a string returned from a jsp page?	7	Nov 13, 2003
Transferring data from html to java together with a function call	3	Jun 5, 2005
File upload/download from database. Download appends aspx page to end of file	2	Nov 10, 2004
What have they done!?	100	Apr 15, 2007

confusion trying to get IMG tags from html page

pkellner

Charles Steinman

pkellner

James Britt

Brian Schröder

Brian Schröder

William James

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads