Retrieve url's of all jpegs at a web page URL

grimmus · Sep 15, 2009

Hi,

I would like to achieve something like Facebook has when you post a
link. It shows images located at the URL you entered so you can choose
what one to display as a summary.

I was thinking i could loop through the html of a page with a regex
and store all the jpeg url's in an array. Then, i could open the
images one by one and save them as thumbnails with something like
below.

Can anyone provide some advice on how best to achieve this ?

Sorry for the noob questions and syntax !!

from PIL import Image
import urllib

image_uri = 'http://www.gstatic.com/news/img/bluelogo/en_ie/news.gif'
PILInfo = Image.open(urllib.urlretrieve(image_uri)[0])
PILInfo.save("saved-image.gif");

Chris Rebert · Sep 16, 2009

Hi,

I would like to achieve something like Facebook has when you post a
link. It shows images located at the URL you entered so you can choose
what one to display as a summary.

I was thinking i could loop through the html of a page with a regex
and store all the jpeg url's in an array. Then, i could open the
images one by one and save them as thumbnails with something like
below.

0. Install BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/)

1:
#untested
from BeautifulSoup import BeautifulSoup
import urllib

page_url = "http://the.url.here"

with urllib.urlopen(page_url) as f:
soup = BeautifulSoup(f.read())
for img_tag in soup.findAll("img"):
relative_url = img_tag.src
img_url = make_absolute(relative_url, page_url)
save_image_from_url(img_url)

2. Write make_absolute() and save_image_from_url()

3. Profit.

Cheers,
Chris

Stefan Behnel · Sep 16, 2009

Chris said:
page_url = "http://the.url.here"

with urllib.urlopen(page_url) as f:
soup = BeautifulSoup(f.read())
for img_tag in soup.findAll("img"):
relative_url = img_tag.src
img_url = make_absolute(relative_url, page_url)
save_image_from_url(img_url)

2. Write make_absolute() and save_image_from_url()

Note that lxml.html provides a make_links_absolute() function.

Also untested:

from lxml import html

doc = html.parse(page_url)
doc.make_links_absolute(page_url)

urls = [ img.src for img in doc.xpath('//img') ]

Then use e.g. urllib2 to save the images.

Stefan

Paul McGuire · Sep 16, 2009

Also untested:

from lxml import html

doc = html.parse(page_url)
doc.make_links_absolute(page_url)

urls = [ img.src for img in doc.xpath('//img') ]

Then use e.g. urllib2 to save the images.

Looks similar to what a pyparsing approach would look like:

from pyparsing import makeHTMLTags, htmlComment

import urllib
html = urllib.urlopen(url).read()

imgTag = makeHTMLTags("img")[0]
imgTag.ignore(htmlComment)

urls = [ img.src for img in imgTag.searchString(html) ]

-- Paul

Download unnamed web image?	5	Feb 17, 2010
regular expression concatenation with strings	6	Jun 22, 2007
regular expressions eliminating filenames of type foo.thumbnail.jpg	7	Jun 25, 2007
No-syntax Web-programming-IDE (was: Does turtle graphics have the wrong associations?)	0	Nov 22, 2009
Generating Thumbnail Images of a Photo Gallery	1	Apr 19, 2006
a simple gallery in asp doesn't exist!	0	Mar 21, 2006
Can't make this page work	6	Mar 8, 2006
need help with a cart I inherited, need to increase number of total characters allowed	3	Oct 22, 2007

Retrieve url's of all jpegs at a web page URL

grimmus

Chris Rebert

Stefan Behnel

Paul McGuire

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads