Retrieve url's of all jpegs at a web page URL

G

grimmus

Hi,

I would like to achieve something like Facebook has when you post a
link. It shows images located at the URL you entered so you can choose
what one to display as a summary.

I was thinking i could loop through the html of a page with a regex
and store all the jpeg url's in an array. Then, i could open the
images one by one and save them as thumbnails with something like
below.

Can anyone provide some advice on how best to achieve this ?

Sorry for the noob questions and syntax !!

from PIL import Image
import urllib

image_uri = 'http://www.gstatic.com/news/img/bluelogo/en_ie/news.gif'
PILInfo = Image.open(urllib.urlretrieve(image_uri)[0])
PILInfo.save("saved-image.gif");
 
C

Chris Rebert

Hi,

I would like to achieve something like Facebook has when you post a
link. It shows images located at the URL you entered so you can choose
what one to display as a summary.

I was thinking i could loop through the html of a page with a regex
and store all the jpeg url's in an array. Then, i could open the
images one by one and save them as thumbnails with something like
below.

0. Install BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/)

1:
#untested
from BeautifulSoup import BeautifulSoup
import urllib

page_url = "http://the.url.here"

with urllib.urlopen(page_url) as f:
soup = BeautifulSoup(f.read())
for img_tag in soup.findAll("img"):
relative_url = img_tag.src
img_url = make_absolute(relative_url, page_url)
save_image_from_url(img_url)

2. Write make_absolute() and save_image_from_url()

3. Profit.

Cheers,
Chris
 
S

Stefan Behnel

Chris said:
page_url = "http://the.url.here"

with urllib.urlopen(page_url) as f:
soup = BeautifulSoup(f.read())
for img_tag in soup.findAll("img"):
relative_url = img_tag.src
img_url = make_absolute(relative_url, page_url)
save_image_from_url(img_url)

2. Write make_absolute() and save_image_from_url()

Note that lxml.html provides a make_links_absolute() function.

Also untested:

from lxml import html

doc = html.parse(page_url)
doc.make_links_absolute(page_url)

urls = [ img.src for img in doc.xpath('//img') ]

Then use e.g. urllib2 to save the images.

Stefan
 
P

Paul McGuire

Also untested:

        from lxml import html

        doc = html.parse(page_url)
        doc.make_links_absolute(page_url)

        urls = [ img.src for img in doc.xpath('//img') ]

Then use e.g. urllib2 to save the images.

Looks similar to what a pyparsing approach would look like:

from pyparsing import makeHTMLTags, htmlComment

import urllib
html = urllib.urlopen(url).read()

imgTag = makeHTMLTags("img")[0]
imgTag.ignore(htmlComment)

urls = [ img.src for img in imgTag.searchString(html) ]

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top