Get directory from http web site

R

rock69

Hi all :)

I was wondering if there's some neat and easy way to get the entire
contents of a directory at a specific web url address.

I have the following link:

http://www.infomedia.it/immagini/riviste/covers/cp

and as you can see it's just a list containing all the files (images)
that I need. Is it possible to retrieve this list (not the physical
files) and have it stored in a variable of type list or something?

And, if so, what would be the easiest and most efficient way?

Thank you so much in advance.

Rock
 
S

Sybren Stuvel

rock69 enlightened us with:
I was wondering if there's some neat and easy way to get the entire
contents of a directory at a specific web url address. [...] Is it
possible to retrieve this list (not the physical files) and have it
stored in a variable of type list or something?

Check out the chapter on HTML parsing at
http://www.diveintopython.org/

Sybren
 
K

Kent Johnson

rock69 said:
Hi all :)

I was wondering if there's some neat and easy way to get the entire
contents of a directory at a specific web url address.

I have the following link:

http://www.infomedia.it/immagini/riviste/covers/cp

and as you can see it's just a list containing all the files (images)
that I need. Is it possible to retrieve this list (not the physical
files) and have it stored in a variable of type list or something?

BeautifulSoup and urllib do this easily:
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib
>>> data = urllib.urlopen('http://www.infomedia.it/immagini/riviste/covers/cp/').read()
>>> soup = BeautifulSoup(data)
>>> anchors = soup.fetch('a')
>>> len(anchors) 164
>>> for a in anchors[:10]:
... print a['href'], a.string
...
?N=D Name
?M=A Last modified
?S=A Size
?D=A Description
/immagini/riviste/covers/ Parent Directory
cp100.jpg cp100.jpg
cp100sm.jpg cp100sm.jpg
cp101.jpg cp101.jpg
cp101sm.jpg cp101sm.jpg
cp102.jpg cp102.jpg

http://www.crummy.com/software/BeautifulSoup/

Kent
 
L

lemon97

You might want to also modify your c:/python/Lib/urllib.py file.


By adding/modifying the following headers.

self.addheaders = [('User-agent', 'Mozilla/4.0')]
#Trick the server into thinking it is explorer

self.addheaders = [('Referer','http://www.infomedia.it')]
#Trick the site that you clicked on a link from their site.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,128
Latest member
ElwoodPhil
Top