saving a webpage's links to the hard disk

J

Jetus

Is there a good place to look to see where I can find some code that
will help me to save webpage's links to the local drive, after I have
used urllib2 to retrieve the page?
Many times I have to view these pages when I do not have access to the
internet.
 
C

castironpi

Don't reinvent the wheel and use wgethttp://en.wikipedia.org/wiki/Wget

A lot of the functionality is already present.

import urllib
urllib.urlretrieve( 'http://python.org/', 'main.htm' )
from htmllib import HTMLParser
from formatter import NullFormatter
parser= HTMLParser( NullFormatter( ) )
parser.feed( open( 'main.htm' ).read( ) )
import urlparse
for a in parser.anchorlist:
print urlparse.urljoin( 'http://python.org/', a )

Output snipped:

...
http://python.org/psf/
http://python.org/dev/
http://python.org/links/
http://python.org/download/releases/2.5.2
http://docs.python.org/
http://python.org/ftp/python/2.5.2/python-2.5.2.msi
...
 
J

Jetus

A lot of the functionality is already present.

import urllib
urllib.urlretrieve( 'http://python.org/', 'main.htm' )
from htmllib import HTMLParser
from formatter import NullFormatter
parser= HTMLParser( NullFormatter( ) )
parser.feed( open( 'main.htm' ).read( ) )
import urlparse
for a in parser.anchorlist:
print urlparse.urljoin( 'http://python.org/', a )

Output snipped:

...http://python.org/psf/http://python.../python.org/ftp/python/2.5.2/python-2.5.2.msi
...

How can I modify or add to the above code, so that the file references
are saved to specified local directories, AND the saved webpage makes
reference to the new saved files in the respective directories?
Thanks for your help in advance.
 
C

castironpi

How can I modify or add to the above code, so that the file references
are saved to specified local directories, AND the saved webpage makes
reference to the new saved files in the respective directories?
Thanks for your help in advance.- Hide quoted text -

- Show quoted text -

You'd have to convert filenames in the loop to a file system path; try
writing as is with makedirs( ). You'd have to replace contents in a
file for links, so your best might be prefixing them with localhost
and spawning a small bounce-router.
 
D

Diez B. Roggisch

Jetus said:
How can I modify or add to the above code, so that the file references
are saved to specified local directories, AND the saved webpage makes
reference to the new saved files in the respective directories?
Thanks for your help in advance.

how about you *try* to do so - and if you have actual problems, you come
back and ask for help? Alternatively, there's always guru.com

Diez
 
C

castironpi

how about you *try* to do so - and if you have actual problems, you come
back and ask for help? Alternatively, there's always guru.com

Diez- Hide quoted text -

- Show quoted text -

I've tried, no avail. How does the open-source plug to Python look/
work? Firefox was able to spawn Python in a toolbar in a distant
land. Does it still? I believe under DOM, return a file named X that
contains a list of changes to make to the page, or put it at the top
of one, to be removed by Firefox. At that point, X would pretty much
be the last lexicly-sorted file in a pre-established directory. Files
are really easy to create and add syntax too, if you create a bunch of
them. Sector size was bouncing though, which brings that all the way
up to file system.

for( int docID= 0; docID++ ) {
if ( doc.links[ docID ]== pythonfileA.links[ pyID ] ) {
doc.links[ docID ].anchor= pythonfileB.links[ pyID ];
pyID++;
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top