saving a webpage's links to the hard disk

Discussion in 'Python' started by Jetus, May 4, 2008.

  1. Jetus

    Jetus Guest

    Is there a good place to look to see where I can find some code that
    will help me to save webpage's links to the local drive, after I have
    used urllib2 to retrieve the page?
    Many times I have to view these pages when I do not have access to the
    internet.
     
    Jetus, May 4, 2008
    #1
    1. Advertising

  2. En Sun, 04 May 2008 01:33:45 -0300, Jetus <> escribió:

    > Is there a good place to look to see where I can find some code that
    > will help me to save webpage's links to the local drive, after I have
    > used urllib2 to retrieve the page?
    > Many times I have to view these pages when I do not have access to the
    > internet.


    Don't reinvent the wheel and use wget
    http://en.wikipedia.org/wiki/Wget

    --
    Gabriel Genellina
     
    Gabriel Genellina, May 4, 2008
    #2
    1. Advertising

  3. Jetus

    Guest

    On May 4, 12:33 am, "Gabriel Genellina" <>
    wrote:
    > En Sun, 04 May 2008 01:33:45 -0300, Jetus <> escribió:
    >
    > > Is there a good place to look to see where I can find some code that
    > > will help me to save webpage's links to the local drive, after I have
    > > used urllib2 to retrieve the page?
    > > Many times I have to view these pages when I do not have access to the
    > > internet.

    >
    > Don't reinvent the wheel and use wgethttp://en.wikipedia.org/wiki/Wget
    >
    > --
    > Gabriel Genellina


    A lot of the functionality is already present.

    import urllib
    urllib.urlretrieve( 'http://python.org/', 'main.htm' )
    from htmllib import HTMLParser
    from formatter import NullFormatter
    parser= HTMLParser( NullFormatter( ) )
    parser.feed( open( 'main.htm' ).read( ) )
    import urlparse
    for a in parser.anchorlist:
    print urlparse.urljoin( 'http://python.org/', a )

    Output snipped:

    ...
    http://python.org/psf/
    http://python.org/dev/
    http://python.org/links/
    http://python.org/download/releases/2.5.2
    http://docs.python.org/
    http://python.org/ftp/python/2.5.2/python-2.5.2.msi
    ...
     
    , May 4, 2008
    #3
  4. Jetus

    Jetus Guest

    On May 4, 7:22 am, wrote:
    > On May 4, 12:33 am, "Gabriel Genellina" <>
    > wrote:
    >
    > > En Sun, 04 May 2008 01:33:45 -0300, Jetus <> escribió:

    >
    > > > Is there a good place to look to see where I can find some code that
    > > > will help me to save webpage's links to the local drive, after I have
    > > > used urllib2 to retrieve the page?
    > > > Many times I have to view these pages when I do not have access to the
    > > > internet.

    >
    > > Don't reinvent the wheel and use wgethttp://en.wikipedia.org/wiki/Wget

    >
    > > --
    > > Gabriel Genellina

    >
    > A lot of the functionality is already present.
    >
    > import urllib
    > urllib.urlretrieve( 'http://python.org/', 'main.htm' )
    > from htmllib import HTMLParser
    > from formatter import NullFormatter
    > parser= HTMLParser( NullFormatter( ) )
    > parser.feed( open( 'main.htm' ).read( ) )
    > import urlparse
    > for a in parser.anchorlist:
    > print urlparse.urljoin( 'http://python.org/', a )
    >
    > Output snipped:
    >
    > ...http://python.org/psf/http://python.../python.org/ftp/python/2.5.2/python-2.5.2.msi
    > ...


    How can I modify or add to the above code, so that the file references
    are saved to specified local directories, AND the saved webpage makes
    reference to the new saved files in the respective directories?
    Thanks for your help in advance.
     
    Jetus, May 7, 2008
    #4
  5. Jetus

    Guest

    On May 7, 1:40 am, Jetus <> wrote:
    > On May 4, 7:22 am, wrote:
    >
    >
    >
    >
    >
    > > On May 4, 12:33 am, "Gabriel Genellina" <>
    > > wrote:

    >
    > > > En Sun, 04 May 2008 01:33:45 -0300, Jetus <> escribió:

    >
    > > > > Is there a good place to look to see where I can find some code that
    > > > > will help me to save webpage's links to the local drive, after I have
    > > > > used urllib2 to retrieve the page?
    > > > > Many times I have to view these pages when I do not have access to the
    > > > > internet.

    >
    > > > Don't reinvent the wheel and use wgethttp://en.wikipedia.org/wiki/Wget

    >
    > > > --
    > > > Gabriel Genellina

    >
    > > A lot of the functionality is already present.

    >
    > > import urllib
    > > urllib.urlretrieve( 'http://python.org/', 'main.htm' )
    > > from htmllib import HTMLParser
    > > from formatter import NullFormatter
    > > parser= HTMLParser( NullFormatter( ) )
    > > parser.feed( open( 'main.htm' ).read( ) )
    > > import urlparse
    > > for a in parser.anchorlist:
    > >     print urlparse.urljoin( 'http://python.org/', a )

    >
    > > Output snipped:

    >
    > > ...http://python.org/psf/http://python.org/dev/http://python.org/links/h....
    > > ...

    >
    > How can I modify or add to the above code, so that the file references
    > are saved to specified local directories, AND the saved webpage makes
    > reference to the new saved files in the respective directories?
    > Thanks for your help in advance.- Hide quoted text -
    >
    > - Show quoted text -


    You'd have to convert filenames in the loop to a file system path; try
    writing as is with makedirs( ). You'd have to replace contents in a
    file for links, so your best might be prefixing them with localhost
    and spawning a small bounce-router.
     
    , May 7, 2008
    #5
  6. Jetus wrote:

    > On May 4, 7:22 am, wrote:
    >> On May 4, 12:33 am, "Gabriel Genellina" <>
    >> wrote:
    >>
    >> > En Sun, 04 May 2008 01:33:45 -0300, Jetus <>
    >> > escribió:

    >>
    >> > > Is there a good place to look to see where I can find some code that
    >> > > will help me to save webpage's links to the local drive, after I have
    >> > > used urllib2 to retrieve the page?
    >> > > Many times I have to view these pages when I do not have access to
    >> > > the internet.

    >>
    >> > Don't reinvent the wheel and use wgethttp://en.wikipedia.org/wiki/Wget

    >>
    >> > --
    >> > Gabriel Genellina

    >>
    >> A lot of the functionality is already present.
    >>
    >> import urllib
    >> urllib.urlretrieve( 'http://python.org/', 'main.htm' )
    >> from htmllib import HTMLParser
    >> from formatter import NullFormatter
    >> parser= HTMLParser( NullFormatter( ) )
    >> parser.feed( open( 'main.htm' ).read( ) )
    >> import urlparse
    >> for a in parser.anchorlist:
    >> print urlparse.urljoin( 'http://python.org/', a )
    >>
    >> Output snipped:
    >>
    >> ...http://python.org/psf/http://python.../python.org/ftp/python/2.5.2/python-2.5.2.msi
    >> ...

    >
    > How can I modify or add to the above code, so that the file references
    > are saved to specified local directories, AND the saved webpage makes
    > reference to the new saved files in the respective directories?
    > Thanks for your help in advance.


    how about you *try* to do so - and if you have actual problems, you come
    back and ask for help? Alternatively, there's always guru.com

    Diez
     
    Diez B. Roggisch, May 7, 2008
    #6
  7. Jetus

    Guest

    On May 7, 8:36 am, "Diez B. Roggisch" <> wrote:
    > Jetus wrote:
    > > On May 4, 7:22 am, wrote:
    > >> On May 4, 12:33 am, "Gabriel Genellina" <>
    > >> wrote:

    >
    > >> > En Sun, 04 May 2008 01:33:45 -0300, Jetus <>
    > >> > escribió:

    >
    > >> > > Is there a good place to look to see where I can find some code that
    > >> > > will help me to save webpage's links to the local drive, after I have
    > >> > > used urllib2 to retrieve the page?
    > >> > > Many times I have to view these pages when I do not have access to
    > >> > > the internet.

    >
    > >> > Don't reinvent the wheel and use wgethttp://en.wikipedia.org/wiki/Wget

    >
    > >> > --
    > >> > Gabriel Genellina

    >
    > >> A lot of the functionality is already present.

    >
    > >> import urllib
    > >> urllib.urlretrieve( 'http://python.org/', 'main.htm' )
    > >> from htmllib import HTMLParser
    > >> from formatter import NullFormatter
    > >> parser= HTMLParser( NullFormatter( ) )
    > >> parser.feed( open( 'main.htm' ).read( ) )
    > >> import urlparse
    > >> for a in parser.anchorlist:
    > >>     print urlparse.urljoin( 'http://python.org/', a )

    >
    > >> Output snipped:

    >
    > >> ...http://python.org/psf/http://python.org/dev/http://python.org/links/h...
    > >> ...

    >
    > > How can I modify or add to the above code, so that the file references
    > > are saved to specified local directories, AND the saved webpage makes
    > > reference to the new saved files in the respective directories?
    > > Thanks for your help in advance.

    >
    > how about you *try* to do so - and if you have actual problems, you come
    > back and ask for help? Alternatively, there's always guru.com
    >
    > Diez- Hide quoted text -
    >
    > - Show quoted text -


    I've tried, no avail. How does the open-source plug to Python look/
    work? Firefox was able to spawn Python in a toolbar in a distant
    land. Does it still? I believe under DOM, return a file named X that
    contains a list of changes to make to the page, or put it at the top
    of one, to be removed by Firefox. At that point, X would pretty much
    be the last lexicly-sorted file in a pre-established directory. Files
    are really easy to create and add syntax too, if you create a bunch of
    them. Sector size was bouncing though, which brings that all the way
    up to file system.

    for( int docID= 0; docID++ ) {
    if ( doc.links[ docID ]== pythonfileA.links[ pyID ] ) {
    doc.links[ docID ].anchor= pythonfileB.links[ pyID ];
    pyID++;
    }
    }
     
    , May 8, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stephane Guyetant

    Hard Disk Drive behavioral model

    Stephane Guyetant, Oct 2, 2003, in forum: VHDL
    Replies:
    0
    Views:
    587
    Stephane Guyetant
    Oct 2, 2003
  2. Ahmed Essa

    hard disk serial number

    Ahmed Essa, Nov 17, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    535
    Nick Hertl
    Nov 17, 2005
  3. Shonty

    SCSI Hard Disk Number

    Shonty, Jan 7, 2005, in forum: C++
    Replies:
    2
    Views:
    2,899
    Shonty
    Jan 7, 2005
  4. bjzhangwn

    seagate hard disk driver problem

    bjzhangwn, May 29, 2006, in forum: VHDL
    Replies:
    0
    Views:
    599
    bjzhangwn
    May 29, 2006
  5. thedarkman

    saving a webpage with on-disk links

    thedarkman, Jan 3, 2007, in forum: HTML
    Replies:
    3
    Views:
    338
Loading...

Share This Page