Re: How to save web pages for offline reading?

Discussion in 'Python' started by Anand Pillai, Aug 19, 2003.

  1. Anand Pillai

    Anand Pillai Guest

    I hope this thread is not dead.

    I would like to know what you decided at the end :)
    Harvestman has about 10 active subscribers right now
    and some corporates in India and umpteen of my own friends
    use it for their personal 'harvesting' needs :->

    I hope you downloaded at least the (new) binaries
    and gave it a go!

    -Anand

    (Will Stuyvesant) wrote in message news:<>...
    > > [Carsten Gehling]
    > > Well since you ARE on Windows:
    > >
    > > Open the page in Internet Explorer, choose "File" and "Save As...".
    > > You've now saved all necessary files.
    > >

    >
    > I know. But I can't do File - Save As from python :) I guess it can
    > be done via COM?
    >
    > > > I thought this whole thing would be easy with all those Python
    > > > internet modules in the standard distro: httplib, urllib, urllib2,
    > > > FancyURLxxx etc. Being able to download a "complete" page *from
    > > > Python source* would be very nice for my particular application.

    > >
    > > Well it's doable with those libraries, but you have to put your own meat
    > > on
    > > the bones.
    > >
    > > 1) Use httplib to get the page first.
    > > 2) Parse it for all "src" attributes, and get the supporting files. The
    > > parsin can be done with a html-parser ...

    >
    > That would be htmllib.
    >
    > What you describe is what I am going to do actually, when I have time
    > again. I was about to do it when I thought "somebody must have been
    > doing this before". It seems like mr. Pillai in another reply has
    > done something similar but I couldn't figure out from his source code.
    >
    > Thank you all for the help!
     
    Anand Pillai, Aug 19, 2003
    #1
    1. Advertising

  2. > [Anand Pillai]
    > I hope this thread is not dead.
    >
    > I would like to know what you decided at the end :)
    > Harvestman has about 10 active subscribers right now
    > and some corporates in India and umpteen of my own friends
    > use it for their personal 'harvesting' needs :->
    >
    > I hope you downloaded at least the (new) binaries
    > and gave it a go!


    Just downloaded it and I will study it. Thanks!

    What I use now is suboptimal, but works good enough for simple offline
    reading of blogs and it is fast (no .GIFs etc.). It stores a bunch of
    blogs I want to read in a directory (if they are new, checked by
    content-size and stuff). I am sure it can be improved a lot, comments
    welcome :)

    # file getsites.py

    ##
    # DEFINITION bloglist.ini format
    # This is like what effnews (www.effbot.org) uses. But I stopped
    # using it because it gets too expensive to read online with a
    # telephone connection.
    # A file in bloglist.ini format contains a sequence of URI and Title
    # pairs: URIs are on a line that starts with a '+', and then the title
    # for that URI follows.
    #
    # for instance:
    #
    # +http://www.python.org
    # Python Language Website
    # +http://online.effbot.org
    # online.effbot.org
    ###

    import sys
    import os
    import urllib

    TMPDIR = 'getsites'

    NEWDELTA = 60
    # If a new blogentry differs less than NEWDELTA bytes from the old
    # entry, then it is considered not new (suspicion of only a date
    # change etc.)
    # For instance, Simon Willison's weblog changed 57 bytes but without
    # new content: only the "2 days 7 hours ago" message changed.
    # Pyzine.com changed 400 bytes because of a new "random abstract".
    # So far, Industry Toulouse is the only site that changes bytes
    # without new content but does send a content-length, it is because it
    # has a random quote every time you visit it.

    # Some titles need in a bloglist.ini file may need editing, for
    instance
    # the title 'zone ::: effbot' will create an IOError: "No such file or
    # directory 'zone ::: effbot'" (on my Windows laptop).
    DEFAULTFILE = 'bloglist.ini'


    wfm = "This file has a size but can not have its date changed?"
    # weird file message

    ##
    # Change the date of a file
    #
    # @param filename The file that will have its date changed
    # @param ddmmyyyy String with date
    # @return Side-effect: file's date changed to ddmmyyyy
    ##
    def setdate(filename, ddmmyyyy):
    import os
    import time
    d = int(ddmmyyyy[2:4])
    m = int(ddmmyyyy[:2])
    y = int(ddmmyyyy[4:])
    # 12 hours, 1 minute: gives 01:01 PM, odd
    t = time.mktime((y, d, m, 12, 1, 0, 0, 0, 0))
    os.utime(filename, (t, t))



    ##
    # An example of a list returned:
    # [{'http://myurl.com/index.html: 'My Site Name'},
    # {'http://other.net/c.htm': 'spam'}]
    #
    # @param bloglistfile Open file in bloglist.ini format
    # @return List of dict, every dict has one uri:title pair
    ##
    def getsitedict(bloglistfile):
    lines = []
    for line in bloglistfile: lines.append(line)
    entry = {} # a dict with a uri:title pair
    haveurl = '' # the uri of the entry we are adding now
    sites = [] # a list with dicts with url:title
    for line in lines:
    line = line.strip()
    if line[0] == '+': # a new url
    haveurl = line[1:]
    if entry: # if a uri:title available
    sites.append(entry)
    entry = {}
    elif haveurl: # collecting a title
    if entry.has_key(haveurl):
    entry[haveurl] = entry[haveurl] + line
    else:
    entry[haveurl] = line
    continue
    return sites


    ##
    # Check and maybe download sites
    #
    # @param sites List of dict, every dict has one uri:title pair
    # @return Side-effect: new HTML in TMPDIR
    ##
    def getsites(sites):
    for site in sites:
    uri = site.keys()[0]
    title = site[uri]
    filename = os.path.join(TMPDIR, title + '.html')
    print
    print title
    print uri
    try:
    fp = urllib.urlopen(uri)
    except IOError:
    print 'ERROR: no connection'
    continue
    oldsize = 0
    try:
    oldsize = os.path.getsize(filename)
    except: pass
    newsize = 0
    for k, v in fp.headers.items():
    if k.lower() == 'content-length':
    newsize = long(v)
    break
    if (oldsize == 0) or (newsize != oldsize):
    # There is a HTTP content-length and it is not the same as
    # the file we already have (new != old), or we don't have
    # a file already (old == 0).
    print 'oldsize', oldsize, 'newsize', newsize
    print 'Downloading: '
    try:
    op = file(filename, "wb")
    except IOError:
    print 'Illegal filename:', filename
    continue
    n = 0
    while 1:
    s = fp.read(8192)
    if not s: break
    op.write(s)
    n = n + len(s)
    fp.close()
    op.close()
    for k, v in fp.headers.items():
    print k, "=", v
    print "stored %s (%s bytes)" % (filename, n)
    if ( (oldsize > 0) # there is an old file
    and
    (newsize == 0) # no HTTP content-length
    and
    (abs(n - oldsize) <= NEWDELTA) # "no change"
    ):
    # Change date of saved blogs that do not send HTTP
    # content-length and that do not appear to have
    # changed.
    # TODO: this also removes NEW pages that have the same
    # length as the old ones. It would be better to check
    # the file content, viz. >= 95% same content. Or
    # check the content of the first 8192 bytes?
    try: setdate(filename, '31012001')
    except IOError: print wfm
    print 'Setting date to 01012001 (untrusted "new")'


    if __name__ == '__main__':
    # handle commandline options
    if len(sys.argv) < 2: filename = DEFAULTFILE
    else: filename = sys.argv[1]

    # check and maybe setup download directory
    try:
    os.chdir(TMPDIR)
    except OSError:
    os.mkdir(TMPDIR)
    os.chdir(TMPDIR)
    os.chdir('..')

    # get dict with sites
    sites = getsitedict(file(filename, 'r'))

    # check and maybe download sites
    getsites(sites)
     
    Will Stuyvesant, Aug 19, 2003
    #2
    1. Advertising

  3. Vattekkat Satheesh Babu, Aug 20, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. George Birbilis
    Replies:
    0
    Views:
    396
    George Birbilis
    Dec 2, 2003
  2. Carl Gilbert

    Running ASP pages offline on any machine

    Carl Gilbert, May 18, 2005, in forum: ASP .Net
    Replies:
    4
    Views:
    581
    Lucas Tam
    May 18, 2005
  3. Paul Rudin
    Replies:
    1
    Views:
    443
    Sean Richards
    Jul 23, 2003
  4. Ahan Hsieh
    Replies:
    1
    Views:
    327
    =?ISO-8859-1?Q?Arne_Vajh=F8j?=
    Oct 8, 2007
  5. haggis36

    Working with offline web pages

    haggis36, Jun 24, 2009, in forum: HTML and CSS
    Replies:
    0
    Views:
    469
    haggis36
    Jun 24, 2009
Loading...

Share This Page