Python Google Server

F

fuzzyman

I've hacked together a 'GoogleCacheServer'. It is based on
SimpleHTTPServer. Run the following script (hopefully google groups
won't mangle the indentation) and set your browser proxy settings to
'localhost:8000'. It will let you browse the internet using google's
cache. Obviously you'll miss images, javascript, css files, etc.

See the world as google sees it !

(This is actually an 'inventive' short term measure to get round a
restrictive internet policy at work :) I'll probably put it in the
Python Cookbook as it's quite fun (so if line lengths or indentation is
mangled here, try there). Tested on Windows XP, with Python 2.3 and IE.



# Copyright Michael Foord, 2004 & 2005.
# Released subject to the BSD License
# Please see http://www.voidspace.org.uk/documents/BSD-LICENSE.txt

# For information about bugfixes, updates and support, please join the
Pythonutils mailing list.
# http://voidspace.org.uk/mailman/listinfo/pythonutils_voidspace.org.uk
# Comments, suggestions and bug reports welcome.
# Scripts maintained at http://www.voidspace.org.uk/python/index.shtml
# E-mail (e-mail address removed)

import google
import BaseHTTPServer
import shutil
from StringIO import StringIO
import urlparse

__version__ = '0.1.0'


"""
This is a simple implementation of a server that fetches web pages
from the google cache.

It lets you explore the internet from your browser, using the google
cache.

Run this script and then set your browser proxy settings to
localhost:8000

Needs google.py (and a google license key).
See http://pygoogle.sourceforge.net/
and http://www.google.com/apis/
"""

cached_types = ['txt', 'html', 'htm', 'shtml', 'shtm', 'cgi', 'pl',
'py']
google.setLicense(google.getLicense())
googlemarker = '''<i>Google is not affiliated with the authors of this
page nor responsible for its
content.</i></font></center></td></tr></table></td></tr></table>\n<hr>\n'''
markerlen = len(googlemarker)

class googleCacheHandler(BaseHTTPServer.BaseHTTPRequestHandler):
server_version = "googleCache/" + __version__
cached_types = cached_types
googlemarker = googlemarker
markerlen = markerlen

def do_GET(self):
f = self.send_head()
if f:
self.copyfile(f, self.wfile)
f.close()

def send_head(self):
"""Common code for GET and HEAD commands.

This sends the response code and MIME headers.

Return value is either a file object (which has to be copied
to the outputfile by the caller unless the command was HEAD,
and must be closed by the caller under all circumstances), or
None, in which case the caller has nothing further to do.

"""
print self.path
url = urlparse.urlparse(self.path)[2]
dotloc = url.find('.') + 1
if dotloc and url[dotloc:] not in self.cached_types:
return None # not a cached type - don't even try

thepage = google.doGetCachedPage(self.path)
headerpos = thepage.find(self.googlemarker)
if headerpos != -1: # remove the google header
pos = self.markerlen + headerpos
thepage = thepage[pos:]

f = StringIO(thepage)

self.send_response(200)
self.send_header("Content-type", 'text/html')
self.send_header("Content-Length", str(len(thepage)))
self.end_headers()
return f

def copyfile(self, source, outputfile):
shutil.copyfileobj(source, outputfile)


def test(HandlerClass = googleCacheHandler,
ServerClass = BaseHTTPServer.HTTPServer):
BaseHTTPServer.test(HandlerClass, ServerClass)


if __name__ == '__main__':
test()
 
V

vegetax

(e-mail address removed) wrote:

lol ,cool hack!! make a slashdot article about it!!
I've hacked together a 'GoogleCacheServer'. It is based on
SimpleHTTPServer. Run the following script (hopefully google groups
won't mangle the indentation) and set your browser proxy settings to
'localhost:8000'. It will let you browse the internet using google's
cache. Obviously you'll miss images, javascript, css files, etc.

See the world as google sees it !

(This is actually an 'inventive' short term measure to get round a
restrictive internet policy at work :) I'll probably put it in the
Python Cookbook as it's quite fun (so if line lengths or indentation is
mangled here, try there). Tested on Windows XP, with Python 2.3 and IE.



# Copyright Michael Foord, 2004 & 2005.
# Released subject to the BSD License
# Please see http://www.voidspace.org.uk/documents/BSD-LICENSE.txt

# For information about bugfixes, updates and support, please join the
Pythonutils mailing list.
# http://voidspace.org.uk/mailman/listinfo/pythonutils_voidspace.org.uk
# Comments, suggestions and bug reports welcome.
# Scripts maintained at http://www.voidspace.org.uk/python/index.shtml
# E-mail (e-mail address removed)

import google
import BaseHTTPServer
import shutil
from StringIO import StringIO
import urlparse

__version__ = '0.1.0'


"""
This is a simple implementation of a server that fetches web pages
from the google cache.

It lets you explore the internet from your browser, using the google
cache.

Run this script and then set your browser proxy settings to
localhost:8000

Needs google.py (and a google license key).
See http://pygoogle.sourceforge.net/
and http://www.google.com/apis/
"""

cached_types = ['txt', 'html', 'htm', 'shtml', 'shtm', 'cgi', 'pl',
'py']
google.setLicense(google.getLicense())
googlemarker = '''<i>Google is not affiliated with the authors of this
page nor responsible for its
content. said:
markerlen = len(googlemarker)

class googleCacheHandler(BaseHTTPServer.BaseHTTPRequestHandler):
server_version = "googleCache/" + __version__
cached_types = cached_types
googlemarker = googlemarker
markerlen = markerlen

def do_GET(self):
f = self.send_head()
if f:
self.copyfile(f, self.wfile)
f.close()

def send_head(self):
"""Common code for GET and HEAD commands.

This sends the response code and MIME headers.

Return value is either a file object (which has to be copied
to the outputfile by the caller unless the command was HEAD,
and must be closed by the caller under all circumstances), or
None, in which case the caller has nothing further to do.

"""
print self.path
url = urlparse.urlparse(self.path)[2]
dotloc = url.find('.') + 1
if dotloc and url[dotloc:] not in self.cached_types:
return None # not a cached type - don't even try

thepage = google.doGetCachedPage(self.path)
headerpos = thepage.find(self.googlemarker)
if headerpos != -1: # remove the google header
pos = self.markerlen + headerpos
thepage = thepage[pos:]

f = StringIO(thepage)

self.send_response(200)
self.send_header("Content-type", 'text/html')
self.send_header("Content-Length", str(len(thepage)))
self.end_headers()
return f

def copyfile(self, source, outputfile):
shutil.copyfileobj(source, outputfile)


def test(HandlerClass = googleCacheHandler,
ServerClass = BaseHTTPServer.HTTPServer):
BaseHTTPServer.test(HandlerClass, ServerClass)


if __name__ == '__main__':
test()
 
V

vegetax

it works on opera and firefox on linux, but you cant search in the cached
google! it would be more usefull if you could somehow search "only" in the
cache instead of putting the straight link. maybe you could put a magic url
to search in the cache, like search:"search terms"

I've hacked together a 'GoogleCacheServer'. It is based on
SimpleHTTPServer. Run the following script (hopefully google groups
won't mangle the indentation) and set your browser proxy settings to
'localhost:8000'. It will let you browse the internet using google's
cache. Obviously you'll miss images, javascript, css files, etc.

See the world as google sees it !

(This is actually an 'inventive' short term measure to get round a
restrictive internet policy at work :) I'll probably put it in the
Python Cookbook as it's quite fun (so if line lengths or indentation is
mangled here, try there). Tested on Windows XP, with Python 2.3 and IE.



# Copyright Michael Foord, 2004 & 2005.
# Released subject to the BSD License
# Please see http://www.voidspace.org.uk/documents/BSD-LICENSE.txt

# For information about bugfixes, updates and support, please join the
Pythonutils mailing list.
# http://voidspace.org.uk/mailman/listinfo/pythonutils_voidspace.org.uk
# Comments, suggestions and bug reports welcome.
# Scripts maintained at http://www.voidspace.org.uk/python/index.shtml
# E-mail (e-mail address removed)

import google
import BaseHTTPServer
import shutil
from StringIO import StringIO
import urlparse

__version__ = '0.1.0'


"""
This is a simple implementation of a server that fetches web pages
from the google cache.

It lets you explore the internet from your browser, using the google
cache.

Run this script and then set your browser proxy settings to
localhost:8000

Needs google.py (and a google license key).
See http://pygoogle.sourceforge.net/
and http://www.google.com/apis/
"""

cached_types = ['txt', 'html', 'htm', 'shtml', 'shtm', 'cgi', 'pl',
'py']
google.setLicense(google.getLicense())
googlemarker = '''<i>Google is not affiliated with the authors of this
page nor responsible for its
content. said:
markerlen = len(googlemarker)

class googleCacheHandler(BaseHTTPServer.BaseHTTPRequestHandler):
server_version = "googleCache/" + __version__
cached_types = cached_types
googlemarker = googlemarker
markerlen = markerlen

def do_GET(self):
f = self.send_head()
if f:
self.copyfile(f, self.wfile)
f.close()

def send_head(self):
"""Common code for GET and HEAD commands.

This sends the response code and MIME headers.

Return value is either a file object (which has to be copied
to the outputfile by the caller unless the command was HEAD,
and must be closed by the caller under all circumstances), or
None, in which case the caller has nothing further to do.

"""
print self.path
url = urlparse.urlparse(self.path)[2]
dotloc = url.find('.') + 1
if dotloc and url[dotloc:] not in self.cached_types:
return None # not a cached type - don't even try

thepage = google.doGetCachedPage(self.path)
headerpos = thepage.find(self.googlemarker)
if headerpos != -1: # remove the google header
pos = self.markerlen + headerpos
thepage = thepage[pos:]

f = StringIO(thepage)

self.send_response(200)
self.send_header("Content-type", 'text/html')
self.send_header("Content-Length", str(len(thepage)))
self.end_headers()
return f

def copyfile(self, source, outputfile):
shutil.copyfileobj(source, outputfile)


def test(HandlerClass = googleCacheHandler,
ServerClass = BaseHTTPServer.HTTPServer):
BaseHTTPServer.test(HandlerClass, ServerClass)


if __name__ == '__main__':
test()
 
F

Fuzzyman

vegetax said:
it works on opera and firefox on linux, but you cant search in the cached
google! it would be more usefull if you could somehow search "only" in the
cache instead of putting the straight link. maybe you could put a magic url
to search in the cache, like search:"search terms"

Thanks for the report. I've also tried it with firefox on windows.

Yeah - google search results aren't cached !! Perhaps anything in a
google domain ought to pass straight through. That could be done by
testing the domain and using urllib2 to fetch the page.

Have just tested the following which works.

Add the follwoing two lines to the start of the code :

import urllib2
txheaders = { 'User-agent' : 'Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.1; SV1; .NET CLR 1.1.4322)' }

Then change the start of the send_head method to this :

def send_head(self):
"""Only GET implemented for this.
This sends the response code and MIME headers.
Return value is a file object, or None.
"""
print 'Request :', self.path # traceback to sys.stdout
url_tuple = urlparse.urlparse(self.path)
url = url_tuple[2]
domain = url_tuple[1]
if domain.find('.google.') != -1: # bypass the cache for
google domains
req = urllib2.Request(self.path, None, txheaders)
return urllib2.urlopen(req)

[snip..]
 
V

vegetax

Fuzzyman wrote:

Add the follwoing two lines to the start of the code :

import urllib2
txheaders = { 'User-agent' : 'Mozilla/4.0 (compatible; MSIE 6.0;
Windows NT 5.1; SV1; .NET CLR 1.1.4322)' }

Then change the start of the send_head method to this :

def send_head(self):
"""Only GET implemented for this.
This sends the response code and MIME headers.
Return value is a file object, or None.
"""
print 'Request :', self.path # traceback to sys.stdout
url_tuple = urlparse.urlparse(self.path)
url = url_tuple[2]
domain = url_tuple[1]
if domain.find('.google.') != -1: # bypass the cache for
google domains
req = urllib2.Request(self.path, None, txheaders)
return urllib2.urlopen(req)


Doesnt work,the browsers keeps asking me to save the page.

this one works =)
print 'Request :', self.path #| traceback| to| sys.stdout
url_tuple = urlparse.urlparse(self.path)
url = url_tuple[2]
domain = url_tuple[1]
if domain.find('.google.') != -1: # bypass the cache for google domains
req = urllib2.Request(self.path, None, txheaders)
self.send_response(200)
self.send_header("Content-type", 'text/html')
self.end_headers()
return urllib2.urlopen(req)
 
P

Paul Rubin

(This is actually an 'inventive' short term measure to get round a
restrictive internet policy at work :)

If that means what I think, you're better off setting up a
url-rewriting proxy server on some other machine, that uses SSL on the
browser side. There's one written in perl at:

http://www.jmarshall.com/tools/cgiproxy/

Presumably you're surfing through some oppressive firewall, and the
SSL going into the proxy prevents the firewall from logging all the
destination URL's going past it (and the content too, for that matter).
 
V

vegetax

Fuzzyman said:
Of course - sorry. Thanks for the fix. Out of interest - why are you
using this... just for curiosity, or is it helpful ?

because is fun to surf on the google cache, =)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,072
Latest member
trafficcone

Latest Threads

Top