Instrumented web proxy

A

Andrew McLean

I would like to write a web (http) proxy which I can instrument to
automatically extract information from certain web sites as I browse
them. Specifically, I would want to process URLs that match a particular
regexp. For those URLs I would have code that parsed the content and
logged some of it.

Think of it as web scraping under manual control.

I found this list of Python web proxies

http://www.xhaus.com/alan/python/proxies.html

Tiny HTTP Proxy in Python looks promising as it's nominally simple (not
many lines of code)

http://www.okisoft.co.jp/esc/python/proxy/

It does what it's supposed to, but I'm a bit at a loss as where to
intercept the traffic. I suspect it should be quite straightforward, but
I'm finding the code a bit opaque.

Any suggestions?

Andrew
 
M

Miki

Hello Andrew,
Tiny HTTP Proxy in Python looks promising as it's nominally simple (not
many lines of code)

http://www.okisoft.co.jp/esc/python/proxy/

It does what it's supposed to, but I'm a bit at a loss as where to
intercept the traffic. I suspect it should be quite straightforward, but
I'm finding the code a bit opaque.

Any suggestions?
From a quick look at the code, you need to either hook to do_GET where
you have the URL (see the urlunparse line).
If you want the actual content of the page, you'll need to hook to
_read_write (data = i.recv(8192)).

HTH,
 
P

Paul Rubin

Andrew McLean said:
I would like to write a web (http) proxy which I can instrument to
automatically extract information from certain web sites as I browse
them. Specifically, I would want to process URLs that match a
particular regexp. For those URLs I would have code that parsed the
content and logged some of it.

Think of it as web scraping under manual control.

I've used Proxy 3 for this, a very cool program with powerful
capabilities for on the fly html rewriting.

http://theory.stanford.edu/~amitp/proxy.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top