A better webpage filter

A

Anton Vredegoor

Since a few days I've been experimenting with a construct that enables
me to send the sourcecode of the web page I'm reading through a Python
script and then into a new tab in Mozilla. The new tab is automatically
opened so the process feels very natural, although there's a lot of
reading, filtering and writing behind the scene.

I want to do three things with this post:

A) Explain the process so that people can try it for themselves and say
"Hey stupid, I've been doing the same thing with greasemonkey for ages",
or maybe "You're great, this is easy to see, since the crux of the
biscuit is the apostrophe." Both kind of comments are very welcome.

B) Explain why I want such a thing.

C) If this approach is still valid after all the before, ask help for
writing a better Python htmlfilter.py

So here we go:

A) Explain the process

We need :

- mozilla firefox http://en-us.www.mozilla.com/en-US/
- add-on viewsourcewith https://addons.mozilla.org/firefox/394/
- batch file (on windows):
(htmfilter.bat)
d:\python25\python.exe D:\Python25\Scripts\htmlfilter.py "%1" > out.html
start out.html
- a python script:
#htmfilter.py

import sys

def htmlfilter(fname, skip = []):
f = file(fname)
data = f.read()
L = []
for i,x in enumerate(data):
if x == '<':
j = i
elif x =='>':
L.append((j,i))
R = list(data)
for i,j in reversed(L):
s = data[i:j+1]
for x in skip:
if x in s:
R[i:j+1] = ' '
break
return ''.join(R)

def test():
if len(sys.argv) == 2:
skip = ['div','table']
fname = sys.argv[1].strip()
print htmlfilter(fname,skip)

if __name__=='__main__':
test()

Now install the htmlfilter.py file in your Python scripts dir and adapt
the batchfile to point to it.

To use the viewsourcewith add-on to open the batchfile: Go to some
webpage, left click and view the source with the batchfile.

B) Explain why I want such a thing.

OK maybe this should have been the thing to start with, but hey it's
such an interesting technique it's almost a waste no to give it a chance
before my idea is dissed :)

Most web pages I visit lately are taking so much room for ads (even with
adblocker installed) that the mere 20 columns of text that are available
for reading are slowing me down unacceptably. I have tried clicking
'print this' or 'printer friendly' or using 'no style' from the mozilla
menu and switching back again for other pages but it was tedious to say
the least. Every webpage has different conventions. In the end I just
started editing web pages' source code by hand, cutting out the beef and
saving it as a html file with only text, no scripts or formatting. But
that was also not very satisfying because raw web pages are *big*.

Then I found out I often could just replace all 'table' or 'div'
elements with a space and the page -although not very html compliant any
more- still loads and often the text looks a lot better. This worked for
at least 50 percent of the pages and restored my autonomy and
independence in reading web pages! (Which I do a lot by the way, maybe
for most people the problem is not very irritating, because they don't
read as much? Tell me that too, I want to know :)

C) Ask help writing a better Python htmlfilter.py

Please. You see the code for yourself, this must be done better :)

A.
 
G

Gabriel Genellina

En Sat, 24 Mar 2007 15:45:41 -0300, Anton Vredegoor
Since a few days I've been experimenting with a construct that enables
me to send the sourcecode of the web page I'm reading through a Python
script and then into a new tab in Mozilla. The new tab is automatically
opened so the process feels very natural, although there's a lot of
reading, filtering and writing behind the scene.

I want to do three things with this post:

A) Explain the process so that people can try it for themselves and say
"Hey stupid, I've been doing the same thing with greasemonkey for ages",
or maybe "You're great, this is easy to see, since the crux of the
biscuit is the apostrophe." Both kind of comments are very welcome.

I use the Opera browser: http://www.opera.com
Among other things (like having tabs for ages!):
- enable/disable tables and divs (like you do)
- enable/disable images with a keystroke, or only show cached images.
- enable/disable CSS
- banner supressing (aggressive)
- enable/disable scripting
- "fit to page width" (for those annoying sites that insist on using a
fixed width of about 400 pixels, less than 1/3 of my actual screen size)
- apply your custom CSS or javascript on any page
- edit the page source and *refresh* the original page to reflect your
changes

All of this makes a very smooth web navigation - specially on a slow
computer or slow connection.
 
A

Anton Vredegoor

Gabriel said:
I use the Opera browser: http://www.opera.com
Among other things (like having tabs for ages!):
- enable/disable tables and divs (like you do)
- enable/disable images with a keystroke, or only show cached images.
- enable/disable CSS
- banner supressing (aggressive)
- enable/disable scripting
- "fit to page width" (for those annoying sites that insist on using a
fixed width of about 400 pixels, less than 1/3 of my actual screen size)
- apply your custom CSS or javascript on any page
- edit the page source and *refresh* the original page to reflect your
changes

All of this makes a very smooth web navigation - specially on a slow
computer or slow connection.

Thanks! I forgot about that one. It does what I want natively so I will
go that route for now. Still I think there must be some use for my
method of filtering. It's just too good to not have some use :) Maybe
in the future -when web pages will add new advertisement tactics faster
than web browser builders can change their toolbox or instruct their
users. After all, I was editing the filter script on one screen and
another screen was using the new filter as soon as I had saved it.

Maybe someday someone will write a GUI where one can click some radio
buttons that would define what goes through and what not. Possibly such
a filter could be collectively maintained on a live webpage with an
update frequency of a few seconds or something. Just to make sure we're
prepared for the worst :)

A.
 
J

John J. Lee

Anton Vredegoor said:
Most web pages I visit lately are taking so much room for ads (even
with adblocker installed) that the mere 20 columns of text that are
available for reading are slowing me down unacceptably. I have tried
[...]

http://webcleaner.sourceforge.net/


Not actually tried it myself, though did browse some of the code once
or twice -- does some clever stuff.

Lots of other Python-implemented HTTP proxies, some of which are
relevant (though AFAIK all less sophisticated than webcleaner), are
listed on Alan Kennedy's nice page here:

http://xhaus.com/alan/python/proxies.html


A surprising amount of diversity there.


John
 
A

Anton Vredegoor

John said:

Thanks, I will look into it sometime. Essentially my problem has been
solved by switching to opera, but old habits die hard and I find myself
using Mozilla and my little script more often than would be logical.

Maybe the idea of having a *Python* script open at all times to which
all content goes through is just too tempting. I mean if there's some
possible irritation on a site theoretically I could just write a
specific function to get rid of it. This mental setting works as a
placebo on my web browsing experience so that the actual problems don't
always even need to be solved ... I hope I'm not losing all traditional
programmers here in this approach :)
Not actually tried it myself, though did browse some of the code once
or twice -- does some clever stuff.

Lots of other Python-implemented HTTP proxies, some of which are
relevant (though AFAIK all less sophisticated than webcleaner), are
listed on Alan Kennedy's nice page here:

http://xhaus.com/alan/python/proxies.html


A surprising amount of diversity there.

At least now I know what general category seems to be nearest to my
solution so thanks again for that. However my solution is not really
doing anything like the programs on this page (although it is related to
removing ads), instead it tries to modulate a copy of the page after
it's been saved on disk. This removes all kinds of links and enables one
to definitely and finally reshape the form the page will take. As such
it is more concerned with the metaphysical image the page makes on the
users brain and less with the actual content or the security aspects.

One thing I noticed though on that (nice!) Alan Kennedy page is that
there was a script that was so small that it didn't even have a homepage
but instead it just relied on a google groups post! I guess you can see
that I liked that one :)

My filter is even smaller. I've tried to make it smaller still by
removing the batch file and using webbrowser.open(some cStringIO object)
but that didn't work on windows.

regards,

A.
 
G

Gabriel Genellina

En Mon, 26 Mar 2007 06:06:00 -0300, Anton Vredegoor
Thanks, I will look into it sometime. Essentially my problem has been
solved by switching to opera, but old habits die hard and I find myself
using Mozilla and my little script more often than would be logical.

Maybe the idea of having a *Python* script open at all times to which
all content goes through is just too tempting. I mean if there's some
possible irritation on a site theoretically I could just write a
specific function to get rid of it. This mental setting works as a

If you don't mind using JavaScript instead of Python, UserJS is for you:
http://www.opera.com/support/tutorials/userjs/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,058
Latest member
QQXCharlot

Latest Threads

Top