New kind of Spam Filter

R

Roedy Green

I wonder if anyone would be interested in a beast such as this.
It is a sort of roll-your-own spam filter. I have been trying out
various filters and none work. They all have some fatal flaw that I
can't fix.


What I am proposing is a simple Javamail framework that looks at
messages on the server, and runs a number of user written filters on
them.

A user written filter gets passed a MimeMessage object, and returns a
float representing the probability this is spam, or the probability
this is definitely good. The user implements either an IsSpam or
IsHam interface.

It might come with a number canned filters, e.g. everyone in my Eudora
address book is considered ham, all mail not addressed to me is spam,
all mail addressed to more than N people is spam. Something that
recognizes the variation on the current's worm of the day's email, no
Chinese or Korean messages,

The advantage is, you can add any feature you like without having to
write an entire program. You can write a filter just to get rid of a
particular class of annoying spam, like Nigerian scam letters.

You might write custom filters for your customers so they don't have
to do fancy configuring. You just start the thing up then ignore it.

It would have no GUI, just a configuration file written in Java that
you compile to create the app you need.

Alternatively it might use class for Name and not require compilation
of the config file.

It either deletes the message, or perhaps adds a "probable spam"
indicator to the subject line for filtering in the email program
manual lookover.

Ideally people might contribute their user-written filters for others
to use and or modify.

To reduce ram overhead, since it runs all the time, you might compile
it with JET.

I already have much of this code working as part of my bulk remailer.
 
W

Wojtek

What I am proposing is a simple Javamail framework that looks at
messages on the server, and runs a number of user written filters on
them.

I own my own domain, but do not manage the servers. So I get all the
SPAM directed at any address in the domain. My provider currently does
not filter anything (which is OK, as I prefer to do it myself).

Anyway, for this type of app to work for me, it would run on my
workstation and it would need FTP capability. That is:
- open an FTP connection to the mail directory
- scan through the files
-- apply the filter criteria
--- delete SPAM
- cleanup
It might come with a number canned filters, e.g. everyone in my Eudora
address book is considered ham, all mail not addressed to me is spam,
all mail addressed to more than N people is spam. Something that
recognizes the variation on the current's worm of the day's email, no
Chinese or Korean messages,

I get a LOT of emails with identical subject lines (or very close to
identical), so some method of saving the subject patterns between
"runs" would be good.
 
R

Roedy Green

Anyway, for this type of app to work for me, it would run on my
workstation and it would need FTP capability. That is:
- open an FTP connection to the mail directory
- scan through the files
-- apply the filter criteria
--- delete SPAM
- cleanup

why would ftp be preferable to POP3 to do this?
 
W

Wojtek

The missing assumed step was that legitimate email would be left on
the server for the email client.
why would ftp be preferable to POP3 to do this?

Well, I don't use Eudora....

I am not sure how this would work with an email client. I assumed that
it was for host side: filter the email before the client gets it via
POP3.

Or does it interpose itself between the email client and the IP stack?
If so, it would be best placed on my firewall, where it would act for
all my internal machines.

Or do I use it to to get my email (via POP3) from the ISP to my
firewall, then I point my email client to the firewall which then acts
as the POP3 email server? If so, will it provide POP3 services, or do
I need to get one?

Besides my own domain, I am also the webmaster for another. That one
has several people who get email on it. How would this work for that
domain (I cannot run this on the actual mail server as it is owned by
the ISP)? All those users use Outlook.

I like the idea, especially the configurable filters. The email client
I use (PMMail 2000) already has built in filters as well as a somewhat
arcane filter language. But it does not have "memory" nor the
capability to compare all the emails as a group for patterns.
 
R

Roedy Green

Or does it interpose itself between the email client and the IP stack?
If so, it would be best placed on my firewall, where it would act for
all my internal machines.

There are two approaches I have seen used.

The simpler approach is to have the spam filter act like an email
client. It goes, sniffs the mail, and deletes spam and leaves the
good stuff on the server. Then you run your real email program. The
problem is some new spam could have come in between the time you ran
the filter and you picked up your mail. IT usually only downloads the
first paragraph or so of the message to decide if it is spam.

The other technique is to implement the filter as an email proxy
server. You then have your email program talk to localhost:9999
instead of the regular mailserver. The K9 people did this, but only
on the pop3 side. Eudora thus does not work since you can't configure
the SMTP side independently.

To deal with spam you can either delete it on the server, or mark it
specially e.g. put [Spam] in subject line, where it is easy for the
mail program to filter it.

For this virus-generated stuff where the messages themselves are
fairly fat with an attachment, it makes sense to delete on the server
without downloading the whole thing.
 
R

Roedy Green

Besides my own domain, I am also the webmaster for another. That one
has several people who get email on it. How would this work for that
domain (I cannot run this on the actual mail server as it is owned by
the ISP)? All those users use Outlook.

I was thinking of something that ran client side, since it would talk
POP3. However, you could run it on the server, just talking to the
mailserver locally.

Probably more efficient though would be to write a different framework
that ran the same filters (not as tight as for individuals) to filter
all mail coming into an ISP.

The main idea of the program is that it is user-extensible with
arbitrarily complicated Java code. It would be fairly trivial to add
a blacklist ISP filter, a friends/enemies filter. All the work does
not fall on one person. You don't have the political problem of
convincing the author your style of filter is important or explaining
just how your filter should work.

Java code is something we all understand. Learning how to write
filters by gui often gets you 90% of way to where you want to be and
leaves you dangling.
 
R

Roedy Green

I like the idea, especially the configurable filters. The email client
I use (PMMail 2000) already has built in filters as well as a somewhat
arcane filter language.

If you are trying to support clients, they all have different email
programs. Further you can't usually figure out the filter then send
it to them. You have to coach them through setting it up themselves.
Phhht!
 
R

Roedy Green

. But it does not have "memory" nor the
capability to compare all the emails as a group for patterns.

This is trickier.

The framework could ask the filter if it wanted to be called twice.
It would get a chance to look at all the incoming new mail first, then
on the second pass make its final decision.

Or perhaps it would only get one look at each new piece, but it would
be at liberty to maintain its own persistent internal state.
 
R

Rach

You could have a client side program which connected to the mail server on
demand, and your mail program connected to this new proxy.

Then your mail proxy would run its filters before sending the refined list
to the mail client. To your mail client this is invisible, and would work
with all mail clients.

I'd be interested in helping.
 
G

Gary M

I wonder if anyone would be interested in a beast such as this.
It is a sort of roll-your-own spam filter. I have been trying out
various filters and none work. They all have some fatal flaw that I
can't fix.

Hi Roedy, this is a subject of great interest for me.

As a long aside, I have written my own spam terminator program which I
called (misnomered now) PopSpam. It uses Javamail with JeTty embedded as
a default servlet container for the GUI. It worked along the lines of a
mail client that polls the pop or imap server periodically and executes
white and black rules.

I built it over MySQL, but with data access interfaces allow other
persistence layers to be built. Likewise rules are pluggable and I employ
a simple java interface to write your own rules and importantly,
prioritize execution, as rules can be expensive endeavours (example: you
want to perform your body checks as a last resort as this requires a full
message download). It also has some self optimization built in that
organizes the most successful resultants to be checked first, for
example.

I run the app over a number friends and family's accounts and I have a
better than 99% success rate. I have adopted a fundamental position that
statistical score based filters are unworkable for the _common_ user
(important distinction), because these statistical methods require
users/organizations to maintain corpora of 'bad'and 'good' spam (what
does the pharmacy think when its legit viagra mail gets wiped). So my
approach is based entirely on deterministic rules with spam features that
are personalized to the user/corporation; that said, it would be
relatively simply to employ some rules that query a model. I just don't
see much mileage in it for most users.

My basic premise is that every thing that meets a black rule is deleted
(stored actually in case of false positive) and everything that meets a
white rule is allowed through. This means it would be possible for me to
recieve email from you even though you are not known to me. However, if
your message was spammy you'd probably not get through. There is the
achilles heel and every antispam solution has one. This is mitigated as I
store the message for recovery.

My ultimate goal is to built an antispam program for admins and not a
mail user that acts as a SMTP proxy prechecking messages. It would
require the mail user to review caught spam periodically and to create
and customize their white rules.

The reason for this approach is that I feel spammers and most antispam
tools exploit both sides of the same problem: bandwidth is cheap, so I
can spam as much as I like, says the spammer. The antispam tool says
bandwidth is cheap so I can download the whole message and check it at
the client. The antispam tool I envision is attractive to organizations
as it would address this fallacy.

Anyway getting back to your post. I am interested in this project if it
gets going. I'd be willing to help out in any way I can. My own effort
has reached a stable point, but I have lost some enthusiasm to finish it
given the almost daily announcements of similar tools that are making for
a rather crowded marketspace.

Gary
 
G

Gary M

I run the app over a number friends and family's accounts and I have a
better than 99% success rate. I have adopted a fundamental position that
statistical score based filters are unworkable for the _common_ user
(important distinction), because these statistical methods require
users/organizations to maintain corpora of 'bad'and 'good' spam

This should read "'bad' and 'good' messages".
 
D

David Segall

Roedy Green said:
I wonder if anyone would be interested in a beast such as this.
It is a sort of roll-your-own spam filter. I have been trying out
various filters and none work. They all have some fatal flaw that I
can't fix.
I agree but I think it may be more productive to implement an email
client to solve the problem. The client already knows the format of
your address book and has "read" and, if necessary stored, your
previous emails.
What I am proposing is a simple Javamail framework that looks at
messages on the server, and runs a number of user written filters on
them.

A user written filter gets passed a MimeMessage object, and returns a
float representing the probability this is spam, or the probability
this is definitely good. The user implements either an IsSpam or
IsHam interface.
It may be preferable to provide an optional separate set of filters
for the message headers to avoid a double download of legitimate large
emails. Actually this would be useful to avoid the download of the
attachments in the current wave of spam which, in my case, is around
200MB per day.
It might come with a number canned filters, e.g. everyone in my Eudora
address book is considered ham, all mail not addressed to me is spam,
all mail addressed to more than N people is spam. Something that
recognizes the variation on the current's worm of the day's email, no
Chinese or Korean messages,

The advantage is, you can add any feature you like without having to
write an entire program. You can write a filter just to get rid of a
particular class of annoying spam, like Nigerian scam letters.

You might write custom filters for your customers so they don't have
to do fancy configuring. You just start the thing up then ignore it.

It would have no GUI, just a configuration file written in Java that
you compile to create the app you need.

Alternatively it might use class for Name and not require compilation
of the config file.

It either deletes the message, or perhaps adds a "probable spam"
indicator to the subject line for filtering in the email program
manual lookover.

Ideally people might contribute their user-written filters for others
to use and or modify.
I like the idea of a library of user-written filters particularly
because some of them would have to interpret one or more of the many
address list formats used by email clients. I would really like a
filter which tells me that the return address is invalid but it would
require a much better knowledge of the protocols than I possess.
To reduce ram overhead, since it runs all the time, you might compile
it with JET.

I already have much of this code working as part of my bulk remailer.
That's a good argument for ignoring my idea of writing an email
client. :) Publish a "pre-Alpha" version and see what happens.
 
N

Neil Campbell

Roedy said:
I wonder if anyone would be interested in a beast such as this.
It is a sort of roll-your-own spam filter. I have been trying out
various filters and none work. They all have some fatal flaw that I
can't fix.


What I am proposing is a simple Javamail framework that looks at
messages on the server, and runs a number of user written filters on
them.

I think the problem with this approach is that simple user-written filters
aren't usually terribly successful. To effectively keep out the majority
of spam you need much more sophisticated techniques.

If your framework provides a way for people to write things like bayesian
filters more easily, then it would be very valuable; however I think it
will only be as good as the filters available for it. Most of the very
simple filters can usually be implemented by the mail client itself, of
course (at least in KMail and Outlook, presumably in others as well).

In my opinion, you'd have to think about what your system would provide that
similar tools don't.
 
R

Roedy Green

That's a good argument for ignoring my idea of writing an email
client. :) Publish a "pre-Alpha" version and see what happens.

Here is a first cut at the interface for spam filters:

package com.mindprod.spam;
import javax.mail.internet.MimeMessage;


/**
* Interface for a spam filter.
*
* @author Roedy Green
* @version 1.0
* @since 2003-09-22
*/
public interface SpamDetect
{
/**
* What is the probability the given message is spam?
* 0.0 = definitely good.
* 0.5 = 50-50 odds
* 1.0 = absolutely certainly spam.
* -1 = no opinion.
*
* @param message MimeMessage from which you can extract any fields
of interest.
*
* @return probability
*/
public float probabilityIsSpam ( MimeMessage message );

/**
* Fire up this filter.
* Do any one-time initialisation,
* e.g. load tables, restore persistent state.
*/
public void open();

/**
* Shutdown this filter,
* e.g. save persistent state, free resources.
*/
public void close();

}
 
R

Roedy Green

In my opinion, you'd have to think about what your system would provide that
similar tools don't.

SpamDetective : would allow large numbers of messages which it does
not.

K9, SpamBayes : would let you use it with Eudora which K9 does not.

MailWasher : would let you use it with large numbers of messages which
it does not.

various server based solutions: let you use it without co-operation of
your ISP or server admin folk.

Vipul's Razor: easier to install and configure, if you just used
canned filters. Perhaps someone could even build a filter than used
the razor protocol.

SaProxy, uses 80 MB ram. Presumably we could do better with Jet
compilation.

Bogofilter: C source only, does not run on windows.

The key thing is the ability to whip up your own little filter to nail
your own particular problem using your familiar Java tools.
 
M

Mark Thornton

Roedy said:
I wonder if anyone would be interested in a beast such as this.
It is a sort of roll-your-own spam filter. I have been trying out
various filters and none work. They all have some fatal flaw that I
can't fix.


What I am proposing is a simple Javamail framework that looks at
messages on the server, and runs a number of user written filters on
them.

A user written filter gets passed a MimeMessage object, and returns a
float representing the probability this is spam, or the probability
this is definitely good. The user implements either an IsSpam or
IsHam interface.

It might come with a number canned filters, e.g. everyone in my Eudora
address book is considered ham, all mail not addressed to me is spam,
all mail addressed to more than N people is spam. Something that
recognizes the variation on the current's worm of the day's email, no
Chinese or Korean messages,

The advantage is, you can add any feature you like without having to
write an entire program. You can write a filter just to get rid of a
particular class of annoying spam, like Nigerian scam letters.

You might write custom filters for your customers so they don't have
to do fancy configuring. You just start the thing up then ignore it.

It would have no GUI, just a configuration file written in Java that
you compile to create the app you need.

Alternatively it might use class for Name and not require compilation
of the config file.

It either deletes the message, or perhaps adds a "probable spam"
indicator to the subject line for filtering in the email program
manual lookover.

Ideally people might contribute their user-written filters for others
to use and or modify.

To reduce ram overhead, since it runs all the time, you might compile
it with JET.

I already have much of this code working as part of my bulk remailer.

To be effective in the current circumstances, the filter needs to
actually run on the server so that it can work while your client
computer is switched off or disconnected from the net. My ISP limits the
size of my mail box on their server to 10MB, thus with >1100 messages in
the past 12 hours (~165MB) the box would have filled many times over if
my machine had not been continuously collecting (and filtering them).

In some cases a server based filter might be able to reject a message
before the entire message had been received (e.g. based on the title or
when a .exe attachment is encountered).

Mark Thornton

P.s. Of course I can't compete with the 5GB you received, but this a
competition I would rather not be in.
 
N

Neil Campbell

Roedy said:
The key thing is the ability to whip up your own little filter to nail
your own particular problem using your familiar Java tools.

Fair enough, but in my experience these sorts of problems are those in which
you want to block all messages with a particular phrase in the subject
line, or messages from a particular domain. In these cases, the mail
client often provides enough functionality to deal with it.

The more complex cases of blocking spam in general are difficult to deal
with using user-written filters. Tools like Popfile go some way to
stopping these, and implementing similar tools again would be
time-consuming at best.

I agree totally with the validity of permitting this sort of filtering to be
done at the client; it is usually impractical to persuade an ISP to
implement something useful.

Please don't interpret these comments as negative; I think the project is
definitely a worthwhile one. I simply feel that the sort of 'little
filters' that could be easily written for this would be somewhat limited.
If this is taken further, however, I'd love to add support for it to my
mail client (which I'm gradually progressing towards a workable release).
 
W

Wojtek

There are two approaches I have seen used.

The other technique is to implement the filter as an email proxy
server. You then have your email program talk to localhost:9999
instead of the regular mailserver.

That make sense....
The K9 people did this, but only
on the pop3 side. Eudora thus does not work since you can't configure
the SMTP side independently.

Really? That's strange. My ISP has SMTP for outgoing, yet I get my
email from my domain via POP3. My domain provider does not allow SMTP
unless it comes from their own network (dialup accounts).
To deal with spam you can either delete it on the server, or mark it
specially e.g. put [Spam] in subject line, where it is easy for the
mail program to filter it.

For this virus-generated stuff where the messages themselves are
fairly fat with an attachment, it makes sense to delete on the server
without downloading the whole thing.

So you grab the first X bytes via what? I would think FTP, or can POP3
do this?
 
W

Wojtek

I was thinking of something that ran client side, since it would talk
POP3. However, you could run it on the server, just talking to the
mailserver locally.

Probably more efficient though would be to write a different framework
that ran the same filters (not as tight as for individuals) to filter
all mail coming into an ISP.

The main idea of the program is that it is user-extensible with
arbitrarily complicated Java code. It would be fairly trivial to add
a blacklist ISP filter, a friends/enemies filter. All the work does
not fall on one person. You don't have the political problem of
convincing the author your style of filter is important or explaining
just how your filter should work.

Ok, then the app would have all filters, then using a config file
(XML?) you would configure the filters and which ones were live.
 
W

Wojtek

This is trickier.

The framework could ask the filter if it wanted to be called twice.
It would get a chance to look at all the incoming new mail first, then
on the second pass make its final decision.

Or perhaps it would only get one look at each new piece, but it would
be at liberty to maintain its own persistent internal state.

I think persistent storage. The greater the sample, the more accurate
the analysis.

Maybe even a central server with the signatures? Hmm, I think this has
been done already. But we can do it "better" :))

Each filter would (should?) have the option of saving some state
information. If nothing else a simple hit count. Either through its
own code, or using the framework's classes.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top