New kind of Spam Filter

W

Wojtek

The main idea of the program is that it is user-extensible with
arbitrarily complicated Java code. It would be fairly trivial to add
a blacklist ISP filter, a friends/enemies filter.

Each filter could return a probablility. The framework could then
tally the probabilities and assign a weight to the email. If the
weight value surpasses some amount (or a filter returns some
significant probability), then the email is SPAM, not the real piggy
stuff (love the SPAM <-> HAM usage :)).
 
R

Roedy Green

So you grab the first X bytes via what? I would think FTP, or can POP3
do this?

I know spamDetective does this. I gather it just starts the read and
aborts. In Javamail, the reading is transparent. You don't know
really when the i/o goes on unless you monitor traffic to figure out
how it works. It may thus be harder in JavaMail to avoid downloading
more than you need.
 
R

Roedy Green

Ok, then the app would have all filters, then using a config file
(XML?) you would configure the filters and which ones were live.

Given this is a tool for Java programmers, you might just write a
piece of Java code that listed the filters you wanted in the order you
wanted. I would want to avoid class for name, to give native code
optimisers the best possible shot.
 
R

Roedy Green

P.s. Of course I can't compete with the 5GB you received, but this a
competition I would rather not be in.

The ISP has to pay for the bandwidth of all this crud, at about 50K
each. It has to be stopped even before it arrives.
That's why I have for now a semi secret email account visible to
humans but not to most robots on my site.
 
G

GaryM

You don't know
really when the i/o goes on unless you monitor traffic to figure out
how it works. It may thus be harder in JavaMail to avoid downloading
more than you need.

The Message will be only get header info unless you ask for the body.
Been there, sniffed that, if you get my meaning.
 
R

Roedy Green

The Message will be only get header info unless you ask for the body.
Been there, sniffed that, if you get my meaning.

There are intermediate levels between just header and whole thing that
are useful in spam detection, namely:

just the first X characters of the body.

Just the body without the attachments
..
 
G

GaryM

There are intermediate levels between just header and whole thing
that are useful in spam detection, namely:

just the first X characters of the body.

I think that if I am at the body level doing a lexigraphic analysis,
then the whole body will always perform better than anything less.
Don't forget the numero uno spam feature, to wit, The Unsubscribe
Message and all of its permutations, is always near the end.
Just the body without the attachments
.

In a multipart mime message there is no distinction between 'body' and
'attachment'. These are all parts and you can make an intelligent guess
by by looking at the mime type and its disposition. Sadly there are no
guarantees where any will occur, so you must parse them all or parse
until you meet an assumption (like the first text/* part is the one I
want analyze).

IMHO, Body checks are definitely the most expensive and you can obtain
excellent performance without them, but they are useful as a last
resort. If not using Javamail, then you can just read the stream and
abort when you've seen enough, but, you will need to decipher mime
boundaries on the fly and decode base 64, quoted-printable etc.

Some other pointers with mail body checks are:

Embedded RFC822 messages which if are multipart require recursion to
parse. Here Javamail is cumbersome but can be wrapped easily to do the
job.

Strip HTML or not? You may have seen, Her<fhjhdfjhd>bal remedy, which
is rendered as Herbal in all mail clients that render html. In general
it is best to strip html, but sometimes the URLs in the body are more
indicting than the domains in the headers.

Just a few thoughts,

Gary
 
B

brougham3

Roedy Green said:
* What is the probability the given message is spam?
* 0.0 = definitely good.
* 0.5 = 50-50 odds
* 1.0 = absolutely certainly spam.
* -1 = no opinion.

What's the difference between 50/50 odds and no opinion?
 
R

Roedy Green

What's the difference between 50/50 odds and no opinion?

Let's say you filtered and kept computing filters until the average
went either below .1 or above .9. 50-50 adds uncertainty to the
moving average, pulling it away from either end. "no opinion" has no
effect on the average.
 
W

Wojtek

Here is a first cut at the interface for spam filters:

And if we want the filters to have some sort of storage capability:
public interface SpamDetect
{
public void setStorage( SpamStorage storage );

Where SpamStorage is a concrete class:
- initialized by the framework
- reference kept in a List
- contains a "key" generated by the framework to uniquely identify
this fliter (for a separate table?)
- uses an interface to a database layer

The filter can ignore or use the storage as it wishes. The framework
will be responsible for cleanup.
 
R

Roedy Green

The filter can ignore or use the storage as it wishes. The framework
will be responsible for cleanup.

What would be the problem with just persisting to little serialised
files?
 
W

Wojtek

What would be the problem with just persisting to little serialised
files?

That would mean that each filter that needed storage would have to
manage its own storage IO. Duplication of effort.

That is what frameworks are for. To provide common services to
processes. You would not want each servlet (in a Web app) to have its
own logging code.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
DewittMill
Top