New kind of Spam Filter

Wojtek · Sep 23, 2003

The main idea of the program is that it is user-extensible with
arbitrarily complicated Java code. It would be fairly trivial to add
a blacklist ISP filter, a friends/enemies filter.

Each filter could return a probablility. The framework could then
tally the probabilities and assign a weight to the email. If the
weight value surpasses some amount (or a filter returns some
significant probability), then the email is SPAM, not the real piggy
stuff (love the SPAM <-> HAM usage

).

Roedy Green · Sep 23, 2003

So you grab the first X bytes via what? I would think FTP, or can POP3
do this?

I know spamDetective does this. I gather it just starts the read and
aborts. In Javamail, the reading is transparent. You don't know
really when the i/o goes on unless you monitor traffic to figure out
how it works. It may thus be harder in JavaMail to avoid downloading
more than you need.

Roedy Green · Sep 23, 2003

Ok, then the app would have all filters, then using a config file
(XML?) you would configure the filters and which ones were live.

Given this is a tool for Java programmers, you might just write a
piece of Java code that listed the filters you wanted in the order you
wanted. I would want to avoid class for name, to give native code
optimisers the best possible shot.

Roedy Green · Sep 23, 2003

P.s. Of course I can't compete with the 5GB you received, but this a
competition I would rather not be in.

The ISP has to pay for the bandwidth of all this crud, at about 50K
each. It has to be stopped even before it arrives.
That's why I have for now a semi secret email account visible to
humans but not to most robots on my site.

GaryM · Sep 23, 2003

You don't know
really when the i/o goes on unless you monitor traffic to figure out
how it works. It may thus be harder in JavaMail to avoid downloading
more than you need.

The Message will be only get header info unless you ask for the body.
Been there, sniffed that, if you get my meaning.

Roedy Green · Sep 23, 2003

The Message will be only get header info unless you ask for the body.
Been there, sniffed that, if you get my meaning.

There are intermediate levels between just header and whole thing that
are useful in spam detection, namely:

just the first X characters of the body.

Just the body without the attachments
..

GaryM · Sep 23, 2003

There are intermediate levels between just header and whole thing
that are useful in spam detection, namely:

just the first X characters of the body.

I think that if I am at the body level doing a lexigraphic analysis,
then the whole body will always perform better than anything less.
Don't forget the numero uno spam feature, to wit, The Unsubscribe
Message and all of its permutations, is always near the end.

Just the body without the attachments
.

In a multipart mime message there is no distinction between 'body' and
'attachment'. These are all parts and you can make an intelligent guess
by by looking at the mime type and its disposition. Sadly there are no
guarantees where any will occur, so you must parse them all or parse
until you meet an assumption (like the first text/* part is the one I
want analyze).

IMHO, Body checks are definitely the most expensive and you can obtain
excellent performance without them, but they are useful as a last
resort. If not using Javamail, then you can just read the stream and
abort when you've seen enough, but, you will need to decipher mime
boundaries on the fly and decode base 64, quoted-printable etc.

Some other pointers with mail body checks are:

Embedded RFC822 messages which if are multipart require recursion to
parse. Here Javamail is cumbersome but can be wrapped easily to do the
job.

Strip HTML or not? You may have seen, Her<fhjhdfjhd>bal remedy, which
is rendered as Herbal in all mail clients that render html. In general
it is best to strip html, but sometimes the URLs in the body are more
indicting than the domains in the headers.

Just a few thoughts,

Gary

brougham3 · Sep 26, 2003

Roedy Green said:
* What is the probability the given message is spam?
* 0.0 = definitely good.
* 0.5 = 50-50 odds
* 1.0 = absolutely certainly spam.
* -1 = no opinion.

What's the difference between 50/50 odds and no opinion?

Roedy Green · Sep 26, 2003

What's the difference between 50/50 odds and no opinion?

Let's say you filtered and kept computing filters until the average
went either below .1 or above .9. 50-50 adds uncertainty to the
moving average, pulling it away from either end. "no opinion" has no
effect on the average.

Wojtek · Sep 27, 2003

Here is a first cut at the interface for spam filters:

And if we want the filters to have some sort of storage capability:

public interface SpamDetect
{

public void setStorage( SpamStorage storage );

}

Where SpamStorage is a concrete class:
- initialized by the framework
- reference kept in a List
- contains a "key" generated by the framework to uniquely identify
this fliter (for a separate table?)
- uses an interface to a database layer

The filter can ignore or use the storage as it wishes. The framework
will be responsible for cleanup.

Roedy Green · Sep 27, 2003

The filter can ignore or use the storage as it wishes. The framework
will be responsible for cleanup.

What would be the problem with just persisting to little serialised
files?

Wojtek · Sep 27, 2003

What would be the problem with just persisting to little serialised
files?

That would mean that each filter that needed storage would have to
manage its own storage IO. Duplication of effort.

That is what frameworks are for. To provide common services to
processes. You would not want each servlet (in a Web app) to have its
own logging code.

What is Anti-Spam Filter.(thunderbird spam filter)	1	Mar 27, 2008
Filter table rows based on multiple checkboxes value	2	Jan 13, 2023
Filter Spam - Make Spam Get Out Of your Inbox.(thunderbird spamfilter)	3	Mar 24, 2008
Text Classification - Spam Filter	6	Jan 3, 2004
Google Groups spam filter	40	Oct 12, 2009
Report Spam	3	Jul 9, 2009
Java Spam Filter	7	Dec 13, 2003
New to the forum	3	Dec 14, 2021

New kind of Spam Filter

Wojtek

Roedy Green

Roedy Green

Roedy Green

GaryM

Roedy Green

GaryM

brougham3

Roedy Green

Wojtek

Roedy Green

Wojtek

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads