speed of spambayes?

P

Paul Rubin

Can someone using spambayes tell me about how fast it runs? I'm using
Spamassassin right now but it takes around 1.5 seconds to process a
message on a 2 ghz Athlon. I believe part of that time is spent doing
network lookups to check the source addresses against various spam
blacklists. I want to crunch through several gigabytes of spam
folders to see if any legitimate messages got trapped, so need a fast
classifier with a low false negative rate (it's ok if the false
positive rate isn't so low, since almost all the messages in these
folders are already spam).

Thanks.
 
J

John J. Lee

Paul Rubin said:
Can someone using spambayes tell me about how fast it runs? I'm using

Damned slow if you're on dialup and IMAP with SSL. At least, it used
to be, but that was a while back...

Spamassassin right now but it takes around 1.5 seconds to process a
message on a 2 ghz Athlon. I believe part of that time is spent doing
network lookups to check the source addresses against various spam
blacklists. I want to crunch through several gigabytes of spam
folders to see if any legitimate messages got trapped, so need a fast

Well, that's only a couple of days even if it's mostly CPU :)

IIRC, I don't think spambayes can be much slower than that -- I could
tell roughly how much was CPU and how much network when I used it,
because I could watch the lights blinking on my modem.

classifier with a low false negative rate (it's ok if the false
positive rate isn't so low, since almost all the messages in these
folders are already spam).

You might want to tune it a bit first, then.


John
 
P

Paul Rubin

Well, that's only a couple of days even if it's mostly CPU :)

No it's much more than a few days. My spamassassin-based classifier
seems to process my mail files at about 20 MB per hour (maybe less),
so 50 hours per GB (maybe more). I have about 5 GB of spam that I
want to process, so that's at least 1.5 weeks of nonstop despamming.
You might want to tune it a bit first, then.

Hmm, good point, spam filters are usually set up the other way.

Thanks.
 
E

Emile van Sebille

Paul Rubin:
Can someone using spambayes tell me about how fast it runs?

IIRC, Tim Peters did some specific measurements during spambayes
development.

.... aah - here it is: (from message id
(e-mail address removed))
in http://mail.python.org/pipermail/python-dev/2002-August.txt.gz

[Eric S. Raymond]
I'm in the process of speed-tuning this now. I intend for it to be
blazingly fast, usable for sites that process 100K mails a day, and I
think I know how to do that. This is not a natural application for
Python :).

[Tim Peters]
I'm not sure about that. The all-Python version I checked in added 20,000
Python-Dev messages to the database in 2 wall-clock minutes. The time for
computing the statistics, and for scoring, is simply trivial (this wouldn't
be true of a "normal" Bayesian classifier (NBC), but Graham skips most of
the work an NBC does, in particular favoring fast classification time over
fast model-update time).

This was 15 months ago, and I'm not sure how that relates to GBs per
howlongs, but it's something to start with.
 
A

Aahz

Can someone using spambayes tell me about how fast it runs? I'm using
Spamassassin right now but it takes around 1.5 seconds to process a
message on a 2 ghz Athlon. I believe part of that time is spent doing
network lookups to check the source addresses against various spam
blacklists. I want to crunch through several gigabytes of spam
folders to see if any legitimate messages got trapped, so need a fast
classifier with a low false negative rate (it's ok if the false
positive rate isn't so low, since almost all the messages in these
folders are already spam).

Maybe I shouldn't tell you this so you're forced to use a Python app,
but you can either disable the network checks or run spamd. ;-)
 
P

Paul Rubin

Maybe I shouldn't tell you this so you're forced to use a Python app,
but you can either disable the network checks or run spamd. ;-)

I do use spamc/spamd, but spamd is what's going the network checks.
If I start my own spamd instance I can disable them, so I'll probably
do that.
 
A

Aahz

I do use spamc/spamd, but spamd is what's going the network checks.
If I start my own spamd instance I can disable them, so I'll probably
do that.

spamd is supposed to cache the network checks; perhaps it's not properly
configured.
 
P

Paul Rubin

spamd is supposed to cache the network checks; perhaps it's not properly
configured.

I'm running tens of thousands of messages through it one after another.
It could really be that many distinct addresses.
 
A

Aahz

I'm running tens of thousands of messages through it one after another.
It could really be that many distinct addresses.

Maybe. OTOH, my ISP (Panix) uses SA for several thousand customers, so
I'm sure they can't be taking 1.5 seconds per message. Either they're
running much heavier hardware than you (unlikely, IMO) or they've got
some config difference that makes it work.
 
P

Paul Rubin

Maybe. OTOH, my ISP (Panix) uses SA for several thousand customers, so
I'm sure they can't be taking 1.5 seconds per message. Either they're
running much heavier hardware than you (unlikely, IMO) or they've got
some config difference that makes it work.

I wonder if it's possible to import the spam address blacklists
(update the local copy a few times a day) instead of doing constant
network hits on them. That would speed things up a lot.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top