mbox despamming script

Paul Rubin · Nov 27, 2003

I was surprised there was no obvious way with spamassassin (maybe I
shoulda looked at spambayes) to split an existing mbox file into its
spam and non-spam messages. So I wrote one. It's pretty slow, taking
around 1.5 seconds per message on a 2 ghz Athlon, making me wonder how
serious ISP's getting thousands of incoming messages per hour can run
anything like spamassassin on all of them. But for my purposes it's ok.
Comments and improvements are welcome.

================================================================

#!/usr/bin/python

# Spam filter for mbox files. Reads mailfile and makes two new
# files, mailfile.spam and mailfile.ham, containing the spam and non-spam
# messages from mailfile as determined by piping through spamc.

# Copyright 2003 Paul Rubin <http://www.paulrubin.com>
# Copying permission: GNU General Public License ver. 2, <http://www.gnu.org>

import mailbox,os,sys
from time import time

def mktemp():
import sha,os,time
d = sha.new("spam:%s,%s"%(os.getpid(),time.time())).hexdigest()
return "spam%s.temp"% d[:10]

tempfilename = mktemp()

def main():
print sys.argv
if len(sys.argv) > 1:
filename = sys.argv[1]
else:
print "Usage: spam.py mboxfile"

print "marking up", filename
mailfile = open(filename, 'r')
ham = open(filename + ".ham", 'w')
spam = open(filename + ".spam", 'w')

mbox = mailbox.UnixMailbox(mailfile)
i = 0

while 1:
i += 1
m1 = mailfile.tell()
msg = mbox.next()
if not msg: break
body = msg.fp.read()
envelope = env_header(mailfile, m1)
print "%5d"%i, m1, mailfile.tell(), msg.startofbody, len(body),
is_spam, txt = spam_filter (envelope, msg, body)
print ['HAM','SPAM'][is_spam]

if is_spam:
spam.write(txt)
else:
ham.write(txt)

def spam_filter(envelope, msg, body):
txt = envelope + ''.join(msg.headers) + '\n' + body
out = os.popen("spamc > %s"% tempfilename, "w")
out.write(txt)
out.close()

t = mailbox.UnixMailbox(open(tempfilename))
spam_level = len(t.next().get('X-Spam-Level', ''))
txt = open(tempfilename).read()
return (spam_level >= 5, txt)

def env_header(fp, pos):
t = fp.tell()
fp.seek(pos)
e = fp.readline()
fp.seek(t)
return e

try:
t=time()
main()
dt = time()-t
print "elapsed: %d min %d sec"% divmod(int(dt), 60)
finally:
os.unlink(tempfilename)

Michael Hudson · Nov 27, 2003

Paul Rubin said:
I was surprised there was no obvious way with spamassassin (maybe I
shoulda looked at spambayes) to split an existing mbox file into its
spam and non-spam messages. So I wrote one. It's pretty slow, taking
around 1.5 seconds per message on a 2 ghz Athlon, making me wonder how
serious ISP's getting thousands of incoming messages per hour can run
anything like spamassassin on all of them. But for my purposes it's ok.
Comments and improvements are welcome.

It's my experience that mailbox is pretty slow at reading mbox files.
I have memories of speeding up some mail-statistics gathering stuff by
a large amount by implementing my own mbox "parser" (basically
s.find('\n\nFrom ') or similar, I forget). I'm not sure I'd like to
use this approach on something less forgiving than stats, though

Cheers,
mwh

maildir->mbox conversion script review	0	Jan 9, 2007
MemoryError on reading mbox file	6	Sep 12, 2007
Translater + module + tkinter	1	Feb 16, 2023
[script] dis/assembling mbox email	5	Jun 10, 2004
Help making this script better	1	Aug 6, 2009
Rapidshare to Megaupload script	4	Feb 14, 2009
simple script to read and output Mailbox body to file.	16	Jun 7, 2004
Cannot figure out line of code, also not understanding error	9	Feb 20, 2014

mbox despamming script

Paul Rubin

Michael Hudson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads