Reading Huge UnixMailbox Files

Brandon McGinty · Apr 26, 2011

List,
I'm trying to import hundreds of thousands of e-mail messages into a
database with Python.
However, some of these mailboxes are so large that they are giving
errors when being read with the standard mailbox module.
I created a buffered reader, that reads chunks of the mailbox, splits
them using the re.split function with a compiled regexp, and imports
each chunk as a message.
The regular expression work is where the bottle-neck appears to be,
based on timings.
I'm wondering if there is a faster way to do this, or some other method
that you all would recommend.

Brandon McGinty

Nobody · Apr 26, 2011

I'm trying to import hundreds of thousands of e-mail messages into a
database with Python.
However, some of these mailboxes are so large that they are giving
errors when being read with the standard mailbox module.
I created a buffered reader, that reads chunks of the mailbox, splits
them using the re.split function with a compiled regexp, and imports
each chunk as a message.
The regular expression work is where the bottle-neck appears to be,
based on timings.
I'm wondering if there is a faster way to do this, or some other method
that you all would recommend.

Consider using awk. In my experience, high-level languages tend to have
slower regex libraries than simple tools such as sed and awk.

E.g. the following script reads a mailbox on stdin and writes a separate
file for each message:

#!/usr/bin/awk -f
BEGIN {
num = 0;
ofile = "";
}

/^From / {
if (ofile != "") close(ofile);
ofile = sprintf("%06d.mbox", num);
num ++;
}

{
print > ofile;
}

It would be simple to modify it to start a new file after a given number
of messages or a given number of lines.

You can then read the resulting smaller mailboxes using your Python script.

Dan Stromberg · Apr 26, 2011

E.g. the following script reads a mailbox on stdin and writes a separate
file for each message:

#!/usr/bin/awk -f
BEGIN {
num = 0;
ofile = "";
}

/^From / {
if (ofile != "") close(ofile);
ofile = sprintf("%06d.mbox", num);
num ++;
}

{
print > ofile;
}

For the archive: This assumes traditional mbox. A SysV-ish sendmail,
for example, may not like it.

Nobody · Apr 27, 2011

For the archive: This assumes traditional mbox. A SysV-ish sendmail,
for example, may not like it.

sendmail itself doesn't deal with mailboxes or spool files; that task is
left to the local delivery agent (e.g. mail.local or procmail).

To clarify: the awk script assumes that any line beginning with
"From " is the start of a message; any matching lines in the message body
must be escaped. sendmail will do this if the mailer has the "E" flag
(F=...E...).

If lines beginning with "From " are only escaped when preceded by a blank
line, you need to maintain a flag which is set when the current line is
the first line in the file or preceded by a blank line and clear
otherwise. This is the behaviour of sendmail's mail.local, and of procmail
when invoked with the -Y flag (this is the default when sendmail is
configured with FEATURE(local_procmail)) or when no Content-Length header
is present.

If lines beginning with "From " aren't escaped (relying upon a
Content-Length header), you need to find some other approach (which
probably won't involve traditional line-oriented tools). You also need to
be really careful when processing such files.

Huge cgi!Help!	5	Oct 7, 2007
reading LWP in chunks	6	Oct 18, 2010
Reading huge text files one line at a time....	8	Nov 23, 2004
Update huge xml files without loading into RAM	7	Oct 9, 2003
ANN: qxmail-0.0.1	0	May 19, 2004
[SUMMARY] Mailing List Files (#115)	0	Mar 1, 2007
mbox despamming script	1	Nov 27, 2003
[ANN] lxml 1.0 released	2	Jun 2, 2006

Reading Huge UnixMailbox Files

Brandon McGinty

Nobody

Dan Stromberg

Nobody

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads