Reading Huge UnixMailbox Files

B

Brandon McGinty

List,
I'm trying to import hundreds of thousands of e-mail messages into a
database with Python.
However, some of these mailboxes are so large that they are giving
errors when being read with the standard mailbox module.
I created a buffered reader, that reads chunks of the mailbox, splits
them using the re.split function with a compiled regexp, and imports
each chunk as a message.
The regular expression work is where the bottle-neck appears to be,
based on timings.
I'm wondering if there is a faster way to do this, or some other method
that you all would recommend.

Brandon McGinty
 
N

Nobody

I'm trying to import hundreds of thousands of e-mail messages into a
database with Python.
However, some of these mailboxes are so large that they are giving
errors when being read with the standard mailbox module.
I created a buffered reader, that reads chunks of the mailbox, splits
them using the re.split function with a compiled regexp, and imports
each chunk as a message.
The regular expression work is where the bottle-neck appears to be,
based on timings.
I'm wondering if there is a faster way to do this, or some other method
that you all would recommend.

Consider using awk. In my experience, high-level languages tend to have
slower regex libraries than simple tools such as sed and awk.

E.g. the following script reads a mailbox on stdin and writes a separate
file for each message:

#!/usr/bin/awk -f
BEGIN {
num = 0;
ofile = "";
}

/^From / {
if (ofile != "") close(ofile);
ofile = sprintf("%06d.mbox", num);
num ++;
}

{
print > ofile;
}

It would be simple to modify it to start a new file after a given number
of messages or a given number of lines.

You can then read the resulting smaller mailboxes using your Python script.
 
D

Dan Stromberg

E.g. the following script reads a mailbox on stdin and writes a separate
file for each message:

       #!/usr/bin/awk -f
       BEGIN {
               num = 0;
               ofile = "";
       }

       /^From / {
               if (ofile != "") close(ofile);
               ofile = sprintf("%06d.mbox", num);
               num ++;
       }

       {
               print > ofile;
       }

For the archive: This assumes traditional mbox. A SysV-ish sendmail,
for example, may not like it.
 
N

Nobody

For the archive: This assumes traditional mbox. A SysV-ish sendmail,
for example, may not like it.

sendmail itself doesn't deal with mailboxes or spool files; that task is
left to the local delivery agent (e.g. mail.local or procmail).

To clarify: the awk script assumes that any line beginning with
"From " is the start of a message; any matching lines in the message body
must be escaped. sendmail will do this if the mailer has the "E" flag
(F=...E...).

If lines beginning with "From " are only escaped when preceded by a blank
line, you need to maintain a flag which is set when the current line is
the first line in the file or preceded by a blank line and clear
otherwise. This is the behaviour of sendmail's mail.local, and of procmail
when invoked with the -Y flag (this is the default when sendmail is
configured with FEATURE(local_procmail)) or when no Content-Length header
is present.

If lines beginning with "From " aren't escaped (relying upon a
Content-Length header), you need to find some other approach (which
probably won't involve traditional line-oriented tools). You also need to
be really careful when processing such files.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,054
Latest member
LucyCarper

Latest Threads

Top