Finding messages in huge mboxes

Bastiaan Welmers · Feb 2, 2004

Hi,

I wondered if anyone has ever met this same mbox issue.

I'm having the following problem:

I need find messages in huge mbox files (50MB or more).
The following way is (of course?) not very usable:

fp = open("mbox", "r")
archive = mailbox.UnixMailbox(fp)
i=0
while i < message_number_needed:
i+=1
archive.next()

needed_message = archive.next()

Especially because I often need messages at the end
of the MBOX file.
So I tried the following (scanning messages backwards
on found "From " lines with readline())

i=0
j=0
while 1:
i+=1
fp.seek(-i, SEEK_TO_END=2)
line = fp.readline()
if not line:
break
if line[:5] == 'From ':
j+=1
if j == total_messages - message_number_needed:
archive.seekp = fp.tell()
message = archive.next()
# message found

But also seems to be slow and CPU consuming.

Anyone who has a better idea?

Regards,

Bastiaan Welmers

Miklós · Feb 2, 2004

What about putting it into a database like MySQL? <pyWink>

Miklós

Diez B. Roggisch · Feb 2, 2004

Anyone who has a better idea?

AFAIK MUAs usually use a mbox.index-file for faster access. The index is
computed once, and updated whenever a new message is added. You could
create this index quite easily yourself by looping over the mbox and
pickling a list of tell'ed positions. If you also store the creation-date
of the index and the filesize of the mbox-file, you should be able to
create a function that will update the index whenever the underlying mbox
has changed. Another approach would be to perform index-creation on regular
bases using cron.

Regards,

Diez

Donn Cave · Feb 2, 2004

Bastiaan Welmers said:
I need find messages in huge mbox files (50MB or more). ....
Especially because I often need messages at the end
of the MBOX file.
So I tried the following (scanning messages backwards
on found "From " lines with readline())

readline() is not your friend here. I suggest that
you read large blocks of data, like 8192 bytes for
example, and search them iteratively. Like,
next = block.find('\nFrom ', prev + 1)

This will give you the location of each message in
the current block, so you can split the block up
into a list of messages. (There will be an extra
chunk of data at the beginning of each block, before
the first "From " - recycle that onto the end of the
next block.)

Since file object buffering is at best useless in this
application, I would use posix.open, posix.lseek and
posix.read. Taking this approach, I find that reading
the last 10 messages in a 100 Mb folder takes 0.05 sec.

Donn Cave, (e-mail address removed)

David M. Cooke · Feb 2, 2004

At some point said:
readline() is not your friend here. I suggest that
you read large blocks of data, like 8192 bytes for
example, and search them iteratively. Like,
next = block.find('\nFrom ', prev + 1)

Unless, of course, you read '\nFr', then 'om ' in the next block...

I can't think of a simple way around this (except for reading by
lines). Concating the last two together means having to keep track of
what you've seen in the last block. Maybe picking off the last line
from the last block (using line.rfind('\n')), and concatenating that
to the beginning of the next.

Donn Cave · Feb 3, 2004

Quoth (e-mail address removed) (David M. Cooke):
|> In article <[email protected]>,
|> ...
|>> I need find messages in huge mbox files (50MB or more).
|> ...
|>> Especially because I often need messages at the end
|>> of the MBOX file.
|>> So I tried the following (scanning messages backwards
|>> on found "From " lines with readline())
|>
|> readline() is not your friend here. I suggest that
|> you read large blocks of data, like 8192 bytes for
|> example, and search them iteratively. Like,
|> next = block.find('\nFrom ', prev + 1)
|
| Unless, of course, you read '\nFr', then 'om ' in the next block...
|
| I can't think of a simple way around this (except for reading by
| lines). Concating the last two together means having to keep track of
| what you've seen in the last block. Maybe picking off the last line
| from the last block (using line.rfind('\n')), and concatenating that
| to the beginning of the next.

I'm reading from the end backwards, so the fragment is block[:start].
Append that to the block before it, and each block always will end at
a message boundary. If you start in the middle, you have to deal with
an extra boundary problem. If reading forward from the beginning, it
would be about as simple.

If I have overlooked some obvious problem with this, it wouldn't be
the first time, but I think it's as simple as it could be. The only
inelegance to it is that you have to scan the fragment at least twice
(one extra time for each time it's added to a new block.)

Donn Cave, (e-mail address removed)

Miki Tebeka · Feb 3, 2004

Hell Bastiaan,

I need find messages in huge mbox files (50MB or more).
...
Anyone who has a better idea?

I find that sometime using the unix little utilties (which are
available for M$ as well) gives very good performance.

--- last.py ---
#!/usr/bin/env python
from os import popen
from sys import argv

# Find last "From:" line
last = popen("grep -n 'From:' %s | tail -1" % argv[1]).read()
last = int(last.split(":")[0])
# Find total number of lines
size = popen("wc -l %s" % argv[1]).read()
size = int(size.split()[0].strip())
# Print the message
print popen("tail -%d %s" % (size - last, argv[1])).read()
--- last.py ---
Tool less than 1sec on my computer on a 11MB mailbox.

HTH.
Miki

Cameron Laird · Feb 3, 2004

Hell Bastiaan,

I need find messages in huge mbox files (50MB or more).
...
Anyone who has a better idea?

Click to expand...

I find that sometime using the unix little utilties (which are
available for M$ as well) gives very good performance.

--- last.py ---
#!/usr/bin/env python
from os import popen
from sys import argv

# Find last "From:" line
last = popen("grep -n 'From:' %s | tail -1" % argv[1]).read()
last = int(last.split(":")[0])
# Find total number of lines
size = popen("wc -l %s" % argv[1]).read()
size = int(size.split()[0].strip())
# Print the message
print popen("tail -%d %s" % (size - last, argv[1])).read()
--- last.py ---
Tool less than 1sec on my computer on a 11MB mailbox.

.
.
.
Absolutely.

Miki, I'd find this illustration even more compelling if it exploited
commands.getoutput(.)
in place of your triplicated
popen(.).read()

Erno Kuusela · Feb 3, 2004

Bastiaan Welmers said:
Especially because I often need messages at the end
of the MBOX file.
So I tried the following (scanning messages backwards
on found "From " lines with readline())

i=0
j=0
while 1:
i+=1
fp.seek(-i, SEEK_TO_END=2)
line = fp.readline()
if not line:
break
if line[:5] == 'From ':
j+=1
if j == total_messages - message_number_needed:
archive.seekp = fp.tell()
message = archive.next()
# message found

But also seems to be slow and CPU consuming.

something like this might work. the loop below scanned a 115MB mailbox
in about 1 second on a 1.2ghz k7. extracts the next-to-last message,
but you get the idea. if you don't want to read the file into cache,
you could adapt it to start with a smaller mmapped chunk from the end
of the file and enlarge it until you find what you want.

import os, re, mmap, sys
from cStringIO import StringIO
import email

fd = os.open(sys.argv[1], os.O_RDONLY)
size = os.fstat(fd).st_size
print size
buf = mmap.mmap(fd, size, access=mmap.ACCESS_READ)
message_offsets = []
for m in re.finditer(r'(?s)\n\nFrom', buf):
message_offsets.append(m.start())

msgfp = StringIO(buf[message_offsets[-2] + 2:message_offsets[-1] + 2])
msg = email.message_from_file(msgfp)
print msg['to']

-- erno

Bastiaan Welmers · Feb 6, 2004

Miki said:
Hell Bastiaan,

I find that sometime using the unix little utilties (which are
available for M$ as well) gives very good performance.

Sounds as a very good idea. Tanks.

/Bastiaan

Bastiaan Welmers · Feb 6, 2004

Miklós said:
What about putting it into a database like MySQL? <pyWink>

Too much work to archieve this. It's just a Mailman archieve mbox
which has to be opened. So then I have to rewrite
pipermail archiever.

/Bastiaan

Bastiaan Welmers · Feb 6, 2004

Diez said:
AFAIK MUAs usually use a mbox.index-file for faster access. The index is
computed once, and updated whenever a new message is added. You could
create this index quite easily yourself by looping over the mbox and
pickling a list of tell'ed positions. If you also store the creation-date
of the index and the filesize of the mbox-file, you should be able to
create a function that will update the index whenever the underlying mbox
has changed. Another approach would be to perform index-creation on
regular bases using cron.

Also good idea. It's a mailman archieve so then I have
to hack mailman for creating an index file besides the
mbox file.

/Bastiaan

Need help finding Segmentation fault C++	0	Apr 16, 2022
Mini Web Server in C++ (Part One)	4	Oct 2, 2025
Filter sober in c++ don't pass test	0	Dec 2, 2023
Numeric root-finding in Python	13	Feb 12, 2012
[jython] Problem with an huge dictionary	3	May 17, 2010
RSA implementation issues in public key pem loader function	0	May 21, 2025
Memory error due to the huge/huge input file size	3	Nov 10, 2008
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025

Finding messages in huge mboxes

Bastiaan Welmers

Miklós

Diez B. Roggisch

Donn Cave

David M. Cooke

Donn Cave

Miki Tebeka

Cameron Laird

Erno Kuusela

Bastiaan Welmers

Bastiaan Welmers

Bastiaan Welmers

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads