MemoryError on reading mbox file

C

Christoph Krammer

Hello everybody,

I have to convert a huge mbox file (~1.5G) to MySQL.

I tried with the following simple code:

for m in mailbox.mbox(fileName):

msg = m.as_string(True)
hash = md5.new(msg).hexdigest()

try:
dbcurs.execute("""INSERT INTO archive (hash, msg) VALUES (%s,
%s)""", (hash, msg))
except MySQLdb.OperationalError, err:
print "%s Error (%d): %s" % (file, err[0], err[1])
else:
print "%s: Message successfully added to database" % (hash,
spamSource)

The problem seems to be the size of file, every time I try to execute
the script, after about 20000 messages, the following error occurs:

Traceback (most recent call last):
File "email_to_mysql_mbox.py", line 21, in <module>
for m in mailbox.mbox(fileName):
File "/usr/lib/python2.5/mailbox.py", line 98, in itervalues
value = self[key]
File "/usr/lib/python2.5/mailbox.py", line 70, in __getitem__
return self.get_message(key)
File "/usr/lib/python2.5/mailbox.py", line 633, in get_message
string = self._file.read(stop - self._file.tell())
MemoryError

My system has 512M RAM and 768M swap, which seems to run out at an
early stage of this. Is there a way to clean up memory for messages
already processed?

Thanks and regards,
Christoph
 
D

David

My system has 512M RAM and 768M swap, which seems to run out at an
early stage of this. Is there a way to clean up memory for messages
already processed?

It may be that Python's garbage collection isn't keeping up with your app.

You could try periodically forcing it to run. eg:

import gc
gc.collect()

You can also finetune the GC settings and check what is using up your memory.

More info here: http://docs.python.org/lib/module-gc.html
 
C

Christoph Krammer

It may be that Python's garbage collection isn't keeping up with your app.

You could try periodically forcing it to run. eg:

import gc
gc.collect()

I tried this, but the problem is not solved. When invoking the garbage
collection after every loop run, the amount of memory indicated by top
stays the same for a very long time until at some point (at different
messages), while it is executing the loop header, the memory increases
until it hits 100% and swap hit also 100% => MemoryError

Can there be a problem within the mailbox module while processing too
large files?

Regards,
Christoph
 
H

Hrvoje Niksic

Christoph Krammer said:
I have to convert a huge mbox file (~1.5G) to MySQL.

Have you tried commenting out the MySQL portion of the code? Does the
code then manage to finish processing the mailbox?
 
I

Istvan Albert

string = self._file.read(stop - self._file.tell())
MemoryError

This line reads an entire message into memory as a string. Is it
possible that you have a huge email in there (hundreds of MB) with
some attachment encoded as text?

Either way, the truth is that many modules in the standard library are
not well equipped to deal with large amounts of data. Many of them
were developed before gigabyte sized files were even possible to store
let alone process. Hopefully P3K will alleviate many of these problems
by its extensive use of generators.

For now I would recommend that you split your mbox file into several
smaller ones. (I think all you need is to split at the To: fields) and
run your script on these individual files.

i.
 
G

Gabriel Genellina

En Wed, 12 Sep 2007 11:39:46 -0300, Istvan Albert
This line reads an entire message into memory as a string. Is it
possible that you have a huge email in there (hundreds of MB) with
some attachment encoded as text?

Printing start,stop,stop-start inside that method would be an easy way to
find if that is the case.

The following idea could help to fix it - at least, avoiding to read the
whole message at once:
self._message_factory will eventually call the mailbox.Message
constructor, which accepts a file object too (instead of a huge string).
In that same module there is an utility class, _PartialFile ("A read-only
wrapper of part of a file"). _mboxMMDF.get_file() does return a
_PartialFile object, so I'd try this code (untested!):

def get_message(self, key):
"""Return a Message representation or raise a KeyError."""
msg = self._message_factory(self.get_file(key, True))
msg.set_from(msg.get_unixfrom()[5:])
return msg
 
C

Christoph Krammer

This line reads an entire message into memory as a string. Is it
possible that you have a huge email in there (hundreds of MB) with
some attachment encoded as text?

No, the largest single message with the mbox is about 100KB large.

For now I would recommend that you split your mbox file into several
smaller ones. (I think all you need is to split at the To: fields) and
run your script on these individual files.

I get it to work with splitting the mbox file into single files, one
for each message, with the git-mailsplit tool, that is included in the
gitk package. This solved the problem for now.

Thanks for all your help.

Christoph
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,754
Messages
2,569,520
Members
44,996
Latest member
rainocode

Latest Threads

Top