HOWTO: Parsing email using Python part2

A

aspineux

Hello

I have written part 2 about parsing email.

You can find the article here :

http://blog.magiksys.net/parsing-email-using-python-content

This part is a lot longer :

The first part was about mail header. This second part take focus on
the mail content.

Today's mails include HTML formatted texts, pictures and other
attachments.
Mails parts

MIME allows to mix all these items into a single mail. But MIME is
complex and not all emails comply with the standards.

Even if few MUA are strictly compliant with the RFC, most are close.
Poorest emails come from self-made programs and mail merge
applications. Even if bad emails are often useless (my experience),
your email parser must handle them at least without crashing.

The Python email library does a wonderful job to split email into
parts following the MIME philosophy.
The email parts can be split into 3 categories:

The message content, that is usually in plain text or in HTML
format, and is often included in both format
Some data related to the message (often to the HTML part), like
background pictures, company's logo ...
The attachments, that can be saved as separate files.

MIME don't clearly indicate which part is the message content. The
plain text followed by the HTML version are usually at the top to
allow MIME unaware mail readers to read them easily. We must be
careful not to use an ordinary attachment as the message of the email.
This is what try to do my functions search_message_bodies()

To understand the complexity, I will explain how different content can
be mixed into a single email. Parts have a type that can be among
others: 'text/plain', 'text/html', 'image/*', 'application/*' or
'multipart/*' to indicate a container. Containers can contains other
containers. Here are the most interesting containers defined by MIME:
multipart/mixed

Used to mix files of different type. Parts can be displayed inline or
as attachment depending of the Content-disposition header (that is
often missing).
multipart/alternative

Each part is an alternative of the same content, each in different
format. The formats are ordered by how faithful they are to the
original, with the least faithful first and the most faithful last.
You are supposed to process the last part you are able to handle
regarding the format. This is how mail include a text and HTML version
of the same message. Be careful, sometime one part or another is
missing and sometime the text part just say: "Read the HTML part" !
multipart/related

Parts must be considered as an aggregate whole. The root part that is
usually the first one, references other parts inline using their
"Content-ID" parameter. This is how pictures are embedded into HTML.

Other containers exists, multipart/report is used for mail delivery
notification and contains message/* parts. These message/* parts are
handled has a message too by Python that split them in one header and
one body, this last one is also parsed and splited into parts. This is
one option, but I prefer to consider the attached message as a whole.
This is why I don't use the Python's Message.walk() to iterate over
parts of the message.
Here are some structures you can find when parsing emails from
different sources. The first one is my favorite, the one I use if I
have to send complex email including: multiple message formats,
related contents and attachments :

multipart/mixed
|
+-- multipart/related
| |
| +-- multipart/alternative
| | |
| | +-- text/plain
| | +-- text/html
| |
| +-- image/gif
|
+-- application/msword

You can see simple structure, without related contents or
attachments :

multipart/alternative
|
+-- text/plain
+-- text/html

This unbalanced structure is also a valid one :

multipart/alternative
|
+-- text/plain
+-- multipart/related
|
+-- text/html
+-- image/gif

Attachments

Some attachments must be shown inline, but if you cannot render such
contents, you must make them available as separate attachments.
Regular attachments must have a filename, but sometime it is stored at
the wrong place and sometime it is just missing. If the filename
contains non ascii characters it must be encoded using RFC 2231, but
every body wrongly use RFC 2047 instead. The function get_filename()
search and decode the filename.

When you have filenames you still need to sanitize it to be sure you
can save the file on you filesystems. Some characters are forbidden
like '/' on Un*x and '\' on Windows, characters must match your
filesystem charset (if any is in use). Also, Windows don't accept some
name like "COM1" or "NUL".

My function get_mail_contents() return a list of Attachment with
related attributes. When attachment is of type text/*, payload must be
decoded using the charset if set. Be careful, charset is not always
accurate ! Use function decode_text(). Attachments holding the message
content can be found using to the is_body attribute.
The code

The code include pieces from first part and can be downloaded here.
Ran without parameter, it parse the embedded sample. The path of a
saved raw email can be used as argument. Here is the output for the
embedded sample :

Subject: u'Dr. Pointcarr\xe9'
From: (u'Alain Spineux', '(e-mail address removed)')
To: [(u'Dr. Pointcarr\xe9', '(e-mail address removed)')]
filename=None is_body=text/plain type=text/plain charset=ISO-8859-1
desc=None size=12
Hello World
filename=None is_body=text/html type=text/html charset=ISO-8859-1
desc=None size=21
filename=u'smile.png' is_body=None type=image/png charset=None
desc=None size=473

And some explanation :
def search_body(mail)

This function navigate into the MIME tree of the mail to retrieve the
parts and their format that contain the message. It return something
like this

{ 'text/plain' : <email.message.Message instance at 0xXXXX>, 'text/
html' : <email.message.Message instance at 0xYYYY> }

And now the code

import sys, os, re, StringIO
import email, mimetypes

invalid_chars_in_filename='<>:"/\\|?*\%\''+reduce(lambda x,y:x+chr(y),
range(32), '')
invalid_windows_name='CON PRN AUX NUL COM1 COM2 COM3 COM4 COM5 COM6
COM7 COM8 COM9 LPT1 LPT2 LPT3 LPT4 LPT5 LPT6 LPT7 LPT8 LPT9'.split()

# email address REGEX matching the RFC 2822 spec from perlfaq9
# my $atom = qr{[a-zA-Z0-9_!#\$\%&'*+/=?\^`{}~|\-]+};
# my $dot_atom = qr{$atom(?:\.$atom)*};
# my $quoted = qr{"(?:\\[^\r\n]|[^\\"])*"};
# my $local = qr{(?:$dot_atom|$quoted)};
# my $domain_lit = qr{\[(?:\\\S|[\x21-\x5a\x5e-\x7e])*\]};
# my $domain = qr{(?:$dot_atom|$domain_lit)};
# my $addr_spec = qr{$local\@$domain};
#
# Python's translation

atom_rfc2822=r"[a-zA-Z0-9_!#\$\%&'*+/=?\^`{}~|\-]+"
atom_posfix_restricted=r"[a-zA-Z0-9_#\$&'*+/=?\^`{}~|\-]+" # without
'!' and '%'
atom=atom_rfc2822
dot_atom=atom + r"(?:\." + atom + ")*"
quoted=r'"(?:\\[^\r\n]|[^\\"])*"'
local="(?:" + dot_atom + "|" + quoted + ")"
domain_lit=r"\[(?:\\\S|[\x21-\x5a\x5e-\x7e])*\]"
domain="(?:" + dot_atom + "|" + domain_lit + ")"
addr_spec=local + "\@" + domain

email_address_re=re.compile('^'+addr_spec+'$')

class Attachment:
def __init__(self, part, filename=None, type=None, payload=None,
charset=None, content_id=None, description=None, disposition=None,
sanitized_filename=None, is_body=None):
self.part=part # original python part
self.filename=filename # filename in unicode (if any)
self.type=type # the mime-type
self.payload=payload # the MIME decoded content
self.charset=charset # the charset (if any)
self.description=description # if any
self.disposition=disposition # 'inline', 'attachment' or
None
self.sanitized_filename=sanitized_filename # cleanup your
filename here (TODO)
self.is_body=is_body # usually in (None, 'text/plain'
or 'text/html')
self.content_id=content_id # if any
if self.content_id:
# strip '<>' to ease searche and replace in "root" content
(TODO)
if self.content_id.startswith('<') and
self.content_id.endswith('>'):
self.content_id=self.content_id[1:-1]

def getmailheader(header_text, default="ascii"):
"""Decode header_text if needed"""
try:
headers=email.Header.decode_header(header_text)
except email.Errors.HeaderParseError:
# This already append in email.base64mime.decode()
# instead return a sanitized ascii string
# this faile '=?UTF-8?B?
15HXmdeh15jXqNeVINeY15DXpteUINeTJ9eV16jXlSDXkdeg15XXldeUINem15PXpywg15TXptei16bXldei15nXnSDXqdecINek15zXmdeZ?
==?UTF-8?B?
157XldeR15nXnCwg157Xldek16Ig157Xl9eV15wg15HXodeV15bXnyDXk9ec15DXnCDXldeh15gg157Xl9eR16rXldeqINep15wg15HXmdeQ?
==?UTF-8?B?15zXmNeZ?='
return header_text.encode('ascii', 'replace').decode('ascii')
else:
for i, (text, charset) in enumerate(headers):
try:
headers=unicode(text, charset or default,
errors='replace')
except LookupError:
# if the charset is unknown, force default
headers=unicode(text, default, errors='replace')
return u"".join(headers)

def getmailaddresses(msg, name):
"""retrieve addresses from header, 'name' supposed to be from,
to, ..."""
addrs=email.utils.getaddresses(msg.get_all(name, []))
for i, (name, addr) in enumerate(addrs):
if not name and addr:
# only one string! Is it the address or is it the name ?
# use the same for both and see later
name=addr

try:
# address must be ascii only
addr=addr.encode('ascii')
except UnicodeError:
addr=''
else:
# address must match address regex
if not email_address_re.match(addr):
addr=''
addrs=(getmailheader(name), addr)
return addrs

def get_filename(part):
"""Many mail user agents send attachments with the filename in
the 'name' parameter of the 'content-type' header instead
of in the 'filename' parameter of the 'content-disposition'
header.
"""
filename=part.get_param('filename', None, 'content-disposition')
if not filename:
filename=part.get_param('name', None) # default is 'content-
type'

if filename:
# RFC 2231 must be used to encode parameters inside MIME
header
filename=email.Utils.collapse_rfc2231_value(filename).strip()

if filename and isinstance(filename, str):
# But a lot of MUA erroneously use RFC 2047 instead of RFC
2231
# in fact anybody miss use RFC2047 here !!!
filename=getmailheader(filename)

return filename

def _search_message_bodies(bodies, part):
"""recursive search of the multiple version of the 'message'
inside
the the message structure of the email, used by
search_message_bodies()"""

type=part.get_content_type()
if type.startswith('multipart/'):
# explore only True 'multipart/*'
# because 'messages/rfc822' are also python 'multipart'
if type=='multipart/related':
# the first part or the one pointed by start
start=part.get_param('start', None)
related_type=part.get_param('type', None)
for i, subpart in enumerate(part.get_payload()):
if (not start and i==0) or (start and
start==subpart.get('Content-Id')):
_search_message_bodies(bodies, subpart)
return
elif type=='multipart/alternative':
# all parts are candidates and latest is best
for subpart in part.get_payload():
_search_message_bodies(bodies, subpart)
elif type in ('multipart/report', 'multipart/signed'):
# only the first part is candidate
try:
subpart=part.get_payload()[0]
except IndexError:
return
else:
_search_message_bodies(bodies, subpart)
return

elif type=='multipart/signed':
# cannot handle this
return

else:
# unknown types must be handled as 'multipart/mixed'
# This is the peace of code could probably be improved, I
use a heuristic :
# - if not already found, use first valid non 'attachment'
parts found
for subpart in part.get_payload():
tmp_bodies=dict()
_search_message_bodies(tmp_bodies, subpart)
for k, v in tmp_bodies.iteritems():
if not subpart.get_param('attachment', None,
'content-disposition')=='':
# if not an attachment, initiate value if not
already found
bodies.setdefault(k, v)
return
else:
bodies[part.get_content_type().lower()]=part
return

return

def search_message_bodies(mail):
"""search message content into a mail"""
bodies=dict()
_search_message_bodies(bodies, mail)
return bodies

def get_mail_contents(msg):
"""split an email in a list of attachments"""

attachments=[]

# retrieve messages of the email
bodies=search_message_bodies(msg)
# reverse bodies dict
parts=dict((v,k) for k, v in bodies.iteritems())

# organize the stack to handle deep first search
stack=[ msg, ]
while stack:
part=stack.pop(0)
type=part.get_content_type()
if type.startswith('message/'):
# ('message/delivery-status', 'message/rfc822', 'message/
disposition-notification'):
# I don't want to explore the tree deeper her and just
save source using msg.as_string()
# but I don't use msg.as_string() because I want to use
mangle_from_=False
from email.Generator import Generator
fp = StringIO.StringIO()
g = Generator(fp, mangle_from_=False)
g.flatten(part, unixfrom=False)
payload=fp.getvalue()
filename='mail.eml'
attachments.append(Attachment(part, filename=filename,
type=type, payload=payload, charset=part.get_param('charset'),
description=part.get('Content-Description')))
elif part.is_multipart():
# insert new parts at the beginning of the stack (deep
first search)
stack[:0]=part.get_payload()
else:
payload=part.get_payload(decode=True)
charset=part.get_param('charset')
filename=get_filename(part)

disposition=None
if part.get_param('inline', None, 'content-
disposition')=='':
disposition='inline'
elif part.get_param('attachment', None, 'content-
disposition')=='':
disposition='attachment'

attachments.append(Attachment(part, filename=filename,
type=type, payload=payload, charset=charset,
content_id=part.get('Content-Id'), description=part.get('Content-
Description'), disposition=disposition, is_body=parts.get(part)))

return attachments

def decode_text(payload, charset, default_charset):
if charset:
try:
return payload.decode(charset), charset
except UnicodeError:
pass

if default_charset and default_charset!='auto':
try:
return payload.decode(default_charset), default_charset
except UnicodeError:
pass

for chset in [ 'ascii', 'utf-8', 'utf-16', 'windows-1252',
'cp850' ]:
try:
return payload.decode(chset), chset
except UnicodeError:
pass

return payload, None

if __name__ == "__main__":

raw="""MIME-Version: 1.0
Received: by 10.229.233.76 with HTTP; Sat, 2 Jul 2011 04:30:31 -0700
(PDT)
Date: Sat, 2 Jul 2011 13:30:31 +0200
Delivered-To: (e-mail address removed)
Message-ID: <CAAJL_=kPAJZ=fryb21wBOALp8-XOEL-
(e-mail address removed)>
Subject: =?ISO-8859-1?Q?Dr.=20Pointcarr=E9?=
From: Alain Spineux <[email protected]>
To: =?ISO-8859-1?Q?Dr=2E_Pointcarr=E9?= <[email protected]>
Content-Type: multipart/mixed; boundary=mixed

--mixed
Content-Type: multipart/alternative; boundary=alternative

--alternative
Content-Type: text/plain; charset=ISO-8859-1

Hello World

--alternative
Content-Type: text/html; charset=ISO-8859-1

Hello World<br>
<br>

--alternative--
--mixed
Content-Type: image/png; name="smile.png"
Content-Disposition: attachment; filename="smile.png"
Content-Transfer-Encoding: base64

iVBORw0KGgoAAAANSUhEUgAAAA4AAAAOBAMAAADtZjDiAAAAMFBMVEUQEAhaUjlaWlp7e3uMezGU
hDGcnJy1lCnGvVretTnn5+/3pSn33mP355T39+//
75SdwkyMAAAACXBIWXMAAA7EAAAOxAGVKw4b
AAAAB3RJTUUH2wcJDxEjgefAiQAAAAd0RVh0QXV0aG9yAKmuzEgAAAAMdEVYdERlc2NyaXB0aW9u
ABMJISMAAAAKdEVYdENvcHlyaWdodACsD8w6AAAADnRFWHRDcmVhdGlvbiB0aW1lADX3DwkAAAAJ
dEVYdFNvZnR3YXJlAF1w/
zoAAAALdEVYdERpc2NsYWltZXIAt8C0jwAAAAh0RVh0V2FybmluZwDA
G+aHAAAAB3RFWHRTb3VyY2UA9f+D6wAAAAh0RVh0Q29tbWVudAD2zJa/
AAAABnRFWHRUaXRsZQCo
7tInAAAAaElEQVR4nGNYsXv3zt27TzHcPup6XDBmDsOeBvYzLTynGfacuHfm/
x8gfS7tbtobEM3w
n2E9kP5n9N/oPZA+//7PP5D8GSCYA6RPzjlzEkSfmTlz
+xkgffbkzDlAuvsMWAHDmt0g0AUAmyNE
wLAIvcgAAAAASUVORK5CYII=
--mixed--
"""

if len(sys.argv)>1:
raw=open(sys.argv[1]).read()

msg=email.message_from_string(raw)
attachments=get_mail_contents(msg)

subject=getmailheader(msg.get('Subject', ''))
from_=getmailaddresses(msg, 'from')
from_=('', '') if not from_ else from_[0]
tos=getmailaddresses(msg, 'to')

print 'Subject: %r' % subject
print 'From: %r' % (from_, )
print 'To: %r' % (tos, )

for attach in attachments:
# dont forget to be careful to sanitize 'filename' and be
carefull
# for filename collision, to before to save :
print '\tfilename=%r is_body=%s type=%s charset=%s desc=%s
size=%d' % (attach.filename, attach.is_body, attach.type,
attach.charset, attach.description, 0 if attach.payload==None else
len(attach.payload))

if attach.is_body=='text/plain':
# print first 3 lines
payload, used_charset=decode_text(attach.payload,
attach.charset, 'auto')
for line in payload.split('\n')[:3]:
# be careful console can be unable to display unicode
characters
if line:
print '\t\t', line
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top