HOWTO: Parsing email using Python part1

A

aspineux

Hi
I have written an article about parsing email using Python.
The article is at http://blog.magiksys.net/Parsing-email-using-python-header
and the full content is here.

Hope this help someone.

Regards.


A lot of programs and libraries commonly used to send emails
don't comply with RFC. Ignore such kind of email is not an option
because all
mails are important. It is important to do it best when parsing
emails, like
does most popular MUA.

Python's has one of the best library to parse emails: the
email package.
First part, how to decode mails header

Regarding RFC 2047
non ascii text in the header must be encoded.
RFC 2822 make the difference
between different kind of header. *text field like
Subject: or address fields like
To:, each with different encoding rules.
This is because RFC 822
forbids the use of some ascii characters at some place because
they have some meaning, but these ascii characters can be used when
they are encoded
because the encoded version don't disturb the parsing of string.

Python provides email.Header.decode_header() for decoding header.
The function decode each atom and return a list of tuples
( text, encoding ) that you still have to decode and join to get
the full text. This is done in my getmailheader() function.

For addresses, Python provides email.utils.getaddresses()
that split addresses in a list of tuple ( display-name, address ).
display-name need to be decoded too and addresses must match
the RFC2822 syntax. The function getmailaddresses() does
all the job.

Here are the functions in actions.

import re
import email
from email.Utils import parseaddr
from email.Header import decode_header

# email address REGEX matching the RFC 2822 spec
# from perlfaq9
# my $atom = qr{[a-zA-Z0-9_!#\$\%&'*+/=?\^`{}~|\-]+};
# my $dot_atom = qr{$atom(?:\.$atom)*};
# my $quoted = qr{"(?:\\[^\r\n]|[^\\"])*"};
# my $local = qr{(?:$dot_atom|$quoted)};
# my $domain_lit = qr{\[(?:\\\S|[\x21-\x5a\x5e-\x7e])*\]};
# my $domain = qr{(?:$dot_atom|$domain_lit)};
# my $addr_spec = qr{$local\@$domain};
#
# Python translation

atom_rfc2822=r"[a-zA-Z0-9_!#\$\%&'*+/=?\^`{}~|\-]+"
atom_posfix_restricted=r"[a-zA-Z0-9_#\$&'*+/=?\^`{}~|\-]+" # without
'!' and '%'
atom=atom_rfc2822
dot_atom=atom + r"(?:\." + atom + ")*"
quoted=r'"(?:\\[^\r\n]|[^\\"])*"'
local="(?:" + dot_atom + "|" + quoted + ")"
domain_lit=r"\[(?:\\\S|[\x21-\x5a\x5e-\x7e])*\]"
domain="(?:" + dot_atom + "|" + domain_lit + ")"
addr_spec=local + "\@" + domain

email_address_re=re.compile('^'+addr_spec+'$')

raw="""MIME-Version: 1.0
Received: by 10.229.233.76 with HTTP; Sat, 2 Jul 2011 04:30:31 -0700
(PDT)
Date: Sat, 2 Jul 2011 13:30:31 +0200
Delivered-To: (e-mail address removed)
Message-ID: <CAAJL_=kPAJZ=fryb21wBOALp8-XOEL-
(e-mail address removed)>
Subject: =?ISO-8859-1?Q?Dr.=20Pointcarr=E9?=
From: Alain Spineux <[email protected]>
To: =?ISO-8859-1?Q?Dr=2E_Pointcarr=E9?= <[email protected]>
Content-Type: multipart/alternative;
boundary=000e0cd68f223dea3904a714768b

--000e0cd68f223dea3904a714768b
Content-Type: text/plain; charset=ISO-8859-1

--
Alain Spineux

--000e0cd68f223dea3904a714768b
Content-Type: text/html; charset=ISO-8859-1



--
Alain Spineux


--000e0cd68f223dea3904a714768b--
"""

def getmailheader(header_text, default="ascii"):
"""Decode header_text if needed"""
try:
headers=decode_header(header_text)
except email.Errors.HeaderParseError:
# This already append in email.base64mime.decode()
# instead return a sanitized ascii string
return header_text.encode('ascii', 'replace').decode('ascii')
else:
for i, (text, charset) in enumerate(headers):
try:
headers=unicode(text, charset or default,
errors='replace')
except LookupError:
# if the charset is unknown, force default
headers=unicode(text, default, errors='replace')
return u"".join(headers)

def getmailaddresses(msg, name):
"""retrieve From:, To: and Cc: addresses"""
addrs=email.utils.getaddresses(msg.get_all(name, []))
for i, (name, addr) in enumerate(addrs):
if not name and addr:
# only one string! Is it the address or is it the name ?
# use the same for both and see later
name=addr

try:
# address must be ascii only
addr=addr.encode('ascii')
except UnicodeError:
addr=''
else:
# address must match adress regex
if not email_address_re.match(addr):
addr=''
addrs=(getmailheader(name), addr)
return addrs

msg=email.message_from_string(raw)
subject=getmailheader(msg.get('Subject', ''))
from_=getmailaddresses(msg, 'from')
from_=('', '') if not from_ else from_[0]
tos=getmailaddresses(msg, 'to')

print 'Subject: %r' % subject
print 'From: %r' % (from_, )
print 'To: %r' % (tos, )

And the ouput:

Subject: u'Dr. Pointcarr\xe9'
From: (u'Alain Spineux', '(e-mail address removed)')
To: [(u'Dr. Pointcarr\xe9', '(e-mail address removed)')]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top