HOWTO: Parsing email using Python part2

Discussion in 'Python' started by aspineux, Jul 15, 2011.

  1. aspineux

    aspineux Guest

    Hello

    I have written part 2 about parsing email.

    You can find the article here :

    http://blog.magiksys.net/parsing-email-using-python-content

    This part is a lot longer :

    The first part was about mail header. This second part take focus on
    the mail content.

    Today's mails include HTML formatted texts, pictures and other
    attachments.
    Mails parts

    MIME allows to mix all these items into a single mail. But MIME is
    complex and not all emails comply with the standards.

    Even if few MUA are strictly compliant with the RFC, most are close.
    Poorest emails come from self-made programs and mail merge
    applications. Even if bad emails are often useless (my experience),
    your email parser must handle them at least without crashing.

    The Python email library does a wonderful job to split email into
    parts following the MIME philosophy.
    The email parts can be split into 3 categories:

    The message content, that is usually in plain text or in HTML
    format, and is often included in both format
    Some data related to the message (often to the HTML part), like
    background pictures, company's logo ...
    The attachments, that can be saved as separate files.

    MIME don't clearly indicate which part is the message content. The
    plain text followed by the HTML version are usually at the top to
    allow MIME unaware mail readers to read them easily. We must be
    careful not to use an ordinary attachment as the message of the email.
    This is what try to do my functions search_message_bodies()

    To understand the complexity, I will explain how different content can
    be mixed into a single email. Parts have a type that can be among
    others: 'text/plain', 'text/html', 'image/*', 'application/*' or
    'multipart/*' to indicate a container. Containers can contains other
    containers. Here are the most interesting containers defined by MIME:
    multipart/mixed

    Used to mix files of different type. Parts can be displayed inline or
    as attachment depending of the Content-disposition header (that is
    often missing).
    multipart/alternative

    Each part is an alternative of the same content, each in different
    format. The formats are ordered by how faithful they are to the
    original, with the least faithful first and the most faithful last.
    You are supposed to process the last part you are able to handle
    regarding the format. This is how mail include a text and HTML version
    of the same message. Be careful, sometime one part or another is
    missing and sometime the text part just say: "Read the HTML part" !
    multipart/related

    Parts must be considered as an aggregate whole. The root part that is
    usually the first one, references other parts inline using their
    "Content-ID" parameter. This is how pictures are embedded into HTML.

    Other containers exists, multipart/report is used for mail delivery
    notification and contains message/* parts. These message/* parts are
    handled has a message too by Python that split them in one header and
    one body, this last one is also parsed and splited into parts. This is
    one option, but I prefer to consider the attached message as a whole.
    This is why I don't use the Python's Message.walk() to iterate over
    parts of the message.
    Here are some structures you can find when parsing emails from
    different sources. The first one is my favorite, the one I use if I
    have to send complex email including: multiple message formats,
    related contents and attachments :

    multipart/mixed
    |
    +-- multipart/related
    | |
    | +-- multipart/alternative
    | | |
    | | +-- text/plain
    | | +-- text/html
    | |
    | +-- image/gif
    |
    +-- application/msword

    You can see simple structure, without related contents or
    attachments :

    multipart/alternative
    |
    +-- text/plain
    +-- text/html

    This unbalanced structure is also a valid one :

    multipart/alternative
    |
    +-- text/plain
    +-- multipart/related
    |
    +-- text/html
    +-- image/gif

    Attachments

    Some attachments must be shown inline, but if you cannot render such
    contents, you must make them available as separate attachments.
    Regular attachments must have a filename, but sometime it is stored at
    the wrong place and sometime it is just missing. If the filename
    contains non ascii characters it must be encoded using RFC 2231, but
    every body wrongly use RFC 2047 instead. The function get_filename()
    search and decode the filename.

    When you have filenames you still need to sanitize it to be sure you
    can save the file on you filesystems. Some characters are forbidden
    like '/' on Un*x and '\' on Windows, characters must match your
    filesystem charset (if any is in use). Also, Windows don't accept some
    name like "COM1" or "NUL".

    My function get_mail_contents() return a list of Attachment with
    related attributes. When attachment is of type text/*, payload must be
    decoded using the charset if set. Be careful, charset is not always
    accurate ! Use function decode_text(). Attachments holding the message
    content can be found using to the is_body attribute.
    The code

    The code include pieces from first part and can be downloaded here.
    Ran without parameter, it parse the embedded sample. The path of a
    saved raw email can be used as argument. Here is the output for the
    embedded sample :

    Subject: u'Dr. Pointcarr\xe9'
    From: (u'Alain Spineux', '')
    To: [(u'Dr. Pointcarr\xe9', '')]
    filename=None is_body=text/plain type=text/plain charset=ISO-8859-1
    desc=None size=12
    Hello World
    filename=None is_body=text/html type=text/html charset=ISO-8859-1
    desc=None size=21
    filename=u'smile.png' is_body=None type=image/png charset=None
    desc=None size=473

    And some explanation :
    def search_body(mail)

    This function navigate into the MIME tree of the mail to retrieve the
    parts and their format that contain the message. It return something
    like this

    { 'text/plain' : <email.message.Message instance at 0xXXXX>, 'text/
    html' : <email.message.Message instance at 0xYYYY> }

    And now the code

    import sys, os, re, StringIO
    import email, mimetypes

    invalid_chars_in_filename='<>:"/\\|?*\%\''+reduce(lambda x,y:x+chr(y),
    range(32), '')
    invalid_windows_name='CON PRN AUX NUL COM1 COM2 COM3 COM4 COM5 COM6
    COM7 COM8 COM9 LPT1 LPT2 LPT3 LPT4 LPT5 LPT6 LPT7 LPT8 LPT9'.split()

    # email address REGEX matching the RFC 2822 spec from perlfaq9
    # my $atom = qr{[a-zA-Z0-9_!#\$\%&'*+/=?\^`{}~|\-]+};
    # my $dot_atom = qr{$atom(?:\.$atom)*};
    # my $quoted = qr{"(?:\\[^\r\n]|[^\\"])*"};
    # my $local = qr{(?:$dot_atom|$quoted)};
    # my $domain_lit = qr{\[(?:\\\S|[\x21-\x5a\x5e-\x7e])*\]};
    # my $domain = qr{(?:$dot_atom|$domain_lit)};
    # my $addr_spec = qr{$local\@$domain};
    #
    # Python's translation

    atom_rfc2822=r"[a-zA-Z0-9_!#\$\%&'*+/=?\^`{}~|\-]+"
    atom_posfix_restricted=r"[a-zA-Z0-9_#\$&'*+/=?\^`{}~|\-]+" # without
    '!' and '%'
    atom=atom_rfc2822
    dot_atom=atom + r"(?:\." + atom + ")*"
    quoted=r'"(?:\\[^\r\n]|[^\\"])*"'
    local="(?:" + dot_atom + "|" + quoted + ")"
    domain_lit=r"\[(?:\\\S|[\x21-\x5a\x5e-\x7e])*\]"
    domain="(?:" + dot_atom + "|" + domain_lit + ")"
    addr_spec=local + "\@" + domain

    email_address_re=re.compile('^'+addr_spec+'$')

    class Attachment:
    def __init__(self, part, filename=None, type=None, payload=None,
    charset=None, content_id=None, description=None, disposition=None,
    sanitized_filename=None, is_body=None):
    self.part=part # original python part
    self.filename=filename # filename in unicode (if any)
    self.type=type # the mime-type
    self.payload=payload # the MIME decoded content
    self.charset=charset # the charset (if any)
    self.description=description # if any
    self.disposition=disposition # 'inline', 'attachment' or
    None
    self.sanitized_filename=sanitized_filename # cleanup your
    filename here (TODO)
    self.is_body=is_body # usually in (None, 'text/plain'
    or 'text/html')
    self.content_id=content_id # if any
    if self.content_id:
    # strip '<>' to ease searche and replace in "root" content
    (TODO)
    if self.content_id.startswith('<') and
    self.content_id.endswith('>'):
    self.content_id=self.content_id[1:-1]

    def getmailheader(header_text, default="ascii"):
    """Decode header_text if needed"""
    try:
    headers=email.Header.decode_header(header_text)
    except email.Errors.HeaderParseError:
    # This already append in email.base64mime.decode()
    # instead return a sanitized ascii string
    # this faile '=?UTF-8?B?
    15HXmdeh15jXqNeVINeY15DXpteUINeTJ9eV16jXlSDXkdeg15XXldeUINem15PXpywg15TXptei16bXldei15nXnSDXqdecINek15zXmdeZ?
    ==?UTF-8?B?
    157XldeR15nXnCwg157Xldek16Ig157Xl9eV15wg15HXodeV15bXnyDXk9ec15DXnCDXldeh15gg157Xl9eR16rXldeqINep15wg15HXmdeQ?
    ==?UTF-8?B?15zXmNeZ?='
    return header_text.encode('ascii', 'replace').decode('ascii')
    else:
    for i, (text, charset) in enumerate(headers):
    try:
    headers=unicode(text, charset or default,
    errors='replace')
    except LookupError:
    # if the charset is unknown, force default
    headers=unicode(text, default, errors='replace')
    return u"".join(headers)

    def getmailaddresses(msg, name):
    """retrieve addresses from header, 'name' supposed to be from,
    to, ..."""
    addrs=email.utils.getaddresses(msg.get_all(name, []))
    for i, (name, addr) in enumerate(addrs):
    if not name and addr:
    # only one string! Is it the address or is it the name ?
    # use the same for both and see later
    name=addr

    try:
    # address must be ascii only
    addr=addr.encode('ascii')
    except UnicodeError:
    addr=''
    else:
    # address must match address regex
    if not email_address_re.match(addr):
    addr=''
    addrs=(getmailheader(name), addr)
    return addrs

    def get_filename(part):
    """Many mail user agents send attachments with the filename in
    the 'name' parameter of the 'content-type' header instead
    of in the 'filename' parameter of the 'content-disposition'
    header.
    """
    filename=part.get_param('filename', None, 'content-disposition')
    if not filename:
    filename=part.get_param('name', None) # default is 'content-
    type'

    if filename:
    # RFC 2231 must be used to encode parameters inside MIME
    header
    filename=email.Utils.collapse_rfc2231_value(filename).strip()

    if filename and isinstance(filename, str):
    # But a lot of MUA erroneously use RFC 2047 instead of RFC
    2231
    # in fact anybody miss use RFC2047 here !!!
    filename=getmailheader(filename)

    return filename

    def _search_message_bodies(bodies, part):
    """recursive search of the multiple version of the 'message'
    inside
    the the message structure of the email, used by
    search_message_bodies()"""

    type=part.get_content_type()
    if type.startswith('multipart/'):
    # explore only True 'multipart/*'
    # because 'messages/rfc822' are also python 'multipart'
    if type=='multipart/related':
    # the first part or the one pointed by start
    start=part.get_param('start', None)
    related_type=part.get_param('type', None)
    for i, subpart in enumerate(part.get_payload()):
    if (not start and i==0) or (start and
    start==subpart.get('Content-Id')):
    _search_message_bodies(bodies, subpart)
    return
    elif type=='multipart/alternative':
    # all parts are candidates and latest is best
    for subpart in part.get_payload():
    _search_message_bodies(bodies, subpart)
    elif type in ('multipart/report', 'multipart/signed'):
    # only the first part is candidate
    try:
    subpart=part.get_payload()[0]
    except IndexError:
    return
    else:
    _search_message_bodies(bodies, subpart)
    return

    elif type=='multipart/signed':
    # cannot handle this
    return

    else:
    # unknown types must be handled as 'multipart/mixed'
    # This is the peace of code could probably be improved, I
    use a heuristic :
    # - if not already found, use first valid non 'attachment'
    parts found
    for subpart in part.get_payload():
    tmp_bodies=dict()
    _search_message_bodies(tmp_bodies, subpart)
    for k, v in tmp_bodies.iteritems():
    if not subpart.get_param('attachment', None,
    'content-disposition')=='':
    # if not an attachment, initiate value if not
    already found
    bodies.setdefault(k, v)
    return
    else:
    bodies[part.get_content_type().lower()]=part
    return

    return

    def search_message_bodies(mail):
    """search message content into a mail"""
    bodies=dict()
    _search_message_bodies(bodies, mail)
    return bodies

    def get_mail_contents(msg):
    """split an email in a list of attachments"""

    attachments=[]

    # retrieve messages of the email
    bodies=search_message_bodies(msg)
    # reverse bodies dict
    parts=dict((v,k) for k, v in bodies.iteritems())

    # organize the stack to handle deep first search
    stack=[ msg, ]
    while stack:
    part=stack.pop(0)
    type=part.get_content_type()
    if type.startswith('message/'):
    # ('message/delivery-status', 'message/rfc822', 'message/
    disposition-notification'):
    # I don't want to explore the tree deeper her and just
    save source using msg.as_string()
    # but I don't use msg.as_string() because I want to use
    mangle_from_=False
    from email.Generator import Generator
    fp = StringIO.StringIO()
    g = Generator(fp, mangle_from_=False)
    g.flatten(part, unixfrom=False)
    payload=fp.getvalue()
    filename='mail.eml'
    attachments.append(Attachment(part, filename=filename,
    type=type, payload=payload, charset=part.get_param('charset'),
    description=part.get('Content-Description')))
    elif part.is_multipart():
    # insert new parts at the beginning of the stack (deep
    first search)
    stack[:0]=part.get_payload()
    else:
    payload=part.get_payload(decode=True)
    charset=part.get_param('charset')
    filename=get_filename(part)

    disposition=None
    if part.get_param('inline', None, 'content-
    disposition')=='':
    disposition='inline'
    elif part.get_param('attachment', None, 'content-
    disposition')=='':
    disposition='attachment'

    attachments.append(Attachment(part, filename=filename,
    type=type, payload=payload, charset=charset,
    content_id=part.get('Content-Id'), description=part.get('Content-
    Description'), disposition=disposition, is_body=parts.get(part)))

    return attachments

    def decode_text(payload, charset, default_charset):
    if charset:
    try:
    return payload.decode(charset), charset
    except UnicodeError:
    pass

    if default_charset and default_charset!='auto':
    try:
    return payload.decode(default_charset), default_charset
    except UnicodeError:
    pass

    for chset in [ 'ascii', 'utf-8', 'utf-16', 'windows-1252',
    'cp850' ]:
    try:
    return payload.decode(chset), chset
    except UnicodeError:
    pass

    return payload, None

    if __name__ == "__main__":

    raw="""MIME-Version: 1.0
    Received: by 10.229.233.76 with HTTP; Sat, 2 Jul 2011 04:30:31 -0700
    (PDT)
    Date: Sat, 2 Jul 2011 13:30:31 +0200
    Delivered-To:
    Message-ID: <CAAJL_=kPAJZ=fryb21wBOALp8-XOEL-
    >
    Subject: =?ISO-8859-1?Q?Dr.=20Pointcarr=E9?=
    From: Alain Spineux <>
    To: =?ISO-8859-1?Q?Dr=2E_Pointcarr=E9?= <>
    Content-Type: multipart/mixed; boundary=mixed

    --mixed
    Content-Type: multipart/alternative; boundary=alternative

    --alternative
    Content-Type: text/plain; charset=ISO-8859-1

    Hello World

    --alternative
    Content-Type: text/html; charset=ISO-8859-1

    Hello World<br>
    <br>

    --alternative--
    --mixed
    Content-Type: image/png; name="smile.png"
    Content-Disposition: attachment; filename="smile.png"
    Content-Transfer-Encoding: base64

    iVBORw0KGgoAAAANSUhEUgAAAA4AAAAOBAMAAADtZjDiAAAAMFBMVEUQEAhaUjlaWlp7e3uMezGU
    hDGcnJy1lCnGvVretTnn5+/3pSn33mP355T39+//
    75SdwkyMAAAACXBIWXMAAA7EAAAOxAGVKw4b
    AAAAB3RJTUUH2wcJDxEjgefAiQAAAAd0RVh0QXV0aG9yAKmuzEgAAAAMdEVYdERlc2NyaXB0aW9u
    ABMJISMAAAAKdEVYdENvcHlyaWdodACsD8w6AAAADnRFWHRDcmVhdGlvbiB0aW1lADX3DwkAAAAJ
    dEVYdFNvZnR3YXJlAF1w/
    zoAAAALdEVYdERpc2NsYWltZXIAt8C0jwAAAAh0RVh0V2FybmluZwDA
    G+aHAAAAB3RFWHRTb3VyY2UA9f+D6wAAAAh0RVh0Q29tbWVudAD2zJa/
    AAAABnRFWHRUaXRsZQCo
    7tInAAAAaElEQVR4nGNYsXv3zt27TzHcPup6XDBmDsOeBvYzLTynGfacuHfm/
    x8gfS7tbtobEM3w
    n2E9kP5n9N/oPZA+//7PP5D8GSCYA6RPzjlzEkSfmTlz
    +xkgffbkzDlAuvsMWAHDmt0g0AUAmyNE
    wLAIvcgAAAAASUVORK5CYII=
    --mixed--
    """

    if len(sys.argv)>1:
    raw=open(sys.argv[1]).read()

    msg=email.message_from_string(raw)
    attachments=get_mail_contents(msg)

    subject=getmailheader(msg.get('Subject', ''))
    from_=getmailaddresses(msg, 'from')
    from_=('', '') if not from_ else from_[0]
    tos=getmailaddresses(msg, 'to')

    print 'Subject: %r' % subject
    print 'From: %r' % (from_, )
    print 'To: %r' % (tos, )

    for attach in attachments:
    # dont forget to be careful to sanitize 'filename' and be
    carefull
    # for filename collision, to before to save :
    print '\tfilename=%r is_body=%s type=%s charset=%s desc=%s
    size=%d' % (attach.filename, attach.is_body, attach.type,
    attach.charset, attach.description, 0 if attach.payload==None else
    len(attach.payload))

    if attach.is_body=='text/plain':
    # print first 3 lines
    payload, used_charset=decode_text(attach.payload,
    attach.charset, 'auto')
    for line in payload.split('\n')[:3]:
    # be careful console can be unable to display unicode
    characters
    if line:
    print '\t\t', line
    aspineux, Jul 15, 2011
    #1
    1. Advertising

  2. aspineux wrote:

    > Hello
    >
    > I have written part 2 about parsing email.
    >
    > You can find the article here :
    >
    > http://blog.magiksys.net/parsing-email-using-python-content
    >
    > This part is a lot longer :


    Wow! Well done.

    Thank you for sharing this, this is extremely detailed and useful.



    --
    Steven
    Steven D'Aprano, Jul 16, 2011
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter
    Replies:
    0
    Views:
    3,351
    Peter
    Jul 1, 2003
  2. Sam
    Replies:
    1
    Views:
    419
  3. aspineux
    Replies:
    2
    Views:
    3,395
    TheSaint
    Jul 3, 2011
  4. Shane

    type vs. module (part2)

    Shane, Oct 17, 2011, in forum: Python
    Replies:
    3
    Views:
    160
    Steven D'Aprano
    Oct 17, 2011
  5. George  Moschovitis

    [ANN] Nitro + Og 0.19.0: Og reloaded part2!

    George Moschovitis, Jun 17, 2005, in forum: Ruby
    Replies:
    4
    Views:
    117
    George Moschovitis
    Jun 18, 2005
Loading...

Share This Page