Standard module for parsing emails?

P

Phillip B Oldham

Is there a standard library for parsing emails that can cope with the
different way email clients quote?
 
D

Diez B. Roggisch

Phillip said:
Is there a standard library for parsing emails that can cope with the
different way email clients quote?

AFAIK not - as unfortunately that's something the user can configure, and
thus no atrocity is unimaginable. Hard to write a module for that...

All you can try is to apply a heuristic like "if there are lines all
starting with a certain prefix that contains non-alphanumeric characters".
But then if the user configures to quote using

XX

you're doomed...



Diez
 
T

Thomas Guettler

Phillip said:
Is there a standard library for parsing emails that can cope with the
different way email clients quote?

What do you mean with "quote" here?
1. Encode utf8/latin1 to ascii
2. Prefix of quoted text like your text above in my mail

Thomas
 
P

Phillip B Oldham

What do you mean with "quote" here?
2. Prefix of quoted text like your text above in my mail

Basically, just be able to parse an email into its actual and "quoted"
parts - lines which have been prefixed to indent from a previous
email.

Most clients use ">" which is easy to check for, but I've seen some
which use "|" and some which *don't* quote at all. Its causing us
nightmares in parsing responses to system-generated emails. I was
hoping someone might've seen the problem previously and released some
code.
 
P

Phillip B Oldham

If there isn't a standard library for parsing emails, is there one for
connecting to a pop/imap resource and reading the mailbox?
 
M

Maric Michaud

Le Wednesday 30 July 2008 17:15:07 Phillip B Oldham, vous avez écrit :
If there isn't a standard library for parsing emails, is there one for
connecting to a pop/imap resource and reading the mailbox?

There are both shipped with python, email module and poplib, both very well
documented in the official doc (with examples and all).

email module is rather easy to use, and really powerful, but you'l need to
manage yourself the many ways email clients compose a message, and broken php
webmails that doesn't respect RFCs (notably about encoding)...
 
A

Aspersieman

Phillip said:
If there isn't a standard library for parsing emails, is there one for
connecting to a pop/imap resource and reading the mailbox?
The search [1] yielded these results:
1) http://docs.python.org/lib/module-email.html
2)
http://www.devshed.com/c/a/Python/Python-Email-Libraries-SMTP-and-Email-Parsing/

I have used the email module very successfully.

Also you can try the following to connect to mailboxes:
1) poplib
2) smtplib

For parsing the mails I would recommend pyparsing.


[1]
http://www.google.com/search?client=opera&rls=en&q=python+email&sourceid=opera&ie=utf-8&oe=utf-8

Regards

Nicolaas

--

The three things to remember about Llamas:
1) They are harmless
2) They are deadly
3) They are made of lava, and thus nice to cuddle.
 
M

Maric Michaud

Le Wednesday 30 July 2008 17:55:35 Aspersieman, vous avez écrit :
For parsing the mails I would recommend pyparsing.

Why ? email module is a great parser IMO.
 
D

Diez B. Roggisch

Maric said:
Le Wednesday 30 July 2008 17:55:35 Aspersieman, vous avez écrit :

Why ? email module is a great parser IMO.

He talks about parsing the *content*, not the email envelope and possible
mime-body.

Diez
 
M

MRAB

Basically, just be able to parse an email into its actual and "quoted"
parts - lines which have been prefixed to indent from a previous
email.

Most clients use ">" which is easy to check for, but I've seen some
which use "|" and some which *don't* quote at all. Its causing us
nightmares in parsing responses to system-generated emails. I was
hoping someone might've seen the problem previously and released some
code.

The problem is that sometimes lines might start with ">" for other
reasons, eg text copied from an interactive Python session, which
could occur in ... um ... _this_ newsgroup. :)
 
M

Maric Michaud

Le Wednesday 30 July 2008 19:25:31 Diez B. Roggisch, vous avez écrit :
He talks about parsing the *content*, not the email envelope and possible
mime-body.

Yes ? I don't know what the OP want to do with the content, but if it's just
filtering the lines begining with a '>', pyparsing might be a bit
overweighted.
 
S

Steven D'Aprano

Most clients use ">" which is easy to check for, but I've seen some
which use "|" and some which *don't* quote at all. Its causing us
nightmares in parsing responses to system-generated emails. I was hoping
someone might've seen the problem previously and released some code.

My sympathies.

I've even seen clients that prefix new (unquoted) text with the quote
character ">".

Well, possibly it's not the mail client, but the user. Who knows?

I will sometimes quote text like this:

Something quoted.
[end quote]

But I'm writing for a human audience, not for a program.

The simple answer is that you can catch 90% of cases by checking for ">",
and another 1% by checking for "|". If the email contains HTML, I have
found that quoted text is sometimes in another colour. As for the rest,
well, sometimes even human beings can't easily determine what's quoted
and what isn't. Good luck getting a program to do it.

(Percentages are plucked out of thin air. YMMV.)
 
S

Steven D'Aprano

My sympathies.

I've even seen clients that prefix new (unquoted) text with the quote
character ">".


Well, this is a new one I've never seen before: found on the python-dev
mailing list, somebody who (apparently) marks quoted text by inserting a
bare quote character on an otherwise empty line after each line of text,
similar to this:

I've even seen clients that prefix new (unquoted) text with the quote
character ">".

The user in question seems to be using gmail. I suspect a PEBCAK error.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top