Handling emails

T

TheSaint

Hello
I wrote a program which was working on python 2.x. I'd like to go for newer
version but I face the problem on how the emails are parsed.
In particular I'd like to extract the significant parts of the headers, but
the query to the servers had turned in to list of bytes.
What could be a method that will parse and return the headers into ascii if
I'll pass the headers as bytes. Even I don't know whether I can pass as they
arrive to the program.

For example if I try:

import poplib.POP3
_pop= poplib.POP3(srvr)
_pop.user(args[1])
_pop.pass_(args[2])

header =_pop.top(nmuid, 0)

This will return a list of bytes string and I don't have idea to process
them in order to have a dictionary containing
'from', 'to', 'cc', 'bcc', 'date', 'subject', 'reply-to', 'message-id'
as keys.
 
S

Steven D'Aprano

Hello
I wrote a program which was working on python 2.x. I'd like to go for
newer version but I face the problem on how the emails are parsed. In
particular I'd like to extract the significant parts of the headers, but
the query to the servers had turned in to list of bytes. What could be a
method that will parse and return the headers into ascii if I'll pass
the headers as bytes. Even I don't know whether I can pass as they
arrive to the program.

For example if I try:

import poplib.POP3
_pop= poplib.POP3(srvr)
_pop.user(args[1])
_pop.pass_(args[2])

header =_pop.top(nmuid, 0)

This will return a list of bytes string and I don't have idea to process
them in order to have a dictionary containing 'from', 'to', 'cc', 'bcc',
'date', 'subject', 'reply-to', 'message-id' as keys.

To parse emails, you should use the email package. It already handles
bytes and strings.

Other than that, I'm not entirely sure I understand your problem. In
general, if you have some bytes, you can decode it into a string by hand:
'To: (e-mail address removed)\n'


If this is not what you mean, perhaps you should give an example of what
header looks like, what you hope to get, and a concrete example of how it
differs in Python 3.
 
T

TheSaint

Steven D'Aprano wrote:

First of all: thanks for the reply
To parse emails, you should use the email package. It already handles
bytes and strings.
I've read several information this afternoon, mostly are leading to errors.
That could be my ignorance fault :)
For what I could come over, I decided to write my own code.

def msg_parser(listOfBytes):
header={}
for lin in listOfBytes:
try: line= lin.decode()
except UnicodeDecodeError:
continue
for key in _FULLhdr:
if key in line:
header[key]= line
continue
return header

listOfBytes is the header content, whuch id given by
libpop.POP3.top(num_msg. how_much), tuple second part.

However, some line will fail to decode correctly. I can't imagine why emails
don't comply to a standard.
Other than that, I'm not entirely sure I understand your problem. In
general, if you have some bytes, you can decode it into a string by hand:

I see. I didn't learn a good english yet :p. I'm Italian :)
'To: (e-mail address removed)\n'

I know this, in case to post the entire massege header and envelope it's not
applicable.
The libraries handling emails and their headers seems to me a big confusion
and I suppose I should take a different smaller approach.

I'll try to show a header (if content isn't privacy breaker) but as the
above example the *_pop.top(nmuid, 0)* won't go into your example
If this is not what you mean, perhaps you should give an example of what
header looks like

The difference is that previous version returning text strings and the
following processes are based on strings manipulations.
Just to mention, my program reads headers from POP3 or IMAP4 server and
apply some regex filtering in order to remove unwanted emails from the
server. All the filters treating IO as ascii string of characters.

I passed my modules to 2to3 for the conversion to the newer python, but at
the first run it told that downloaded header is not a string.
 
N

Nobody

However, some line will fail to decode correctly. I can't imagine why emails
don't comply to a standard.

Any headers should be in ASCII; Non-ASCII characters should be encoded
using quoted-printable and/or base-64 encoding.

Any message with non-ASCII characters in the headers can safely be
discarded as spam (I've never seen this bug in "legitimate" email).
Many MTAs will simply reject such messages.

The message body can be in any encoding, or in multiple encodings (e.g.
for multipart/mixed content), or none (e.g. the body may be binary data
rather than text).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,119
Latest member
IrmaNorcro
Top