Python:Email and Header Parsing: Some Help

D

dont bother

Hi,
I have written this small piece of code. I am a brand
new player of Python. I had asked some people for
help, unfortunately not many helped.
Here is the code I have:

import email
import os
import sys
fread = open('email_message', 'r')
msg=email.message_from_file(fread)
print msg
#fwrite = open('output','w')
#fwrite.write(msg)

This way I am able to print the entire email message
on the stdout. The program generates an error If I try
to write the output to a file-- It says the argument
(here msg) should be a string but not as an instance
like here. How to write the message to another file
then?

2. I have so many headers in the email message

To:
From:
X Received:
X Priority:
Subject:
etc etc.
I want to parse the headers separtely and message
separately. Does anyone has an example code to deal
with Parser?
Also I want to remove the redundant words and all html
tags. Any advise on that?
I saw some examples using HTMLGen But I dont have
HTMLGen with python on my machine. I have Python
2.3.3. on my machine.

All help is greatly appreciated.
Thanks
Dont

__________________________________
Do you Yahoo!?
Get better spam protection with Yahoo! Mail.
http://antispam.yahoo.com/tools
 
P

Paul McGuire

dont bother said:
I want to parse the headers separtely and message
separately. Does anyone has an example code to deal
with Parser?
Here is a spam cleaner that I run several times a day. My ISP run Symantec
on their end, and tag suspect e-mails with header virus tags. This program
looks for those tags, and autodeletes any Klez or Swen infected e-mails.


import poplib, re

# Change this to your needs
POPHOST = "pop-server.austin.rr.com"
POPUSER = "xyzzy"
POPPASS = "ajsdlfjslfkj"

# reg expressions for extracting header data
re_from = re.compile( "^From: (.*)" )
re_to = re.compile( "^To: (.*)" )
re_subject = re.compile( "^Subject: (.*)" )
re_virusresult = re.compile( "^X-Virus-Scan-Result: (.*)" )

def showMessage( msgHdr ):
out = ( msgHdr["msgnum"], msgHdr["From"], msgHdr["Subject"],
msgHdr["Virus"] )
print "%3d. %-30.30s %-24.24s %-24.24s" % out

def scanMailboxMsgs():
"refresh window contents"
global deleteCount

try:
# log in to mail box
pop = poplib.POP3(POPHOST)
pop.user(POPUSER)
pop.pass_( POPPASS)
connected = True

# retrieve msg headers
msgCount, msgTotalSize = pop.stat()

emptyHdr = {
"From" : "",
"To" : "",
"Subject" : "",
"Virus" : "none"
}
matchREs = [
( re_from, "From" ),
( re_to, "To" ),
( re_subject, "Subject" ),
( re_virusresult, "Virus" )
]

# for each message, display header info
for n in range( msgCount ):
msgnum = n+1 # msg nums are 1-based, not 0-based

# Retrieve message header
response, headerLines, bytes = pop.top(msgnum, 0)

hdr = emptyHdr.copy()
hdr["msgnum"] = msgnum
hdr["size"] = bytes
for line in headerLines:
for reExpr,hdrField in matchREs:
match = reExpr.match( line )
if match:
hdr[ hdrField ] = match.group(1).strip('"')

# auto-delete any msgs that had the W32.Swen virus
if hdr["Virus"].count("W32.Swen") > 0 or \
hdr["Virus"].count("W32.Klez") > 0:
showMessage( hdr )
pop.dele(msgnum)
deleteCount += 1

except poplib.error_proto, detail:
print "POP3 error:", detail

if connected :
pop.quit()


# ============= main script ===============
deleteCount = 0
scanMailboxMsgs()
print "Deleted", deleteCount, "messages"

raw_input( "Press <return> to continue" )
 
D

David M. Cooke

At some point said:
Hi,
I have written this small piece of code. I am a brand
new player of Python. I had asked some people for
help, unfortunately not many helped.
Here is the code I have:

import email
import os
import sys
fread = open('email_message', 'r')
msg=email.message_from_file(fread)
print msg
#fwrite = open('output','w')
#fwrite.write(msg)

This way I am able to print the entire email message
on the stdout. The program generates an error If I try
to write the output to a file-- It says the argument
(here msg) should be a string but not as an instance
like here. How to write the message to another file
then?

msg here isn't a string; it's an email.Message object. The print
statement works because print call str() on the objects passed.

You want
fwrite = open('output', 'w')
fwrite.write( msg.as_string() )

I didn't use str(msg) here, as that defaults to
msg.as_string(unixfrom=True). Depends whether or not you want the
2. I have so many headers in the email message

To:
From:
X Received:
X Priority:
Subject:
etc etc.
I want to parse the headers separtely and message
separately. Does anyone has an example code to deal
with Parser?

I'm not sure what you want -- email.message_from_file produces a Message
object, which already splits out the headers from the body. You can
then iterate over the headers. For example, to strip out the optional
headers (those starting with 'X-'):

for hdr in msg.keys():
if hdr.startswith('X-'):
del msg[hdr]
Also I want to remove the redundant words and all html
tags. Any advise on that?
I saw some examples using HTMLGen But I dont have
HTMLGen with python on my machine. I have Python
2.3.3. on my machine.

HTMLGen won't work, as that generates HTML (hence the name...). To
strip out the HTML tags, probably a regular expression would be
sufficient. Otherwise, have a look at HTMLParser (in the standard library).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,521
Members
44,995
Latest member
PinupduzSap

Latest Threads

Top