D
dont bother
Hey,
I want to map my document vectors to n dimensional
feature space.
All I have to do is to:
Part 1. Take emails, parse them, generate a dictionary
of words from spams.
Part 2. When a new email arrives: I have to do this:
Parse the new email. Check all the words of this new
email from my dictionary.
For example: My dictionary has words like this:
1. Hi
2. Bye
3. you
4. cola
5. pepsi
6. viagra
7. weight
Suppose a new email arrives and I parse it and get the
following body in a file.
"Hi How is viagra. Loose weight"
total number of words=6
I have to compare this body with the words in
dictionary and create a feature vector like this:
Index: --> words in the email
1:--> for 'Hi'
6:--> for 'viagra'
7:--> for 'weight'
Feature Vector for the email:
[1:1/6 6:1/6 7:1/6 ]
since each of the word appears only once and the total
number of words are 6.
-----------------------------------------------------
I have been able to do Part1: Get an email, parse it,
remove html headers, get payload and generate a
dictionary.
What I dont know is this:
a)- How to strip off blanks spaces and characters like
^M from my dictionary
b)- How to remove numbers from my dictionary
c)- How to remove To, From and Message ID headers frm
my dictionary
The real important ones where I really need help is:
d)- How to compare the words from the payload of the
new email message, which I write in a another file
with the dictionary indexes
e)- How to create the feature vector I talked about in
part 2 above
I know these are not difficult but its a matter of
ignorance because I am new bie with Python. I choose
Python instead of Java because I heard parsing emails
is really easy and which is true.
I would really appreciate if some one can give me a
hand in that.
Thanks
Doon't
-------------------------------------------------
My code for parsing emails and generating dictionary
are here: emailparser.py and dictionary.py
--------------------------------------------------
#emailparser.py
#!/usr/local/bin/py
import string, StringIO, sys
import mailbox, email, re
def parse_mail(msg):
if msg.is_multipart():
pass
else:
# Get the parts of the message
body = msg.get_payload()
for hdr in msg.keys():
if hdr.startswith('From'):
del msg[hdr]
if hdr.startswith('To'):
del msg[hdr]
if hdr.startswith('Received'):
del msg[hdr]
if hdr.startswith('X-'):
del msg[hdr]
# process the body to remove html messages<>
body=re.sub(r'<[^>]*>','',body)
return(body)
if __name__ == '__main__':
if len(sys.argv) == 1:
print """
emailparser.py MBOX_FILE
"""
sys.exit(0)
f = open(sys.argv[1],'r')
mbox =
mailbox.UnixMailbox(f,email.message_from_file)
f1 = open('output','w')
num = 0
while 1:
num = num+1
try:
msg = mbox.next()
except email.Errors.HeaderParseError:
print 'Current mail (num = '+str(num)+')
seems to have a parse error. Skipping'
continue
if not msg: break
if msg.is_multipart():
print 'Skipping a multipart email (num
'+str(num)+')'
continue
s = parse_mail(msg)
f1.write(s)
f1.close()
#------------------------------------------------------
#dictionary.py
# python code for creating dictionary of words
import os
import sys
import re
try:
fread = open(sys.argv[1], 'r')
except IOError:
print 'Cant open file for reading'
sys.exit(0)
print 'Okay reading the file'
s=""
fread.seek(0,2)
c=fread.tell()
fread.seek(0)
d=fread.tell()
a=fread.read(1)
while(fread.tell()!=c):
s=s+a
b=fread.tell()
a=fread.read(1)
if(a=='\012'): #newline
#print s
#print 'The Line Ends'
fwrite=open('dictionary', 'a')
fwrite.write(s)
s=""
if(a=='\040'): #blank character
#print s
fwrite=open('dictionary', 'a')
fwrite.write(s)
fwrite.write("\n")
s=""
print 'Wrote to Dictionary\n'
fwrite.close()
fread.close()
#------------------------------------------------------
__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com
I want to map my document vectors to n dimensional
feature space.
All I have to do is to:
Part 1. Take emails, parse them, generate a dictionary
of words from spams.
Part 2. When a new email arrives: I have to do this:
Parse the new email. Check all the words of this new
email from my dictionary.
For example: My dictionary has words like this:
1. Hi
2. Bye
3. you
4. cola
5. pepsi
6. viagra
7. weight
Suppose a new email arrives and I parse it and get the
following body in a file.
"Hi How is viagra. Loose weight"
total number of words=6
I have to compare this body with the words in
dictionary and create a feature vector like this:
Index: --> words in the email
1:--> for 'Hi'
6:--> for 'viagra'
7:--> for 'weight'
Feature Vector for the email:
[1:1/6 6:1/6 7:1/6 ]
since each of the word appears only once and the total
number of words are 6.
-----------------------------------------------------
I have been able to do Part1: Get an email, parse it,
remove html headers, get payload and generate a
dictionary.
What I dont know is this:
a)- How to strip off blanks spaces and characters like
^M from my dictionary
b)- How to remove numbers from my dictionary
c)- How to remove To, From and Message ID headers frm
my dictionary
The real important ones where I really need help is:
d)- How to compare the words from the payload of the
new email message, which I write in a another file
with the dictionary indexes
e)- How to create the feature vector I talked about in
part 2 above
I know these are not difficult but its a matter of
ignorance because I am new bie with Python. I choose
Python instead of Java because I heard parsing emails
is really easy and which is true.
I would really appreciate if some one can give me a
hand in that.
Thanks
Doon't
-------------------------------------------------
My code for parsing emails and generating dictionary
are here: emailparser.py and dictionary.py
--------------------------------------------------
#emailparser.py
#!/usr/local/bin/py
import string, StringIO, sys
import mailbox, email, re
def parse_mail(msg):
if msg.is_multipart():
pass
else:
# Get the parts of the message
body = msg.get_payload()
for hdr in msg.keys():
if hdr.startswith('From'):
del msg[hdr]
if hdr.startswith('To'):
del msg[hdr]
if hdr.startswith('Received'):
del msg[hdr]
if hdr.startswith('X-'):
del msg[hdr]
# process the body to remove html messages<>
body=re.sub(r'<[^>]*>','',body)
return(body)
if __name__ == '__main__':
if len(sys.argv) == 1:
print """
emailparser.py MBOX_FILE
"""
sys.exit(0)
f = open(sys.argv[1],'r')
mbox =
mailbox.UnixMailbox(f,email.message_from_file)
f1 = open('output','w')
num = 0
while 1:
num = num+1
try:
msg = mbox.next()
except email.Errors.HeaderParseError:
print 'Current mail (num = '+str(num)+')
seems to have a parse error. Skipping'
continue
if not msg: break
if msg.is_multipart():
print 'Skipping a multipart email (num
'+str(num)+')'
continue
s = parse_mail(msg)
f1.write(s)
f1.close()
#------------------------------------------------------
#dictionary.py
# python code for creating dictionary of words
import os
import sys
import re
try:
fread = open(sys.argv[1], 'r')
except IOError:
print 'Cant open file for reading'
sys.exit(0)
print 'Okay reading the file'
s=""
fread.seek(0,2)
c=fread.tell()
fread.seek(0)
d=fread.tell()
a=fread.read(1)
while(fread.tell()!=c):
s=s+a
b=fread.tell()
a=fread.read(1)
if(a=='\012'): #newline
#print s
#print 'The Line Ends'
fwrite=open('dictionary', 'a')
fwrite.write(s)
s=""
if(a=='\040'): #blank character
#print s
fwrite=open('dictionary', 'a')
fwrite.write(s)
fwrite.write("\n")
s=""
print 'Wrote to Dictionary\n'
fwrite.close()
fread.close()
#------------------------------------------------------
__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com