Almost Done: Need some Help in Generating FEATURE VECTORS

D

dont bother

Hey,

I want to map my document vectors to n dimensional
feature space.

All I have to do is to:

Part 1. Take emails, parse them, generate a dictionary
of words from spams.

Part 2. When a new email arrives: I have to do this:

Parse the new email. Check all the words of this new
email from my dictionary.

For example: My dictionary has words like this:

1. Hi
2. Bye
3. you
4. cola
5. pepsi
6. viagra
7. weight


Suppose a new email arrives and I parse it and get the
following body in a file.

"Hi How is viagra. Loose weight"
total number of words=6

I have to compare this body with the words in
dictionary and create a feature vector like this:

Index: --> words in the email

1:--> for 'Hi'
6:--> for 'viagra'
7:--> for 'weight'

Feature Vector for the email:

[1:1/6 6:1/6 7:1/6 ]

since each of the word appears only once and the total
number of words are 6.

-----------------------------------------------------

I have been able to do Part1: Get an email, parse it,
remove html headers, get payload and generate a
dictionary.

What I dont know is this:


a)- How to strip off blanks spaces and characters like
^M from my dictionary
b)- How to remove numbers from my dictionary
c)- How to remove To, From and Message ID headers frm
my dictionary

The real important ones where I really need help is:

d)- How to compare the words from the payload of the
new email message, which I write in a another file
with the dictionary indexes

e)- How to create the feature vector I talked about in
part 2 above

I know these are not difficult but its a matter of
ignorance because I am new bie with Python. I choose
Python instead of Java because I heard parsing emails
is really easy and which is true.

I would really appreciate if some one can give me a
hand in that.

Thanks

Doon't


-------------------------------------------------
My code for parsing emails and generating dictionary
are here: emailparser.py and dictionary.py
--------------------------------------------------
#emailparser.py

#!/usr/local/bin/py


import string, StringIO, sys
import mailbox, email, re


def parse_mail(msg):
if msg.is_multipart():
pass
else:
# Get the parts of the message
body = msg.get_payload()

for hdr in msg.keys():
if hdr.startswith('From'):
del msg[hdr]
if hdr.startswith('To'):
del msg[hdr]
if hdr.startswith('Received'):
del msg[hdr]
if hdr.startswith('X-'):
del msg[hdr]






# process the body to remove html messages<>
body=re.sub(r'<[^>]*>','',body)

return(body)

if __name__ == '__main__':

if len(sys.argv) == 1:
print """
emailparser.py MBOX_FILE

"""
sys.exit(0)

f = open(sys.argv[1],'r')
mbox =
mailbox.UnixMailbox(f,email.message_from_file)
f1 = open('output','w')


num = 0
while 1:
num = num+1
try:
msg = mbox.next()
except email.Errors.HeaderParseError:
print 'Current mail (num = '+str(num)+')
seems to have a parse error. Skipping'
continue

if not msg: break

if msg.is_multipart():
print 'Skipping a multipart email (num
'+str(num)+')'
continue
s = parse_mail(msg)

f1.write(s)
f1.close()


#------------------------------------------------------


#dictionary.py

# python code for creating dictionary of words

import os
import sys
import re

try:
fread = open(sys.argv[1], 'r')
except IOError:
print 'Cant open file for reading'
sys.exit(0)
print 'Okay reading the file'
s=""
fread.seek(0,2)
c=fread.tell()

fread.seek(0)
d=fread.tell()

a=fread.read(1)

while(fread.tell()!=c):

s=s+a
b=fread.tell()

a=fread.read(1)
if(a=='\012'): #newline
#print s
#print 'The Line Ends'
fwrite=open('dictionary', 'a')
fwrite.write(s)
s=""

if(a=='\040'): #blank character
#print s
fwrite=open('dictionary', 'a')
fwrite.write(s)
fwrite.write("\n")
s=""

print 'Wrote to Dictionary\n'
fwrite.close()
fread.close()

#------------------------------------------------------


__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com
 
D

Diez B. Roggisch

You could create a dict that maps every word to the feature index - then for
creating the feature vector you simply create a list with size of the # of
words in the dict an set in this list the corresponding index.
 
J

Josiah Carlson

#First, normalize the line breaks:
email_source = email_source.replace('\r\n', '\n').replace('\r', '\n')

#toss the headers:
pos = email_source.find('\n\n')
if pos != -1:
email_body = email_source[pos:]
else:
email_body = email_source

#clean out html:
(use the method given http://flangy.com/dev/python/striphtml.html )

#get rid of anything that isn't a letter, and make it all lowercase:
lower = ''.join(map(chr, range(97, 123)))
fixed_body = email_body.translate(65*' '+lower+6*' '+lower+133*' ')

words_in_body = fixed_body.split()

#load up external dictionary:
words = open('dictionary', 'r').read().split()
dct = {}
for i in xrange(len(words)):
dct[words] = i

#make vector:
vector = {}
a = float(len(words_in_body))
for i in words_in_body:
if i in dct:
try:
vector += 1
except:
vector = 1

for i in vector:
vector /= a



I know the above doesn't fit with what you have, but you should be able
to adapt it.

- Josiah
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top