Almost Done: Need some Help in Generating FEATURE VECTORS

dont bother · Mar 5, 2004

Hey,

I want to map my document vectors to n dimensional
feature space.

All I have to do is to:

Part 1. Take emails, parse them, generate a dictionary
of words from spams.

Part 2. When a new email arrives: I have to do this:

Parse the new email. Check all the words of this new
email from my dictionary.

For example: My dictionary has words like this:

1. Hi
2. Bye
3. you
4. cola
5. pepsi
6. viagra
7. weight

Suppose a new email arrives and I parse it and get the
following body in a file.

"Hi How is viagra. Loose weight"
total number of words=6

I have to compare this body with the words in
dictionary and create a feature vector like this:

Index: --> words in the email

1:--> for 'Hi'
6:--> for 'viagra'
7:--> for 'weight'

Feature Vector for the email:

[1:1/6 6:1/6 7:1/6 ]

since each of the word appears only once and the total
number of words are 6.

-----------------------------------------------------

I have been able to do Part1: Get an email, parse it,
remove html headers, get payload and generate a
dictionary.

What I dont know is this:

a)- How to strip off blanks spaces and characters like
^M from my dictionary
b)- How to remove numbers from my dictionary
c)- How to remove To, From and Message ID headers frm
my dictionary

The real important ones where I really need help is:

d)- How to compare the words from the payload of the
new email message, which I write in a another file
with the dictionary indexes

e)- How to create the feature vector I talked about in
part 2 above

I know these are not difficult but its a matter of
ignorance because I am new bie with Python. I choose
Python instead of Java because I heard parsing emails
is really easy and which is true.

I would really appreciate if some one can give me a
hand in that.

Thanks

Doon't

-------------------------------------------------
My code for parsing emails and generating dictionary
are here: emailparser.py and dictionary.py
--------------------------------------------------
#emailparser.py

#!/usr/local/bin/py

import string, StringIO, sys
import mailbox, email, re

def parse_mail(msg):
if msg.is_multipart():
pass
else:
# Get the parts of the message
body = msg.get_payload()

for hdr in msg.keys():
if hdr.startswith('From'):
del msg[hdr]
if hdr.startswith('To'):
del msg[hdr]
if hdr.startswith('Received'):
del msg[hdr]
if hdr.startswith('X-'):
del msg[hdr]

# process the body to remove html messages<>
body=re.sub(r'<[^>]*>','',body)

return(body)

if __name__ == '__main__':

if len(sys.argv) == 1:
print """
emailparser.py MBOX_FILE

"""
sys.exit(0)

f = open(sys.argv[1],'r')
mbox =
mailbox.UnixMailbox(f,email.message_from_file)
f1 = open('output','w')

num = 0
while 1:
num = num+1
try:
msg = mbox.next()
except email.Errors.HeaderParseError:
print 'Current mail (num = '+str(num)+')
seems to have a parse error. Skipping'
continue

if not msg: break

if msg.is_multipart():
print 'Skipping a multipart email (num
'+str(num)+')'
continue
s = parse_mail(msg)

f1.write(s)
f1.close()

#------------------------------------------------------

#dictionary.py

# python code for creating dictionary of words

import os
import sys
import re

try:
fread = open(sys.argv[1], 'r')
except IOError:
print 'Cant open file for reading'
sys.exit(0)
print 'Okay reading the file'
s=""
fread.seek(0,2)
c=fread.tell()

fread.seek(0)
d=fread.tell()

a=fread.read(1)

while(fread.tell()!=c):

s=s+a
b=fread.tell()

a=fread.read(1)
if(a=='\012'): #newline
#print s
#print 'The Line Ends'
fwrite=open('dictionary', 'a')
fwrite.write(s)
s=""

if(a=='\040'): #blank character
#print s
fwrite=open('dictionary', 'a')
fwrite.write(s)
fwrite.write("\n")
s=""

print 'Wrote to Dictionary\n'
fwrite.close()
fread.close()

#------------------------------------------------------

__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com

Diez B. Roggisch · Mar 5, 2004

You could create a dict that maps every word to the feature index - then for
creating the feature vector you simply create a list with size of the # of
words in the dict an set in this list the corresponding index.

Josiah Carlson · Mar 5, 2004

#First, normalize the line breaks:
email_source = email_source.replace('\r\n', '\n').replace('\r', '\n')

#toss the headers:
pos = email_source.find('\n\n')
if pos != -1:
email_body = email_source[pos:]
else:
email_body = email_source

#clean out html:
(use the method given http://flangy.com/dev/python/striphtml.html )

#get rid of anything that isn't a letter, and make it all lowercase:
lower = ''.join(map(chr, range(97, 123)))
fixed_body = email_body.translate(65*' '+lower+6*' '+lower+133*' ')

words_in_body = fixed_body.split()

#load up external dictionary:
words = open('dictionary', 'r').read().split()
dct = {}
for i in xrange(len(words)):
dct[words] = i

#make vector:
vector = {}
a = float(len(words_in_body))
for i in words_in_body:
if i in dct:
try:
vector += 1
except:
vector = 1

for i in vector:
vector /= a

I know the above doesn't fit with what you have, but you should be able
to adapt it.

- Josiah

I need help with some python code	1	Mar 9, 2022
Need help with this script	4	Mar 12, 2023
I need help with a Gemini prompt	1	May 14, 2025
I need help fixing my website	2	Oct 15, 2023
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Need some help confirming transactions using sha256	3	Jan 31, 2013
__delitem__ "feature"	5	Dec 26, 2010
Hello I am learning how to code and I tried making a calculator with HTML and js with some CSS I am stuck at thing, Like the screen value is	0	Mar 13, 2025

Almost Done: Need some Help in Generating FEATURE VECTORS

dont bother

Diez B. Roggisch

Josiah Carlson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads