Building a word list from multiple files

M

Manu

Hi,

Here's what i want to accomplish.
I want to make a list of frequenctly occuring words in a group of
files along with the no of occurances of each
The brute force method will be to read the file as a string,split,load
the words
into a dict with words as key and no of occurances as key.
Load the next file ,iterate through the new words increment the value
if there is
a match or add a new key,value pair if there is none.
repeat for all files.

is there a better way ??


Thanks in advance.
Manu
 
L

Larry Bates

Manu said:
Hi,

Here's what i want to accomplish.
I want to make a list of frequenctly occuring words in a group of
files along with the no of occurances of each
The brute force method will be to read the file as a string,split,load
the words
into a dict with words as key and no of occurances as key.
Load the next file ,iterate through the new words increment the value
if there is
a match or add a new key,value pair if there is none.
repeat for all files.

is there a better way ??


Thanks in advance.
Manu

Manu,

There are some things we would need to know to specifically
answer your question. I've tried to answer it with some
"assumptions" about your data/usage:

1) How large are the files you are reading (e.g. can they
fit in memory)?

If not, you will need to read the file a line at a time
and process each line individually.

2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

If not, you will probably need to use regular expressions
to handle all different punctuations that might separate
the words. Things like quotes, commas, periods, colons,
semi-colons, etc. Simple string split won't handle these
properly.

3) Do the "files" change a lot?

If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you
add/change one of the files run this process to recreate
and shelve the new dictionary. In your main program
get the shelved dictionary from the preprocess program
so that you don't have to process all the files every
time.

Hope info helps,
Larry Bates
Syscon, Inc.
 
S

Steven Bethard

Larry said:
2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

If not, you will probably need to use regular expressions
to handle all different punctuations that might separate
the words. Things like quotes, commas, periods, colons,
semi-colons, etc. Simple string split won't handle these
properly.

If you go this way, you probably ought to read this thread:

http://mail.python.org/pipermail/python-list/2004-November/250520.html

which suggests finding words with a regexp something like r'[^\W\d_]+'.
(If you're not concerned about internationalization, it could be simpler.)

STeve
 
M

Manu

hi,
1) How large are the files you are reading (e.g. can they
fit in memory)?

The files are email messages.
I will using the the builtin email module to extract only the content
type which is plain text or in html.So no line by line processing is
possible unless
i write my own parser for email.
2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

in the case of html mail i only extract the text and strip of the
tags.
Since this is regular text i expect no special seperators and as i
understand split() by default takes any whitespace character as
delimter.This will work fine for my purposes.

If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you

This is what i was planning to do_Once the processing is done for a
set of files they are never processed again.I was going to store the
dict as a string in a file and then use eval() to get it back.


Thanks
Manu
 
J

Jeff Shannon

Manu said:
hi,



The files are email messages.
I will using the the builtin email module to extract only the content
type which is plain text or in html.So no line by line processing is
possible unless
i write my own parser for email.

The email package can do that parsing for you -- it's not too difficult
to feed it a raw message file and get back only the text and/or html
payload.

This is what i was planning to do_Once the processing is done for a
set of files they are never processed again.I was going to store the
dict as a string in a file and then use eval() to get it back.

Use the shelve module instead of eval()ing it yourself -- the shelve
authors have already done all of the hard work for you. It'll act
almost like a regular dictionary, but is extremely easy to save to disk
and reload later.

This is why Python is called "batteries included". :)

Jeff Shannon
Technician/Programmer
Credit International
 
L

Larry Bates

With email messages they should be small enough so reading
them into memory isn't an issue so line-by-line processing
isn't indicated here.

Email messages have LOTS of punctuation in the other than
witespace between words. Just look at your email message
below. It contains:
> greater than symbol
) parenthesis
.. periods
? question marks
, commas

Even text like: "html.So no line.." Periods with no
whitespace will be a problem string split would
return "html.So" as a word.

I really think you are going to need to use regex to
split this into "words" and even then the words may
be of questionable origin. See another response for
an example regex expression that might work. Constructs
like e.g. will return two words "e" and "g" (which
might be ok for your application).

Hope feedback at least helps.

Larry Bates
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,682
Members
48,796
Latest member
Greg L.

Latest Threads

Top