Building a word list from multiple files

Manu · Nov 18, 2004

Hi,

Here's what i want to accomplish.
I want to make a list of frequenctly occuring words in a group of
files along with the no of occurances of each
The brute force method will be to read the file as a string,split,load
the words
into a dict with words as key and no of occurances as key.
Load the next file ,iterate through the new words increment the value
if there is
a match or add a new key,value pair if there is none.
repeat for all files.

is there a better way ??

Thanks in advance.
Manu

Larry Bates · Nov 18, 2004

Manu said:
Hi,

Here's what i want to accomplish.
I want to make a list of frequenctly occuring words in a group of
files along with the no of occurances of each
The brute force method will be to read the file as a string,split,load
the words
into a dict with words as key and no of occurances as key.
Load the next file ,iterate through the new words increment the value
if there is
a match or add a new key,value pair if there is none.
repeat for all files.

is there a better way ??

Thanks in advance.
Manu

Manu,

There are some things we would need to know to specifically
answer your question. I've tried to answer it with some
"assumptions" about your data/usage:

1) How large are the files you are reading (e.g. can they
fit in memory)?

If not, you will need to read the file a line at a time
and process each line individually.

2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

If not, you will probably need to use regular expressions
to handle all different punctuations that might separate
the words. Things like quotes, commas, periods, colons,
semi-colons, etc. Simple string split won't handle these
properly.

3) Do the "files" change a lot?

If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you
add/change one of the files run this process to recreate
and shelve the new dictionary. In your main program
get the shelved dictionary from the preprocess program
so that you don't have to process all the files every
time.

Hope info helps,
Larry Bates
Syscon, Inc.

Steven Bethard · Nov 18, 2004

Larry said:
2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

If not, you will probably need to use regular expressions
to handle all different punctuations that might separate
the words. Things like quotes, commas, periods, colons,
semi-colons, etc. Simple string split won't handle these
properly.

If you go this way, you probably ought to read this thread:

http://mail.python.org/pipermail/python-list/2004-November/250520.html

which suggests finding words with a regexp something like r'[^\W\d_]+'.
(If you're not concerned about internationalization, it could be simpler.)

STeve

Manu · Nov 18, 2004

hi,

1) How large are the files you are reading (e.g. can they
fit in memory)?

The files are email messages.
I will using the the builtin email module to extract only the content
type which is plain text or in html.So no line by line processing is
possible unless
i write my own parser for email.

2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

in the case of html mail i only extract the text and strip of the
tags.
Since this is regular text i expect no special seperators and as i
understand split() by default takes any whitespace character as
delimter.This will work fine for my purposes.

If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you

This is what i was planning to d

nce the processing is done for a
set of files they are never processed again.I was going to store the
dict as a string in a file and then use eval() to get it back.

Thanks
Manu

Jeff Shannon · Nov 18, 2004

Manu said:
hi,

The files are email messages.
I will using the the builtin email module to extract only the content
type which is plain text or in html.So no line by line processing is
possible unless
i write my own parser for email.

The email package can do that parsing for you -- it's not too difficult
to feed it a raw message file and get back only the text and/or html
payload.

This is what i was planning to dnce the processing is done for a
set of files they are never processed again.I was going to store the
dict as a string in a file and then use eval() to get it back.

Use the shelve module instead of eval()ing it yourself -- the shelve
authors have already done all of the hard work for you. It'll act
almost like a regular dictionary, but is extremely easy to save to disk
and reload later.

This is why Python is called "batteries included".

Jeff Shannon
Technician/Programmer
Credit International

Larry Bates · Nov 19, 2004

With email messages they should be small enough so reading
them into memory isn't an issue so line-by-line processing
isn't indicated here.

Email messages have LOTS of punctuation in the other than
witespace between words. Just look at your email message
below. It contains:

> greater than symbol

) parenthesis
.. periods
? question marks
, commas

Even text like: "html.So no line.." Periods with no
whitespace will be a problem string split would
return "html.So" as a word.

I really think you are going to need to use regex to
split this into "words" and even then the words may
be of questionable origin. See another response for
an example regex expression that might work. Constructs
like e.g. will return two words "e" and "g" (which
might be ok for your application).

Hope feedback at least helps.

Larry Bates

Word matching with specific parameters	1	Jan 26, 2025
Suggestions on building an AI?	0	Oct 3, 2022
Sort and count word pairs in a string	6	Jan 29, 2023
Collecting multiple items and saving to one list item, for eventual storage as a record.	8	Mar 5, 2023
Building a fantasy league website.	3	Feb 10, 2021
Find and count strings of text from multiple files	17	Dec 16, 2021
How to Make CSV Contact Files Work Seamlessly Across All Smartphones?	0	Sep 17, 2025
HTACCESS - prevent files from being downloaded via 'view source'	1	Dec 8, 2022

Building a word list from multiple files

Manu

Larry Bates

Steven Bethard

Manu

Jeff Shannon

Larry Bates

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads