On Sun, 3 May 2009 22:21:13 +0530, SUBHABRATA BANERJEE
<
[email protected]> declaimed the following in
gmane.comp.python.general:
said:
from decimal import*
#SAMPLE TEST PROGRAM FOR FILE
def sample_file_test(n):
Is that "n" supposed to mean something? I don't see it used anywhere
in the following stuff.
#FILE FOR STORING PROBABILITY VALUES
open_file=open("/python26/Newfile1.txt","r+")
This is your output file, no? So why are you opening it for read
(with direct positioning option)... Oh, and just out of curiosity --
that isn't the Python install directory you are using for your data
files, is it?
#OPENING OF ENGLISH CORPUS
open_corp_eng=open("/python26/TOTALENGLISHCORPUS1.txt","r")
Same comment about directory
#READING THE ENGLISH CORPUS
corp_read=open_corp_eng.read()
Let's see, read the entire file into one large string...
#CONVERTING THE CORPUS FILE IN WORDS
corp_word=corp_read.split()
Then create a list of words (so you now have, essentially, two
copies of the text in memory). And forgive me, but those names aren't
the most illuminative: open_corp_eng sounds more like a function that is
meant to open something, not the result from opening it...
#EXTRACTING WORDS FROM CORPUS FILE OF WORDS
for word in corp_word:
#COUNTING THE WORD
count1=corp_word.count(word)
If any given word appears in the text more than once, you end up
repeating this counting operation each time it appears.
#COUNTING TOTAL NUMBER OF WORDS
count2=len(corp_word)
This value should not change and should be obtained outside the word
loop.
#COUNTING PROBABILITY OF WORD
count1_dec=Decimal(count1)
count2_dec=Decimal(count2)
getcontext().prec = 6
prob_count=count1_dec/count2_dec
Is there some particular reason for using decimal package here (and
again, count2_dec should be computed outside the loop.
print prob_count
string_of_prob_count=str(prob_count)
file_input_val=open_file.write(string_of_prob_count)
Does .write() even return a value? What do you expect
"file_input_val" to contain after that statement? And, as others have
mentioned, .write() does not add newlines or other whitespace, so all
your output would be one long string.
Uhm, the first word it calculates and writes a probability for will
be followed by closing the output file -- I suspect this should be
outside the for loop.
Also note that you will be repeating words in the output as there is
no provision to create unique entries.
Does the following seem to do what you need?
-=-=-=-=-=-=-=-=-
"""
wordprob.py relative probabilities for word appearance in
text
may 3 2009 dennis lee bieber
an alternate approach to a problem posted on C.L.P
"""
import sys
def loadData(fid):
words = {}
fin = open(fid, "r")
#only read one line at a time
for ln in fin:
#treat upper and lower case words as same
for wd in ln.lower().split():
#increment count of specific word
words[wd] = words.get(wd, 0) + 1
fin.close()
return words
def computeProbabilities(wordCounts):
#get total count of words (and make it float)
total = float(sum(wordCounts.values()))
probs = {}
#for each word, compute the probability
for (wd, wc) in wordCounts.items():
probs[wd] = wc / total
return probs
def writeResults(fid, probs):
if type(fid) == type("string"):
fout = open(fid, "w")
else:
#otherwise assume supplied fid is an open stream
fout = fid
#convert dictionary to list for sorting
ordered = probs.items()
#sort into descending probability (most common first)
ordered.sort(key=lambda x: x[1], reverse=True)
for (wd, wp) in ordered:
#write the word followed by probability, newline
fout.write("%s : %s\n" % (wd, wp))
if fout != fid:
fout.close()
if __name__ == "__main__":
theData = loadData(YOUR_FILE_NAME_HERE)
theProbs = computeProbabilities(theData)
writeResults(sys.stdout, theProbs)
## use a file name if desired output is other than screen
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/