Finding repeated words in text documents: what Algorithm ?

D

Daniele Menozzi

Hi all, I have this problem: I have some documents (10,20,30..) and I have
to find the words that repeats most of all.
Can you suggest me some Algorithm that can be used in this case?

Thank you :)
Daniele
 
J

JosephWu

Hi all, I have this problem: I have some documents (10,20,30..) and I
have
to find the words that repeats most of all.
Can you suggest me some Algorithm that can be used in this case?

Thank you :)
Daniele


huffman£¿£¿£¿
 
S

Stefan Schulz

huffman???

Huffman iirc needs to have the frequencies.

Why not just make a list of all the words occurring in your documents, and
whenever you encounter a word, increment its frequency by one?
 
H

Hemal Pandya

Daniele said:
Hi all, I have this problem: I have some documents (10,20,30..) and I have
to find the words that repeats most of all.
Can you suggest me some Algorithm that can be used in this case?\

initialize collection word-frequency
for each document
for each word in the document
if the word exists in word-frequency
bump frequency
else
add the word to word-frequency with frequency 1

initialize top-frequency to 0, top-word to null
for each word in word-frequency
if its frequency is greater then top-frequency
assign the frequency to top-frequency, word to top-word

the top-word is the word that repeats most
 
G

George Cherry

Hemal Pandya said:
initialize collection word-frequency
for each document
for each word in the document
if the word exists in word-frequency
bump frequency
else
add the word to word-frequency with frequency 1

initialize top-frequency to 0, top-word to null
for each word in word-frequency
if its frequency is greater then top-frequency
assign the frequency to top-frequency, word to top-word

the top-word is the word that repeats most

Maybe the the op meant successive repetitions as in
"the the" at the beginning of this sentence??? My
spelling checker detects this btw and warns me.

George
 
R

Roedy Green

Hi all, I have this problem: I have some documents (10,20,30..) and I have
to find the words that repeats most of all.
Can you suggest me some Algorithm that can be used in this case?


tHere are two most commonly used:


1. sort and look for adjacent duplicates.

2. build a HashSet. If word is already in there, you have a dup.

--
Bush crime family lost/embezzled $3 trillion from Pentagon.
Complicit Bush-friendly media keeps mum. Rumsfeld confesses on video.
http://www.infowars.com/articles/us/mckinney_grills_rumsfeld.htm

Canadian Mind Products, Roedy Green.
See http://mindprod.com/iraq.html photos of Bush's war crimes
 
W

Wibble

Roedy said:
tHere are two most commonly used:


1. sort and look for adjacent duplicates.

2. build a HashSet. If word is already in there, you have a dup.
Find all the words that aren't the most duplicated. The remaining one
is your answer.
 
R

Roedy Green

Find all the words that aren't the most duplicated. The remaining one
is your answer.

IN that case use a HashMap and add to the count for every hit. Then
sort the hit counts in order, and you can find your least and most
duplicated words, or just do a linear scan looking for the one you
want.

--
Bush crime family lost/embezzled $3 trillion from Pentagon.
Complicit Bush-friendly media keeps mum. Rumsfeld confesses on video.
http://www.infowars.com/articles/us/mckinney_grills_rumsfeld.htm

Canadian Mind Products, Roedy Green.
See http://mindprod.com/iraq.html photos of Bush's war crimes
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top