remove common words from a string

S

sfu

Hi
I'd like to import a text file containing common words such as
"at","the","and" etc.

Then if i take a String "I want to visit the shops today and buy some
clothes", i would like to strip out the common words to become "visit
shops today buy clothes"

I know to speed up searching for "at" within the string i can use a
Boyer Moore algorithm which returns the index of the first "at". But
im unsure how i would use this to remove all occurances of.

Any help is much appreciated, thanks
 
E

Erwin Moller

sfu said:
Hi
I'd like to import a text file containing common words such as
"at","the","and" etc.

Then if i take a String "I want to visit the shops today and buy some
clothes", i would like to strip out the common words to become "visit
shops today buy clothes"

I know to speed up searching for "at" within the string i can use a
Boyer Moore algorithm which returns the index of the first "at". But
im unsure how i would use this to remove all occurances of.

Any help is much appreciated, thanks

Hi,


I am bored today anyway, so I made this little program.
Is this what you want?
Who is that chap Boyer Moore by the way?

Regards,
Erwin



public class WordFilter {

public WordFilter() {
}

public static void main(String[] args) {
String strOrg = "I want to visit the shops today and buy some clothes and";
String filterThis[] = {" at " , " the " , " and "};
System.out.println("before: "+strOrg);

for(int iFilt = 0; iFilt<filterThis.length; iFilt++) {
int nWordLength = filterThis[iFilt].length();
int nFoundPos = strOrg.indexOf(filterThis[iFilt]);
while (nFoundPos != -1){
System.out.println("found "+filterThis[iFilt]+" at position:"+nFoundPos);
strOrg = strOrg.substring(0,nFoundPos)+"
"+strOrg.substring(nFoundPos+nWordLength);
nFoundPos = strOrg.indexOf(filterThis[iFilt]);
}
}
System.out.println("after: "+strOrg);
}
}
 
D

Danny Woods

sfu said:
Hi
I'd like to import a text file containing common words such as
"at","the","and" etc.

The regular expressions package in java.util.regex sounds like what
you're after. Use Pattern.compile to create a Pattern object for a
given regular expression. Create a Matcher from that Pattern with
Pattern.matcher, and then replaceAll on that Matcher (replacing with
""). That matcher will come back with a suitably formed String.

A regex like "\b((and)|(or)|(at))\b" is a starting point.

Hope that helps,

Danny
 
W

William Brogden

sfu said:
Hi
I'd like to import a text file containing common words such as
"at","the","and" etc.

Then if i take a String "I want to visit the shops today and buy some
clothes", i would like to strip out the common words to become "visit
shops today buy clothes"

I know to speed up searching for "at" within the string i can use a
Boyer Moore algorithm which returns the index of the first "at". But
im unsure how i would use this to remove all occurances of.

Any help is much appreciated, thanks

If I had that problem I would use a StringTokenizer to take apart
the String and a HashMap filled with the common words to detect
which to discard. Reassemble the words that pass in a StringBuffer.

WBB
 
S

sfu

Erwin Moller said:
sfu said:
Hi
I'd like to import a text file containing common words such as
"at","the","and" etc.

Then if i take a String "I want to visit the shops today and buy some
clothes", i would like to strip out the common words to become "visit
shops today buy clothes"

I know to speed up searching for "at" within the string i can use a
Boyer Moore algorithm which returns the index of the first "at". But
im unsure how i would use this to remove all occurances of.

Any help is much appreciated, thanks

Hi,


I am bored today anyway, so I made this little program.
Is this what you want?
Who is that chap Boyer Moore by the way?

Regards,
Erwin



public class WordFilter {

public WordFilter() {
}

public static void main(String[] args) {
String strOrg = "I want to visit the shops today and buy some clothes and";
String filterThis[] = {" at " , " the " , " and "};
System.out.println("before: "+strOrg);

for(int iFilt = 0; iFilt<filterThis.length; iFilt++) {
int nWordLength = filterThis[iFilt].length();
int nFoundPos = strOrg.indexOf(filterThis[iFilt]);
while (nFoundPos != -1){
System.out.println("found "+filterThis[iFilt]+" at position:"+nFoundPos);
strOrg = strOrg.substring(0,nFoundPos)+"
"+strOrg.substring(nFoundPos+nWordLength);
nFoundPos = strOrg.indexOf(filterThis[iFilt]);
}
}
System.out.println("after: "+strOrg);
}
}

Cool thats great, Boyer-Moore is a fast string searching algorithm
which i could replace indexOf to improve search times.

Thanks for everyones ideas and help
 
R

Roedy Green

I know to speed up searching for "at" within the string i can use a
Boyer Moore algorithm which returns the index of the first "at". But
im unsure how i would use this to remove all occurances of.

I would think the way you would implement this is to tokenise, break
the text into words.

You can do this with a Regex.split or a StringTokenizer, or with a
simple indexOf( ' ' );


Then you look up each word in a HashSet. Only the words you want to
expunge are there. Then write the words you keep to a new file.

See http://mindprod.com/jgloss/hashtable.html

and http://mindprod.com/fileio.html
 
S

S.A. Samokhodkin

sfu said:
Hi
...
I know to speed up searching for "at" within the string i can use a
Boyer Moore algorithm which returns the index of the first "at". But

Using BM for searching the "at" wouldn't speed up things. Actually, it
would significantly slow them down. If the speed is of concern, you
should look into building a DFA from the words to search.

Regards
 
S

sfu

S.A. Samokhodkin said:
Using BM for searching the "at" wouldn't speed up things. Actually, it
would significantly slow them down. If the speed is of concern, you
should look into building a DFA from the words to search.

Regards

it would slow things down? doh! i thought it was meant to be one of
the fastest string searching algorithms, especially over java's
indexOf. :(
 
S

sfu

S.A. Samokhodkin said:
Using BM for searching the "at" wouldn't speed up things. Actually, it
would significantly slow them down. If the speed is of concern, you
should look into building a DFA from the words to search.

Regards

yeah im looking for a fast way of doing this because i want to strip
out all the common words from an essay which could be > 2000 words for
example.

excuse me for my ignorance but what is a DFA?

thanks for your suggestions thou!
 
G

Gordon Beaton

it would slow things down? doh! i thought it was meant to be one of
the fastest string searching algorithms, especially over java's
indexOf. :(

Like many things, it depends.

BM *is* fast, for long search strings. The longer the better in fact.

For search strings of length 1 it has the same complexity as
String.indexOf(), but with a higher constant factor due to some added
complexity in the search loop.

BM also has a high initial cost associated with setting up the two
delta tables, so for short strings it loses unless you are searching
for the same string many times in a relatively large body of text, and
have an implementation that lets you reuse the delta tables (to take
advantage of the fact that the search string hasn't changed).

String.indexOf() is fine when the body of text isn't huge and the
search string is relatively short.

/gordon
 
R

Roedy Green

it would slow things down? doh! i thought it was meant to be one of
the fastest string searching algorithms, especially over java's
indexOf. :(

See http://mindprod.com/products.html#BOYER

It is only faster in circumstance where the overhead to set it up pays
off.

There is a variant that can search for many words at once.

Note you are not looking for "the" but " the ". If you looked for
"the" you would also hit "there".
 
R

Roedy Green

yeah im looking for a fast way of doing this because i want to strip
out all the common words from an essay which could be > 2000 words for
example.

No matter which recommended technique you select, it will be over in a
blink of an eye. The only time consuming part will be reading and
writing and that will be under a second.

By far the most time consuming part of the task will be firing up
java.exe.
 
S

sfu

William Brogden said:
If I had that problem I would use a StringTokenizer to take apart
the String and a HashMap filled with the common words to detect
which to discard. Reassemble the words that pass in a StringBuffer.

WBB

while(essayTokenizer.hasMoreTokens()) {
String currentToken = essayTokenizer.nextToken();
if (currentToken.equals(hashTable.get(currentToken))) {
System.out.println("removed: " + currentToken);
} else {
strippedText += currentToken + " ";
}
}

does anyone know if its faster to use
strippedText += currentToken + " ";
or
text = new StringBuffer().append(currentToken + " ")
to create the stripped body of text over say 2000 tokens for example?
 
S

sfu

Roedy Green said:
No matter which recommended technique you select, it will be over in a
blink of an eye. The only time consuming part will be reading and
writing and that will be under a second.

By far the most time consuming part of the task will be firing up
java.exe.

oh right, cool!
 
C

Chris Uppal

Roedy said:
By far the most time consuming part of the task will be firing up
java.exe.

By the sound of it, by far the most time consuming thing will be writing the
blasted program in the first place...

(Which, btw, is a serious point for sfu -- trying to make a program run fast is
nearly always a time-wasting academic exercise unless the program *actually*
runs too slowly for your needs.)

-- chris
 
W

William Brogden

sfu said:
"William Brogden" <[email protected]> wrote in message

while(essayTokenizer.hasMoreTokens()) {
String currentToken = essayTokenizer.nextToken();
if (currentToken.equals(hashTable.get(currentToken))) {
System.out.println("removed: " + currentToken);
} else {
strippedText += currentToken + " ";
}
}

does anyone know if its faster to use
strippedText += currentToken + " ";
or
text = new StringBuffer().append(currentToken + " ")
to create the stripped body of text over say 2000 tokens for example?

GAH! why would you want to do that? Create one StringBuffer before entering
the loop, use the size of the input String so StringBuffer won't have to
enlarge itself.

Also, as Roedy suggested, put your words to be dropped in a HashSet, then
you can use the very fast contains( obj ) method. You may want to use
toLowerCase on your string tokens and have all lower case words in the
HashSet.

Another thought - if your input string includes punctuation, be sure to
create
the StringTokenizer with the punctuation as tokens - otherwise you will
get tokens consisting of words plus punctuation and the match will fail.

Bill
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top