remove common words from a string

sfu · Sep 12, 2003

Hi
I'd like to import a text file containing common words such as
"at","the","and" etc.

Then if i take a String "I want to visit the shops today and buy some
clothes", i would like to strip out the common words to become "visit
shops today buy clothes"

I know to speed up searching for "at" within the string i can use a
Boyer Moore algorithm which returns the index of the first "at". But
im unsure how i would use this to remove all occurances of.

Any help is much appreciated, thanks

Erwin Moller · Sep 12, 2003

sfu said:
Hi
I'd like to import a text file containing common words such as
"at","the","and" etc.

Then if i take a String "I want to visit the shops today and buy some
clothes", i would like to strip out the common words to become "visit
shops today buy clothes"

I know to speed up searching for "at" within the string i can use a
Boyer Moore algorithm which returns the index of the first "at". But
im unsure how i would use this to remove all occurances of.

Any help is much appreciated, thanks

Hi,

I am bored today anyway, so I made this little program.
Is this what you want?
Who is that chap Boyer Moore by the way?

Regards,
Erwin

public class WordFilter {

public WordFilter() {
}

public static void main(String[] args) {
String strOrg = "I want to visit the shops today and buy some clothes and";
String filterThis[] = {" at " , " the " , " and "};
System.out.println("before: "+strOrg);

for(int iFilt = 0; iFilt<filterThis.length; iFilt++) {
int nWordLength = filterThis[iFilt].length();
int nFoundPos = strOrg.indexOf(filterThis[iFilt]);
while (nFoundPos != -1){
System.out.println("found "+filterThis[iFilt]+" at position:"+nFoundPos);
strOrg = strOrg.substring(0,nFoundPos)+"
"+strOrg.substring(nFoundPos+nWordLength);
nFoundPos = strOrg.indexOf(filterThis[iFilt]);
}
}
System.out.println("after: "+strOrg);
}
}

Danny Woods · Sep 12, 2003

sfu said:
Hi
I'd like to import a text file containing common words such as
"at","the","and" etc.

The regular expressions package in java.util.regex sounds like what
you're after. Use Pattern.compile to create a Pattern object for a
given regular expression. Create a Matcher from that Pattern with
Pattern.matcher, and then replaceAll on that Matcher (replacing with
""). That matcher will come back with a suitably formed String.

A regex like "\b((and)|(or)|(at))\b" is a starting point.

Hope that helps,

Danny

William Brogden · Sep 12, 2003

sfu said:
Hi
I'd like to import a text file containing common words such as
"at","the","and" etc.

Then if i take a String "I want to visit the shops today and buy some
clothes", i would like to strip out the common words to become "visit
shops today buy clothes"

I know to speed up searching for "at" within the string i can use a
Boyer Moore algorithm which returns the index of the first "at". But
im unsure how i would use this to remove all occurances of.

Any help is much appreciated, thanks

If I had that problem I would use a StringTokenizer to take apart
the String and a HashMap filled with the common words to detect
which to discard. Reassemble the words that pass in a StringBuffer.

WBB

sfu · Sep 12, 2003

Erwin Moller said:
sfu said:

Hi
I'd like to import a text file containing common words such as
"at","the","and" etc.

Then if i take a String "I want to visit the shops today and buy some
clothes", i would like to strip out the common words to become "visit
shops today buy clothes"

I know to speed up searching for "at" within the string i can use a
Boyer Moore algorithm which returns the index of the first "at". But
im unsure how i would use this to remove all occurances of.

Any help is much appreciated, thanks

Click to expand...

Hi,

I am bored today anyway, so I made this little program.
Is this what you want?
Who is that chap Boyer Moore by the way?

Regards,
Erwin

public class WordFilter {

public WordFilter() {
}

public static void main(String[] args) {
String strOrg = "I want to visit the shops today and buy some clothes and";
String filterThis[] = {" at " , " the " , " and "};
System.out.println("before: "+strOrg);

for(int iFilt = 0; iFilt<filterThis.length; iFilt++) {
int nWordLength = filterThis[iFilt].length();
int nFoundPos = strOrg.indexOf(filterThis[iFilt]);
while (nFoundPos != -1){
System.out.println("found "+filterThis[iFilt]+" at position:"+nFoundPos);
strOrg = strOrg.substring(0,nFoundPos)+"
"+strOrg.substring(nFoundPos+nWordLength);
nFoundPos = strOrg.indexOf(filterThis[iFilt]);
}
}
System.out.println("after: "+strOrg);
}
}

Cool thats great, Boyer-Moore is a fast string searching algorithm
which i could replace indexOf to improve search times.

Thanks for everyones ideas and help

Roedy Green · Sep 12, 2003

I know to speed up searching for "at" within the string i can use a
Boyer Moore algorithm which returns the index of the first "at". But
im unsure how i would use this to remove all occurances of.

I would think the way you would implement this is to tokenise, break
the text into words.

You can do this with a Regex.split or a StringTokenizer, or with a
simple indexOf( ' ' );

Then you look up each word in a HashSet. Only the words you want to
expunge are there. Then write the words you keep to a new file.

See http://mindprod.com/jgloss/hashtable.html

and http://mindprod.com/fileio.html

S.A. Samokhodkin · Sep 13, 2003

sfu said:
Hi
...
I know to speed up searching for "at" within the string i can use a
Boyer Moore algorithm which returns the index of the first "at". But

Using BM for searching the "at" wouldn't speed up things. Actually, it
would significantly slow them down. If the speed is of concern, you
should look into building a DFA from the words to search.

Regards

sfu · Sep 13, 2003

S.A. Samokhodkin said:
Using BM for searching the "at" wouldn't speed up things. Actually, it
would significantly slow them down. If the speed is of concern, you
should look into building a DFA from the words to search.

Regards

it would slow things down? doh! i thought it was meant to be one of
the fastest string searching algorithms, especially over java's
indexOf.

sfu · Sep 13, 2003

S.A. Samokhodkin said:
Using BM for searching the "at" wouldn't speed up things. Actually, it
would significantly slow them down. If the speed is of concern, you
should look into building a DFA from the words to search.

Regards

yeah im looking for a fast way of doing this because i want to strip
out all the common words from an essay which could be > 2000 words for
example.

excuse me for my ignorance but what is a DFA?

thanks for your suggestions thou!

Gordon Beaton · Sep 13, 2003

it would slow things down? doh! i thought it was meant to be one of
the fastest string searching algorithms, especially over java's
indexOf.

Like many things, it depends.

BM *is* fast, for long search strings. The longer the better in fact.

For search strings of length 1 it has the same complexity as
String.indexOf(), but with a higher constant factor due to some added
complexity in the search loop.

BM also has a high initial cost associated with setting up the two
delta tables, so for short strings it loses unless you are searching
for the same string many times in a relatively large body of text, and
have an implementation that lets you reuse the delta tables (to take
advantage of the fact that the search string hasn't changed).

String.indexOf() is fine when the body of text isn't huge and the
search string is relatively short.

/gordon

Roedy Green · Sep 13, 2003

it would slow things down? doh! i thought it was meant to be one of
the fastest string searching algorithms, especially over java's
indexOf.

See http://mindprod.com/products.html#BOYER

It is only faster in circumstance where the overhead to set it up pays
off.

There is a variant that can search for many words at once.

Note you are not looking for "the" but " the ". If you looked for
"the" you would also hit "there".

Roedy Green · Sep 13, 2003

yeah im looking for a fast way of doing this because i want to strip
out all the common words from an essay which could be > 2000 words for
example.

No matter which recommended technique you select, it will be over in a
blink of an eye. The only time consuming part will be reading and
writing and that will be under a second.

By far the most time consuming part of the task will be firing up
java.exe.

sfu · Sep 13, 2003

William Brogden said:
If I had that problem I would use a StringTokenizer to take apart
the String and a HashMap filled with the common words to detect
which to discard. Reassemble the words that pass in a StringBuffer.

WBB

while(essayTokenizer.hasMoreTokens()) {
String currentToken = essayTokenizer.nextToken();
if (currentToken.equals(hashTable.get(currentToken))) {
System.out.println("removed: " + currentToken);
} else {
strippedText += currentToken + " ";
}
}

does anyone know if its faster to use
strippedText += currentToken + " ";
or
text = new StringBuffer().append(currentToken + " ")
to create the stripped body of text over say 2000 tokens for example?

sfu · Sep 13, 2003

Roedy Green said:
No matter which recommended technique you select, it will be over in a
blink of an eye. The only time consuming part will be reading and
writing and that will be under a second.

By far the most time consuming part of the task will be firing up
java.exe.

oh right, cool!

Chris Uppal · Sep 14, 2003

Roedy said:
By far the most time consuming part of the task will be firing up
java.exe.

By the sound of it, by far the most time consuming thing will be writing the
blasted program in the first place...

(Which, btw, is a serious point for sfu -- trying to make a program run fast is
nearly always a time-wasting academic exercise unless the program *actually*
runs too slowly for your needs.)

-- chris

William Brogden · Sep 14, 2003

sfu said:
"William Brogden" <[email protected]> wrote in message

while(essayTokenizer.hasMoreTokens()) {
String currentToken = essayTokenizer.nextToken();
if (currentToken.equals(hashTable.get(currentToken))) {
System.out.println("removed: " + currentToken);
} else {
strippedText += currentToken + " ";
}
}

does anyone know if its faster to use
strippedText += currentToken + " ";
or
text = new StringBuffer().append(currentToken + " ")
to create the stripped body of text over say 2000 tokens for example?

GAH! why would you want to do that? Create one StringBuffer before entering
the loop, use the size of the input String so StringBuffer won't have to
enlarge itself.

Also, as Roedy suggested, put your words to be dropped in a HashSet, then
you can use the very fast contains( obj ) method. You may want to use
toLowerCase on your string tokens and have all lower case words in the
HashSet.

Another thought - if your input string includes punctuation, be sure to
create
the StringTokenizer with the punctuation as tokens - otherwise you will
get tokens consisting of words plus punctuation and the match will fail.

Bill

Finding K most common words from a collection of Documents.	0	Feb 19, 2010
Copy string from 2D array to a 1D array in C	1	Nov 1, 2023
Replace stop words (remove words from a string)	6	Jan 17, 2008
Remove repeated words from a file	3	Sep 18, 2009
String and list error while running a Markov Chain	1	Aug 26, 2020
remove certain words from a c++ string	18	Oct 23, 2005
Remove short words from a string	6	Oct 25, 2006
Initial letters of the words in a string	9	Sep 8, 2011

remove common words from a string

sfu

Erwin Moller

Danny Woods

William Brogden

sfu

Roedy Green

S.A. Samokhodkin

sfu

sfu

Gordon Beaton

Roedy Green

Roedy Green

sfu

sfu

Chris Uppal

William Brogden

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads