Searching and replacing multiple strings in text

C

Chris

Suppose I have a String like:

"The quick brown fox jumped over the lazy dog's back."

Suppose I wanted to make the following multiple replacements in the
String:

"quick" -> "slow"
"brown" -> "red"
"jump" -> "walk"
"over" -> "under"
"back" -> "stomach"

After making the replacements, the String becomes:

"The slow red fox walked under the lazy dog's stomach."

It would be easy to just scan the string N times if we're replacing N
words but I'd like a way that scans the string only once regardless of
how many words we're replacing. I'm sure tons of programmers have had
need for this and I'm wondering if there's already some existing code
on the net for it.

If not, then what are the best data structures and classes to use for
this? For example, should the collection of words to replace just be a
String[] and the collection of replacements just be a String[], so
that e.g. wordsToReplace is replaced by replacement, and then
you do the search and replace algorithm something loosely along the
idea of:

0. assume an input String, an output StringBuffer
1. set c = current input text character
2. if c is the first character of any of the words in wordsToReplace[]
then
2a. check if the next input character is the next character of
any of the words in our input string and so forth....
if the input word is a match, append the replacement
to the output StringBuffer.
otherwise, append the character c from step 1 to the
StringBuffer and
go back to step 1. with the input character
advanced to the next character.

I'm being very sloppy and crude here and only trying to illustrate the
idea, because actually programming the above function while making
sure you handle all the out-of-bounds cases, etc., would be tricky and
time-consuming, and I want to post here for a pre-canned solution or
something easy I haven't thought of yet before going to all the
trouble.

Thanks very much for any replies.

Chris
 
?

=?ISO-8859-1?Q?Daniel_Sj=F6blom?=

Chris said:
It would be easy to just scan the string N times if we're replacing N
words but I'd like a way that scans the string only once regardless of
how many words we're replacing. I'm sure tons of programmers have had
need for this and I'm wondering if there's already some existing code
on the net for it.

If not, then what are the best data structures and classes to use for
this? For example, should the collection of words to replace just be a
String[] and the collection of replacements just be a String[], so
that e.g. wordsToReplace is replaced by replacement, and then
you do the search and replace algorithm something loosely along the
idea of:

0. assume an input String, an output StringBuffer
1. set c = current input text character
2. if c is the first character of any of the words in wordsToReplace[]
then
2a. check if the next input character is the next character of
any of the words in our input string and so forth....
if the input word is a match, append the replacement
to the output StringBuffer.
otherwise, append the character c from step 1 to the
StringBuffer and
go back to step 1. with the input character
advanced to the next character.


This is a horribly inefficient algorithm, almost as bad as looping
through the string multiple times. You should use an algorithm that is
designed to find multiple strings in one pass, instead of trying to
modify a brute-force algorithm to do what it is not supposed to do. A
finite state machine will find multiple strings at the cost of one. You
need to build the FSM of course, which takes some time, but you only
need to do it once. The searching is done in linear time. Try googling
for Aho Corasick (or Aho Korasick) bibliographic search algorithm for
more information.
 
R

Roedy Green

Suppose I wanted to make the following multiple replacements in the
String:

"quick" -> "slow"
"brown" -> "red"
"jump" -> "walk"
"over" -> "under"
"back" -> "stomach"

We had this question a few weeks back and many different solutions
were posted. I suggested creating a Hashtable of the word pairs to be
replaced.

Then use a regex to split the sentence into words. Look up each word,
and if you find it, append the transform to a StringBuffer. If not
append the original word (and a space).

See http://mindprod.com/jgloss/regex.html

For how to split.

Alternatively write your own mini parser that uses String.charAt to
compose words in StringBuffers by whatever rules you want.

For an original estimate of the StringBuffer length use the original
sentence length + 10% where 10% is a tweaking factor (optimised by
experiment if you are anally retentive.)

See http://mindprod.com/jgloss/tweakable.html

If you want to find the original thread, look for Boyer-Moore.
Someone likely mentioned that approach.
 
E

Eric Sosman

Roedy said:
We had this question a few weeks back and many different solutions
were posted. I suggested creating a Hashtable of the word pairs to be
replaced.

Then use a regex to split the sentence into words. Look up each word,
and if you find it, append the transform to a StringBuffer. If not
append the original word (and a space).

This wouldn't quite work in the present instance, because
of the O.P.'s illustrative transformation:
"The quick brown fox jumped over the lazy dog's back."
"The slow red fox walked under the lazy dog's stomach."

Note the change of "jumped" to "walked" even though
this pair of words is not present in the replacement list.
 
R

Roedy Green

Note the change of "jumped" to "walked" even though
this pair of words is not present in the replacement list.

In that case you want a multiple boyer moore than can search for a
list of words in one pass.
 
C

Christopher R. Barry

Daniel Sjöblom said:
This is a horribly inefficient algorithm, almost as bad as looping
through the string multiple times. You should use an algorithm that is
designed to find multiple strings in one pass, instead of trying to
modify a brute-force algorithm to do what it is not supposed to do. A
finite state machine will find multiple strings at the cost of one. You
need to build the FSM of course, which takes some time, but you only
need to do it once. The searching is done in linear time. Try googling
for Aho Corasick (or Aho Korasick) bibliographic search algorithm for
more information.

I did actually find mention of the Aho-Corasick algorithm last night
and also the Commentz-Walter algorithm which achieves the same thing.

I have found Java implementations of neither though. Commentz-Walter
is supposed to be substantially faster than Aho-Corasick in practice.
There's a C implementation of Commentz-Walter in GNU fgrep 2.0 and
above that I could use as a starting point but it still would be a
waste of time if there's already a Java implementation floating around
the net.

So my original question about not reinventing the wheel and going to
all the effort to get an efficient multiple-search and replace still
stands.

---Chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top