Searching and replacing multiple strings in text

Discussion in 'Java' started by Chris, Oct 9, 2003.

  1. Chris

    Chris Guest

    Suppose I have a String like:

    "The quick brown fox jumped over the lazy dog's back."

    Suppose I wanted to make the following multiple replacements in the
    String:

    "quick" -> "slow"
    "brown" -> "red"
    "jump" -> "walk"
    "over" -> "under"
    "back" -> "stomach"

    After making the replacements, the String becomes:

    "The slow red fox walked under the lazy dog's stomach."

    It would be easy to just scan the string N times if we're replacing N
    words but I'd like a way that scans the string only once regardless of
    how many words we're replacing. I'm sure tons of programmers have had
    need for this and I'm wondering if there's already some existing code
    on the net for it.

    If not, then what are the best data structures and classes to use for
    this? For example, should the collection of words to replace just be a
    String[] and the collection of replacements just be a String[], so
    that e.g. wordsToReplace is replaced by replacement, and then
    you do the search and replace algorithm something loosely along the
    idea of:

    0. assume an input String, an output StringBuffer
    1. set c = current input text character
    2. if c is the first character of any of the words in wordsToReplace[]
    then
    2a. check if the next input character is the next character of
    any of the words in our input string and so forth....
    if the input word is a match, append the replacement
    to the output StringBuffer.
    otherwise, append the character c from step 1 to the
    StringBuffer and
    go back to step 1. with the input character
    advanced to the next character.

    I'm being very sloppy and crude here and only trying to illustrate the
    idea, because actually programming the above function while making
    sure you handle all the out-of-bounds cases, etc., would be tricky and
    time-consuming, and I want to post here for a pre-canned solution or
    something easy I haven't thought of yet before going to all the
    trouble.

    Thanks very much for any replies.

    Chris
     
    Chris, Oct 9, 2003
    #1
    1. Advertising

  2. Chris wrote:

    > It would be easy to just scan the string N times if we're replacing N
    > words but I'd like a way that scans the string only once regardless of
    > how many words we're replacing. I'm sure tons of programmers have had
    > need for this and I'm wondering if there's already some existing code
    > on the net for it.
    >
    > If not, then what are the best data structures and classes to use for
    > this? For example, should the collection of words to replace just be a
    > String[] and the collection of replacements just be a String[], so
    > that e.g. wordsToReplace is replaced by replacement, and then
    > you do the search and replace algorithm something loosely along the
    > idea of:
    >
    > 0. assume an input String, an output StringBuffer
    > 1. set c = current input text character
    > 2. if c is the first character of any of the words in wordsToReplace[]
    > then
    > 2a. check if the next input character is the next character of
    > any of the words in our input string and so forth....
    > if the input word is a match, append the replacement
    > to the output StringBuffer.
    > otherwise, append the character c from step 1 to the
    > StringBuffer and
    > go back to step 1. with the input character
    > advanced to the next character.
    >


    This is a horribly inefficient algorithm, almost as bad as looping
    through the string multiple times. You should use an algorithm that is
    designed to find multiple strings in one pass, instead of trying to
    modify a brute-force algorithm to do what it is not supposed to do. A
    finite state machine will find multiple strings at the cost of one. You
    need to build the FSM of course, which takes some time, but you only
    need to do it once. The searching is done in linear time. Try googling
    for Aho Corasick (or Aho Korasick) bibliographic search algorithm for
    more information.

    --
    Daniel Sjöblom
     
    =?ISO-8859-1?Q?Daniel_Sj=F6blom?=, Oct 10, 2003
    #2
    1. Advertising

  3. Chris

    Roedy Green Guest

    On 9 Oct 2003 15:37:28 -0700, (Chris) wrote or
    quoted :

    >Suppose I wanted to make the following multiple replacements in the
    >String:
    >
    >"quick" -> "slow"
    >"brown" -> "red"
    >"jump" -> "walk"
    >"over" -> "under"
    >"back" -> "stomach"


    We had this question a few weeks back and many different solutions
    were posted. I suggested creating a Hashtable of the word pairs to be
    replaced.

    Then use a regex to split the sentence into words. Look up each word,
    and if you find it, append the transform to a StringBuffer. If not
    append the original word (and a space).

    See http://mindprod.com/jgloss/regex.html

    For how to split.

    Alternatively write your own mini parser that uses String.charAt to
    compose words in StringBuffers by whatever rules you want.

    For an original estimate of the StringBuffer length use the original
    sentence length + 10% where 10% is a tweaking factor (optimised by
    experiment if you are anally retentive.)

    See http://mindprod.com/jgloss/tweakable.html

    If you want to find the original thread, look for Boyer-Moore.
    Someone likely mentioned that approach.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Oct 10, 2003
    #3
  4. Chris

    Eric Sosman Guest

    Roedy Green wrote:
    >
    > On 9 Oct 2003 15:37:28 -0700, (Chris) wrote or
    > quoted :
    >
    > >Suppose I wanted to make the following multiple replacements in the
    > >String:
    > >
    > >"quick" -> "slow"
    > >"brown" -> "red"
    > >"jump" -> "walk"
    > >"over" -> "under"
    > >"back" -> "stomach"

    >
    > We had this question a few weeks back and many different solutions
    > were posted. I suggested creating a Hashtable of the word pairs to be
    > replaced.
    >
    > Then use a regex to split the sentence into words. Look up each word,
    > and if you find it, append the transform to a StringBuffer. If not
    > append the original word (and a space).


    This wouldn't quite work in the present instance, because
    of the O.P.'s illustrative transformation:

    > "The quick brown fox jumped over the lazy dog's back."
    > "The slow red fox walked under the lazy dog's stomach."


    Note the change of "jumped" to "walked" even though
    this pair of words is not present in the replacement list.

    --
     
    Eric Sosman, Oct 10, 2003
    #4
  5. Chris

    Roedy Green Guest

    On Fri, 10 Oct 2003 15:50:32 -0400, Eric Sosman <>
    wrote or quoted :

    >
    > Note the change of "jumped" to "walked" even though
    >this pair of words is not present in the replacement list.


    In that case you want a multiple boyer moore than can search for a
    list of words in one pass.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Oct 10, 2003
    #5
  6. Daniel Sjöblom <_NOSPAM> wrote in message news:<3f86d5e5$0$16834$>...

    > This is a horribly inefficient algorithm, almost as bad as looping
    > through the string multiple times. You should use an algorithm that is
    > designed to find multiple strings in one pass, instead of trying to
    > modify a brute-force algorithm to do what it is not supposed to do. A
    > finite state machine will find multiple strings at the cost of one. You
    > need to build the FSM of course, which takes some time, but you only
    > need to do it once. The searching is done in linear time. Try googling
    > for Aho Corasick (or Aho Korasick) bibliographic search algorithm for
    > more information.


    I did actually find mention of the Aho-Corasick algorithm last night
    and also the Commentz-Walter algorithm which achieves the same thing.

    I have found Java implementations of neither though. Commentz-Walter
    is supposed to be substantially faster than Aho-Corasick in practice.
    There's a C implementation of Commentz-Walter in GNU fgrep 2.0 and
    above that I could use as a starting point but it still would be a
    waste of time if there's already a Java implementation floating around
    the net.

    So my original question about not reinventing the wheel and going to
    all the effort to get an efficient multiple-search and replace still
    stands.

    ---Chris
     
    Christopher R. Barry, Oct 11, 2003
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Christopher R. Barry

    Re: Searching/replacing multiple words in input.

    Christopher R. Barry, Oct 11, 2003, in forum: Java
    Replies:
    1
    Views:
    2,065
    John Kordyback
    Oct 12, 2003
  2. Robert
    Replies:
    9
    Views:
    457
    nice.guy.nige
    Dec 1, 2004
  3. Oltmans

    Searching and replacing text ?

    Oltmans, May 2, 2008, in forum: Python
    Replies:
    2
    Views:
    252
  4. lles
    Replies:
    0
    Views:
    282
  5. Rob Meade

    Replacing - and not Replacing...

    Rob Meade, Apr 5, 2005, in forum: ASP General
    Replies:
    5
    Views:
    297
    Chris Hohmann
    Apr 11, 2005
Loading...

Share This Page