Help: what is the quickest way to find out whether a string contains another string?

Discussion in 'Java' started by tuweiwen@gmail.com, Dec 5, 2005.

  1. Guest

    Hi all, can anyone give me a hint on what approach I should take to
    this simple problem: what is the quickest way to find out whether a
    string contains another string? For example,

    String a = "a very long string, UP TO 100k";
    String b = "a string of two or three words";

    //---- now I need to write this function
    boolean a_contains_b (String a, String b) {
    //I don't need to worry about the location of b, just want to know
    //whether a has b inside it. Returns true or false
    }

    This function will be called Thousands of times in an app.

    What's best way to handle this to achieve best performance?

    Currently I am using "if (a.indexOf(b) !=-1) " in the above
    "a_contains_b()" function. But I don't like it at all. It feels dumb,
    isn't it? Do I have the wrong approach at the outset?

    Thanks a lot. I appreciate your help.
    , Dec 5, 2005
    #1
    1. Advertising

  2. Re: Help: what is the quickest way to find out whether a string containsanother string?

    wrote:
    > Hi all, can anyone give me a hint on what approach I should take to
    > this simple problem: what is the quickest way to find out whether a
    > string contains another string? For example,
    >
    > String a = "a very long string, UP TO 100k";
    > String b = "a string of two or three words";
    >
    > //---- now I need to write this function
    > boolean a_contains_b (String a, String b) {
    > //I don't need to worry about the location of b, just want to know
    > //whether a has b inside it. Returns true or false
    > }
    >
    > This function will be called Thousands of times in an app.
    >
    > What's best way to handle this to achieve best performance?
    >
    > Currently I am using "if (a.indexOf(b) !=-1) " in the above
    > "a_contains_b()" function. But I don't like it at all. It feels dumb,
    > isn't it? Do I have the wrong approach at the outset?
    >
    > Thanks a lot. I appreciate your help.
    >


    Unless I'm missing something wouldn't the contains() method of String
    class be what you want??

    boolean contains(CharSequence s)
    Returns true if and only if this string contains the
    specified sequence of char values.
    Brandon McCombs, Dec 5, 2005
    #2
    1. Advertising

  3. Rhino Guest

    Re: what is the quickest way to find out whether a string contains another string?

    <> wrote in message
    news:...
    > Hi all, can anyone give me a hint on what approach I should take to
    > this simple problem: what is the quickest way to find out whether a
    > string contains another string? For example,
    >
    > String a = "a very long string, UP TO 100k";
    > String b = "a string of two or three words";
    >
    > //---- now I need to write this function
    > boolean a_contains_b (String a, String b) {
    > //I don't need to worry about the location of b, just want to know
    > //whether a has b inside it. Returns true or false
    > }
    >
    > This function will be called Thousands of times in an app.
    >
    > What's best way to handle this to achieve best performance?
    >
    > Currently I am using "if (a.indexOf(b) !=-1) " in the above
    > "a_contains_b()" function. But I don't like it at all. It feels dumb,
    > isn't it? Do I have the wrong approach at the outset?
    >
    > Thanks a lot. I appreciate your help.
    >

    One approach would be to use "regular expressions", often abbreviated
    "regexp". If you look at the Pattern class in the J2SE API, you'll find an
    overview describing how to use them. A full tutorial on regular expressions
    can be found in the online version of the Java Tutorial at
    http://java.sun.com/docs/books/tutorial/extra/regex/index.html.

    Rhino
    Rhino, Dec 5, 2005
    #3
  4. Roedy Green Guest

    On 4 Dec 2005 18:47:20 -0800, wrote, quoted or
    indirectly quoted someone who said :

    >Hi all, can anyone give me a hint on what approach I should take to
    >this simple problem: what is the quickest way to find out whether a
    >string contains another string? For example,


    an approach is to do a quick check before you scan.

    You break your big string into words with a regex split and put each
    word into a HashSet.

    Then you break you short string into words with that same precompiled
    regex and look up each word in turn in the HashSet. If any word is
    not found, there is no point in further processing. If you do find all
    three, then use indexOf or contains.

    If you want to get fancy, use a hashMap to track the offset of the
    FIRST ocurrence of each word. Then you take the min of the three
    offsets for starting off your indexof search and the max of the three
    for your stopping search point.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
    Roedy Green, Dec 5, 2005
    #4
  5. Guest

    Thanks a lot, guys. Roedy, I will try that approach later.

    I did a little testing on the performance and got an interesting
    result. Here is the program.


    import java.util.*;
    /*
    This is an un-scientific, probably way off-mark, testing
    comparing the speed of three functions:
    String.contains(), String.indexOf() and String.matches().
    Note that String.matches() use regular expression approach
    according to the java doc

    Result: the speeds of indexOf() and contains() are about the same,
    but matches() is at least ten times slower.
    Is regular expression bad in terms of performance?

    Problems with this tesing:
    1. The target string a is not long enough. As I said, in my app, it
    can be up to 100k. The performance may change if string a gets bigger.
    2. In all loops, string a and string b are the same. In my app, at the
    least string a is always changing, such as newly downloaded web pages.
    3. Right now, it always matches inside the loop. should do a randomized
    String a and string b.

    Testing machine: WinXP Pro, P3 500, 256MB Ram, JDK 1.5.0_05
    */

    public class Test {

    public static void main(String args[]) {

    int loop = 1000000;

    int matches = 0;

    String a = "This is a testing string. It is not very long.";
    String b = "is not";

    long startTime = System.currentTimeMillis();

    for (int i = 0; i < loop; i++) {
    if (a.contains(b)) matches++; //took about 1620 miliseconds
    //if (a.matches(".*not.*")) matches++; //took about 25200
    miliseconds
    //if (a.indexOf(b) != -1) matches++; //took about 1520 miliseconds
    }

    long endTime = System.currentTimeMillis();

    System.out.println("\nRunning " + loop + " times took " + (endTime -
    startTime) + " miliseconds.");
    }

    }



    right now, this part of my app (processing the downloaded web pages)
    takes lots of time. so i am looking for ways to speed it up.
    , Dec 5, 2005
    #5
  6. Guest

    This time I use explicit regular expression instead of
    String.matches(), and pre-compile the regex. Speed gets better a little
    bit.
    Just replace the old main() method with this new one. Also need import
    java.util.regex.*; at the top.


    public static void main(String args[]) {
    int loop = 1000000;
    int matches = 0;
    String a = "This is a testing string. It is not very long.";
    String b = "is not";

    Pattern pat = Pattern.compile(".*is not.*");
    Matcher mat;

    long startTime = System.currentTimeMillis();

    for (int i = 0; i < loop; i++) {
    //if (a.contains(b)) matches++; //took about 1620 miliseconds
    //if (a.indexOf(b) != -1) matches++; //took about 1520
    miliseconds
    if (pat.matcher(a).find()) matches++; //took about 18400 milis

    }
    long endTime = System.currentTimeMillis();
    System.out.println("\nRunning " + loop + " times took " + (endTime -
    startTime) + " miliseconds.");
    }
    , Dec 5, 2005
    #6
  7. Re: Help: what is the quickest way to find out whether a string containsanother string?

    wrote:
    > What's best way to handle this to achieve best performance?


    "best" always depends on the context, the word as such is rather
    meaningless. I assume you mean fast. Then, contrary to other
    suggestions, regexps are not a good choice at all. We have discussed
    this in the past.

    Your problem is a standard problem in CS, it has been extensively
    studied and many textbooks discuss it. Look for the following keywords
    in books, Google, etc.:

    Boyer-Moore, Knuth-Morris-Pratt or Rabin-Karp.

    There are for sure more such algorithms.

    /Thomas
    --
    The comp.lang.java.gui FAQ:
    ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
    http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/
    Thomas Weidenfeller, Dec 5, 2005
    #7
  8. Chris Uppal Guest

    wrote:

    > Result: the speeds of indexOf() and contains() are about the same,
    > but matches() is at least ten times slower.
    > Is regular expression bad in terms of performance?


    Yes.

    Regexp matching is an order of magnitude more complex than a simple scan of the
    string. Although the time (for normal regexps) is linear in the length of the
    String, just as for naive scanning, the constants are significantly greater --
    as you have found.

    Another problem, by the way, is that if you are looking for a string like "this
    is *not* easy", then you'll have to pre-process the search string to escape the
    special characters.

    You've already been pointed at better algorithms for searching strings than a
    naive scan. I just wanted to add a few warnings. One is that you should
    verify that string matching is /actually/ the bottleneck in your application
    before wasing too much time looking for alternatives. Another is that all
    these algorithms have significant start-up time, so you should be careful to
    measure using representative data -- a string literal that you can type into a
    Java program is /not/ representative of a 100K String. (BTW it's distinctly
    unusual to have such long strings in a program, how do they come above ?) The
    last is that if you are looking at Roedy's BMP implementation, that is
    artificially restricted[*] to chararacters in the range 0...255 in both the
    search and target strings.

    ([*] Or it was when I last looked, and the restriction is -- in fact --
    justifiable, but it may not rule out that implementation for your purposes.)

    -- chris
    Chris Uppal, Dec 5, 2005
    #8
  9. zero Guest

    Thomas Weidenfeller <> wrote in news:dn0u58$ika$1
    @news.al.sw.ericsson.se:

    >
    > Your problem is a standard problem in CS, it has been extensively
    > studied and many textbooks discuss it. Look for the following keywords
    > in books, Google, etc.:
    >
    > Boyer-Moore, Knuth-Morris-Pratt or Rabin-Karp.
    >


    I find my knowledge of such standard algorithms to be quite limited. Can
    you suggest a good source - online or in print - to broaden my knowledge?

    TIA
    zero

    --
    Beware the False Authority Syndrome
    zero, Dec 5, 2005
    #9
  10. Chris Uppal Guest

    zero wrote:

    > > Boyer-Moore, Knuth-Morris-Pratt or Rabin-Karp.


    > I find my knowledge of such standard algorithms to be quite limited. Can
    > you suggest a good source - online or in print - to broaden my knowledge?


    The wikipedia article:

    http://en.wikipedia.org/wiki/String_searching_algorithm

    seems like a fair starting point on string searching. It has links to the
    wikipedia entries on BMP and RK algorithms, and som external links too.

    If you'd rather read a book (or would like to read a book too), then I
    recommend Robert Sedgewick's algorithm books as particularly readable surveys
    of algorithms in common use. His 1-volumne "Algorithms in <language>" books
    have several pages on string searching, and /lots/ of other stuff that you
    might find interesting. He's also in the process of putting out an expanded
    N-volume version of "Algorithms in <language>", unfortunately only two volumes
    have come out so far, and I think his expanded coverage of string searching
    won't appear until in vol.3 comes out (which Amazon seem to think will be in
    next July).

    Knuth is, of course, the ultimate algorithm book, but Sedgewick is also an
    authoritative and knowledgable specialist -- and his books are a /lot/ easier
    to read being (a) better explained and motivated, (b) aimed at practising
    programmers rather than (near-)mathematicians, (c) much shorter...

    -- chris
    Chris Uppal, Dec 5, 2005
    #10
  11. Guest

    Thomas thanks for the tip. I went on to download and run Roedy's Boyer
    program and it works great. I used quite large strings of Chinese
    characters (target string of size 7KB and pattern string of 50
    character-long, to avoid it falling back to String.indexOf() . I am
    running a Chinese version of WinXP) to test it, and it works too.
    Thanks Roedy.

    Chris, the strings are downloaded web pages in memory. What my app (a
    spider) does is to find out if a page contains one of several pattern
    strings, if it is, then it's saved to disk. I would like to know if my
    current approach is an efficient one.
    , Dec 5, 2005
    #11
  12. Roedy Green Guest

    On 4 Dec 2005 23:50:06 -0800, wrote, quoted or
    indirectly quoted someone who said :

    >
    >Result: the speeds of indexOf() and contains() are about the same,
    >but matches() is at least ten times slower.
    >Is regular expression bad in terms of performance?


    Don't you believe your own experiment?

    In general the more specialised a tool is to the task at hand the
    faster it will be.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
    Roedy Green, Dec 5, 2005
    #12
  13. Roedy Green Guest

    On Mon, 5 Dec 2005 10:27:51 -0000, "Chris Uppal"
    <-THIS.org> wrote, quoted or indirectly
    quoted someone who said :

    >You've already been pointed at better algorithms for searching strings than a
    >naive scan. I just wanted to add a few warnings.

    see http://mindprod.com/products1.html#BOYER
    for a faster indexOf
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
    Roedy Green, Dec 5, 2005
    #13
  14. Chris Smith Guest

    Re: what is the quickest way to find out whether a string contains another string?

    Rhino <> wrote:
    > One approach would be to use "regular expressions", often abbreviated
    > "regexp". If you look at the Pattern class in the J2SE API, you'll find an
    > overview describing how to use them.


    Why would anyone want to use regular expressions to look for a plain
    String? Regular expressions are a tool for deciding whether a String or
    some subset thereof matches a defined regular language. Although string
    searching is a trivial degeneration of the problem addressed by regular
    expressions, there's no need to introduce the added complexity.

    If you are set on using regular expressions, converting the arbitrary
    input string into a pattern that matches only itself is the first task.
    Prior to Java 1.5, this was very difficult. In 1.5, it's encapsulated
    in a method called Pattern.quote(String). Once that's done, you can
    search for that pattern. It's far easier, though, to use String.indexOf
    (prior to 1.5) or String.contains (as of 1.5).

    --
    www.designacourse.com
    The Easiest Way To Train Anyone... Anywhere.

    Chris Smith - Lead Software Developer/Technical Trainer
    MindIQ Corporation
    Chris Smith, Dec 5, 2005
    #14
  15. Chris Smith Guest

    <> wrote:
    > Chris, the strings are downloaded web pages in memory. What my app (a
    > spider) does is to find out if a page contains one of several pattern
    > strings, if it is, then it's saved to disk. I would like to know if my
    > current approach is an efficient one.


    (Realizing your addressing your comment to Chris Uppal, but I'll respond
    anyway. Hope you don't mind.)

    In this case, you can be practically guaranteed that String matching is
    not your bottleneck. It will be much faster than retrieving those same
    strings over a network connection.

    --
    www.designacourse.com
    The Easiest Way To Train Anyone... Anywhere.

    Chris Smith - Lead Software Developer/Technical Trainer
    MindIQ Corporation
    Chris Smith, Dec 5, 2005
    #15
  16. Rhino Guest

    Re: what is the quickest way to find out whether a string contains another string?

    "Chris Smith" <> wrote in message
    news:...
    > Rhino <> wrote:
    >> One approach would be to use "regular expressions", often abbreviated
    >> "regexp". If you look at the Pattern class in the J2SE API, you'll find
    >> an
    >> overview describing how to use them.

    >
    > Why would anyone want to use regular expressions to look for a plain
    > String? Regular expressions are a tool for deciding whether a String or
    > some subset thereof matches a defined regular language. Although string
    > searching is a trivial degeneration of the problem addressed by regular
    > expressions, there's no need to introduce the added complexity.
    >
    > If you are set on using regular expressions, converting the arbitrary
    > input string into a pattern that matches only itself is the first task.
    > Prior to Java 1.5, this was very difficult. In 1.5, it's encapsulated
    > in a method called Pattern.quote(String). Once that's done, you can
    > search for that pattern. It's far easier, though, to use String.indexOf
    > (prior to 1.5) or String.contains (as of 1.5).
    >

    I didn't say regular expressions were the _best_ way, just that they were
    an option.

    The original poster mentioned that he was searching very long strings for
    other strings. I don't know if regular contains() or similar methods have
    length limitations and I've never looked into whether the more traditional
    methods are more or less efficient than techniques based on regular
    expressions. Offhand, I suspected that regular expressions might be more
    efficient on longer strings but the other replies to this thread indicate
    that regular expressions are actually _less_ efficient that contains() or
    its brethren.

    Therefore, I learned something along the way, too.

    Rhino
    Rhino, Dec 5, 2005
    #16
  17. Alan Moore Guest

    Re: what is the quickest way to find out whether a string contains another string?

    I'd like to add a note about the inefficiency of regexes. The regex
    ".*is not.*" is inherently slow, and not a fair test of regex speed.
    Adding the dot-stars before and after the real search string is a hack
    for when you can only use String#matches(). When using the find()
    method, you can get rid of them and the search will go much faster.
    When I ran the OP's second test program with the regex "is not", it
    took 1052ms (compared to 4486ms for ".*is not.*"). That was still
    three times as long as the contains() or indexOf() approaches, so I'm
    not pushing for regex use here. Just setting the record straight.
    Alan Moore, Dec 6, 2005
    #17
  18. "Thomas Weidenfeller" <> schrieb im Newsbeitrag
    news:dn0u58$ika$...
    > wrote:
    >> What's best way to handle this to achieve best performance?

    >
    > "best" always depends on the context, the word as such is rather
    > meaningless. I assume you mean fast. Then, contrary to other suggestions,
    > regexps are not a good choice at all. We have discussed this in the past.


    Agree, best is ambiguous. But the message subject line contains 'quickest',
    so the question must be runtime.

    Hiran
    Hiran Chaudhuri, Dec 6, 2005
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steve Edwards
    Replies:
    13
    Views:
    466
    Kai-Uwe Bux
    Mar 1, 2006
  2. Borse, Ganesh
    Replies:
    0
    Views:
    220
    Borse, Ganesh
    Oct 30, 2007
  3. Borse, Ganesh
    Replies:
    2
    Views:
    1,596
    Dennis Lee Bieber
    Oct 30, 2007
  4. Replies:
    3
    Views:
    83
    pinki
    Mar 13, 2006
  5. dblock
    Replies:
    2
    Views:
    645
    Simon Krahnke
    Oct 9, 2011
Loading...

Share This Page