Speeding up matching Strings in a set

Discussion in 'Java' started by softwarepearls_com, Oct 24, 2008.

  1. Let's say you had a set of Strings (I deliberately use lower case, so
    not necessarily java.util.Set), and you wanted to know if String x was
    in this set.. what would be the fastest way?

    Anyone have any experience comparing the performance of
    HashSet<String> against regular expressions?
     
    softwarepearls_com, Oct 24, 2008
    #1
    1. Advertising

  2. softwarepearls_com

    Lew Guest

    On Oct 24, 12:26 pm, softwarepearls_com <>
    wrote:
    > Let's say you had a set of Strings (I deliberately use lower case, so
    > not necessarily java.util.Set), and you wanted to know if String x was
    > in this set.. what would be the fastest way?
    >
    > Anyone have any experience comparing the performance of
    > HashSet<String> against regular expressions?


    I recommend that you write a benchmark to compare the results.

    I predict that the HashSet will win hands down. First question I
    have, how would you implement a structure that used regular
    expressions (regexes) to retrieve the String?

    Any use of regexes must be slower. Strings hashCodes, if computed on
    demand and not cached in the String, require a computation based on
    each char in the String. Likewise for a regex comparison. The
    hashCode need not be parsed, unlike the regex, and comparisons based
    on hashCode use int comparison, unlike regexes, and a contains() call
    into a HashSet is O(1), which I predict no regex-based comparison will
    beat.

    But really, I need to know what you mean by using regexes to tell if a
    String is in the collection. How would you do that?

    If I see an SSCCE, I can tell if it is O(1) for contains() or not.

    Please provide an SSCCE of a regex-based 'contains()' operation.

    --
    Lew
     
    Lew, Oct 24, 2008
    #2
    1. Advertising

  3. On 24.10.2008 18:50, Lew wrote:
    > On Oct 24, 12:26 pm, softwarepearls_com <>
    > wrote:
    >> Let's say you had a set of Strings (I deliberately use lower case, so
    >> not necessarily java.util.Set), and you wanted to know if String x was
    >> in this set.. what would be the fastest way?
    >>
    >> Anyone have any experience comparing the performance of
    >> HashSet<String> against regular expressions?

    >
    > I recommend that you write a benchmark to compare the results.


    Absolutely. Results of this exercise is often educational.

    > I predict that the HashSet will win hands down. First question I
    > have, how would you implement a structure that used regular
    > expressions (regexes) to retrieve the String?


    I am speculating that the OP just would create a regexp that matches for
    _all_ strings. You can do some optimization for Java's NFA (i.e.
    combine matching prefixes) but I doubt this will work well for large sets.

    > Any use of regexes must be slower. Strings hashCodes, if computed on
    > demand and not cached in the String, require a computation based on
    > each char in the String.


    Last time I checked there was an optimization for larger strings that
    did not look at all the characters.

    > Likewise for a regex comparison. The
    > hashCode need not be parsed, unlike the regex, and comparisons based
    > on hashCode use int comparison, unlike regexes, and a contains() call
    > into a HashSet is O(1), which I predict no regex-based comparison will
    > beat.


    I would be careful with prediction when it comes to performance. I have
    often seen surprising results. And they may even change with JVM
    version. :)

    I'd say, this has to happen:

    HashSet<String>:

    Preparation
    1. create HashSet
    2. for all Strings add to set

    Test
    1. calculate input.hashCode() O(n)
    2. lookup bucket O(1)
    3. if empty => not found O(1)
    4. for all String s in bucket: O(n*m)
    if calculate s.equal(input) yields true => found
    5. not found O(1)


    Pattern:

    Preparation
    1. create a trie structure
    2. for all Strings add to trie
    3. traverse trie and generate pattern (will cause less backtracing!)

    1. create Matcher object O(1)
    2. traverse all chars in input and check pattern O(n)

    So it does not sound too unrealistic that regexp is not slower. But
    there are a lot of open questions:

    > But really, I need to know what you mean by using regexes to tell if a
    > String is in the collection. How would you do that?
    >
    > If I see an SSCCE, I can tell if it is O(1) for contains() or not.


    Big O is a good hint but the ultimate answer can only be found via
    benchmarking.

    I have more questions: how often is the set built? Is it static or
    rather dynamic? How many strings are expected to be in that set? How
    many tests are done? etc.

    > Please provide an SSCCE of a regex-based 'contains()' operation.


    Pseudo code

    // set is "barfly, barton, foo"
    private static final Pattern pat =
    Pattern.compile("(?:bar(?:fly|ton))|foo");
    ....

    public boolean check(CharSequence input) {
    return pat.matcher(input).matches();
    }

    Kind regards

    robert
     
    Robert Klemme, Oct 24, 2008
    #3
  4. On 24.10.2008 19:17, Robert Klemme wrote:
    > On 24.10.2008 18:50, Lew wrote:
    >> On Oct 24, 12:26 pm, softwarepearls_com <>
    >> wrote:
    >>> Let's say you had a set of Strings (I deliberately use lower case, so
    >>> not necessarily java.util.Set), and you wanted to know if String x was
    >>> in this set.. what would be the fastest way?
    >>>
    >>> Anyone have any experience comparing the performance of
    >>> HashSet<String> against regular expressions?

    >>
    >> I recommend that you write a benchmark to compare the results.


    PS: Forgot to add one thing: given the simplicity of the HashSet
    approach I'd go with that unless performance is extremely critical and
    the regexp solution is _significant_ faster. If performance of the
    HashSet is ok then I'd probably not even bother to benchmark the other
    solution...

    robert
     
    Robert Klemme, Oct 24, 2008
    #4
  5. softwarepearls_com

    Lew Guest

    softwarepearls_com <> wrote:
    > >> Let's say you had a set of Strings (I deliberately use lower case, so
    > >> not necessarily java.util.Set), and you wanted to know if String x was
    > >> in this set.. what would be the fastest way?


    Lew wrote:
    >> I predict that the HashSet will win hands down.  First question I
    >> have, how would you implement a structure that used regular
    >> expressions (regexes) to retrieve the String?

    ....
    >> Any use of regexes must be slower.  Strings hashCodes, if computed on
    >> demand and not cached in the String, require a computation based on
    >> each char in the String.


    Robert Klemme wrote:
    > Last time I checked there was an optimization for larger strings that
    > did not look at all the characters.


    How does that work?

    Lew wrote:
    >> Likewise for a regex comparison.  The
    >> hashCode need not be parsed, unlike the regex, and comparisons based
    >> on hashCode use int comparison, unlike regexes, and a contains() call
    >> into a HashSet is O(1), which I predict no regex-based comparison will
    >> beat.


    > I would be careful with prediction when it comes to performance.  I have
    > often seen surprising results.  And they may even change with JVM
    > version. :)


    I am aware of the risks, and made the prediction anyway.

    > I'd say, this has to happen:
    >
    > HashSet<String>:
    >
    > Preparation
    > 1. create HashSet
    > 2. for all Strings add to set


    The OP asked about the speed of determination if the String is in the
    set. Creation and fill of the set is not relevant.

    Creation of a HashSet, if you know the size ahead of time, is O(1).
    If not, O(log r) where 'r' is the number of resize events. Or do you
    count that as part of the 'add' time?

    > Test
    > 1. calculate input.hashCode() O(n)


    O(1) after the first time, in Sun's implementation.

    > 2. lookup bucket O(1)
    > 3. if empty => not found O(1)
    > 4. for all String s in bucket: O(n*m)


    'm' is the number of entries in the bucket, yes?

    Most of the time 'm' is 1 or close to it. Chances are also high that
    two Strings with the same hash will differ in the first one or two
    characters. So probably O(1).

    >        if calculate s.equal(input) yields true => found
    > 5. not found O(1)


    > Pattern:
    >
    > Preparation
    > 1. create a trie structure


    Not relevant, ibid.

    > 2. for all Strings add to trie


    That is likely more expensive than the add to the HashSet, yes? More
    allocations? Comparisons to find the location? Re-arrangement of
    pointers?

    Let's call it a wash, and it's moot anyway since only 'contains()'
    performance is at issue.

    > 3. traverse trie and generate pattern (will cause less backtracing!)


    Please explain this step. I do not understand it.

    > 1. create Matcher object O(1)
    > 2. traverse all chars in input and check pattern O(n)


    Here's where I see the trie approach losing to HashSet, which is
    overwhelmingly likely to approach O(1) for randomly distributed data.

    > So it does not sound too unrealistic that regexp is not slower.  But
    > there are a lot of open questions:


    Indeed. Such as the load factor of the HashSet, whether input Strings
    compute their hash only once or use the cached value often, whether
    inputs are randomly distributed. I still stand by my prediction until
    measurement proves me wrong in the general case.

    >> But really, I need to know what you mean by using regexes to tell if a
    >> String is in the collection.  How would you do that?

    ....
    >> If I see an SSCCE, I can tell if it is O(1) for contains() or not.


    > Big O is a good hint but the ultimate answer can only be found via
    > benchmarking.


    Exactly my point.

    > I have more questions: how often is the set built?  Is it static or
    > rather dynamic?  How many strings are expected to be in that set?  How
    > many tests are done? etc.


    Again, the question was specifically about a 'contains()' operation.
    If build and fill are important that of course changes the balance.
    My prediction about performance was specific to the 'contains()'
    operation.

    As it happens, HashSet is reasonably performant about inserts, too.

    Construction of a HashSet and a trie could be different in timing
    also. A whole HashSet is created in a single allocation.

    >> Please provide an SSCCE of a regex-based 'contains()' operation.


    > Pseudo code


    Pseudocode is not an SSCCE. Frankly, I'm interested in the OP
    providing an SSCCE with their own benchmark measurements as the first
    point of discussion. Let's see their commitment to their answer.

    --
    Lew
     
    Lew, Oct 24, 2008
    #5
  6. softwarepearls_com

    Lew Guest

    On Oct 24, 1:38 pm, Lew <> wrote:
    > Creation of a HashSet, if you know the size ahead of time, is O(1).
    > If not, O(log r) where 'r' is the number of resize events.  


    Er, O(r), which I estimate at around log n.

    --
    Lew
     
    Lew, Oct 24, 2008
    #6
  7. softwarepearls_com wrote:
    > Let's say you had a set of Strings (I deliberately use lower case, so
    > not necessarily java.util.Set), and you wanted to know if String x was
    > in this set.. what would be the fastest way?


    It depends. Some sets, such as the set of all strings beginning with
    "A", correspond to very rapidly checkable regular expressions. Other
    sets correspond to regular expressions that would be equivalent to
    checking against each word in the set.

    The HashSet technique can be very fast. The String implementation caches
    the hashCode, so that it only gets calculated once for each instance.

    Unless this is really performance critical in your application, I
    suggest using whichever is simpler. If it is performance critical, I
    suggest capturing typical sets and lists of probe strings from the
    application, and benchmarking each approach.

    Patricia
     
    Patricia Shanahan, Oct 24, 2008
    #7
  8. softwarepearls_com wrote:

    > Let's say you had a set of Strings (I deliberately use lower case, so
    > not necessarily java.util.Set), and you wanted to know if String x was
    > in this set.. what would be the fastest way?


    Depending on
    - character set and case sensitivity
    - length and distribution of strings
    - sparseness and size of set
    - count and hit/(hit+miss) ratio of lookups
    - memory footprint requirements
    you might be best off using java.util.HashSet, java.util.TreeSet,
    or a homebrew specialized structure. And for very small sets, even
    java.util.ArrayList might do...

    For example, if the strings to look up are non-constant and rarely
    re"used", TreeSet's tree traversal might be cheaper than HashSet's
    hash code calculation.

    Whatever you do, remember Three Golden Rules of Performance Tuning:
    1. Don't do it.
    2. Don't do it yet.
    3. To find the fastet solution, you need at least one.

    --

    "I'm a doctor, not a mechanic." Dr Leonard McCoy <>
    "I'm a mechanic, not a doctor." Volker Borchert <>
     
    Volker Borchert, Oct 25, 2008
    #8
  9. Volker Borchert wrote:
    >
    > Whatever you do, remember Three Golden Rules of Performance Tuning:
    > 1. Don't do it.
    > 2. Don't do it yet.
    > 3. To find the fastet solution, you need at least one.


    4. If something seems obviously inefficient, then:
    A. If you can remove it without making the code more complex, do that.
    B. If not, make a note of it as something to look at later, if
    performance tuning is necessary.

    A long time ago, I worked on a real-time monitoring system that was
    taking about two minutes to process a minute's worth of inputs, which
    turned out to be a problem. When we profiled it, we found a lot of
    really dumb stuff, like opening and closing files in a tight loop
    instead of keeping them open until the loop ended. There's no harm in
    avoiding the dumb stuff the first time.
     
    Mike Schilling, Oct 25, 2008
    #9
  10. On Oct 24, 11:51 pm, Eric Sosman <> wrote:
    > softwarepearls_com wrote:
    > > Let's say you had a set of Strings (I deliberately use lower case, so
    > > not necessarily java.util.Set), and you wanted to know if String x was
    > > in this set.. what would be the fastest way?

    >
    >      The fastest way would be to know the answer already so
    > you don't even need to ask the question.  See also "The
    > fastest I/O operation is the one you don't do."
    >
    >      ... and although I've given you a flip answer, there's
    > really not much more that can be said usefully.  Does this
    > set hold ten Strings or ten million?  Are they short or long?
    > How common is it that distinct Strings have long identical
    > prefixes?  Are you looking up just one "String x" or many
    > thousands of them?


    The set would hold very few Strings.. maybe a maximum of two dozen.
    The Strings would average a length of 20-30 chars. Many of them could
    share prefixes that have lengths of maybe half the String's length.
    And finally, I would be doing tens, hundreds of thousands of lookups.

    >      You're (presumably) thinking of applying the regular
    > expression somehow, either to the "String x" (with set
    > membership determined by what the Matcher says),


    Correct.
     
    softwarepearls_com, Oct 25, 2008
    #10
  11. On Oct 24, 9:01 pm, Patricia Shanahan <> wrote:
    > softwarepearls_com wrote:
    > > Let's say you had a set of Strings (I deliberately use lower case, so
    > > not necessarily java.util.Set), and you wanted to know if String x was
    > > in this set.. what would be the fastest way?

    >
    > It depends. Some sets, such as the set of all strings beginning with
    > "A", correspond to very rapidly checkable regular expressions. Other
    > sets correspond to regular expressions that would be equivalent to
    > checking against each word in the set.
    >
    > The HashSet technique can be very fast. The String implementation caches
    > the hashCode, so that it only gets calculated once for each instance.


    Yes, I know. But the incoming Strings, they may or may not benefit
    from the cacheing of their hashCode. Depends on whether anyone has
    already called hashCode().

    > Unless this is really performance critical in your application, I
    > suggest using whichever is simpler. If it is performance critical, I
    > suggest capturing typical sets and lists of probe strings from the
    > application, and benchmarking each approach.


    Will need to do the latter, as the logic will very much be on the
    (performance) critical path.
     
    softwarepearls_com, Oct 25, 2008
    #11
  12. On Oct 24, 7:35 pm, Christian <> wrote:
    > Robert Klemme schrieb:
    >
    > > On 24.10.2008 19:17, Robert Klemme wrote:
    > >> On 24.10.2008 18:50, Lew wrote:
    > >>> On Oct 24, 12:26 pm, softwarepearls_com <>
    > >>> wrote:
    > >>>> Let's say you had a set of Strings (I deliberately use lower case, so
    > >>>> not necessarily java.util.Set), and you wanted to know if String x was
    > >>>> in this set.. what would be the fastest way?

    >
    > >>>> Anyone have any experience comparing the performance of
    > >>>> HashSet<String> against regular expressions?

    >
    > >>> I recommend that you write a benchmark to compare the results.

    >
    > > PS: Forgot to add one thing: given the simplicity of the HashSet
    > > approach I'd go with that unless performance is extremely critical and
    > > the regexp solution is _significant_ faster.  If performance of the
    > > HashSet is ok then I'd probably not even bother to benchmark the other
    > > solution...

    >
    > >     robert

    >
    > Even without benchmarking I can tell you that using regexps will
    > probably be by a factor of 100 to 10000 slower than HashSet..
    > First academic point:
    > HashSet  O(1)
    > Regexp   O(n)
    >
    > Just compiling the regexp is probably already a hundred times slower
    > that searching in the HashSet... and thats just the constant overhead of
    > the regexp method...


    The setup costs of any matching approach are not relevant for my
    application.. because the matching will be done a huge (10^[5..7])
    number of times..

    The reason I'm interested to know of any experience with regular
    expression matching is that I was thinking of a regex-like expensive-
    to-implement, but potentially "optimal" approach whereby I would
    generate some logic, at run time, wrap it all in a new class, and
    ClassLoader-it. The generated logic would consist of a cascade of ifs,
    with all characters from the set of Strings embedded as bipush
    constants. So the logic would have no object or array accesses (apart
    from accessing the chars of the input String), no iteration..

    Let's say the set of Strings was "hello world", "higher than", and
    "maybe not".

    The totally customized "hand coded" logic to match against this set
    would be something like

    // note: set of Strings is implicit in code that follows, so no
    argument for the set!
    public static boolean matchesAnyOf(String input) {

    char first = input.charAt(0);
    if (first == 'h') {
    char second = input.charAt(1);
    if (second == 'e') {
    etc..
    }
    if (second == 'i') {
    etc..
    }
    return false;
    }
    if (first == 'm') {
    etc..
    }

    return false;
    }


    Because the length of the Strings in the set is known to be small (max
    two dozen chars), the maximum nesting of the ifs would be manageable,
    and because the number of Strings is likewise small, the total amount
    of generated code would probably not blow the 64K limit for method
    bytecode.

    The HashSet<String>.contains(input) approach is very attractive mainly
    because it requires no custom code at all, and it will be fast. But if
    I thought I could get 50%+ more speed out of a custom approach, I'd
    invest the time in it.
     
    softwarepearls_com, Oct 25, 2008
    #12
  13. Just for the record, I once used such a technique, and for those who
    like stories from the Stone Age, here goes.

    The project was a commercial game. The language was Motorola 68K
    assembler. The problem was how to render horizontal lines (depicting
    laser canon output) as fast as possible. Any vanilla line rendering
    call was hopelessly slow, so I ended up with a code-generation
    approach which exploited the fact that only the start and end points
    of the line needed any pixel masking operations. The bulk of the
    pixels completely filled the 16-bit words in video RAM, so for those,
    a plain (constant) write was sufficient.

    The final solution was a switch-like routine that picked one of
    several dozen generated routines. The whole thing cost about 16K of
    RAM for the generated code, but the resulting horizontal line renderer
    was "probably" one of the fastest ever written for the Amiga and Atari
    ST. (My apologies to any other game coder who has a similar claim ;-)
     
    softwarepearls_com, Oct 25, 2008
    #13
  14. "softwarepearls_com" <> wrote in message
    news:...
    On Oct 24, 11:51 pm, Eric Sosman <> wrote:
    > softwarepearls_com wrote:
    > > Let's say you had a set of Strings (I deliberately use lower case, so
    > > not necessarily java.util.Set), and you wanted to know if String x was
    > > in this set.. what would be the fastest way?

    >
    > The fastest way would be to know the answer already so
    > you don't even need to ask the question. See also "The
    > fastest I/O operation is the one you don't do."
    >
    > ... and although I've given you a flip answer, there's
    > really not much more that can be said usefully. Does this
    > set hold ten Strings or ten million? Are they short or long?
    > How common is it that distinct Strings have long identical
    > prefixes? Are you looking up just one "String x" or many
    > thousands of them?


    The set would hold very few Strings.. maybe a maximum of two dozen.
    The Strings would average a length of 20-30 chars. Many of them could
    share prefixes that have lengths of maybe half the String's length.
    And finally, I would be doing tens, hundreds of thousands of lookups.

    > You're (presumably) thinking of applying the regular
    > expression somehow, either to the "String x" (with set
    > membership determined by what the Matcher says),


    Correct.

    ********************************************
    Find and use a database that has regular expression support, so you can do
    something like

    SELECT COUNT(*) FROM codes where REGEXP_LIKE (code_descr, '^abc(1|2)');

    [Oracle specific here] or

    SELECT COUNT(*) FROM codes where code_descr !~ '[0-9]+xxx$';

    [PostgreSQL]

    I'm being somewhat flippant, but not by much. Oracle can do it, so can
    PostgreSQL and MySQL (same implementation for those two), Firebird doesn't
    yet (I think) but likely soon will, SQL Server can do it etc.

    The advantage here is that someone else took care of it for you. After some
    months now of maintaining and fixing code where the original developers
    never saw an API implementation that they couldn't rewrite themselves (I
    honestly believe they'd have rewritten the J2EE server given time) I am a
    zealous evangelist for re-use. :)

    However, the disadvantage of using a database is that you may not need one
    otherwise. I put it out there, though - consider it.

    If you're considering your own approach, Robert already mentioned tries. See
    links like

    http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/
    oak.cs.ucla.edu/~cho/papers/cho-regex.pdf
    http://www.di.unipi.it/didadoc/asw/grossidir/Sahni/tries.htm
    and so forth.

    You'll likely find an implementation in your language of choice, and only
    have to integrate it.

    AHS
     
    Arved Sandstrom, Oct 25, 2008
    #14
  15. >Find and use a database that has regular expression support, so you can do
    >something like
    >
    >SELECT COUNT(*) FROM codes where REGEXP_LIKE (code_descr, '^abc(1|2)');


    >I'm being somewhat flippant, but not by much. Oracle can do it, so can
    >PostgreSQL and MySQL (same implementation for those two), Firebird doesn't
    >yet (I think) but likely soon will, SQL Server can do it etc.
    >
    >The advantage here is that someone else took care of it for you. After some
    >months now of maintaining and fixing code where the original developers
    >never saw an API implementation that they couldn't rewrite themselves (I
    >honestly believe they'd have rewritten the J2EE server given time) I am a
    >zealous evangelist for re-use. :)


    So am I.. maybe even more zealous than yourself :-]. (BTW The litmus
    test I use to determine whether you're a true re-user, or a fake one,
    is to ask you how large your personal library of reusable Java types
    is. Mine is about half the size of the original 1.0 JDK)

    Anyway, my set matching requirement really requires maximum
    performance, and inserting anything as heavy as a database layer is
    completely out of the question... as is any other approach where I
    can't measure the entire logic's length in a handful of JVM
    instructions.

    Thx for the original suggestion anyway.
     
    softwarepearls_com, Oct 25, 2008
    #15
  16. On 24.10.2008 19:38, Lew wrote:
    > softwarepearls_com <> wrote:
    >>>> Let's say you had a set of Strings (I deliberately use lower case, so
    >>>> not necessarily java.util.Set), and you wanted to know if String x was
    >>>> in this set.. what would be the fastest way?

    >
    > Lew wrote:
    >>> I predict that the HashSet will win hands down. First question I
    >>> have, how would you implement a structure that used regular
    >>> expressions (regexes) to retrieve the String?

    > ...
    >>> Any use of regexes must be slower. Strings hashCodes, if computed on
    >>> demand and not cached in the String, require a computation based on
    >>> each char in the String.

    >
    > Robert Klemme wrote:
    >> Last time I checked there was an optimization for larger strings that
    >> did not look at all the characters.

    >
    > How does that work?


    Well, since it's a hash value, there is no requirement to read all the
    chars. I just checked again and in the current JDK hashCode() actually
    reads all the characters.

    >> I'd say, this has to happen:
    >>
    >> HashSet<String>:
    >>
    >> Preparation
    >> 1. create HashSet
    >> 2. for all Strings add to set

    >
    > The OP asked about the speed of determination if the String is in the
    > set. Creation and fill of the set is not relevant.


    That depends. He did not state whether the set was static or not and
    what the relationship of preparation to tests was etc.

    > Creation of a HashSet, if you know the size ahead of time, is O(1).
    > If not, O(log r) where 'r' is the number of resize events. Or do you
    > count that as part of the 'add' time?


    Yes. Where is the number of strings and their average length in your
    calculation?

    >> Test
    >> 1. calculate input.hashCode() O(n)

    >
    > O(1) after the first time, in Sun's implementation.
    >
    >> 2. lookup bucket O(1)
    >> 3. if empty => not found O(1)
    >> 4. for all String s in bucket: O(n*m)

    >
    > 'm' is the number of entries in the bucket, yes?


    Right.

    > Most of the time 'm' is 1 or close to it. Chances are also high that
    > two Strings with the same hash will differ in the first one or two
    > characters. So probably O(1).


    Yep.

    >> if calculate s.equal(input) yields true => found
    >> 5. not found O(1)

    >
    >> Pattern:
    >>
    >> Preparation
    >> 1. create a trie structure

    >
    > Not relevant, ibid.
    >
    >> 2. for all Strings add to trie

    >
    > That is likely more expensive than the add to the HashSet, yes? More
    > allocations? Comparisons to find the location? Re-arrangement of
    > pointers?


    Probably.

    > Let's call it a wash, and it's moot anyway since only 'contains()'
    > performance is at issue.
    >
    >> 3. traverse trie and generate pattern (will cause less backtracing!)

    >
    > Please explain this step. I do not understand it.


    I could dig up a bit of Ruby code that does exactly this. Basically you
    create a pattern which leads to less backtracking by representing the
    tree like structure with non capturing groups (see my small example
    pattern).

    >> 1. create Matcher object O(1)
    >> 2. traverse all chars in input and check pattern O(n)

    >
    > Here's where I see the trie approach losing to HashSet, which is
    > overwhelmingly likely to approach O(1) for randomly distributed data.


    Did you take the cost of equals() into account? This has to do a char
    by char comparison. You could argue of course that for misses chances
    are that lengths for strings with identical hash code but different
    content also differ which would make the failure test O(1) as well.

    > Pseudocode is not an SSCCE. Frankly, I'm interested in the OP
    > providing an SSCCE with their own benchmark measurements as the first
    > point of discussion. Let's see their commitment to their answer.


    Right. I didn't want to do the OP's job as well. :)

    Kind regards

    robert
     
    Robert Klemme, Oct 25, 2008
    #16
  17. On 24.10.2008 19:35, Christian wrote:

    > Just compiling the regexp is probably already a hundred times slower
    > that searching in the HashSet... and thats just the constant overhead of
    > the regexp method...


    That's an unfair comparison: building the regular expression (Pattern)
    is part of preparation, not containment checking. So, if you want to
    compare the pattern compilation to something you need to compare to the
    HashSet creation with all the strings.

    Regards

    robert
     
    Robert Klemme, Oct 25, 2008
    #17
  18. On Sat, 25 Oct 2008 08:59:08 -0400, Lew <> wrote:

    >What do your benchmarks show for this approach vs. HashSet?


    Well, that the precise data inputs matter a lot (see benchmark code
    below). For the test as written, the custom logic approach is roughly
    twice as fast. But.. I saw that the length of input Strings matter a
    lot, which is to be expected since the longer the Strings, the deeper
    the if nesting gets. It's a race between HashSet's O(1) logic vs my
    O(n) logc.

    I also had a quick look at the generated bytecode (using Eclipse's
    compiler, javac fans YMMV).. and found some recurring sillyness which
    would not feature in the hand generated logic, so the custom approach
    would give a significant performance advantage over HashSet.contains,
    but only for really short Strings. Mine are about 3-5x longer than
    those in the benchmark.. so for me, HashSet.contains() and String's
    clever hashCode() wins.


    import java.util.HashSet;
    import java.util.Set;

    public class MatchStringSetBenchmark {

    private static final int TIMES_TO_RUN_TEST = 10;
    private static final int HAMMER_SIZE = 1000000;
    private static final double TIMES_UNROLLED = 6;

    public static void main(String[] args) {

    String[] stringsToMatch = { "hello", "highe", "maybe" };

    Set<String> realSet = fillHashSet(stringsToMatch);

    for (int t = 0; t < TIMES_TO_RUN_TEST; t++) {

    long then = System.nanoTime();
    for (int i = 0; i < HAMMER_SIZE; i++) {
    // three matching inputs and three failing inputs to
    fairly distribute
    // match/fail bias in algorithm
    // assert realSet.contains("hello");
    // assert realSet.contains("hello");
    // assert realSet.contains("hello");
    // assert !realSet.contains("hella");
    // assert !realSet.contains("hella");
    // assert !realSet.contains("hella");
    realSet.contains("hello");
    realSet.contains("hello");
    realSet.contains("hello");
    realSet.contains("hella");
    realSet.contains("hella");
    realSet.contains("hella");
    }
    long now = System.nanoTime();

    long elapsed = now - then;
    double costOfSearch = ((double) elapsed) / HAMMER_SIZE *
    TIMES_UNROLLED;

    System.out.println("ns per HashSet.contains(): " +
    costOfSearch);
    }

    for (int t = 0; t < TIMES_TO_RUN_TEST; t++) {

    long then = System.nanoTime();
    for (int i = 0; i < HAMMER_SIZE; i++) {
    // assert contains("hello");
    // assert contains("hello");
    // assert contains("hello");
    // assert ! contains("hella");
    // assert ! contains("hella");
    // assert ! contains("hella");
    contains("hello");
    contains("hello");
    contains("hello");
    contains("hella");
    contains("hella");
    contains("hella");
    }
    long now = System.nanoTime();

    long elapsed = now - then;
    double costOfSearch = ((double) elapsed) / HAMMER_SIZE *
    TIMES_UNROLLED;

    System.out.println("ns per CUSTOM.contains(): " +
    costOfSearch);
    }
    }

    private static Set<String> fillHashSet(String[] stringsToMatch) {

    Set<String> realSet = new HashSet<String>();
    for (String string : stringsToMatch) {

    realSet.add(string);
    }

    return realSet;
    }

    public static boolean contains(String input) {

    char first = input.charAt(0);
    if (first == 'h') {
    char second = input.charAt(1);
    if (second == 'e') {
    char third = input.charAt(2);
    if (third == 'l') {
    char fourth = input.charAt(3);
    if (fourth == 'l') {
    char fifth = input.charAt(4);
    return (fifth == 'o');
    } else {
    return false;
    }
    } else {
    return false;
    }
    }
    if (second == 'i') {
    char third = input.charAt(2);
    if (third == 'g') {
    char fourth = input.charAt(3);
    if (fourth == 'h') {
    char fifth = input.charAt(4);
    return (fifth == 'e');
    } else {
    return false;
    }
    } else {
    return false;
    }
    }
    return false;
    }
    if (first == 'm') {
    char second = input.charAt(1);
    if (second == 'a') {
    char third = input.charAt(2);
    if (third == 'y') {
    char fourth = input.charAt(3);
    if (fourth == 'b') {
    char fifth = input.charAt(4);
    return (fifth == 'e');
    } else {
    return false;
    }
    } else {
    return false;
    }
    }
    }

    return false;
    }
    }
     
    softwarepearls_com, Oct 25, 2008
    #18
  19. On Sat, 25 Oct 2008 16:16:46 +0200, Robert Klemme
    <> wrote:

    >Well, since it's a hash value, there is no requirement to read all the
    >chars. I just checked again and in the current JDK hashCode() actually
    >reads all the characters.


    That's because the original (JDK 1.0) approach, which only read say
    the first 16 chars, ran into problems.. creating the same hashcode for
    longer strings.. and that meant terrible Hashtable performance.
    There's a Sun Bug Parade entry on this, for more detail.
     
    softwarepearls_com, Oct 25, 2008
    #19
  20. On 25.10.2008 16:51, softwarepearls_com wrote:
    > On Sat, 25 Oct 2008 16:16:46 +0200, Robert Klemme
    > <> wrote:
    >
    >> Well, since it's a hash value, there is no requirement to read all the
    >> chars. I just checked again and in the current JDK hashCode() actually
    >> reads all the characters.

    >
    > That's because the original (JDK 1.0) approach, which only read say
    > the first 16 chars, ran into problems.. creating the same hashcode for
    > longer strings.. and that meant terrible Hashtable performance.
    > There's a Sun Bug Parade entry on this, for more detail.


    Actually I believe I have seen an implementation that does not read all
    chars but jumps for longer strings. I would have to do more research
    though to find out which version of JDK it was or whether I mixed this
    up with a string hashing function in another library / programming language.

    Thanks for the info!

    Kind regards

    robert
     
    Robert Klemme, Oct 25, 2008
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    813
    Malcolm
    Jun 24, 2006
  2. anonym
    Replies:
    1
    Views:
    1,074
    Knute Johnson
    Jan 15, 2009
  3. Karin Lagesen

    matching strings in a large set of strings

    Karin Lagesen, Apr 29, 2010, in forum: Python
    Replies:
    13
    Views:
    472
    Bryan
    May 3, 2010
  4. Helmut Jarausch
    Replies:
    3
    Views:
    345
    Dave Angel
    Apr 30, 2010
  5. Joel Goldstick
    Replies:
    1
    Views:
    765
    Arnaud Delobelle
    Aug 31, 2010
Loading...

Share This Page