Speeding up matching Strings in a set

softwarepearls_com · Oct 24, 2008

Let's say you had a set of Strings (I deliberately use lower case, so
not necessarily java.util.Set), and you wanted to know if String x was
in this set.. what would be the fastest way?

Anyone have any experience comparing the performance of
HashSet<String> against regular expressions?

Lew · Oct 24, 2008

Let's say you had a set of Strings (I deliberately use lower case, so
not necessarily java.util.Set), and you wanted to know if String x was
in this set.. what would be the fastest way?

Anyone have any experience comparing the performance of
HashSet<String> against regular expressions?

I recommend that you write a benchmark to compare the results.

I predict that the HashSet will win hands down. First question I
have, how would you implement a structure that used regular
expressions (regexes) to retrieve the String?

Any use of regexes must be slower. Strings hashCodes, if computed on
demand and not cached in the String, require a computation based on
each char in the String. Likewise for a regex comparison. The
hashCode need not be parsed, unlike the regex, and comparisons based
on hashCode use int comparison, unlike regexes, and a contains() call
into a HashSet is O(1), which I predict no regex-based comparison will
beat.

But really, I need to know what you mean by using regexes to tell if a
String is in the collection. How would you do that?

If I see an SSCCE, I can tell if it is O(1) for contains() or not.

Please provide an SSCCE of a regex-based 'contains()' operation.

Robert Klemme · Oct 24, 2008

I recommend that you write a benchmark to compare the results.

Absolutely. Results of this exercise is often educational.

I predict that the HashSet will win hands down. First question I
have, how would you implement a structure that used regular
expressions (regexes) to retrieve the String?

I am speculating that the OP just would create a regexp that matches for
_all_ strings. You can do some optimization for Java's NFA (i.e.
combine matching prefixes) but I doubt this will work well for large sets.

Any use of regexes must be slower. Strings hashCodes, if computed on
demand and not cached in the String, require a computation based on
each char in the String.

Last time I checked there was an optimization for larger strings that
did not look at all the characters.

Likewise for a regex comparison. The
hashCode need not be parsed, unlike the regex, and comparisons based
on hashCode use int comparison, unlike regexes, and a contains() call
into a HashSet is O(1), which I predict no regex-based comparison will
beat.

I would be careful with prediction when it comes to performance. I have
often seen surprising results. And they may even change with JVM
version.

I'd say, this has to happen:

HashSet<String>:

Preparation
1. create HashSet
2. for all Strings add to set

Test
1. calculate input.hashCode() O(n)
2. lookup bucket O(1)
3. if empty => not found O(1)
4. for all String s in bucket: O(n*m)
if calculate s.equal(input) yields true => found
5. not found O(1)

Pattern:

Preparation
1. create a trie structure
2. for all Strings add to trie
3. traverse trie and generate pattern (will cause less backtracing!)

1. create Matcher object O(1)
2. traverse all chars in input and check pattern O(n)

So it does not sound too unrealistic that regexp is not slower. But
there are a lot of open questions:

But really, I need to know what you mean by using regexes to tell if a
String is in the collection. How would you do that?

If I see an SSCCE, I can tell if it is O(1) for contains() or not.

Big O is a good hint but the ultimate answer can only be found via
benchmarking.

I have more questions: how often is the set built? Is it static or
rather dynamic? How many strings are expected to be in that set? How
many tests are done? etc.

Please provide an SSCCE of a regex-based 'contains()' operation.

Pseudo code

// set is "barfly, barton, foo"
private static final Pattern pat =
Pattern.compile("(?:bar(?:fly|ton))|foo");
....

public boolean check(CharSequence input) {
return pat.matcher(input).matches();
}

Kind regards

robert

Robert Klemme · Oct 24, 2008

PS: Forgot to add one thing: given the simplicity of the HashSet
approach I'd go with that unless performance is extremely critical and
the regexp solution is _significant_ faster. If performance of the
HashSet is ok then I'd probably not even bother to benchmark the other
solution...

robert

Lew · Oct 24, 2008

Robert said:
Last time I checked there was an optimization for larger strings that
did not look at all the characters.

How does that work?

I would be careful with prediction when it comes to performance. I have
often seen surprising results. And they may even change with JVM
version.

I am aware of the risks, and made the prediction anyway.

I'd say, this has to happen:

HashSet<String>:

Preparation
1. create HashSet
2. for all Strings add to set

The OP asked about the speed of determination if the String is in the
set. Creation and fill of the set is not relevant.

Creation of a HashSet, if you know the size ahead of time, is O(1).
If not, O(log r) where 'r' is the number of resize events. Or do you
count that as part of the 'add' time?

Test
1. calculate input.hashCode() O(n)

O(1) after the first time, in Sun's implementation.

2. lookup bucket O(1)
3. if empty => not found O(1)
4. for all String s in bucket: O(n*m)

'm' is the number of entries in the bucket, yes?

Most of the time 'm' is 1 or close to it. Chances are also high that
two Strings with the same hash will differ in the first one or two
characters. So probably O(1).

if calculate s.equal(input) yields true => found
5. not found O(1)

Pattern:

Preparation
1. create a trie structure

Not relevant, ibid.

2. for all Strings add to trie

That is likely more expensive than the add to the HashSet, yes? More
allocations? Comparisons to find the location? Re-arrangement of
pointers?

Let's call it a wash, and it's moot anyway since only 'contains()'
performance is at issue.

3. traverse trie and generate pattern (will cause less backtracing!)

Please explain this step. I do not understand it.

1. create Matcher object O(1)
2. traverse all chars in input and check pattern O(n)

Here's where I see the trie approach losing to HashSet, which is
overwhelmingly likely to approach O(1) for randomly distributed data.

So it does not sound too unrealistic that regexp is not slower. But
there are a lot of open questions:

Indeed. Such as the load factor of the HashSet, whether input Strings
compute their hash only once or use the cached value often, whether
inputs are randomly distributed. I still stand by my prediction until
measurement proves me wrong in the general case.

Big O is a good hint but the ultimate answer can only be found via
benchmarking.

Exactly my point.

I have more questions: how often is the set built? Is it static or
rather dynamic? How many strings are expected to be in that set? How
many tests are done? etc.

Again, the question was specifically about a 'contains()' operation.
If build and fill are important that of course changes the balance.
My prediction about performance was specific to the 'contains()'
operation.

As it happens, HashSet is reasonably performant about inserts, too.

Construction of a HashSet and a trie could be different in timing
also. A whole HashSet is created in a single allocation.

Pseudo code

Pseudocode is not an SSCCE. Frankly, I'm interested in the OP
providing an SSCCE with their own benchmark measurements as the first
point of discussion. Let's see their commitment to their answer.

Lew · Oct 24, 2008

Creation of a HashSet, if you know the size ahead of time, is O(1).
If not, O(log r) where 'r' is the number of resize events.

Er, O(r), which I estimate at around log n.

Patricia Shanahan · Oct 24, 2008

softwarepearls_com said:
Let's say you had a set of Strings (I deliberately use lower case, so
not necessarily java.util.Set), and you wanted to know if String x was
in this set.. what would be the fastest way?

It depends. Some sets, such as the set of all strings beginning with
"A", correspond to very rapidly checkable regular expressions. Other
sets correspond to regular expressions that would be equivalent to
checking against each word in the set.

The HashSet technique can be very fast. The String implementation caches
the hashCode, so that it only gets calculated once for each instance.

Unless this is really performance critical in your application, I
suggest using whichever is simpler. If it is performance critical, I
suggest capturing typical sets and lists of probe strings from the
application, and benchmarking each approach.

Patricia

Volker Borchert · Oct 25, 2008

softwarepearls_com said:
Let's say you had a set of Strings (I deliberately use lower case, so
not necessarily java.util.Set), and you wanted to know if String x was
in this set.. what would be the fastest way?

Depending on
- character set and case sensitivity
- length and distribution of strings
- sparseness and size of set
- count and hit/(hit+miss) ratio of lookups
- memory footprint requirements
you might be best off using java.util.HashSet, java.util.TreeSet,
or a homebrew specialized structure. And for very small sets, even
java.util.ArrayList might do...

For example, if the strings to look up are non-constant and rarely
re"used", TreeSet's tree traversal might be cheaper than HashSet's
hash code calculation.

Whatever you do, remember Three Golden Rules of Performance Tuning:
1. Don't do it.
2. Don't do it yet.
3. To find the fastet solution, you need at least one.

Mike Schilling · Oct 25, 2008

Volker said:
Whatever you do, remember Three Golden Rules of Performance Tuning:
1. Don't do it.
2. Don't do it yet.
3. To find the fastet solution, you need at least one.

4. If something seems obviously inefficient, then:
A. If you can remove it without making the code more complex, do that.
B. If not, make a note of it as something to look at later, if
performance tuning is necessary.

A long time ago, I worked on a real-time monitoring system that was
taking about two minutes to process a minute's worth of inputs, which
turned out to be a problem. When we profiled it, we found a lot of
really dumb stuff, like opening and closing files in a tight loop
instead of keeping them open until the loop ended. There's no harm in
avoiding the dumb stuff the first time.

softwarepearls_com · Oct 25, 2008

The fastest way would be to know the answer already so
you don't even need to ask the question. See also "The
fastest I/O operation is the one you don't do."

... and although I've given you a flip answer, there's
really not much more that can be said usefully. Does this
set hold ten Strings or ten million? Are they short or long?
How common is it that distinct Strings have long identical
prefixes? Are you looking up just one "String x" or many
thousands of them?

The set would hold very few Strings.. maybe a maximum of two dozen.
The Strings would average a length of 20-30 chars. Many of them could
share prefixes that have lengths of maybe half the String's length.
And finally, I would be doing tens, hundreds of thousands of lookups.

You're (presumably) thinking of applying the regular
expression somehow, either to the "String x" (with set
membership determined by what the Matcher says),

Correct.

softwarepearls_com · Oct 25, 2008

It depends. Some sets, such as the set of all strings beginning with
"A", correspond to very rapidly checkable regular expressions. Other
sets correspond to regular expressions that would be equivalent to
checking against each word in the set.

The HashSet technique can be very fast. The String implementation caches
the hashCode, so that it only gets calculated once for each instance.

Yes, I know. But the incoming Strings, they may or may not benefit
from the cacheing of their hashCode. Depends on whether anyone has
already called hashCode().

Unless this is really performance critical in your application, I
suggest using whichever is simpler. If it is performance critical, I
suggest capturing typical sets and lists of probe strings from the
application, and benchmarking each approach.

Will need to do the latter, as the logic will very much be on the
(performance) critical path.

softwarepearls_com · Oct 25, 2008

Even without benchmarking I can tell you that using regexps will
probably be by a factor of 100 to 10000 slower than HashSet..
First academic point:
HashSet O(1)
Regexp O(n)

Just compiling the regexp is probably already a hundred times slower
that searching in the HashSet... and thats just the constant overhead of
the regexp method...

The setup costs of any matching approach are not relevant for my
application.. because the matching will be done a huge (10^[5..7])
number of times..

The reason I'm interested to know of any experience with regular
expression matching is that I was thinking of a regex-like expensive-
to-implement, but potentially "optimal" approach whereby I would
generate some logic, at run time, wrap it all in a new class, and
ClassLoader-it. The generated logic would consist of a cascade of ifs,
with all characters from the set of Strings embedded as bipush
constants. So the logic would have no object or array accesses (apart
from accessing the chars of the input String), no iteration..

Let's say the set of Strings was "hello world", "higher than", and
"maybe not".

The totally customized "hand coded" logic to match against this set
would be something like

// note: set of Strings is implicit in code that follows, so no
argument for the set!
public static boolean matchesAnyOf(String input) {

char first = input.charAt(0);
if (first == 'h') {
char second = input.charAt(1);
if (second == 'e') {
etc..
}
if (second == 'i') {
etc..
}
return false;
}
if (first == 'm') {
etc..
}

return false;
}

Because the length of the Strings in the set is known to be small (max
two dozen chars), the maximum nesting of the ifs would be manageable,
and because the number of Strings is likewise small, the total amount
of generated code would probably not blow the 64K limit for method
bytecode.

The HashSet<String>.contains(input) approach is very attractive mainly
because it requires no custom code at all, and it will be fast. But if
I thought I could get 50%+ more speed out of a custom approach, I'd
invest the time in it.

softwarepearls_com · Oct 25, 2008

Just for the record, I once used such a technique, and for those who
like stories from the Stone Age, here goes.

The project was a commercial game. The language was Motorola 68K
assembler. The problem was how to render horizontal lines (depicting
laser canon output) as fast as possible. Any vanilla line rendering
call was hopelessly slow, so I ended up with a code-generation
approach which exploited the fact that only the start and end points
of the line needed any pixel masking operations. The bulk of the
pixels completely filled the 16-bit words in video RAM, so for those,
a plain (constant) write was sufficient.

The final solution was a switch-like routine that picked one of
several dozen generated routines. The whole thing cost about 16K of
RAM for the generated code, but the resulting horizontal line renderer
was "probably" one of the fastest ever written for the Amiga and Atari
ST. (My apologies to any other game coder who has a similar claim ;-)

Arved Sandstrom · Oct 25, 2008

The fastest way would be to know the answer already so
you don't even need to ask the question. See also "The
fastest I/O operation is the one you don't do."

... and although I've given you a flip answer, there's
really not much more that can be said usefully. Does this
set hold ten Strings or ten million? Are they short or long?
How common is it that distinct Strings have long identical
prefixes? Are you looking up just one "String x" or many
thousands of them?

The set would hold very few Strings.. maybe a maximum of two dozen.
The Strings would average a length of 20-30 chars. Many of them could
share prefixes that have lengths of maybe half the String's length.
And finally, I would be doing tens, hundreds of thousands of lookups.

You're (presumably) thinking of applying the regular
expression somehow, either to the "String x" (with set
membership determined by what the Matcher says),

Correct.

********************************************
Find and use a database that has regular expression support, so you can do
something like

SELECT COUNT(*) FROM codes where REGEXP_LIKE (code_descr, '^abc(1|2)');

[Oracle specific here] or

SELECT COUNT(*) FROM codes where code_descr !~ '[0-9]+xxx$';

[PostgreSQL]

I'm being somewhat flippant, but not by much. Oracle can do it, so can
PostgreSQL and MySQL (same implementation for those two), Firebird doesn't
yet (I think) but likely soon will, SQL Server can do it etc.

The advantage here is that someone else took care of it for you. After some
months now of maintaining and fixing code where the original developers
never saw an API implementation that they couldn't rewrite themselves (I
honestly believe they'd have rewritten the J2EE server given time) I am a
zealous evangelist for re-use.

However, the disadvantage of using a database is that you may not need one
otherwise. I put it out there, though - consider it.

If you're considering your own approach, Robert already mentioned tries. See
links like

http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/
oak.cs.ucla.edu/~cho/papers/cho-regex.pdf
http://www.di.unipi.it/didadoc/asw/grossidir/Sahni/tries.htm
and so forth.

You'll likely find an implementation in your language of choice, and only
have to integrate it.

AHS

softwarepearls_com · Oct 25, 2008

Find and use a database that has regular expression support, so you can do

something like

SELECT COUNT(*) FROM codes where REGEXP_LIKE (code_descr, '^abc(1|2)');

I'm being somewhat flippant, but not by much. Oracle can do it, so can
PostgreSQL and MySQL (same implementation for those two), Firebird doesn't
yet (I think) but likely soon will, SQL Server can do it etc.

The advantage here is that someone else took care of it for you. After some
months now of maintaining and fixing code where the original developers
never saw an API implementation that they couldn't rewrite themselves (I
honestly believe they'd have rewritten the J2EE server given time) I am a
zealous evangelist for re-use.

So am I.. maybe even more zealous than yourself :-]. (BTW The litmus
test I use to determine whether you're a true re-user, or a fake one,
is to ask you how large your personal library of reusable Java types
is. Mine is about half the size of the original 1.0 JDK)

Anyway, my set matching requirement really requires maximum
performance, and inserting anything as heavy as a database layer is
completely out of the question... as is any other approach where I
can't measure the entire logic's length in a handful of JVM
instructions.

Thx for the original suggestion anyway.

Robert Klemme · Oct 25, 2008

How does that work?

Well, since it's a hash value, there is no requirement to read all the
chars. I just checked again and in the current JDK hashCode() actually
reads all the characters.

The OP asked about the speed of determination if the String is in the
set. Creation and fill of the set is not relevant.

That depends. He did not state whether the set was static or not and
what the relationship of preparation to tests was etc.

Creation of a HashSet, if you know the size ahead of time, is O(1).
If not, O(log r) where 'r' is the number of resize events. Or do you
count that as part of the 'add' time?

Yes. Where is the number of strings and their average length in your
calculation?

O(1) after the first time, in Sun's implementation.

'm' is the number of entries in the bucket, yes?
Right.

Most of the time 'm' is 1 or close to it. Chances are also high that
two Strings with the same hash will differ in the first one or two
characters. So probably O(1).
Yep.

Not relevant, ibid.

That is likely more expensive than the add to the HashSet, yes? More
allocations? Comparisons to find the location? Re-arrangement of
pointers?
Probably.

Let's call it a wash, and it's moot anyway since only 'contains()'
performance is at issue.

Please explain this step. I do not understand it.

I could dig up a bit of Ruby code that does exactly this. Basically you
create a pattern which leads to less backtracking by representing the
tree like structure with non capturing groups (see my small example
pattern).

Here's where I see the trie approach losing to HashSet, which is
overwhelmingly likely to approach O(1) for randomly distributed data.

Did you take the cost of equals() into account? This has to do a char
by char comparison. You could argue of course that for misses chances
are that lengths for strings with identical hash code but different
content also differ which would make the failure test O(1) as well.

Pseudocode is not an SSCCE. Frankly, I'm interested in the OP
providing an SSCCE with their own benchmark measurements as the first
point of discussion. Let's see their commitment to their answer.

Right. I didn't want to do the OP's job as well.

Kind regards

robert

Robert Klemme · Oct 25, 2008

Just compiling the regexp is probably already a hundred times slower
that searching in the HashSet... and thats just the constant overhead of
the regexp method...

That's an unfair comparison: building the regular expression (Pattern)
is part of preparation, not containment checking. So, if you want to
compare the pattern compilation to something you need to compare to the
HashSet creation with all the strings.

Regards

robert

softwarepearls_com · Oct 25, 2008

What do your benchmarks show for this approach vs. HashSet?

Well, that the precise data inputs matter a lot (see benchmark code
below). For the test as written, the custom logic approach is roughly
twice as fast. But.. I saw that the length of input Strings matter a
lot, which is to be expected since the longer the Strings, the deeper
the if nesting gets. It's a race between HashSet's O(1) logic vs my
O(n) logc.

I also had a quick look at the generated bytecode (using Eclipse's
compiler, javac fans YMMV).. and found some recurring sillyness which
would not feature in the hand generated logic, so the custom approach
would give a significant performance advantage over HashSet.contains,
but only for really short Strings. Mine are about 3-5x longer than
those in the benchmark.. so for me, HashSet.contains() and String's
clever hashCode() wins.

import java.util.HashSet;
import java.util.Set;

public class MatchStringSetBenchmark {

private static final int TIMES_TO_RUN_TEST = 10;
private static final int HAMMER_SIZE = 1000000;
private static final double TIMES_UNROLLED = 6;

public static void main(String[] args) {

String[] stringsToMatch = { "hello", "highe", "maybe" };

Set<String> realSet = fillHashSet(stringsToMatch);

for (int t = 0; t < TIMES_TO_RUN_TEST; t++) {

long then = System.nanoTime();
for (int i = 0; i < HAMMER_SIZE; i++) {
// three matching inputs and three failing inputs to
fairly distribute
// match/fail bias in algorithm
// assert realSet.contains("hello");
// assert realSet.contains("hello");
// assert realSet.contains("hello");
// assert !realSet.contains("hella");
// assert !realSet.contains("hella");
// assert !realSet.contains("hella");
realSet.contains("hello");
realSet.contains("hello");
realSet.contains("hello");
realSet.contains("hella");
realSet.contains("hella");
realSet.contains("hella");
}
long now = System.nanoTime();

long elapsed = now - then;
double costOfSearch = ((double) elapsed) / HAMMER_SIZE *
TIMES_UNROLLED;

System.out.println("ns per HashSet.contains(): " +
costOfSearch);
}

for (int t = 0; t < TIMES_TO_RUN_TEST; t++) {

long then = System.nanoTime();
for (int i = 0; i < HAMMER_SIZE; i++) {
// assert contains("hello");
// assert contains("hello");
// assert contains("hello");
// assert ! contains("hella");
// assert ! contains("hella");
// assert ! contains("hella");
contains("hello");
contains("hello");
contains("hello");
contains("hella");
contains("hella");
contains("hella");
}
long now = System.nanoTime();

long elapsed = now - then;
double costOfSearch = ((double) elapsed) / HAMMER_SIZE *
TIMES_UNROLLED;

System.out.println("ns per CUSTOM.contains(): " +
costOfSearch);
}
}

private static Set<String> fillHashSet(String[] stringsToMatch) {

Set<String> realSet = new HashSet<String>();
for (String string : stringsToMatch) {

realSet.add(string);
}

return realSet;
}

public static boolean contains(String input) {

char first = input.charAt(0);
if (first == 'h') {
char second = input.charAt(1);
if (second == 'e') {
char third = input.charAt(2);
if (third == 'l') {
char fourth = input.charAt(3);
if (fourth == 'l') {
char fifth = input.charAt(4);
return (fifth == 'o');
} else {
return false;
}
} else {
return false;
}
}
if (second == 'i') {
char third = input.charAt(2);
if (third == 'g') {
char fourth = input.charAt(3);
if (fourth == 'h') {
char fifth = input.charAt(4);
return (fifth == 'e');
} else {
return false;
}
} else {
return false;
}
}
return false;
}
if (first == 'm') {
char second = input.charAt(1);
if (second == 'a') {
char third = input.charAt(2);
if (third == 'y') {
char fourth = input.charAt(3);
if (fourth == 'b') {
char fifth = input.charAt(4);
return (fifth == 'e');
} else {
return false;
}
} else {
return false;
}
}
}

return false;
}
}

softwarepearls_com · Oct 25, 2008

Well, since it's a hash value, there is no requirement to read all the
chars. I just checked again and in the current JDK hashCode() actually
reads all the characters.

That's because the original (JDK 1.0) approach, which only read say
the first 16 chars, ran into problems.. creating the same hashcode for
longer strings.. and that meant terrible Hashtable performance.
There's a Sun Bug Parade entry on this, for more detail.

Robert Klemme · Oct 25, 2008

That's because the original (JDK 1.0) approach, which only read say
the first 16 chars, ran into problems.. creating the same hashcode for
longer strings.. and that meant terrible Hashtable performance.
There's a Sun Bug Parade entry on this, for more detail.

Actually I believe I have seen an implementation that does not read all
chars but jumps for longer strings. I would have to do more research
though to find out which version of JDK it was or whether I mixed this
up with a string hashing function in another library / programming language.

Thanks for the info!

Kind regards

robert

JSF feeds a Set<String> into a Set<Foo>	2	Jun 2, 2011
matching strings in a large set of strings	13	Apr 29, 2010
java-code ... how 'matching' are 2 strings ?	3	Feb 11, 2006
Speeding up multiple regex matches	3	Nov 18, 2005
WIN32 - Update Text in a Window in order to show its size in Pixels and coordinates	0	Oct 4, 2023
How to set up a fast correct java build?	41	Jan 8, 2010
Get all strings matching given RegExp	3	Apr 3, 2008
efficiently splitting up strings based on substrings	7	Sep 5, 2009

Speeding up matching Strings in a set

softwarepearls_com

Lew

Robert Klemme

Robert Klemme

Lew

Lew

Patricia Shanahan

Volker Borchert

Mike Schilling

softwarepearls_com

softwarepearls_com

softwarepearls_com

softwarepearls_com

Arved Sandstrom

softwarepearls_com

Robert Klemme

Robert Klemme

softwarepearls_com

softwarepearls_com

Robert Klemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads