Speeding up matching Strings in a set

V

Volker Borchert

softwarepearls_com wrote:

|> Well, that the precise data inputs matter a lot (see benchmark code
|> below). For the test as written, the custom logic approach is roughly
|> twice as fast. But.. I saw that the length of input Strings matter a
|> lot, which is to be expected since the longer the Strings, the deeper
|> the if nesting gets. It's a race between HashSet's O(1) logic vs my
|> O(n) logc.

Only if the strings to test are re"used". The String.hashCode() takes
time proportional to String.length() the first time it is called for
any given String instance.

|> I also had a quick look at the generated bytecode (using Eclipse's
|> compiler, javac fans YMMV).. and found some recurring sillyness which
|> would not feature in the hand generated logic, so the custom approach
|> would give a significant performance advantage over HashSet.contains,
|> but only for really short Strings. Mine are about 3-5x longer than
|> those in the benchmark.. so for me, HashSet.contains() and String's
|> clever hashCode() wins.

The key might be code and data locality. How do the JIT compiled code
and the hashtable fit into instruction and data caches and vm pages?

BTW, have you tested switch against elsif?
 
D

Daniel Pitts

softwarepearls_com said:
Let's say you had a set of Strings (I deliberately use lower case, so
not necessarily java.util.Set), and you wanted to know if String x was
in this set.. what would be the fastest way?

Anyone have any experience comparing the performance of
HashSet<String> against regular expressions?

HashSet is plenty fast in my experience. If you really wanted to get
"the fastest", you'd have to do some analysis.

All String comparison's (HashSet or FSA, or other) have the worst case
of O(n), which is when all but the last character checked matches. If
*all* of your strings start with "foo", but end unpredictably, then you
would be better matching backwards. If every string is completely
random, then a FSA or HashSet would be similar in performance.

There may be some other heuristic to help the worst case.

Now, after you've done that, you might also do some more analysis and
find that all your strings have characters that are between 32 and 196.

Then, you can use a FSA where each node contains an array of length 164
that holds the transition state. Depending on the number of strings
(and the uniformity of strings), you might have enough memory to
construct such an FSA which includes all your strings. This would
result in an algorithm that for every character in the string-to-check
uses two compares (to ensure the char is in range), a subtraction, an
array index lookup, another compare (for null), and a loop for the next
char. Then one final comparison for the final "isTerminal" state.

The alternative is: calculate the hash of the string-to-check (which
itself is n addition/multiplication operations), and then at least one
array access, and then n*m comparisons, where n is the length of the
string to check, and m is the number of hash collisions.

So, depending on the set size, data size, number of checks per second,
CPU power, and memory, you should try both approaches and see which
matches your requirements best.

Hope this helps.
 
M

Mark Space

Robert said:
Actually I believe I have seen an implementation that does not read all
chars but jumps for longer strings. I would have to do more research
though to find out which version of JDK it was or whether I mixed this
up with a string hashing function in another library / programming
language.

Joshua Bloch talks about this in Effective Java, 2nd ed. It was 16
characters max, and the algorithm jumped around for strings that were
longer than 16 characters, iirc. I don't recall in what version of Java
this was fixed, however.
 
M

Mark Space

softwarepearls_com said:
The setup costs of any matching approach are not relevant for my
application.. because the matching will be done a huge (10^[5..7])
number of times..

Considering this and your other post (the one talking about optimizing
68k programming), have you considered JNI? How about a database with
some well choosen indexes? (How big is this set here?)

For JNI, you could possibly construct the HashSet first with Java, then
export the internal arrays to JNI C structs. Then write some custom
code to do the quick look-up.

Similar idea with using a database: eliminate as much Java as possible
and use (presumably) faster code to do the work that needs optimizing.
 
M

Mark Space

softwarepearls_com said:
The set would hold very few Strings.. maybe a maximum of two dozen.
The Strings would average a length of 20-30 chars. Many of them could
share prefixes that have lengths of maybe half the String's length.
And finally, I would be doing tens, hundreds of thousands of lookups.

OK, if it's only like 2 dozen, I'd consider something like a hash on
each character, so you don't have to even scan the whole input string
(to compute a hash).

OTOH, I have a hard time believing that this will be your actual
bottleneck. I'd write the program with just a HashSet, then profile it
and make sure the whole thing isn't IO bound or something.

Remember how to optimize:

1. Don't do it.
2. Don't do it yet.
3. Make sure you have a correct, working, clear solution first.
 
D

Daniel Pitts

softwarepearls_com said:
Let's say you had a set of Strings (I deliberately use lower case, so
not necessarily java.util.Set), and you wanted to know if String x was
in this set.. what would be the fastest way?

Anyone have any experience comparing the performance of
HashSet<String> against regular expressions?

I just did an experiment. I actually had a similar situation, but
instead of "contains", I needed a map (from name to an object). I
originally had used a HashMap, which was fast enough, but this thread
got me thinking. I created a StringMap class which is *almost* a Map
(no iteration). The StringMap class utilized a Deterministic Finite
State Automata.

I ran my benchmarks and the speed as nearly the same between both the
StringMap<E> and the HashMap<String, E>. Surprisingly the
HashMap<String, E> performed slightly better. Not by enough that it
would sway my decision, except the fact that HashMap is existing and
well supported.

Now, in your situation there may be other factors, but I would surprised
if it was that much of a difference for you either.

Note, the size of the String keyset was in the 9000 range, so that may
affect the results.

Hope this helps,
Daniel.
 
P

Patricia Shanahan

Daniel said:
I just did an experiment. I actually had a similar situation, but
instead of "contains", I needed a map (from name to an object). I
originally had used a HashMap, which was fast enough, but this thread
got me thinking. I created a StringMap class which is *almost* a Map
(no iteration). The StringMap class utilized a Deterministic Finite
State Automata.

I ran my benchmarks and the speed as nearly the same between both the
StringMap<E> and the HashMap<String, E>. Surprisingly the
HashMap<String, E> performed slightly better. Not by enough that it
would sway my decision, except the fact that HashMap is existing and
well supported.
>
Note, the size of the String keyset was in the 9000 range, so that may affect the results.


Modern processors do computation very fast, but achieve that partly by
having a lot of things going on at once, including work on instructions
that depend on conditional branches. Two things tend to slow them down.
Cache misses are expensive because access to memory takes a lot longer
than transfers inside the chip. Conditional branches are expensive,
unless correctly predicted, because the processor has to throw away a
lot of partly done work and switch to processing the correct next
instruction.

The number of cache misses is probably more-or-less independent of the
choice of method. However, the state machine and similar methods may
have a lot more conditional branches, and a lot more opportunities for
the processor to guess wrong, than the HashSet method.

Of course, the results will be influenced by data dependent issues such
as how quickly the state machine can reject non-matching inputs.

Patricia
 
J

John B. Matthews

Mark Space said:
Joshua Bloch talks about this in Effective Java, 2nd ed. It was 16
characters max, and the algorithm jumped around for strings that were
longer than 16 characters, iirc. I don't recall in what version of Java
this was fixed, however.

At the end of item 9, he mentions: "The String hash function implemented
in all releases prior to 1.2 examined at most sixteen characters, evenly
spaced throughout the string..." The even spacing was a problem for
hierarchical strings like URLs and, I suppose, package names.
 
R

Robert Klemme

At the end of item 9, he mentions: "The String hash function implemented
in all releases prior to 1.2 examined at most sixteen characters, evenly
spaced throughout the string..." The even spacing was a problem for
hierarchical strings like URLs and, I suppose, package names.

Mark, John, thanks for helping my feeble memory!

Kind regards

robert
 
S

softwarepearls_com

softwarepearls_com wrote:

|> Well, that the precise data inputs matter a lot (see benchmark code
|> below). For the test as written, the custom logic approach is roughly
|> twice as fast. But.. I saw that the length of input Strings matter a
|> lot, which is to be expected since the longer the Strings, the deeper
|> the if nesting gets. It's a race between HashSet's O(1) logic vs my
|> O(n) logc.

Only if the strings to test are re"used". The String.hashCode() takes
time proportional to String.length() the first time it is called for
any given String instance.

Sure.. but in my situation they would very much be reused. In fact, I
strongly suspect that they're even interned Strings.. which, if true,
would open up an even faster optimization.
The key might be code and data locality. How do the JIT compiled code
and the hashtable fit into instruction and data caches and vm pages?

BTW, have you tested switch against elsif?

I haven;t delved deeper than the benchmark code I wrote. Like another
poster suggested, I can't lose track of the remainder of the
application... so I settled with the HashSet approach, for the time
being.
 
S

softwarepearls_com

softwarepearls_com said:
The setup costs of any matching approach are not relevant for my
application.. because the matching will be done a huge (10^[5..7])
number of times..

Considering this and your other post (the one talking about optimizing
68k programming), have you considered JNI? How about a database with
some well choosen indexes? (How big is this set here?)

Like I wrote elsewhere, inserting any heavy-duty layers is out of the
question. I have some JNI experience from many years ago, and at least
then, the performance hit to bridge the language gap was shockingly
painful. I don't know what the situation is these days, but let's just
say I'm no JNI fan and will always try to avoid if possible.

IMO JNI only makes sense when you can pass a very large amount of data
to some non-Java code that can process it let's say an order of
magnitude faster than Java can (eg by using Intel/AMD streaming
instructions). My Strings-in-set matching requirement doesn't fall in
this scenario.
 
P

Patricia Shanahan

softwarepearls_com said:
Sure.. but in my situation they would very much be reused. In fact, I
strongly suspect that they're even interned Strings.. which, if true,
would open up an even faster optimization.

If you can be sure they are all interned, you could use an
IdentityHashMap. It does pointer comparison and uses the identity-based
hash code method.

Patricia
 
D

Daniel Pitts

Patricia said:
Modern processors do computation very fast, but achieve that partly by
having a lot of things going on at once, including work on instructions
that depend on conditional branches. Two things tend to slow them down.
Cache misses are expensive because access to memory takes a lot longer
than transfers inside the chip. Conditional branches are expensive,
unless correctly predicted, because the processor has to throw away a
lot of partly done work and switch to processing the correct next
instruction.

The number of cache misses is probably more-or-less independent of the
choice of method. However, the state machine and similar methods may
have a lot more conditional branches, and a lot more opportunities for
the processor to guess wrong, than the HashSet method.

Of course, the results will be influenced by data dependent issues such
as how quickly the state machine can reject non-matching inputs.

Patricia
Well, the state machine was implemented with an array where the choice
of the next state was an element at the specific index, so the branches
were minimal (branch if outside the range of array, branch is data at
array index is null).

Granted, my benchmark was designed to represent *my* typical use-cases
(expected to match much more often than not match). That may be
different than the OP's use-case.
 
Z

zerg

Patricia said:
Conditional branches are expensive, unless correctly predicted,
because the processor has to throw away a lot of partly done work
and switch to processing the correct next instruction.

I see, with my crystal ball, a new processor a few years from now ...
When it sees a conditional branch, it follows both branches in parallel
until it has enough information to drop one of them ... unless it's
already doing that with an earlier branch ... (*that* will have to wait
for a quantum computer!)
 
P

Patricia Shanahan

zerg said:
I see, with my crystal ball, a new processor a few years from now ...
When it sees a conditional branch, it follows both branches in parallel
until it has enough information to drop one of them ... unless it's
already doing that with an earlier branch ... (*that* will have to wait
for a quantum computer!)

Processing both sides in parallel has been considered for at least 10
years. See "Threaded Multiple Path Execution", Steven Wallace, Brad
Calder, and Dean Tullsen, 25th International Symposium on Computer
Architecture, pages 238-249, June 1998.
http://www-cse.ucsd.edu/~calder/papers/ISCA-98-TME.pdf

In addition to the problem you note, reaching another hard-to-predict
branch execution before an earlier one has been resolved, there is a
price to pay in wasted cache and memory accesses.

Patricia
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top