find words that contains some specific letters

G

Giovanni Azua

Patricia Shanahan said:
Yes, that's the sort of thing I meant.

The Long hash code is also not perfect, and cannot be, because there are
more longs than ints.

To achieve the perfect, optimal and ideal no collisions Hash table "perfect
Hash table" these are the two factors that play a role:

1-. the "perfect" property of the hash function as defined in the wiki page
cited above
2-. the size of the elements set

If #2 goes arbitrarily large meaning allocating a hash table of the size of
that elements set becomes prohibitive memory-wise then it does not really
matter how perfect the hash function is, there will be collisions anyway and
an additional search algorithm will be needed to disambiguate the colliding
elements that fall under the same buckets. This is what I think John
Matthews hasn't realized yet about his implementation of the Jumble
algorithm. For a realistic dataset he will be returning elements that do not
match the of input (String built on top of the sorted character input) but
matches that happen to collide under the same bucket.

e.g. trying to allocate a Hash table of the size ( Integer.MAX_VALUE -
Integer.MIN_VALUE) would be prohibitive unless you place unrealistic
requirements so you have the Integer with a perfect hash function but you
will not get a "perfect Hash table". It is of course worse for the String
class where neither #1 nor #2 are fulfilled.

Character class offers the perfect hash function but since the elements set
is discrete i.e. 26 elements http://en.wikipedia.org/wiki/English_alphabet
you can build a perfect Hash structure like the one I described and edited
by Tom McGlynn.

Best regards,
Giovanni
 
G

Giovanni Azua

Hi John,

Please find my comments below.

John B. Matthews said:
The OP's requirements are not so clear to me. Perhaps the OP
can elucidate whether the algorithm addresses the
problem.
Now that I read it carefully, to my understanding the Jumble problem is
exactly what the OP was asking for. I apologize I did not completely read
the wiki Jumble description the first time. You did a good job identifying
the problem to the OP description.

John B. Matthews said:
The code [1] correctly implements the first algorithm described in the
article cited [2]. In particular, it uses the sorted letters of each
word as the key in a HashMap.
This is why I think your solution does not solve the Jumble problem:

Assume you have two sorted characters input word, say:

String A
String B

Assume input A has the following word matches: mA1, mA2
Assume input B has the following word matches: mB1, mB2, mB3

Since the String class does not offer a perfect hash function, then assume
too that the following holds hashCode(A) == hashCode(B). Under this scenario
the Hash table will place all mA* and mB* under the same bucket i.e.

hashTable.get(A) == Set of { mA1, mA2, mB1, mB2, mB3 }
hashTable.get(B) == Set of { mA1, mA2, mB1, mB2, mB3 }

Therefore your implementation will return wrong values for A namely the
elements that match B and the same for B too. I would expect in a real life
dataset e.g. Oxford dictionary 350k words and phrases to have way more
collisions than this, not only from the lack of perfection of the String
hash function but also because the hash table can not realistically offer a
size that correspond to the number of all the distinct sorted character
input words.

Best regards,
Giovanni
 
G

Giovanni Azua

Giovanni Azua said:
Since the String class does not offer a perfect hash function, then assume
too that the following holds hashCode(A) == hashCode(B). Under this
scenario
To be more precise the correct assumption would be:

[snip] that the following holds:

1-. hashCode(A) == hashCode(B)
2-. A.equals(B) == false; so A and B are not equal

Best regards,
Giovanni
 
L

Lew

Giovanni said:
Hi John,

Please find my comments below.

John B. Matthews said:
The OP's requirements are not so clear to me. Perhaps the OP
can elucidate whether the algorithm addresses the
problem.
Now that I read it carefully, to my understanding the Jumble problem is
exactly what the OP was asking for. I apologize I did not completely read
the wiki Jumble description the first time. You did a good job identifying
the problem to the OP description.

John B. Matthews said:
The code [1] correctly implements the first algorithm described in the
article cited [2]. In particular, it uses the sorted letters of each
word as the key in a HashMap.
This is why I think your solution does not solve the Jumble problem:

Assume you have two sorted characters input word, say:

String A
String B

Assume input A has the following word matches: mA1, mA2
Assume input B has the following word matches: mB1, mB2, mB3

Since the String class does not offer a perfect hash function, then assume
too that the following holds hashCode(A) == hashCode(B). Under this scenario
the Hash table will place all mA* and mB* under the same bucket i.e.

hashTable.get(A) == Set of { mA1, mA2, mB1, mB2, mB3 }
hashTable.get(B) == Set of { mA1, mA2, mB1, mB2, mB3 }

Therefore your implementation will return wrong values for A namely the

That is not true. HashMaps and Hashtables do not use hashCode() for equality
checks; they use equals(). String#equals() and Set#equals() are value-based,
so do not worry that you'll get wrong results.
elements that match B and the same for B too. I would expect
in a real life dataset e.g. Oxford dictionary 350k words and phrases

350K words and phrases (I didn't know you meant phrases, too) seems low.
Let's say there are about 512 KiPhrases of interest, of an average length for
English of sixteen characters.
to have way more collisions than this, not only from the lack of

A regular HashMap for those data at the default .75f load factor will most
likely be about 475 KiBuckets long, have about a quarter of those empty,
nearly all of the rest will contain one entry, and a few will have two, three,
or much less likely, more.
perfection of the String hash function but also because
the hash table can not realistically offer a size that
correspond to the number of all the distinct sorted character input words.

Can it not?

I forget the overhead of a String, but for the Strings let's say 32 bytes for
address, lengths - I'm just too lazy to look it up right now, and the new
version of Java for 64-bit just cut its pointer overhead in half, so I know
the numbers can change anyway.

That's 16 MiB for String overhead, another 16 MiB for the Set values, another
16 MiB for Map.Entry instances, and another 16 MiB for the character data in
the Strings. That's 64 MiB, well within the capacity of nearly all JVM
installations. Hell, let's double it to 128 MiB. If RAM is that tight, you
can use clever compression techniques to lower the overhead. Nevertheless,
you can clearly see that a 128 MiB data structure is certainly realistic.

That there are collisions does not affect the accuracy of the match. Hash
codes are used merely as an optimization, to bring searches down near constant
time. Once the hash identifies a bucket, the Map then uses equals() to
distinguish between the k candidates at that bucket. That's why a good hash
code, such as String's, is desirable. It keeps the search down at O(k) rather
than O(m). Since in practice the String hash code is very good, the number of
dictionary Strings can vary widely without materially changing the performance
of HashMap#get(). The chances are very good that no bucket will contain more
than one Map.Entry<String, Set<String>>. A few might contain two or three.

You are correct that the worst case is O(m), or perhaps O(m log m) if the Map
implementor keeps the bucket list properly sorted. If the hash really ever
does just happen to land all the keys in one bucket, then, yes, you are up
that infamous creek without the ameliorative paddle. Write back when this
actually happens to you.
 
J

John B. Matthews

Patricia Shanahan said:
Yes, that's the sort of thing I meant.

The Long hash code is also not perfect, and cannot be, because there
are more longs than ints.

Ah, thank you. I see Long also uses an exclusive-or fold, similar to the
one seen in Double: (int)(value ^ (value >>> 32)).

[Newsgroups trimmed; Followup-To comp.lang.java.programmer.]
 
G

Giovanni Azua

Lew said:
That is not true. HashMaps and Hashtables do not use hashCode() for
equality checks; they use equals(). String#equals() and Set#equals() are
value-based, so do not worry that you'll get wrong results.
Oppss you are right! I am so sorry! I made a terrible mistake here :( ...
another Ariane goes down.

I should have tested it before posting but the bad laziness hit me.

Sorry John B. Matthews for casting doubt on your implementation and Lew
thank you for clarifying.

I wrote meantime another algorithm but it falls short in complexity compared
to Matthews.

I think Matthews is: O(n log n)
though there is still a log x somewhere else corresponding to the TreeSet
binary search. That extra variable x would be [in average] the number of
matching words for a sorted input but I guess this will be nearly constant.

Mine even using concurrency could get at best O(log n) * O(m)
and having the m in the equation is a blow.
350K words and phrases (I didn't know you meant phrases, too) seems low.
http://www.amazon.com/Oxford-Dictio...=sr_1_1?ie=UTF8&s=books&qid=1244329339&sr=1-1

Best regards,
Giovanni
 
G

Giovanni Azua

Giovanni Azua said:
elements that fall under the same buckets. This is what I think John
Matthews hasn't realized yet about his implementation of the Jumble
algorithm. For a realistic dataset he will be returning elements that do
not match the of input (String built on top of the sorted character input)
but matches that happen to collide under the same bucket.
I am sorry, this is wrong. Matthews's Jumble implementation is correct and
does not have the problem I described here.

Best regards,
Giovanni
 
T

Tom Anderson

John, i think you've missed a trick there - doesn't Rotate_Left do a
barrel shift, where the bits that fall off the left-hand end get shoved
back in at the right, rather than a plain shift? If so, the basic
algebraic structure is actually a bit different to java's hash.
The Java String hash is just about verbatim from Knuth. It has the virtue of
being a well-studied algorithm that has been shown to have good distribution
characteristics.

I'm not so sure about the Ada one. I haven't studied this in a while,
but I seem to recall that power-of-two multipliers mod 2**32 don't have
as good a distribution. Perhaps someone could correct me on that.

If Rotate_Left is a plain shift, then the Ada hash is based only on the
last 10 characters of the string. Multiplying by eight is like shifting
left three places; modding by 2*32 is like anding with 0xffffffff (if
you'll pardon my verbs). When you add in a character, the rightmost bit
you can affect is bit 0, the LSB; after ten shifts of three bits, that is
now at bit thirty, and after one more, it's at bit thirty-three, and has
vanished.

If Rotate_Left is a barrel shift, then our bit of interest doesn't vanish,
because it comes back to the right-hand end of the hash after the eleventh
shift in position 1. Because 3 is coprime with 32, it will then proceed to
make a complete circuit of all positions in the hash over the following
twenty-one shifts, winding up back in position 0 after the thirty-second.
I have a vague worry that this means that strings built from the same
32-character blocks arranged in any order will have the same hash, but i
suspect that this is not the case, because of carries. My maths isn't up
to proving it, though.

I still think this is a shoddy hash. It has no diffusion: if you hash two
strings which differ at one character position, the difference between the
hashes will always be confined to the sixteen bits which correspond to
where that character ended up in the rotating hash. Java's
multiply-and-mod approach quickly spreads the bits of each character out
over the whole hash.

tom
 
J

John B. Matthews

"Giovanni Azua said:
Now that I read it carefully, to my understanding the Jumble problem
is exactly what the OP was asking for. I apologize I did not
completely read the wiki Jumble description the first time. You did a
good job identifying the problem to the OP description.

Thank you. I am hopeful; if this is not the OP's desired solution, the
implementation may suggest another approach.
John B. Matthews said:
The code [1] correctly implements the first algorithm described in
the article cited [2]. In particular, it uses the sorted letters of
each word as the key in a HashMap.
This is why I think your solution does not solve the Jumble problem:

Assume you have two sorted characters input word, say:

String A
String B

Assume input A has the following word matches: mA1, mA2
Assume input B has the following word matches: mB1, mB2, mB3

Since the String class does not offer a perfect hash function, then
assume too that the following holds hashCode(A) == hashCode(B). Under
this scenario the Hash table will place all mA* and mB* under the
same bucket i.e.

hashTable.get(A) == Set of { mA1, mA2, mB1, mB2, mB3 }
hashTable.get(B) == Set of { mA1, mA2, mB1, mB2, mB3 }

Yes, the matching hash codes put A and B in the same bucket, but
collision resolution separates A and B, as well as their respective
permutation lists. My platform's implementation appears to use direct
chaining:

[...] I would expect in a real life dataset e.g. Oxford dictionary
350k words and phrases to have way more collisions than this, not
only from the lack of perfection of the String hash function but also
because the hash table can not realistically offer a size that
correspond to the number of all the distinct sorted character input
words.

Empirically, Java's hashCode() implementation for String generates just
six potential collisions in my ~250K word list. That number surely
increases when the hash code is used to index the implementation's
bucket list, but performance is still excellent.
 
T

Tom Anderson

350K words and phrases (I didn't know you meant phrases, too) seems low.
Let's say there are about 512 KiPhrases of interest, of an average length for
English of sixteen characters.


A regular HashMap for those data at the default .75f load factor will most
likely be about 475 KiBuckets long, have about a quarter of those empty,
nearly all of the rest will contain one entry, and a few will have two,
three, or much less likely, more.

I thought i'd put my compiler where Lew's mouth is. I used this code (with
the relevant bits commented out or edited for each test):

http://urchin.earth.li/~twic/tmp/Collisions.java

And the /usr/share/dict/words file on my machine, which contains 234937
words and 2486825 characters including carriage returns, which makes for
an average of 9.6 characters per word.

I used nominal load factors of 0.75 and 1.00. My code follows HashMap in
rounding up to the nearest power of two; this gave true load factors of
0.45 and 0.90 respectively, with tables containing 524288 and 262144
buckets.

My code also follows HashMap in that it stirs the hash up before using it.
For completeness, i give results with and without stirring, and also using
the Ada hash function John posted.

The following tables give a number of collided hashes, followed by the
count of buckets with that number of hashes in them, followed by that
count as a percentage of the total number of buckets.

Standard String hash, no stirring, load factor 0.45:

0: 334747 (63.847923278808594)
1: 150391 (28.68480682373047)
2: 33573 (6.403541564941406)
3: 4958 (0.9456634521484375)
4: 571 (0.10890960693359375)
5: 46 (0.0087738037109375)
6: 2 (3.814697265625E-4)

Standard String hash, no stirring, load factor 0.90:

0: 106807 (40.74363708496094)
1: 96064 (36.6455078125)
2: 43199 (16.479110717773438)
3: 12535 (4.7817230224609375)
4: 2940 (1.12152099609375)
5: 494 (0.188446044921875)
6: 96 (0.03662109375)
7: 8 (0.0030517578125)
8: 1 (3.814697265625E-4)

Standard string hash, stirred, load factor 0.45:

0: 334883 (63.873863220214844)
1: 150294 (28.666305541992188)
2: 33387 (6.368064880371094)
3: 5077 (0.9683609008789062)
4: 600 (0.11444091796875)
5: 44 (0.008392333984375)
6: 3 (5.7220458984375E-4)

Standard string hash, stirred, load factor 0.90:

0: 106949 (40.79780578613281)
1: 95944 (36.5997314453125)
2: 42971 (16.392135620117188)
3: 12775 (4.8732757568359375)
4: 2903 (1.1074066162109375)
5: 510 (0.194549560546875)
6: 82 (0.031280517578125)
7: 8 (0.0030517578125)
8: 2 (7.62939453125E-4)

Ada string hash, no stirring, load factor 0.45:

0: 391319 (74.63817596435547)
1: 88824 (16.94183349609375)
2: 24581 (4.688453674316406)
3: 8784 (1.6754150390625)
4: 4115 (0.7848739624023438)
5: 2183 (0.41637420654296875)
6: 1296 (0.2471923828125)
7: 822 (0.1567840576171875)
8: 562 (0.1071929931640625)
9: 374 (0.0713348388671875)
10: 278 (0.0530242919921875)
11: 202 (0.0385284423828125)
12: 161 (0.03070831298828125)
13: 140 (0.026702880859375)
14: 113 (0.02155303955078125)
15: 81 (0.01544952392578125)
16: 57 (0.01087188720703125)
17: 62 (0.0118255615234375)
18: 51 (0.00972747802734375)
19: 42 (0.0080108642578125)
20: 24 (0.00457763671875)
21: 32 (0.006103515625)
22: 27 (0.00514984130859375)
23: 30 (0.0057220458984375)
24: 11 (0.00209808349609375)
25: 9 (0.00171661376953125)
26: 12 (0.002288818359375)
27: 10 (0.0019073486328125)
28: 13 (0.00247955322265625)
29: 12 (0.002288818359375)
30: 5 (9.5367431640625E-4)
31: 5 (9.5367431640625E-4)
32: 6 (0.0011444091796875)
33: 5 (9.5367431640625E-4)
34: 5 (9.5367431640625E-4)
35: 3 (5.7220458984375E-4)
36: 1 (1.9073486328125E-4)
37: 5 (9.5367431640625E-4)
38: 2 (3.814697265625E-4)
39: 2 (3.814697265625E-4)
40: 1 (1.9073486328125E-4)
41: 3 (5.7220458984375E-4)
42: 2 (3.814697265625E-4)
43: 2 (3.814697265625E-4)
44: 1 (1.9073486328125E-4)
45: 2 (3.814697265625E-4)
49: 3 (5.7220458984375E-4)
51: 1 (1.9073486328125E-4)
52: 2 (3.814697265625E-4)
54: 1 (1.9073486328125E-4)
55: 2 (3.814697265625E-4)
62: 1 (1.9073486328125E-4)
73: 1 (1.9073486328125E-4)

Ada string hash, no stirring, load factor 0.90:

0: 156382 (59.654998779296875)
1: 59970 (22.876739501953125)
2: 22198 (8.467864990234375)
3: 9565 (3.6487579345703125)
4: 4742 (1.808929443359375)
5: 2682 (1.023101806640625)
6: 1775 (0.6771087646484375)
7: 1147 (0.4375457763671875)
8: 791 (0.3017425537109375)
9: 547 (0.2086639404296875)
10: 403 (0.1537322998046875)
11: 336 (0.128173828125)
12: 231 (0.0881195068359375)
13: 203 (0.0774383544921875)
14: 156 (0.05950927734375)
15: 132 (0.05035400390625)
16: 118 (0.045013427734375)
17: 90 (0.034332275390625)
18: 80 (0.030517578125)
19: 63 (0.0240325927734375)
20: 64 (0.0244140625)
21: 42 (0.016021728515625)
22: 48 (0.018310546875)
23: 36 (0.01373291015625)
24: 29 (0.0110626220703125)
25: 28 (0.01068115234375)
26: 24 (0.0091552734375)
27: 33 (0.0125885009765625)
28: 20 (0.00762939453125)
29: 16 (0.006103515625)
30: 10 (0.003814697265625)
31: 14 (0.005340576171875)
32: 12 (0.00457763671875)
33: 10 (0.003814697265625)
34: 16 (0.006103515625)
35: 12 (0.00457763671875)
36: 10 (0.003814697265625)
37: 10 (0.003814697265625)
38: 8 (0.0030517578125)
39: 7 (0.0026702880859375)
40: 6 (0.002288818359375)
41: 5 (0.0019073486328125)
42: 8 (0.0030517578125)
43: 3 (0.0011444091796875)
44: 8 (0.0030517578125)
45: 3 (0.0011444091796875)
46: 1 (3.814697265625E-4)
47: 7 (0.0026702880859375)
48: 4 (0.00152587890625)
49: 3 (0.0011444091796875)
50: 2 (7.62939453125E-4)
51: 2 (7.62939453125E-4)
52: 4 (0.00152587890625)
53: 2 (7.62939453125E-4)
54: 2 (7.62939453125E-4)
55: 1 (3.814697265625E-4)
56: 2 (7.62939453125E-4)
57: 1 (3.814697265625E-4)
58: 3 (0.0011444091796875)
61: 1 (3.814697265625E-4)
64: 1 (3.814697265625E-4)
65: 1 (3.814697265625E-4)
66: 1 (3.814697265625E-4)
67: 1 (3.814697265625E-4)
69: 1 (3.814697265625E-4)
70: 1 (3.814697265625E-4)
73: 1 (3.814697265625E-4)
74: 1 (3.814697265625E-4)
79: 1 (3.814697265625E-4)
80: 1 (3.814697265625E-4)
83: 1 (3.814697265625E-4)
87: 1 (3.814697265625E-4)
93: 1 (3.814697265625E-4)
110: 1 (3.814697265625E-4)
116: 1 (3.814697265625E-4)
122: 1 (3.814697265625E-4)

Ada string hash, stirred, load factor 0.45:

0: 336927 (64.26372528076172)
1: 147039 (28.04546356201172)
2: 33984 (6.48193359375)
3: 5527 (1.0541915893554688)
4: 719 (0.13713836669921875)
5: 83 (0.01583099365234375)
6: 7 (0.00133514404296875)
8: 2 (3.814697265625E-4)

Ada string hash, stirred, load factor 0.90:

0: 108112 (41.241455078125)
1: 94683 (36.11869812011719)
2: 42466 (16.199493408203125)
3: 13109 (5.0006866455078125)
4: 3023 (1.1531829833984375)
5: 625 (0.2384185791015625)
6: 110 (0.041961669921875)
7: 12 (0.00457763671875)
8: 2 (7.62939453125E-4)
9: 2 (7.62939453125E-4)

So, Lew's description was rather optimistic; there are more empty buckets
than he predicted, and also more collided buckets. Still, in what we can
consider the normal case - standard String hash, stirred, 0.45 load factor
- 64% of the entries were in a bucket of their own.

Also, the ada hash function is indeed complete crap. And HashMap's
hash-stirring function is remarkably effective at saving the day!

tom
 
L

Lew

John said:
Empirically, Java's hashCode() implementation for String generates just
six potential collisions in my ~250K word list. That number surely
increases when the hash code is used to index the implementation's
bucket list, but performance is still excellent.

So a hash map even at a load factor of 1.0f with 250K entries would only have
six buckets with more than one entry. 340K at the default .75f would be
there, too.

There should be an optimization in the JVM for statically-generated
non-mutable reference structures like the Map<String, Set<String>> reference.
Oh, yes, I'm sure there are, but I wonder if the JVM optimize such a complex
space on its own, or if there's more opportunity in this area. I'm thinking
of a sort of constant pool that gets built and robotically compacted at
class-initialization time so that it can optimize a statically-initialized
dictionary structure or whatever.

Idioms that would benefit would include our jumble use case:

<code status="{untested, uncompiled, scratchpad}" complete="false" >
public class Jungle
{
...
/** Thread-safe read-only lookup for word jungles. */
private static final Map <String, Set <String>> jungles;
static
{
Map <String, Set <String>> builder =
new HashMap <String, Set <String>> ( NWORDS * 4 / 3 + 1 );
for( String word : inputWords() )
{
final String sorted = sort( word ); // standard word sort

Set <String> matches = builder.get( word );
if ( matches == null )
{
matches = new HashSet <String> (); // or concurrent, ...
builder.put( sorted, matches );
}
matches.add( word );
}
for ( Map.Entry <String, Set <String>> entry
: builder.entrySet() )
{
Set <String> immutable = Collections.unmodifiableSet(
entry.getValue() );
entry.setValue( immutable );
}
jungles = Collections.unmodifiableMap( builder );
}
...
}
</code>

HashSet<E>
<http://java.sun.com/javase/6/docs/api/java/util/HashSet.html>

HashMap<K,V>
<http://java.sun.com/javase/6/docs/api/java/util/HashMap.html>

ConcurrentHashMap<K,V>
<http://java.sun.com/javase/6/docs/api/java/util/concurrent/ConcurrentHashMap.html>

Collections methods:

public static <K,V> Map<K,V>
synchronizedMap( Map<K,V> m ))
<http://java.sun.com/javase/6/docs/api/java/util/Collections.html#synchronizedMap(java.util.Map)>

public static <T> Set<T>
unmodifiableSet(Set<? extends T> s)
<http://java.sun.com/javase/6/docs/api/java/util/Collections.html#unmodifiableSet(java.util.Set)>

public static <K,V> Map<K,V>
unmodifiableMap(Map<? extends K,? extends V> m)
<http://java.sun.com/javase/6/docs/api/java/util/Collections.html#unmodifiableMap(java.util.Map)>

The Collections methods can be static imported.
 
L

Lew

Tom said:
> actual hard data!

I admire your patience to give us hard data.
...
Standard string hash, stirred, load factor 0.45:

0: 334883 (63.873863220214844)
1: 150294 (28.666305541992188)
2: 33387 (6.368064880371094)
3: 5077 (0.9683609008789062)
4: 600 (0.11444091796875)
5: 44 (0.008392333984375)
6: 3 (5.7220458984375E-4)

Standard string hash, stirred, load factor 0.90:

0: 106949 (40.79780578613281)
1: 95944 (36.5997314453125)
2: 42971 (16.392135620117188)
3: 12775 (4.8732757568359375)
4: 2903 (1.1074066162109375)
5: 510 (0.194549560546875)
6: 82 (0.031280517578125)
7: 8 (0.0030517578125)
8: 2 (7.62939453125E-4)
...

We might care about maximum bucket list length and the skew toward shorter or
longer bucket list lengths.

For some uses, say, solving a jumbled-word problem, it may be that (nearly)
all map get()s will succeed, with (nearly) never a miss. The bucket list at
the hash index will (nearly) never be empty. So performance will (mostly) not
be affected by the empty-list test. The ratio of short-list buckets to those
with any entries at all will affect performance.

From Tom Anderson's results, the map at 0.45f load used almost 64% of
capacity for (nearly) never-reached empty bucket lists. Of the rest,
one-entry buckets made up over 79%, and buckets with two or fewer entries
about 97%.

The map at .90f load used almost 41% of capacity for (nearly) never-reached
empty bucket lists. Of the rest, one-entry buckets made up less than 62%, and
two-or-unders about 89.5%.

From upthread:

private static final Map <String, Set <String>> jungles;

For the jumbled-word application all key Strings at a particular bucket will
be the same length. So will all the Strings in the value Set once the right
entry is found. The length check in String#equals() will not help performance.

To optimize searches, intern the Strings. This benefits both the Map and the
Set, one-entry buckets the most and halving value comparisons in two-entry
buckets. It also reduces the impact of load factor on expected search times.

The read-only Sets, once retrieved, are likely meant for iteration, perhaps a
lot of it. If so, LinkedHashSet could be a good implementation:

Iteration over a LinkedHashSet requires time proportional to the size
of the set, regardless of its capacity. Iteration over a HashSet is
likely to be more expensive, requiring time proportional to its capacity.

As a general rule, the default load factor (.75)
offers a good tradeoff between time and space costs.

I've lost any will to deviate from the default .75f load factor.
 
J

John B. Matthews

[QUOTE="Lew said:
Empirically, Java's hashCode() implementation for String generates
just six potential collisions in my ~250K word list. That number
surely increases when the hash code is used to index the
implementation's bucket list, but performance is still excellent.

So a hash map even at a load factor of 1.0f with 250K entries would
only have six buckets with more than one entry.[/QUOTE]

Not exactly. I was merely counting (potential) collisions generated by
calling the hashCode() method on each distinct key:

$ java -jar dist/jumble.jar zzzz | sort -n | wc -l
218961
$ java -jar dist/jumble.jar zzzz | sort -n | uniq | wc -l
218955

Tom's illuminating code actually looks at the distribution among buckets
in an array managed as HashMap typically does internally. His result is
dispositive for such an implementation. In particular, the index
calculation relies on the capacity being an integral power of two, which
may not apply in other implementations.

<http://urchin.earth.li/~twic/tmp/Collisions.java>

[Thoughtful proposal for optimizing static structures like the
Map<String, Set<String>> elided.]
 
J

John B. Matthews

John, i think you've missed a trick there - doesn't Rotate_Left do a
barrel shift, where the bits that fall off the left-hand end get
shoved back in at the right, rather than a plain shift? If so, the
basic algebraic structure is actually a bit different to java's hash.[/QUOTE]

D'oh, you're absolutely right. I overlooked this when I was tinkering
with my own hash function, too. Thank you for looking at this.

[...]
I still think this is a shoddy hash.

Well, it's a legacy from GNAT.HTable.Hash, but it does very well in the
context of Ada.Containers.Hashed_Maps. Java uses a bucket list whose
size is an integral power of two, while (this implementation of) Ada
uses the next largest prime. Modeling an Ada collision counter on your
example <http://urchin.earth.li/~twic/tmp/Collisions.java>, I got these
results for collision count, number of buckets having that count, and
the percentage of the total:

$ ./collisions
0: 218586 (55.6%)
1: 126250 (32.1%)
2: 38432 (9.8%)
3: 8362 (2.1%)
4: 1354 (0.3%)
5: 229 (0.1%)
6: 24 (0.0%)
7: 4 (0.0%)
8: 1 (0.0%)

The code is here: <http://home.roadrunner.com/~jbmatthews/jumble.html>

[Followup-To set to comp.lang.ada.]
 
J

John B. Matthews

[...]
Sorry John B. Matthews for casting doubt on your implementation and Lew
thank you for clarifying.

No problem, Giovanni. Your questions, Lew's insights, Tom's exemplary
code, and others' comments have all improved my understanding of this
interesting problem.
 
L

Lew

No problem, Giovanni. Your questions, Lew's insights, Tom's exemplary
code, and others' comments have all improved my understanding of this
interesting problem.

I thank you, too, Giovanni - molto grazie - and the rest of you.

Hmm, maybe there's a way to use a different hash that increases the likelihood
of single-entry buckets for the most common letter combinations.

Then I smack myself in the forehead and tell myself not to worry, it works
well enough with the default hashes and load factors, and intern()ed Strings
in this case. Your contributions have taught me a ton more about why the
defaults work, and when they might not.
 
G

Giovanni Azua

Hi Lew,

Lew said:
I thank you, too, Giovanni - molto grazie - and the rest of you.
Prego! but it would actually be "muchas gracias" because I'm from Cuba :)
Hmm, maybe there's a way to use a different hash that increases the
likelihood of single-entry buckets for the most common letter
combinations.
This is not an answer to this question but it triggered another slightly
related one. From the top of my head, I recall that choosing an appropriate
prime number for the hash table size was a good way to help distribute the
elements more evenly specially for perfect (or near perfect) hashCode
implementations. I don't expect the Java Hash implementation automatic table
growth method to do this (I think it is expensive) ... but I need to
double-check tomorrow, it is now too late here in Switzerland.

I would expect that changing the initial table size like the one you defined
before "NWORDS * 4 / 3 + 1" into another that finds the closest prime number
out of that value would yield a better quality hash structure where the
values are more evenly distributed.

AFAIR computing prime numbers is a complexity problem on its own but I think
doing it statically will pay off if computed only once corresponding to the
data size and the structure will remain immutable so no rehashes and
re-computation of this prime number will happen.

Best regards,
Giovanni
 
L

Lew

Giovanni said:
I would expect that changing the initial table size like the one you defined
before "NWORDS * 4 / 3 + 1" into another that finds the closest prime number
out of that value would yield a better quality hash structure where the
values are more evenly distributed.

Muchas gracias.
AFAIR computing prime numbers is a complexity problem on its own but I think
doing it statically will pay off if computed only once corresponding to the
data size and the structure will remain immutable so no rehashes and
re-computation of this prime number will happen.

Here's a rough-cut, untried fragment. Maybe you can do better.

private static final Map <String, Set<String>> jungle;
static
{
final Random entropy = new Random( System.currentTimeMillis() );

final int GOAL = (int) (NWORDS / 0.75f) + 1;
int sz;
do
{
sz = BigInteger.probablePrime( 31, entropy ).intValue();
}
while ( sz < GOAL );
Map <String, Set<String>> builder =
new HashMap <String, Set<String>> ( sz );
...
jungle = Collections.unmodifiableMap( builder );
}
 
J

John B. Matthews

[QUOTE="Lew said:
I would expect that changing the initial table size like the one
you defined before "NWORDS * 4 / 3 + 1" into another that finds the
closest prime number out of that value would yield a better quality
hash structure where the values are more evenly distributed.

Muchas gracias.
AFAIR computing prime numbers is a complexity problem on its own
but I think doing it statically will pay off if computed only once
corresponding to the data size and the structure will remain
immutable so no rehashes and re-computation of this prime number
will happen.

Here's a rough-cut, untried fragment. Maybe you can do better.

[Method using BigInteger.probablePrime elided.][/QUOTE]

This is an appealing approach [1, 2], and it is used in a widely
available implementation of hashed maps [3, 4]. Unfortunately, it might
run afoul of common implementations of java.util.HashMap, which rely on
the bucket list being an integral power of two [5]. Of course, a custom
hashCode() or HashMap implementation are possible alternatives.

[1]<http://en.wikipedia.org/wiki/Hash_table>
[2]<http://www.concentric.net/~Ttwang/tech/primehash.htm>
[3]<http://gcc.gnu.org/viewcvs/trunk/gcc/ada/a-coprnu.ads>
[4]<http://gcc.gnu.org/viewcvs/trunk/gcc/ada/a-chtgop.adb>
[5]<http://www.docjar.org/html/api/java/util/HashMap.java.html>
 
G

Giovanni Azua

Giovanni Azua said:
[snip]
Thank you both for elucidating this. If I understand big O notation
correctly, the factorial term dominates the complexity, giving O(n!)
overall. Of course, I would welcome clarification.
I have to double check but I am actually not sure it is only O(n!) the
reason is that "n" and "m" are two different measures here and will be
wildly different, so I am not sure from the top of my head that it will be
O(n!) taking over rather than O(n! * log m). If the log was on top of the
'n' yes by all means the factorial would dominate but not for the 'm'.
I was right, the m is not dominated in this case but the expression:

O(n! * log m)

becomes [1]:

O(n!) * O(log m)

[1] http://www.ugrad.cs.ubc.ca/~cs260/2008W1/chnotes/ch5/Ch5Cov01.html

Best regards,
Giovanni
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top