Siemel said:
I don't think I'm in that 1%.
I wouldn't dare claim to be either, without doing some research first.
A naive guess is to add up all the letters
and modulo by the bucket size,
I think you mean table size. Each bucket has a separate size determined
by the number of items that hash to that index.
Or, perhaps you meant "the size measured in number of buckets" and I was
just reading it wrong. I could see that.
which they say should be prime, but why is
this?
I'm not sure I ever completely understood that. The one obvious reason
doesn't apply to most implementations. In "probing" collision-resolution
schemes it makes sense, but most use "chaining" or buckets. What I'm
calling "probing" is the situation where a collision (i.e., a hash to a
location that's already taken) is handled by trying other locations. For
example, a naive approach is to try the next slot, until an empty one is
found. A better technique (I think) would be double hashing, where a
second hash function is used to determine the sequence of slots to try.
So if h1() is the primary hash function and h2() is the secondary,
suppose h1(x) = n, but n is already used. Then you would try n + h2(x),
n + 2*h2(x), n + 3*h2(x), etc. In all cases, modding by the table size.
In this case, having the table size prime ensures that you will
eventually try every table slot -- if there's an open one, you'll find it.
But nobody seems to use schemes like this very often, and it seems
obvious why. Why would you want to take up additional slots in the
table, increasing the chances of future collisions, when you can just
chain things outside the table (in buckets)? Lookup complexity for the
element is no worse (may even be better), and the table is less cluttered.
So why the prime table size? I can't see an obvious reason. If hash
values are approximately uniformly distributed between 0 and N, I'd
think that ideal table sizes would be numbers that evenly divide N,
because any remainder would split the table into two areas with
different probabilities of any particular value hashing into them.
size_t hash(const char * s, int buckets) {
size_t h = 0;
for ( ; *s; ++s) h += *s;
return h%bucket.
}
Kernighan & Pike's _The Practice of Programming_ mentioned a similar
hash function, but it included (if I recall correctly) multiplying the
total by a constant before adding in the next character value. This
constant has been experimentally determined to be effective for ASCII
text. I'm not sure if anyone really knows why it works. I don't recall
what the value was, but I'm sure someone can tell us.
But we need hash only the first maxchars characters.
Actually, I think K&P also talked about this, and suggested it was a bad
idea. This is from memory again, so I may be wrong, but I believe they
mentioned Java using a scheme like this originally. It turned out to not
work well, and was changed to use the entire string.
-Kevin