Eric said:
[...]
There are a lot of classic mistakes made during implementation of hash
tables that can reduce their performance. 1) If you use a prime-sized
table (as some books recommend) you will pay a horrendous penalty for
the % operator on every access; always use a power of 2 sized table
(and an odd, or in fact, prime numbered skip value for the open address
hash table.)
The horrendousness of the penalty for division should be
assessed in light of what else is afoot. It may be drowned
by the computation of the hash function itself, for example.
True, but in those cases, using a hash instead of a balanced tree is
probably not the real performance issue anyways. Hash tables are about
hard core "to the metal" optimizing -- if not, then you are on
theoretically very shakey ground anyways.
Also, there's a trade-off: If the division costs you N cycles
more than an AND would but spreads the data better and saves
you an average of C comparisons, it may be worth while.
Ok, but versus a good design, this cannot be the case.
[...] Note,
too, that an AND simply discards bits from the hash value;
if you engage in a more complex "folding" operation to let
all the bits participate, the cycle count starts creeping up
toward that of the division.
Real hash functions don't require folding.
If you choose a power-of-two table size, it doesn't matter
whether the skip value is prime or not: as long as it's odd,
it'll work just fine.
This is not correct. The reason takes a little thought. By using a
primes, you reduce the probability that any distinct pair of skip
values with have a low gcd(). So if you happen to have a few overly
long chains "near" each other the probability that they interefere with
each other multiple times within a chained sequence is minimized. This
is seen most clearly by picking the numbers 3 and 9, and assuming that
two chains are the same offset modulo 3. If the chain with the skip of
3 is really long (15 items, say) then one third of them (5 items) will
intersect with the chain with the skip of 9. If the skip values are
distinct primes, then successive interference can only happen at most
every skip1 * skip2 items. Thus using prime skip values will tend to,
in fact, spread the distribution of entries a better and consequently
reduce the maximum chain length on average.
[...] More generally, if you choose a table
of size T, the skip distances Si should be relatively prime to
T but need not be prime. The attraction of a prime T is that
it's easy to find Si that are relatively prime to it!
Easy but costly. To find small primes for skip values (it turns out
that you never need more than 512 or so) one is typically at worst
going to do a (fits in a tiny corner of the L1 cache) table look up.
There turns out to be a perfectly balanced way of doing this if you
have a good 32-bit (or higher) hash function -- you use the lower bits
of the hash for the initial index, and the upper bits for the index
into the prime table to determine the skip value. So you only need a
single hash function.
This is another interesting trade-off. In an open-addressed
table that holds pointers to the "payload" data (rather than the
data itself), holding on to the computed hash alongside the data
pointer will probably double the size of the table.
I don't know what kind of thing you want to hash, where the index and
stored hash is not dominated by the actual raw data that you are
hashing in the first place.
[...] And there's
where the interesting part comes in: For the same expenditure of
storage, you could have a pointers-only table with twice as many
entries,
Huh? In a chained hash solution, you don't have chains that point at
NULL. And aliased entries will map to the same place. That means you
*can't* have twice as many entries, if you are memory limited. The
worst case scenario is if you were hashing an integer or a float or
something, but in those cases, you would want a special design and are
likely to have a specially crafted hash function (i.e., the indentity
function, for example). Besides such marginal cases, your reasoning
fails to kindergarten arithmetic.
[...] or half the "load factor" for the same number of stored
items. Search and insertion times rise steeply at high load
factors (in ways that depend on the algorithms), so the potential
reduction in probe count might make up for the increased number
of item comparisons per probe.
You must be smoking pot. You are positing that the cost of additional
hash value storage (keep in mind that in a hash table, there *IS* no
locality, unless you do repeated requests, in which case, there is
trivial locality) is in some way comparable to calling the element
comparison function. That's generally just going to be ridiculous.