Tony said:
(Aside: Use of complete sentences on USENET is a virtue).
Designing and implementing a hashing library is more difficult (say 5X to
10X) than a linked list. There are many more choices to consider for
implementation and more research and thought is therefor required (indeed,
one may need linked list techniques to implement hashtables if chaining is
chosen as the solution to collisions). The first hashtable one creates
will surely not be the last nor will there be only one hashtable design to
satisfy all needs. It will take a few (or a dozen) iterations and use over
time to settle and be satisfied with some "final" designs. The OP may want
to take a look at a C++ library for useful algorithms and design ideas if
that language is not a deterent to him, as those are evolved and robust
implementations. A container and algorithm library should have
cohesiveness and developing one container in isolation is a "myopic"
approach. Finally, a hashtable should not be the first container a noob
should try to create.
IMO, it is nowhere near that difficult...
hash tables are (conceptually) intimidating, but in practice are damn near
one of the simplest things to implement (well, ok, linked lists are maybe
simpler, but this is not always the case...).
well, ok, there are 3 major varieties of hash (my own terms here):
single-index hashes;
multi-index/iterative hashes;
chained hashes.
a single-index hash is what I could also call a "cache hash", basically, one
computes the index, and checks the item at the index. if it is a match, the
lookup succeeded, so do whatever. if it is not a match, do whatever and put
the result in this spot in the hash (so that it will be found next time).
IMO, this approach is very simple (maybe only a few lines of code in many
cases), and does speedup many operations (primarily lookups), but is not
good for "exact" lookups (since collisions will typically replace whatever
key was there before...).
iterative hashing:
an iterative hash basically differs from the above in that:
usually the hash table is larger (since now the hash actually needs to store
stuff);
usually, whenever there is a colision, some means is needed to calculate a
new "subsequent" key.
however, it has a few added costs:
the hash table can fill up;
resizing the table typically requires an awkward "re-hash" process;
if this is done, any existing literal indices into the table will be
invalidated.
this leads to 2 solutions:
make the hash larger than the maximum number of items it may reasonably
contain (the hash-table filling up can then be regarded as a "fatal" error);
don't hold literal indices into the table, thus resizing/rehashing is safe.
as an advantage, this approach tends to be fastest IME, but is a little more
effort to make work well (in one of my more complicated uses, I have a table
of this sort that is around 500 loc).
chaining:
this can have the advantages of the first and second approaches:
the hash table can be smaller and need not be resized;
one can hold literal indices to items (just not in the hash table itself).
the alteration is to have a sideband array (or, typically, several
associated arrays), which contain:
the keys; the values;
index chains (each item holds the array index of the next item, or a
terminal value such as 0 or -1).
so, one calculates the hash, and jumps to the item in question, and steps
along the list.
indices can be held into this array, and the array can be resized as needed
without any other ill-effects.
the cost of chaining, however, is that if the hash table is too small chains
may become long and performance is reduced.
hash functions:
some people like shifts and xors, but I dislike this approach, as I have
found that in general it is difficult to get good behavior from this
(typically, a combination which works well for one type of input works
poorly for another).
so, my typical approach is to use primes:
either normal primes or mersenne primes (note that, numerically, they behave
rather differently).
for example, to hash a string:
i=0; while(*s)i=(i*4093)+(*s++);
this uses a normal prime. IME, the most effective strategy here is to shift
right and mask to the table size when getting the table index:
j=(i>>12)&16383;
my usual method for itteration then is to multiply the original value by the
prime:
i=i*4093+1; //the +1 is for the rare case i==0
which, when combined with the above, generates a psuedorandom sequence...
mersenne primes are similar, but different:
i=0; while(*s)i=(i*127)+(*s++);
and:
i=i*127+1;
but differ in that usually one wants the low bits:
j=i&16383;
why this is the case is subtle number magic IMO, but mixing these up (taking
the low bits of a normal prime, or the high bits of a mersenne prime) leads
to poor results.
also, typically with normal primes, one wants the right-shift to be near the
same value as the prime, which should be within a few orders of magnitude of
the hash size.
for mersenne primes, they should be around 2^((log2 n)/2) if possible (seems
to give best results), but this is problematic as there are not so many
usable mersenne primes (whereas normal primes are far more common...).
as such, I typically use normal primes, except for with small hashes (for
example <=1024 spots) where mersenne primes seem to work better...
multiplying against a prime can also hash many other types of data as well
(not just strings), for example, if done right they are similarly effective
with pointers, ...