How to implement a Hash Table in C

M

Malcolm McLean

As pointed out by Philip Potter, you can't be so cavalier about such
things. If the index is 0 or 1, then squaring it is going to give you
0 and 1 as the new index all the time. Furthermore, depending on the
table size, you can get all sorts of "short cycles". For example 2*2
== 4 (mod 7) and 4*4 == 16 == 2 (mod 7). So 2 and 4 just cycle back
and forth in a table of sized 7.
It doesn't work like that.
The index returned from the hash is squared, made modulus table size, and
used as a linear probe increment. It's effectively secondary hashing where
function 2 is hash 1, squared. It does a reasonable job of busting clusters,
since apart from the first few keys are scattered all over the place.
However I should square the hash, not the index, to get a truly distributed
secondary hash. As it is collisions will create chains of collisions.
It was an accident, anyway.
 
W

websnarf

It doesn't work like that.
The index returned from the hash is squared, made modulus table size, and
used as a linear probe increment.

Well, 0*0 == 0, so you still have a problem. Also this does nothing
special about the general problems that linear probes have then.

Ok, so lets think about this again using the examples I showed above.
Table size of 7, on offsets 2 and 4. If you land on offset 2, then
you will skip in steps of 4. If you land of offset 4 you will go in
steps of 2. In both cases, 2+4 = 4+2 = 6, so they both collide with
their first probe. Furthermore, 2+4+4 = 3 (mod 7) and 4+2+2+2 = 3
(mod 7) so again they collide. Sorry but no, this is a terrible
strategy.
[...] It's effectively secondary hashing where
function 2 is hash 1, squared. It does a reasonable job of busting clusters,
since apart from the first few keys are scattered all over the place.

This is not analysis. Try doing some math, or collecting real data to
back up your claim. I am almost certain that this is actually an
unusually *BAD* strategy -- even worse than CBFalconer's "just hash it
again, and mask it down to a quarter the size" kind of strategy.
However I should square the hash, not the index, to get a truly distributed
secondary hash. As it is collisions will create chains of collisions.
It was an accident, anyway.

Well this fixes one anomaly (the same chain sequence always coming out
of the same slot) bit it does not settle the other standard numeric
collision anomaly (stepping on each other just due to a lack of
coprimality of the two chains).
 
B

Ben Bacarisse

As pointed out by Philip Potter, you can't be so cavalier about such
things.
Quite.

If the index is 0 or 1, then squaring it is going to give you
0 and 1 as the new index all the time. Furthermore, depending on the
table size, you can get all sorts of "short cycles". For example 2*2
== 4 (mod 7) and 4*4 == 16 == 2 (mod 7). So 2 and 4 just cycle back
and forth in a table of sized 7.

That is not what is done. I don't know if what he does is better or
worse than that, but it is in effect using (idx * idx) % size as an
_increment_ (with 0 mapped to 1) -- similar to basic linear probing.
The % size is pointless (since the he repeats the reduction to do the
index operation after adding the increment) and I specifically say idx
rather hash because what gets used is the index that has caused the
collision, not the full hash of the key.

As I understand it, most linear probing schemes that use an increment
other than 1, choose one based on the full hash (or even another hash)
so as to avoid generating the same sequence of probes for all elements
that map a particular initial index.

Of course, there may be some curious reason why this form of linear
probing *is* a good idea in practice, but it certainly looks bad on
paper and it most definitely is *not* quadratic probing as promised in
the text.
 
J

Julienne Walker

It is a good description. The writing is rather dry and my car park metaphor
is more interesting.

My writing fluctuates between excessively dry, overly playful, and
just pain awful. ;-) As for the car park metaphor, I don't think it's
appropriate when valets can park cars in the first available slot and
write the slot number on a key tag. Your readers are very likely to
wonder why this particular car park is going to all of that trouble
with registration numbers for something that could be solved with the
equivalent of a list of pointers.

I'm thinking that a library catalogue would make more sense for this
purpose. The Dewey Decimal system is well known (by anyone over 25
most likely) and can act as a gentle introduction into hashing
concepts.
Also I describe how to do deletions with linear probing.

True. I glossed over that technique in favor of what you referred to
as "lazy deletion". I've found marking items as "occupied but deleted"
and deferring physical removal until a more expensive but far less
frequent rebuild more practical. And my tutorials focus heavily on
practicality. ;-)
The code works on global variables. Whilst this is acceptable for
demonstration code, it also tends to encourage bad programming habits.

I have no excuse, but I can defend the decision if you'd like. In
order to keep the code examples short and highly nutritious I cut out
all of the fluff I could manage. That includes showing the definition
of the variables used with as little framework as possible. Now that
I've been outed on clc, I suppose I'll have to add a section about
good practices that covers my butt. ;-)
This looks like a bug to me

In complete examples rather than minimal snippest, I'd agree that it's
an issue. Calling global variables a bug is a tad extreme though. I
welcome any suggestions for improvement, so please don't hesitate to
send me an email if you have any. I want my tutorials to be as
accurate and useful as possible. :)
(Quote recommended hash table website)

Insertion and deletion are equally simple relative to linear probing:

1void jsw_insert ( void *key, int len )
2...{
3 unsigned h = hash ( key, len ) % N;
4 unsigned step;
5
6 for ( step = 1; table[h] != NULL; ++step )
7 h = ( h + ( step * step - step ) / 2 ) % N;
8
9 table[h] = key;
10}
1void jsw_remove ( void *key, int len )
2...{
3 unsigned h = hash ( key, len ) % N;
4 unsigned step;
5
6 for ( step = 1; table[h] != NULL; ++step )
7 h = ( h + ( step * step - step ) / 2 ) % N;
8
9 table[h] = DELETED;
10}

You need to call the comparison function and break if equal. As it is the
code marks a null at the end of the cluster as deleted.

Ouch. That's what I get for writing on autopilot. I do the same thing
for all of the probing examples. How embarrassing. I'll fix it
straight away. :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,111
Latest member
KetoBurn
Top