How to implement a Hash Table in C

M

Malcolm McLean

CBFalconer said:
Nonsense. "ptr = malloc(sizeof *ptr);" is automatically the right
size. No checking of any form is required.
It's hard to read. Purely syntactical. The indirection operator happens to
be the same as a multiply. All functions take parentheses, except sizeof,
which is a function in the sense that matters - it retuns a value based on
its argument, but not of course in terms of generated code. So you've got
two quirks in one expression.

In compiler terms t is better, of course, because if you change ptr's type
the call will automatically update. But that's an overly narrow
understanding of maintainability which doesn't address the real issue.
 
M

Malcolm McLean

CBFalconer said:
For contrast, look at my hashlib. It was published complete with
the test suite and test programs that evaluated it during and after
development. I have had two and a half bug reports since the
initial publication, and no more for years.
And there was one I found in Dann's recommended text, admittedly in a code
fragment, and one in Ben Pfaff's circular buffer. Two in my two hash
tables, though one could be described as a feature. So assuming
Poisson-distributed bugs, we're doing about equally. However I am blamed.
 
F

Flash Gordon

CBFalconer wrote, On 18/08/07 19:40:
Nonsense. "ptr = malloc(sizeof *ptr);" is automatically the right
size. No checking of any form is required.

Unless you need enough space for two objects ;-)

Unsurprisingly, I agree with you on this.
 
S

santosh

Richard said:
Malcolm McLean said:


Does anyone else here apart from Malcolm have difficulty reading it?
Hands up please.

To be honest this construct threw me the first time I encountered it, which
is of course in clc, but it hasn't bothered me in the slightest after I
looked it up in the FAQ.

C's expressions have to be understood with C's rules and quirks in mind, not
by comparing them to mathematical expressions. I suppose someone new to C
will find it a bit strange, but then this is true for almost all
programming languages. There's a reason they are not classified under the
natural languages class.
 
F

Flash Gordon

Malcolm McLean wrote, On 18/08/07 21:18:

Actually chapter 2 is untypical. Most present a module of related
functions. So I could present a header for for each chapter.

Yes, that would be reasonable.
What I am reluctant to do is to present too much of the code as
"finished".

They don't have to be finished, at least not as generic implementations.
However it is in my opinion *very* important, *especially* in a basic
book for beginners, that you make it clear that they are not finished,
that you point out where the holes are, and give an indication as to
routes to plug them. Otherwise the naive beginner will assume that they
are complete and usable.
It is not meant to be a library with documentation, although
some of the chapters get that way. It's meant to show how things work,
to demystify the functions that people call in everyday programming, or
applications they use.

Which is a fine aim, but they still need to be good enough otherwise
they will not demystify anything but instead add to the confusion.

You should also get your book reviewed somewhere like comp-programming
where the algorithms side will get properly reviewed, instead of the C
issues people concentrate on here.

I would also suggest that your chapter ordering does not make sense to
me. Stacks, queues, linked lists, trees and hash tables are all methods
of organising and storing data, yet the chapters are not grouped
together. Hell, a few chapters after doing trees you have a chapter on a
specific type of tree. I believe that related topics should be grouped
together, not scattered about.
 
K

Keith Thompson

Malcolm McLean said:
It's hard to read. Purely syntactical. The indirection operator
happens to be the same as a multiply. All functions take parentheses,
except sizeof, which is a function in the sense that matters - it
retuns a value based on its argument, but not of course in terms of
generated code. So you've got two quirks in one expression.

No, sizeof is not a function. It's a unary operator, just like "-",
"*", and "&". (If "sizeof" is a function in some abstract sense, then
so is "-", but C doesn't call them functions.) The confusion is
probably caused by the fact that it's the only operator whose symbol
look like an identifier rather than punctuation.

Pretend that 'sizeof' had been spelled '$', and you'll see what I mean.

But if you find
ptr = malloc(sizeof *ptr);
too confusing because it looks too much like a multiplication, you can
write
ptr = malloc(sizeof(*ptr));
applying the operator to a parenthesized expression.
In compiler terms t is better, of course, because if you change ptr's
type the call will automatically update. But that's an overly narrow
understanding of maintainability which doesn't address the real issue.

The real issue is that the clc-approved form (a) is perfectly
understandable once it's explained (something you could do in your
book if you chose), and (b) makes maintenance easier, because you
don't have to keep the target type and the size expression
synchronized manually (with no warning from the compiler if you get it
wrong).

You're right that 'ptr = malloc(sizeof *ptr)' is not common. That's a
pity, because it's *better* than the alternatives, and you have an
opportunity to help teach that.
 
F

Flash Gordon

Malcolm McLean wrote, On 18/08/07 22:02:
It's hard to read.

A lot of newbies here don't find it hard once it is pointed out to them.
Purely syntactical.

Only if you consider "ptr = malloc(sizeof(type))" to be purely syntactical.

The indirection operator happens
to be the same as a multiply.

I very much doubt that people would mistake it for being multiplication
in this usage.
All functions take parentheses, except
sizeof, which is a function in the sense that matters - it retuns a
value based on its argument, but not of course in terms of generated
code.

It is only a function in the same sense that a unary minus is a
function. It has one operand and yields a result. The lack of
parenthesis in the "clc recommended" form if anything make it clearer it
is not a function.
So you've got two quirks in one expression.

You have more quirks than that.
In compiler terms t is better, of course, because if you change ptr's
type the call will automatically update. But that's an overly narrow
understanding of maintainability which doesn't address the real issue.

Most of the regulars here have a good few years of experience in
maintaining code written by themselves and others. I personally also
have experience of passing code on (not C code) to be maintained by
others and still being around myself to here what their issues are. So I
think many of us have reason to believe we *do* understand what the
maintainability issues are, and many of us disagree with you on this.
 
R

Richard

Richard Heathfield said:
Malcolm McLean said:


Does anyone else here apart from Malcolm have difficulty reading it?
Hands up please.

<snip>

IMO, It is fairly uncommon to see

(sizeof *p)

in legacy code bases and is at odds with "normal" C operators as was
well described by Keith in another post in this thread. I think MM has
a point.

The people reading his algorithms shouldn't need to be C experts to
understand his code.

But don't let that stop you taking another opportunity to proclaim your
genius.
 
P

pete

Richard said:
Malcolm McLean said (in reply to me):

No, it's a good clc-ism.

(ptr = malloc(N * sizeof *ptr)) is an absolutely great expression;
I use it much.
Right. As Sturgeon says, at least 90% of everything is crud, so you'd
expect to find malloc(sizeof *ptr) in only 10% of cases (or fewer).
That doesn't mean it's bad. It just means most people haven't come
across it yet.


If the reader has to perform a mental dereference, he (or she) doesn't
understand sizeof. No dereferencing occurs, because sizeof does not
evaluate its argument. Furthermore, the size is *bound* to be correct
for any object pointer type.

For strings, I frequently call malloc this way:
char *string = malloc(length + 1);
 
W

websnarf

The question is not whether AND is faster than MOD; it's whether
hash-tables implemented with AND as the reduction method are faster
than broadly-similar hashtables using MOD.

If I recall correctly, MOD is said to be better than AND because
more of the bits of the hash contribute to the index.

This is an incorrect way to solve the problem.
[...] If the hash
function is really good, this won't matter -- but we don't always
get to use "really good" hash functions.

Why don't we get to use a "really good" hash function? If you are
using prime modulo then you are trying to use it as a *substitute* for
a good hash function. This is essentially "non-analysis".

Better hash functions plus an AND function have lower cost than
MOD. Prime sized tables don't offer anything else of value.
I'm not saying that tables-using-AND are no faster than
tables-using-MOD; I'm saying that "AND is faster than MOD" doesn't
justify tuA faster than tuM.

Well that's nice because I *AM* saying the converse. Use modulo
prime, and you are just wasting your time and causing yourself
analytical headaches in order to achieve a slow solution.
 
W

websnarf

... snip ...
[...] that always using the same probe increment in a
power of two table size also has bad properties. His
solution is to use the unused bits of the full hash to
select an increment size from a table of primes.
That is dangerous. The purpose of the multiple hash is to
have different search patterns for different objects that
have the same primary hash pattern.
No, its to have a different search pattern for different object
that have the same primary hash *INDEX* (i.e., post masking).
This is obvious since, for any good hash function, index
collisions will be many many times more common than full hash
function collisions.
Mathematically speaking, on average a single 32-bit (good) hash
function collision will occur roughly when 65536 entries appear,
while a hash index collision will occur roughly when the square
root of the table size number of entries occur (so for a 9
million entry table, that's just 3000 entries).

No, you are ignoring the size of the hash table itself. For
example, the starting size for hashlib is 17, so no more than 17
hash values (modulo 17) are useful. This changes as the table
enlarges itself.

Huh? Well of course I am ignoring small hash tables -- *ALL*
solutions are basically equivalent for small enough hash tables. The
problem is only interesting when there is a performance impact, which
will happen as the hash table holds thousands of entries.

I think you are misreading my language (I am exercising serious
restraint here) a hash function collision is not the same as a hash
index collision. This distinction matters for completely optimized
hash table implementations.
Again, you are ignoring the effect of varying table size.

I am not. I am concerned with asymptotic size of hash tables which
are large enough to matter (a few thousand entries) otherwise the
problem is too small to be concerned with the details about (you just
need to make sure it works).

Perhaps you don't understand the basic mathematics. If you increase
the size of the table, the probability that two randomly chosen skip
values will not be coprime does *NOT* tend towards zero. The number
of pathological cases increases with the increase in table size. Thus
a strategy that just picks random skip values via another hash or
whatever, does not directly address the inherent inter-collision
problem.

You need to pick coprime (between each other) skip values if you use
linear probing. The most sensible way of doing this is to choose from
a set of primes.
 
C

Chris Dollin

Richard said:
Malcolm McLean said:


Does anyone else here apart from Malcolm have difficulty reading it?
Hands up please.

I'll put my hand up, just to say "Not me sir! Easy peasy sir!
Fits neatly on the eyeball and can be checked without context!
Win win, sir!"

Of course anyone who's using a remotely syntax-colouring editor
won't think that `sizeof` is an ordinary left operand of a
multiplication; nor will anyone who already knows what `sizeof`
means.

It would be really nice to have empirical evidence either way,
of course, since we're well into anecdotage here.
 
M

Malcolm McLean

Flash Gordon said:
A lot of newbies here don't find it hard once it is pointed out to them.


Only if you consider "ptr = malloc(sizeof(type))" to be purely
syntactical.
The objection to

ptr = malloc(sizeof *ptr);

is purely syntactical. As Keith said, if sizeof were spelt $ and given
normal operator rules for whitespace

ptr = malloc($*ptr);

would be OK. The * wouldn't look like multiplication of an identifier.
 
C

Chris Dollin

This is an incorrect way to solve the problem.

That is certainly possible.
[...] If the hash
function is really good, this won't matter -- but we don't always
get to use "really good" hash functions.

Why don't we get to use a "really good" hash function?

Because (a) we don't always get to choose the one that's used,
(b) "really good" hash functions likely to be more expensive
than not-so-good hash functions.
Better hash functions plus an AND function have lower cost than
MOD. Prime sized tables don't offer anything else of value.

It's a three-point comparison, not a two-point one. The middle
point -- the one you're excluding -- would be the one where
prime-sized tables made a difference: where the hash function
didn't manage to distribute the key evenly into all the bits
of the hash, but the MOD-prime step pulled in more of that
variability than AND-pow2 did.
Well that's nice

Thank you: I'm glad my precision is appreciated.
because I *AM* saying the converse. Use modulo
prime, and you are just wasting your time and causing yourself
analytical headaches in order to achieve a slow solution.

Yes, I understood your claim the first time round. I don't
think you've justified your premises yet. Again, I'm not
saying that you're /wrong/; I'm saying that I don't think
your argument is complete.
 
M

Malcolm McLean

Flash Gordon said:
You should also get your book reviewed somewhere like comp-programming
where the algorithms side will get properly reviewed, instead of the C
issues people concentrate on here.
That will happen.
The practical problem is that people focus on little things like ] instead
of ) for a close, or a missing check for null, because they are unambiguous.
However they are relatively easy to fix. The more serious criticisms come
out later - the car park maybe isn't the best metaphor for a hash table.
So it's best to get C language issues and size_t wars out of the way first.
 
W

websnarf

This is an incorrect way to solve the problem.

That is certainly possible.
[...] If the hash
function is really good, this won't matter -- but we don't always
get to use "really good" hash functions.
Why don't we get to use a "really good" hash function?

Because (a) we don't always get to choose the one that's used,
(b) "really good" hash functions likely to be more expensive
than not-so-good hash functions.

I thought Bob Jenkins and I made this point moot:

http://www.azillionmonkeys.com/qed/hash.html

Or at least people at Apple (Webkit; i.e., Safari), Adobe (Flash; at
least an old version used it) and Google (Google's open source sparse-
hash project) seem to think I've made a point anyway.
It's a three-point comparison, not a two-point one. The middle
point -- the one you're excluding -- would be the one where
prime-sized tables made a difference: where the hash function
didn't manage to distribute the key evenly into all the bits
of the hash, but the MOD-prime step pulled in more of that
variability than AND-pow2 did.

Again, a moot point:

http://www.concentric.net/~Ttwang/tech/inthash.htm

These integer hashes run faster than a modulo operation, so even if
someone hands you a lame hash, using the functions above will do a
better and faster job of further hashing than MOD prime will.
Thank you: I'm glad my precision is appreciated.


Yes, I understood your claim the first time round. I don't
think you've justified your premises yet. Again, I'm not
saying that you're /wrong/; I'm saying that I don't think
your argument is complete.

Well, much of the industry seems to have accepted the analysis shown
above, so I don't know what else to tell you.
 
R

Richard Harter

Richard said:
Eric Sosman said:
James Dow Allen wrote:

[...]
Speaking of Falconer, his method (linear probing with second
hash value used for probe increment) *requires* (even more so
than quadratic probing, see upthread) prime table size.

Well, not exactly: It requires only that the increment
between probes be relatively prime to the table size. A prime
table size makes this criterion easy to satisfy, but it's not
the only way. A table size equal to a power of two, for example,
works just fine if the probe increment is odd. (Degenerate
special case: pure linear probing with an increment of unity
works with any table size at all, prime or composite.)

In a thread in comp.programming (IIRCC) Paul Hsieh pointed out
that always using the same probe increment in a power of two
table size also has bad properties. His solution is to use the
unused bits of the full hash to select an increment size from a
table of primes.

That is dangerous. The purpose of the multiple hash is to have
different search patterns for different objects that have the same
primary hash pattern. This avoids clustering, and tends to keep
the search chain short. The very iffy thing in hashlib is the
reduction of the range of the secondary hash, which encourages
light clustering in nearly empty tables. The system attempts to
maintain tables at between 44 and 88% of capacity.

Also, I want the table action to be independent of the hash
functions (apart from speed), so the functions are computed every
time needed. The system functions with all the hash functions
replaced by the constant 1 (unity) (tested in the test suite) at a
considerable speed penalty. So the action is hash function
independent, but the speed is highly dependent.

Here is a little clue: Discussions of hash tables are not
confined to your particular implementation.
 
R

Richard Harter

Richard said:
<OT>
James Dow Allen wrote:
[...]
Speaking of Falconer, his method (linear probing
with second hash value used for probe increment)
*requires* (even more so than quadratic probing,
see upthread) prime table size.
Well, not exactly: It requires only that the increment
between probes be relatively prime to the table size. A prime
table size makes this criterion easy to satisfy, but it's not
the only way. A table size equal to a power of two, for example,
works just fine if the probe increment is odd. (Degenerate
special case: pure linear probing with an increment of unity
works with any table size at all, prime or composite.)

In a thread in comp.programming (IIRCC) Paul Hsieh pointed out
that always using the same probe increment in a power of two
table size also has bad properties. His solution is to use the
unused bits of the full hash to select an increment size from a
table of primes.

I did not and do not imply that the probe increment must
be the same for all keys, nor even for all keys that probe
the same slot on the first attempt.

I didn't say that you say or implied any such foolish thing and I
certainly don't think it. If what I wrote in any way appears to
imply such a thing you have my apologies.
This is one of the reasons
twin primes are popular as table sizes: you use the hash code
mod N for the first probe, and then if necessary you use the
hash mod (N-2) plus 1 as the increment for further probes. The
primality of N guarantees that the increment is relatively prime
to it; the primality of N-2 helps scramble groups of keys that
have simple relationships (sum1, sum2, sum3, ...).

I'm not clear as to why the other sibling in a pair twin prime
should be superior for scrambling as compared to using some other
prime.

Be all of that as it the point of Hsieh's scheme is that it
avoids all modulo operations while retaining the advantage of
using different increments. The disadvantage is that the number
of different increment sizes is fixed. The advantage is that all
increments are guaranteed to be primes. I have no idea which
factor outweighs the other, but both are minor.
[snip]

re: counterargument to storing the hash code.
- If the table stores pointers to items rather than item
instances, storing hash values in the table roughly
doubles the amount of memory required. If the same
amount of memory were used for a pointers-only table,
the number of table slots would double and the load
factor would be halved. This should lead to shorter
searches -- and it is not at all clear _a priori_ that
more fast probes will outrace fewer slow probes.

This doesn't sound right - ordinarily a table has to hold a
key/value pair. Adding a hash would increase the load factor by
50% - for a given amount of memory.

I think you said the same thing I did; perhaps we have a
confusion of terminology. (Not surprising, since "hash table"
itself is not "a" thing, but a wide and varied family of things.)

My point is that the choice between a pointer/hash pair and
a pointer alone is not clear-cut. Storing the pair allows one
to short-circuit the (possibly expensive) comparisons when the
search key's hash and the stored hash disagree. For the same
expenditure of memory, a pointer-only table could have twice as
many slots and a corresponding reduction in collision frequencies,
leading to fewer probes per search.

Hmmm: Maybe our disagreement is about where the key lives.
I've been assuming that the key is part of the item, so the
choice is between pointer/hash and pointer. But maybe you're
assuming the key is a separate entity, so the choice is between
key_pointer/data_pointer and key_pointer/data_pointer/hash. The
argument still holds with a different expansion factor: Storing
the hash now adds 50% to the table size, while leaving it out
allows 33% more slots and a load factor reduced by one-third.

That is what I was assuming. Mea culpa; I should have seen that
you were doing the other way around.
 
K

Kelsey Bjarnason

[snips]

And there was one I found in Dann's recommended text, admittedly in a code
fragment, and one in Ben Pfaff's circular buffer. Two in my two hash
tables, though one could be described as a feature. So assuming
Poisson-distributed bugs, we're doing about equally. However I am blamed.

Do you see Dann or Ben using ints when size_t is the correct tool to be
used, defending this on the basis that they don't want to code in C and
that, in effect, C should be changed to suit their weird - and objectively
wrong - views?

No, you don't. They have bugs. You have concept failures *plus* bugs.
All non-trivial code has bugs. Code written by people who don't know what
they're doing has bugs *plus* assorted other issues. You never did, for
example, explain the magic values used by squnch, or what magic occurs
when it is passed length values of -1, -137 or -32767. Nor did you
explain how to make it work on a perfectly legitimately-sized buffer
too large to have its size specified by a signed int.

Their code has bugs, so does yours. Yours adds in umpteen different sorts
of unpredictability and confusion, without justification.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,276
Latest member
Sawatmakal

Latest Threads

Top