ISO recommendations for hash table lib

kj · Jun 12, 2009

Can someone recommend a decent hash table library? I know about
the functions in search.h (hsearch, etc.), but I need to be able
to have multiple hash tables simultaneously, so this is out.

TIA!

kynn

P.S. Yes, I know I can ask "The Google" for links to "hash tables
in C", but rather than links, what I'm looking for are recommendations
from satisfied (and knowledgeable) customers.

Kenny McCormack · Jun 12, 2009

Can someone recommend a decent hash table library? I know about
the functions in search.h (hsearch, etc.), but I need to be able
to have multiple hash tables simultaneously, so this is out.

TIA!

kynn

P.S. Yes, I know I can ask "The Google" for links to "hash tables
in C", but rather than links, what I'm looking for are recommendations
from satisfied (and knowledgeable) customers.

Then this is not the place. Nobody here uses hash tables, debuggers,
graphics, or anything else that is not mentioned in the C standards
documents. If you don't believe me, just ask anyone here - they''l tell
you the same thing.

user923005 · Jun 12, 2009

Can someone recommend a decent hash table library? I know about
the functions in search.h (hsearch, etc.), but I need to be able
to have multiple hash tables simultaneously, so this is out.

TIA!

kynn

P.S. Yes, I know I can ask "The Google" for links to "hash tables
in C", but rather than links, what I'm looking for are recommendations
from satisfied (and knowledgeable) customers.

Have you tried sourceforge?
Plenty of hash table implementations there.

jacob navia · Jun 12, 2009

kj said:
Can someone recommend a decent hash table library? I know about
the functions in search.h (hsearch, etc.), but I need to be able
to have multiple hash tables simultaneously, so this is out.

TIA!

kynn

P.S. Yes, I know I can ask "The Google" for links to "hash tables
in C", but rather than links, what I'm looking for are recommendations
from satisfied (and knowledgeable) customers.

I wonder why Chuck Falconer hasn't sprung into this.

Anyway:

http://cbfalconer.home.att.net/download/

cr88192 · Jun 12, 2009

kj said:
Can someone recommend a decent hash table library? I know about
the functions in search.h (hsearch, etc.), but I need to be able
to have multiple hash tables simultaneously, so this is out.

TIA!

kynn

P.S. Yes, I know I can ask "The Google" for links to "hash tables
in C", but rather than links, what I'm looking for are recommendations
from satisfied (and knowledgeable) customers.

can you not just write it yourself?...

for something THIS simple, if anything, you would need the learning
experience...

and then, maybe 50-150 lines of code later (if that), one will have gained
at least a small sliver of knowledge...

or until one is perplexed by such a task as, say, implementing a linked
list, or calculating the sum of an array of numbers...

now, for a head start:
http://en.wikipedia.org/wiki/Hash_table

Antoninus Twink · Jun 13, 2009

int hcreate(size_t nel);
ENTRY *hsearch(ENTRY item, ACTION action);
void hdestroy(void);

Unfortunately, this interface is unnecessarily rigid, since keys are
forced to be character strings compared for equality with strcmp().

While this standard interface can be a convenient shortcut if your
situation happens to be one it applies to, it's still no substitute for
rolling your own standard hash table, since it lacks the generality you
often need. It would be better if an ENTRY was an arbitary void *, and
the constructor hcreate() took a comparison function pointer as an
argument.

The memory management is also a pig's breakfast: there is no way to tell
hdestroy() whether or not it needs to free the key and/or value pointers
in the items in the hash table, so the only solution is for the
programmer to maintain a whole other data structure just to do
bookkeeping for his or her hash table! Again it would be easier just to
implement your own hash table to begin with.

I find it difficult to understand why a recent international standard
would choose not to facilitate the level of generality often needed in
applications, and to create a memory management mess for users of this
data structure.

cr88192 · Jun 13, 2009

Tony said:
(Aside: Use of complete sentences on USENET is a virtue).

Designing and implementing a hashing library is more difficult (say 5X to
10X) than a linked list. There are many more choices to consider for
implementation and more research and thought is therefor required (indeed,
one may need linked list techniques to implement hashtables if chaining is
chosen as the solution to collisions). The first hashtable one creates
will surely not be the last nor will there be only one hashtable design to
satisfy all needs. It will take a few (or a dozen) iterations and use over
time to settle and be satisfied with some "final" designs. The OP may want
to take a look at a C++ library for useful algorithms and design ideas if
that language is not a deterent to him, as those are evolved and robust
implementations. A container and algorithm library should have
cohesiveness and developing one container in isolation is a "myopic"
approach. Finally, a hashtable should not be the first container a noob
should try to create.

IMO, it is nowhere near that difficult...
hash tables are (conceptually) intimidating, but in practice are damn near
one of the simplest things to implement (well, ok, linked lists are maybe
simpler, but this is not always the case...).

well, ok, there are 3 major varieties of hash (my own terms here):
single-index hashes;
multi-index/iterative hashes;
chained hashes.

a single-index hash is what I could also call a "cache hash", basically, one
computes the index, and checks the item at the index. if it is a match, the
lookup succeeded, so do whatever. if it is not a match, do whatever and put
the result in this spot in the hash (so that it will be found next time).

IMO, this approach is very simple (maybe only a few lines of code in many
cases), and does speedup many operations (primarily lookups), but is not
good for "exact" lookups (since collisions will typically replace whatever
key was there before...).

iterative hashing:

an iterative hash basically differs from the above in that:
usually the hash table is larger (since now the hash actually needs to store
stuff);
usually, whenever there is a colision, some means is needed to calculate a
new "subsequent" key.

however, it has a few added costs:
the hash table can fill up;
resizing the table typically requires an awkward "re-hash" process;
if this is done, any existing literal indices into the table will be
invalidated.

this leads to 2 solutions:
make the hash larger than the maximum number of items it may reasonably
contain (the hash-table filling up can then be regarded as a "fatal" error);
don't hold literal indices into the table, thus resizing/rehashing is safe.

as an advantage, this approach tends to be fastest IME, but is a little more
effort to make work well (in one of my more complicated uses, I have a table
of this sort that is around 500 loc).

chaining:

this can have the advantages of the first and second approaches:
the hash table can be smaller and need not be resized;
one can hold literal indices to items (just not in the hash table itself).

the alteration is to have a sideband array (or, typically, several
associated arrays), which contain:
the keys; the values;
index chains (each item holds the array index of the next item, or a
terminal value such as 0 or -1).

so, one calculates the hash, and jumps to the item in question, and steps
along the list.
indices can be held into this array, and the array can be resized as needed
without any other ill-effects.

the cost of chaining, however, is that if the hash table is too small chains
may become long and performance is reduced.

hash functions:

some people like shifts and xors, but I dislike this approach, as I have
found that in general it is difficult to get good behavior from this
(typically, a combination which works well for one type of input works
poorly for another).

so, my typical approach is to use primes:
either normal primes or mersenne primes (note that, numerically, they behave
rather differently).

for example, to hash a string:
i=0; while(*s)i=(i*4093)+(*s++);

this uses a normal prime. IME, the most effective strategy here is to shift
right and mask to the table size when getting the table index:
j=(i>>12)&16383;

my usual method for itteration then is to multiply the original value by the
prime:
i=i*4093+1; //the +1 is for the rare case i==0

which, when combined with the above, generates a psuedorandom sequence...

mersenne primes are similar, but different:
i=0; while(*s)i=(i*127)+(*s++);

and:
i=i*127+1;

but differ in that usually one wants the low bits:
j=i&16383;

why this is the case is subtle number magic IMO, but mixing these up (taking
the low bits of a normal prime, or the high bits of a mersenne prime) leads
to poor results.

also, typically with normal primes, one wants the right-shift to be near the
same value as the prime, which should be within a few orders of magnitude of
the hash size.

for mersenne primes, they should be around 2^((log2 n)/2) if possible (seems
to give best results), but this is problematic as there are not so many
usable mersenne primes (whereas normal primes are far more common...).

as such, I typically use normal primes, except for with small hashes (for
example <=1024 spots) where mersenne primes seem to work better...

multiplying against a prime can also hash many other types of data as well
(not just strings), for example, if done right they are similarly effective
with pointers, ...

Nate Eldredge · Jun 13, 2009

Antoninus Twink said:
Unfortunately, this interface is unnecessarily rigid, since keys are
forced to be character strings compared for equality with strcmp().

While this standard interface can be a convenient shortcut if your
situation happens to be one it applies to, it's still no substitute for
rolling your own standard hash table, since it lacks the generality you
often need. It would be better if an ENTRY was an arbitary void *, and
the constructor hcreate() took a comparison function pointer as an
argument.

The memory management is also a pig's breakfast: there is no way to tell
hdestroy() whether or not it needs to free the key and/or value pointers
in the items in the hash table, so the only solution is for the
programmer to maintain a whole other data structure just to do
bookkeeping for his or her hash table! Again it would be easier just to
implement your own hash table to begin with.

I find it difficult to understand why a recent international standard
would choose not to facilitate the level of generality often needed in
applications, and to create a memory management mess for users of this
data structure.

The interface wasn't invented by POSIX, it came from SYSV. POSIX chose
to adopt the existing practice, presumably not so much to promote its
use in new software as to guarantee the continued functionality of old
SYSV software that used it.

I think POSIX wisely refrained from inventing a new and improved hash
table API; such things rarely turn out well when designed by committee.

James Dow Allen · Jun 13, 2009

Can someone recommend a decent hash table library? I know about
the functions in search.h (hsearch, etc.), but I need to be able
to have multiple hash tables simultaneously, so this is out.

hsearch(), etc. from gnu have multiple-use versions
named, IIRC, hsearch_r(), etc. In fact, gnu's hsearch()
just calls hsearch_r().

There may be *other* good reasons not to use hsearch_r().
Do you have special requirements? Speed, space, etc.?

James

Tony · Jun 14, 2009

Tor said:
Tony wrote:

[...]

It will take a few (or a dozen) iterations and use over
time to settle and be satisfied with some "final" designs. The OP
may want to take a look at a C++ library for useful algorithms and
design ideas if that language is not a deterent to him, as those are
evolved and robust implementations.

Click to expand...

Nonsense, OP is familiar with the GNU interface:

int hcreate(size_t nel);
ENTRY *hsearch(ENTRY item, ACTION action);
void hdestroy(void);

so adjusting this API to handle multiple hash tables, can easily be
done in a couple of *seconds*:

struct hash_table *my_hcreate(size_t nel);
ENTRY *my_hsearch(struct hash_table *, ENTRY item, ACTION action);
void my_hdestroy(struct hash_table *);

the point is not to hold the hash table internally.

A container and algorithm library should have cohesiveness
and developing one container in isolation is a "myopic" approach.
Finally, a hashtable should not be the first container a noob should
try to create.

Click to expand...

Every elementary book on algorithms, would most likely discuss this
topic. Reading up on collisions, isn't exactly rocket science.

Using or just copying existing hash containers (whether from existing
library or textbook) is hardly the same as doing the required research to
become capable to create something "from scratch". No one is going to sit
down and create their first hashtable and then use it for the rest of their
programming days. The same could of course be said for any container, but
hashtable is, like I have already noted, not trivial compared to say linked
list, though certainly a fixed-size table is easier to do than a dynamic
one.

Tony · Jun 14, 2009

cr88192 said:
IMO, it is nowhere near that difficult...
hash tables are (conceptually) intimidating, but in practice are damn
near one of the simplest things to implement (well, ok, linked lists
are maybe simpler, but this is not always the case...).

Hash tables are not "damn near one of the simplest things to implement".
Anyone who thinks so has surely never implemented one of any generality or
applicability. I've evolved the ones I use over a long time period along
with the framework within which they reside. Though I forget all the
decision details, I do remember being enlightened more than once and
reworking them a few times to have what I have now. There are many, many
ways to implement hashtables and many of those are "domain"-specific.
Textbook offerings need not apply.

To the other guy: If all you have is UNIX... all you have is UNIX! (Couldn't
resist that).

[snipped excessively looooong response with way too many ellipses!]

Tony · Jun 14, 2009

Tor said:
Tony wrote:

[...]

It will take a few (or a dozen) iterations and use over
time to settle and be satisfied with some "final" designs. The OP
may want to take a look at a C++ library for useful algorithms and
design ideas if that language is not a deterent to him, as those are
evolved and robust implementations.

Click to expand...

Nonsense, OP is familiar with the GNU interface:

While the OP asked for a library, I, OTOH, was responding to another
poster's post.

Context matters people!

cr88192 · Jun 14, 2009

Tony said:
cr88192 said:

IMO, it is nowhere near that difficult...
hash tables are (conceptually) intimidating, but in practice are damn
near one of the simplest things to implement (well, ok, linked lists
are maybe simpler, but this is not always the case...).

Click to expand...

Hash tables are not "damn near one of the simplest things to implement".
Anyone who thinks so has surely never implemented one of any generality or
applicability. I've evolved the ones I use over a long time period along
with the framework within which they reside. Though I forget all the
decision details, I do remember being enlightened more than once and
reworking them a few times to have what I have now. There are many, many
ways to implement hashtables and many of those are "domain"-specific.
Textbook offerings need not apply.

To the other guy: If all you have is UNIX... all you have is UNIX!
(Couldn't resist that).

[snipped excessively looooong response with way too many ellipses!]

I have implemented more than a few of the things, and I have nowhere along
the lines encountered any sort of pandora's box of unforseen complexity...

I have also used them for all sorts of tasks, ranging from:
looking up keys;
merging strings;
looking for repeating patterns (LZ77, ...);
hashing object field and method lookups;
....

my experience is that, like linked lists, the task is sufficiently simple
that one implements a hash for the task at hand, and then they are free to
move on to the next task at hand.

(it is much like linked lists and sort functions, once one learns the basic
algorithms one can write sort functions for all sorts of tasks...).

through experience, one refines the technique, but the core idea is simple,
and easy to learn.
just people keep fudding it up by calling it difficult...

Nate Eldredge · Jun 14, 2009

Malcolm McLean said:
The problem is getting a nice interface, efficiency, and generality. Often
it is handy to have the keys as strings, for instance, but not always.
Sometimes you want to store objects and sometimes pointers, and you don't
want to store pointers as objects because that entails inefficient byte by
byte copying.

However the actual hash table logic isn't too difficult.

But there are a lot of choices. How do you handle collisions?
Do you put a linked list at each entry of the hash table? Do you do
linear or quadratic probing? What will you use for a hash function?
The best option in each case may depend on knowing something about your
data set, and a wrong choice may unexpectedly give pathological behavior.

I'm not sure that it's so trivial. In a real application, I wouldn't
mind having a linked list implemented by a beginner, but I'd rather use
a hash table developed by someone that really knew what they were
doing. So I'd probably prefer to use an established library.

Just my $0.02.

chad · Jun 14, 2009

Unfortunately, this interface is unnecessarily rigid, since keys are
forced to be character strings compared for equality with strcmp().

Huh? How do get that the keys are forced to be character strings?

cr88192 · Jun 14, 2009

Malcolm McLean said:
The problem is getting a nice interface, efficiency, and generality. Often
it is handy to have the keys as strings, for instance, but not always.
Sometimes you want to store objects and sometimes pointers, and you don't
want to store pointers as objects because that entails inefficient byte by
byte copying.

However the actual hash table logic isn't too difficult.

yeah...

I hadn't much considered the "generic API" case, because IMO hashes were
often too simple for me to feel there was much need for a generic API...

(it is much how like if one wants a "generic sort" they can use qsort,
nevermind that there are many cases where qsort does not fit well, and so it
is often better to implement a custom sorter for the specific task...).

so, if one needed an API, a solution I think would be to have several
"varieties" of hash best suited to particular use patterns, but which have
relatively "generic" behavior, and then wrap them with little interfaces to
make them more applicable to specific cases (such as interning strings,
looking up by key, ...).

nevermind that someone more like myself would likely just implement custom
logic for the specific task-at-hand, and then maybe give the task a nice
little API.

so, I don't think: hash table which does everyone's needs, rather:
intern strings;
lookup key and return value;
....

often, the hashtables are invisible behind-the-scenes components.

one doesn't know that, for example, when they invoke an interface method on
an object, it jumps through a hash table (typically via pre-computed
indices), rather than, say, looking up the interface method linearly,
looking for the class containing the method and fetching the method info
(starting at the object and walking the inheritence heirarchy), ...

likewise for "given a pointer to a function, go and find its name and
relevant metadata", ...

the hash is just a way to solve this particular problem, and so need not be
exposed in its own right.

so, "hash table" is a fairly vague and far reaching idea.

"here is a name so give me a value" is far more specific.
personally, I figure time is better served addressing these more specific
issues, as they occur.

it is much like how, if given a simple file to parse, one does not go and
try to address all the great issues of parsing, one just writes a parser...

(or at least, some do, and end up throwing lex and yacc at problems that,
IMO, either have no need for, or are ill suited to, the usage of these
tools...).

it is like how screwdrivers, wrenches, and drills are different tool.
in general it is not needed for someone to go and try to figure out how to
unify them (say, for example, as a terribly overpowered electric drill which
can accept impact sockets, and is small and light enough, and precise
enough, that it doesn't tear things apart when one tries to use it like an
electric screwdriver...).

similarly, it is also not likely to need an add-on for hammer-like
functionality, ...

Nate Eldredge · Jun 14, 2009

chad said:
Huh? How do get that the keys are forced to be character strings?

From FreeBSD's hcreate(3):

The hsearch() function is a hash-table search routine. It returns a
pointer into a hash table indicating the location at which an entry can
be found. The item argument is a structure of type ENTRY (defined in the
<search.h> header) containing two pointers: item.key points to the com-
parison key (a char *), and item.data (a void *) points to any other data
to be associated with that key. The comparison function used by
hsearch() is strcmp(3). ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^

cr88192 · Jun 14, 2009

Nate Eldredge said:
But there are a lot of choices. How do you handle collisions?
Do you put a linked list at each entry of the hash table? Do you do
linear or quadratic probing? What will you use for a hash function?
The best option in each case may depend on knowing something about your
data set, and a wrong choice may unexpectedly give pathological behavior.

well, typically not worse than a linear search, unless the implementation is
really bad...

as for linear and quadratic probing, IMO, both are lame, so personally I use
neither...
usually, I use psuedo-random probing...

since each key ends up calculating a (generally) unique hash value, the
psuedo-random sequence is typically different for each input key, and as
such, the usual bad behavior of linear probing is often avoided.

of course, typically after some N probes, one either has to resize the
table, or do a linear scan (but, of course, by the time one is doing a
linear scan one can safely conclude that the table is either full or almost
entirely full, and so by this point I normally fallback to whatever the
"table is full" behavior is...).

as for hash functions, for most input data I have found primes-based
polynomial hashing to be most effective, and in general, it is relatively
flexible and forgiving (vs the more popular shifts&xors, which typically
work well for one piece of data and suck for another...).

there are slight variations though related to the size and type of the prime
(mostly related to approximate hash table size).

I usually have the tables themselves at either power-of-2 sizes, or may
expand via a factor of 1.5^x.

I'm not sure that it's so trivial. In a real application, I wouldn't
mind having a linked list implemented by a beginner, but I'd rather use
a hash table developed by someone that really knew what they were
doing. So I'd probably prefer to use an established library.

in a "real application", we typically use a profile and optimize whatever is
going slow...
if said hash table is poorly written and bogs down the app, it will
typically show up on the profiler, and if it does not, it is either working
well or is not being used enough to matter...

if it really matters that it works well, set up an external text case:
typically this will be an implementation of the algo in question, and it
will be fed in a large amount of simulated (or actual, if available) data.

this way, one can see how well the algo is working, and one can try out new
possibilities and twiddle the details and measure the results...

(this is actually how I originally noticed the somewhat different behavior
between ordinary and mersenne primes: they showed up differently in the
performance measurements and statistical analysis tests, ...).

Tony · Jun 18, 2009

cr88192 said:
well, typically not worse than a linear search, unless the implementation
is really bad...

as for linear and quadratic probing, IMO, both are lame, so personally I
use neither...
usually, I use psuedo-random probing...

since each key ends up calculating a (generally) unique hash value, the
psuedo-random sequence is typically different for each input key, and as
such, the usual bad behavior of linear probing is often avoided.

of course, typically after some N probes, one either has to resize the
table, or do a linear scan (but, of course, by the time one is doing a
linear scan one can safely conclude that the table is either full or
almost entirely full, and so by this point I normally fallback to whatever
the "table is full" behavior is...).

as for hash functions, for most input data I have found primes-based
polynomial hashing to be most effective, and in general, it is relatively
flexible and forgiving (vs the more popular shifts&xors, which typically
work well for one piece of data and suck for another...).

there are slight variations though related to the size and type of the
prime (mostly related to approximate hash table size).

I usually have the tables themselves at either power-of-2 sizes, or may
expand via a factor of 1.5^x.

in a "real application", we typically use a profile and optimize whatever
is going slow...
if said hash table is poorly written and bogs down the app, it will
typically show up on the profiler, and if it does not, it is either
working well or is not being used enough to matter...

if it really matters that it works well, set up an external text case:
typically this will be an implementation of the algo in question, and it
will be fed in a large amount of simulated (or actual, if available) data.

this way, one can see how well the algo is working, and one can try out
new possibilities and twiddle the details and measure the results...

(this is actually how I originally noticed the somewhat different behavior
between ordinary and mersenne primes: they showed up differently in the
performance measurements and statistical analysis tests, ...).

You proved my point quite well: creating hashtables is an order of magnitude
more difficult than creating a linked list and one is hardly likely to get
it right the first time or even to produce something not laughable by the
initiated in a week's time (a month? evolved over years?).

Tony · Jun 18, 2009

cr88192 said:
I am confused by what is meant by this...

hash tables are trivial enough to write, yes...

No, they are not. You and others who have posted in this thread have already
proven that with your discourse about hashtable parameters and choices.

but, errm, how will one expect a single hash table to fulfill all their
tasks?

No one suggested that. Note that I particularly avoided making that
impression when I used the words "hashing library" in an earlier post.

Help needed to create a Python extension library for an existingshared memory hash table library	2	Mar 22, 2014
update hash table	1	Nov 17, 2004
Recommendations for JavaScript table component?	5	Oct 26, 2004
Hash Function	4	Dec 20, 2003
wsdl2perl.pl doesn't completely parse the WSDL file for all its types	1	May 8, 2012
Looking for recommendations on a programming language for a web application.	3	Jan 24, 2007
comp.lang.c FAQ list Table of Contents	0	Jan 12, 2008
Confusion about writting an insert() for a Hash Table...	0	Apr 30, 2005

ISO recommendations for hash table lib

kj

Kenny McCormack

user923005

jacob navia

cr88192

Antoninus Twink

cr88192

Nate Eldredge

James Dow Allen

Tony

Tony

Tony

cr88192

Nate Eldredge

chad

cr88192

Nate Eldredge

cr88192

Tony

Tony

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads