Which hashing function to use?

J

January Weiner

Hello,

I need to use a hashing function for relatively short strings (roughly 16
characters or less). The data that needs to be accessed via hash is just a
simple int. I just need to look up values in two dimensional matrices each
time that I encounter a pair of these strings.

I would like to keep the code as simple and standard as
possible, but at the same time to have a reasonable performance.

While looking through the docs on my system I found a few different
implementations:
- hsearch from <search.h>
- tsearch from <search.h>
- bsearch from <stdlib.h>

Which one would you recommend? Or maybe something completly different?
Right now I just use Perl to do the hashing :) this is nice, but I would
love to have a small, efficient, ANSI C program.

Regards,

January
 
U

user923005

Hello,

I need to use a hashing function for relatively short strings (roughly 16
characters or less). The data that needs to be accessed via hash is just a
simple int. I just need to look up values in two dimensional matrices each
time that I encounter a pair of these strings.

I would like to keep the code as simple and standard as
possible, but at the same time to have a reasonable performance.

While looking through the docs on my system I found a few different
implementations:
- hsearch from <search.h>
- tsearch from <search.h>
- bsearch from <stdlib.h>

Which one would you recommend? Or maybe something completly different?
Right now I just use Perl to do the hashing :) this is nice, but I would
love to have a small, efficient, ANSI C program.

Your post is more topical in {follow-ups set}.

I do not think that your problem has an answer yet. It will depend on
lots of things.
For instance, are the strings a static set that never changes? If so,
then a perfect hash is your choice.
There are string searching algorithms faster than a hash table (See
Sedgewick).

Do you need to preserve the data (e.g in a key/value pair set on
disk)?

"I need a hash, which should I use?" is such an open question that any
simple answer will probably be lacking.
 
C

CBFalconer

January said:
I need to use a hashing function for relatively short strings
(roughly 16 characters or less). The data that needs to be
accessed via hash is just a simple int. I just need to look up
values in two dimensional matrices each time that I encounter a
pair of these strings.

I would like to keep the code as simple and standard as
possible, but at the same time to have a reasonable performance.

While looking through the docs on my system I found a few
different implementations:
- hsearch from <search.h>
- tsearch from <search.h>
- bsearch from <stdlib.h>

Which one would you recommend? Or maybe something completly
different? Right now I just use Perl to do the hashing :) this
is nice, but I would love to have a small, efficient, ANSI C
program.

There are a couple of very simple, but efficient, string hashing
routines available (in source form) in hashlib.zip. Although
hashlib is copyright and released under GPL, those functions are
pretty common knowledge and there will be no objections to
including them in anything else. You can find hashlib.zip at:

<http://cbfalconer.home.att.net/download/>
 
J

January Weiner

user923005 said:
Your post is more topical in news:comp.programming {follow-ups set}.

Well, it is not. Not 100%. The reason being that I am looking for
something that is part of the C standard and that is simple to use in C.
What do you think happens when I ask a question like that in
comp.programming? Some wise guy will set the Followup-To to comp.lang.c
:p, that's what would happen. Finally, the reason is that I am not a
computer scientist and need, hm, very practical advice. Practical advice
is something that I always got plenty on comp.lang.c :) Note that you will
find a discussion on different hashing algorithms in the FAQ of this group
-- however, I do not fully understand it (otherwise I would not be asking
the questions).
I do not think that your problem has an answer yet. It will depend on
lots of things.
For instance, are the strings a static set that never changes? If so,
then a perfect hash is your choice.
There are string searching algorithms faster than a hash table (See
Sedgewick).

OK, as I am not a computer scientist, I will try to look it up somewhere.
Meanwhile, which of that is a part of standard C libraries?
Do you need to preserve the data (e.g in a key/value pair set on
disk)?

Nope, I would go for DBM or SQL if I needed that.
"I need a hash, which should I use?" is such an open question that any
simple answer will probably be lacking.

"I need a hash, which of the ones in the standard C libraries should I use"
is a more specific question, is it not?

Kind regards,

January
 
K

Keith Thompson

January said:
user923005 said:
Your post is more topical in news:comp.programming {follow-ups set}. [...]
"I need a hash, which should I use?" is such an open question that any
simple answer will probably be lacking.

"I need a hash, which of the ones in the standard C libraries should I use"
is a more specific question, is it not?

Yes, it is. And the answer is, there are no hash functions in the
standard C library.
 
R

Richard Heathfield

Keith Thompson said:
January said:
user923005 said:
Your post is more topical in news:comp.programming {follow-ups set}. [...]
"I need a hash, which should I use?" is such an open question that any
simple answer will probably be lacking.

"I need a hash, which of the ones in the standard C libraries should I
use" is a more specific question, is it not?

Yes, it is. And the answer is, there are no hash functions in the
standard C library.

Cases could be made for some of the math functions, and perhaps rand. Not
particularly good cases, admittedly, but theoretically...
 
M

Malcolm McLean

January Weiner said:
Hello,

I need to use a hashing function for relatively short strings (roughly 16
characters or less). The data that needs to be accessed via hash is just
a
simple int. I just need to look up values in two dimensional matrices each
time that I encounter a pair of these strings.

I would like to keep the code as simple and standard as
possible, but at the same time to have a reasonable performance.

While looking through the docs on my system I found a few different
implementations:
- hsearch from <search.h>
- tsearch from <search.h>
- bsearch from <stdlib.h>

Which one would you recommend? Or maybe something completly different?
Right now I just use Perl to do the hashing :) this is nice, but I would
love to have a small, efficient, ANSI C program.
There's a general-purpose hash function on my website - under Basic
Algorithms, sample chapter on hash tables.
Whilst it is not specially designed for strings, it's pretty good. Use it
for now, and replace it only if performance is unacceptable.
 
J

January Weiner

CBFalconer said:
There are a couple of very simple, but efficient, string hashing
routines available (in source form) in hashlib.zip. Although
hashlib is copyright and released under GPL, those functions are
pretty common knowledge and there will be no objections to
including them in anything else. You can find hashlib.zip at:

Thank you very much, that is precisely the answer that I needed.

January
 
J

January Weiner

Malcolm McLean said:
There's a general-purpose hash function on my website - under Basic
Algorithms, sample chapter on hash tables.
Whilst it is not specially designed for strings, it's pretty good. Use it
for now, and replace it only if performance is unacceptable.

Thanks!

January
 
P

Peter Nilsson

Keith Thompson said:
Yes, it is. And the answer is, there are no hash functions
in the standard C library.

There are at least two that fit the raw definition: strlen()
and strtoul(..., ..., 36). Not necessarily useful, but forms
of hashing nonetheless.
 
U

user923005

Well, it is not. Not 100%. The reason being that I am looking for
something that is part of the C standard and that is simple to use in C.

The C standard does not contain a hash function. Quite a pity, too.
What hash function you should use will be a function of the
requirements.

The SDBM and Bernstein are good, simple, fast hashes used to operate
on strings when you need something like a join or lookup and a few
rare collisions are OK.
If you need a much lower chance of collision, then a 64 bit hash like
UMAC is probably what is desired.
If you need 0% chance of collisions, then generate a perfect hash
(obviously, this only works if the inputs are distinct).
What do you think happens when I ask a question like that in
comp.programming? Some wise guy will set the Followup-To to comp.lang.c
:p, that's what would happen.

Well, he would be a wise guy. True, the group is not specifically for algorithm discussion, but it's the closest
thing we've got.
Finally, the reason is that I am not a
computer scientist and need, hm, very practical advice. Practical advice
is something that I always got plenty on comp.lang.c :) Note that you will
find a discussion on different hashing algorithms in the FAQ of this group
-- however, I do not fully understand it (otherwise I would not be asking
the questions).

The practical advice you get here will be of the same quality as
Indeed, many of the c.l.c regulars read and post in
both places.
OK, as I am not a computer scientist, I will try to look it up somewhere.
Meanwhile, which of that is a part of standard C libraries?

None of them.
Nope, I would go for DBM or SQL if I needed that.


"I need a hash, which of the ones in the standard C libraries should I use"
is a more specific question, is it not?

Yes. Unfortunately, the answer is not the one you would like to hear.
I guess that this site might be helpful:
http://www.partow.net/programming/hashfunctions/

I guess further that if you state your requirements precisely you will
get better answers.
 
P

Paul Hsieh

Hello,

I need to use a hashing function for relatively short strings (roughly
16 characters or less). The data that needs to be accessed via hash is
just a simple int. I just need to look up values in two dimensional
matrices each time that I encounter a pair of these strings.

There is a google group dedicated to this sort of thing:

http://groups.google.com/group/hash-functions
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top