Why MULT 31 (hash function for string)?

G

gokkog

Hi there,


There's a classic hash function to hash strings, where MULT is defined
as "31":

//from programming pearls
unsigned int hash(char *ptr)
{ unsigned int h = 0;
unsigned char *p = ptr;
int n;
for (n = k; n > 0; p++) {
h = MULT * h + *p;
if (*p == 0)
n--;
}
return h % NHASH;
}

Why MULT defined as 31? ( How about 29? 24? or 26? )


Thanks,
Wenjie
 
R

Richard Heathfield

(e-mail address removed) said:
Hi there,


There's a classic hash function to hash strings, where MULT is defined
as "31":

Why MULT defined as 31? ( How about 29? 24? or 26? )

Why not find out yourself? Generic hashing routines are intended to get you
a decent spread of hashes for arbitrary data. So cook up a few million
records of arbitrary data, and see what kind of spreads you get for various
multipliers.

If you call it research, maybe you can even fool someone into paying you.
 
E

Eric Sosman

Hi there,


There's a classic hash function to hash strings, where MULT is defined
as "31":

//from programming pearls
unsigned int hash(char *ptr)
{ unsigned int h = 0;
unsigned char *p = ptr;
int n;
for (n = k; n > 0; p++) {
h = MULT * h + *p;
if (*p == 0)
n--;
}
return h % NHASH;
}

It looks like this code was probably mis-transcribed.
There's this undefined quantity `k' floating around, and
the behavior at end-of-string can only be called broken.
But that doesn't seem central to your question:
Why MULT defined as 31? ( How about 29? 24? or 26? )

It's a mixture of superstition and good sense.

First, the superstition: People who write code having
to do with hash tables apparently recall that prime numbers
are particularly "good" for them. It seems they don't always
remember just what the "goodness" was or in what connection,
but they'll throw prime numbers into the mix whenever they
can. They'll throw in prime numbers even if they're not too
sure what a prime number is! A colleague of mine once ran
across this little coding gem:

#define HASHSIZE 51 /* a smallish prime */

Second, the good sense: Suppose MULT were 26, and consider
hashing a hundred-character string. How much influence does
the string's first character have on the final value of `h',
just before the mod operation? The first character's value
will have been multiplied by MULT 99 times, so if the arithmetic
were done in infinite precision the value would consist of some
jumble of bits followed by 99 low-order zero bits -- each time
you multiply by MULT you introduce another low-order zero, right?
The computer's finite arithmetic just chops away all the excess
high-order bits, so the first character's actual contribution to
`h' is ... precisely zero! The `h' value depends only on the
rightmost 32 string characters (assuming a 32-bit int), and even
then things are not wonderful: the first of those final 32 bytes
influences only the leftmost bit of `h' and has no effect on
the remaining 31. Clearly, an even-valued MULT is a poor idea.

Need MULT be prime? Not as far as I know (I don't know
everything); any odd value ought to suffice. 31 may be attractive
because it is close to a power of two, and it may be easier for
the compiler to replace a possibly slow multiply instruction with
a shift and subtract (31*x == (x << 5) - x) on machines where it
makes a difference. Setting MULT one greater than a power of two
(e.g., 33) would also be easy to optimize, but might produce too
"simple" an arrangement: mostly a juxtaposition of two copies
of the original set of bits, with a little mixing in the middle.
So you want an odd MULT that has plenty of one-bits.

You'd also like a MULT that "smears" its operand bits across
as much of `h' as you can manage, so MULT shouldn't be too small
(consider the degenerate case of MULT==1). Also, MULT shouldn't
be too large -- to put it differently, UINT_MAX-MULT shouldn't be
too small. How small is "too small" depends to some extent on
the expected lengths of the strings; I suspect 31 is "too small"
if the strings are short (the values won't have time to invade
the high-order part of `h'). I think it would be wiser to choose
MULT somewhere between sqrt(UINT_MAX) and UINT_MAX-sqrt(UINT_MAX),
making sure it's odd and has lots of one-bits. Primality doesn't
seem important here -- but perhaps someone else may offer a good
reason to choose a prime. Sometimes superstition has valid roots.
 
A

Ancient_Hacker

Thanks Eric for a good rundown of hash multipliers. A lot of us don't
have a good feel for the tradeoffs in this area.


FWIW: Many of the Borland compilers (wickedly fast if you don't already
know), used a multiplier of "13" for hashing.
 
R

Richard Heathfield

Eric Sosman said:
A colleague of mine once ran
across this little coding gem:

#define HASHSIZE 51 /* a smallish prime */

<snorfl>

1 is prime, 3 is prime, 5 is prime, 7 is prime, 9 is prime, 11 is prime...
 
G

gokkog

Eric said:
It looks like this code was probably mis-transcribed.
There's this undefined quantity `k' floating around, and
the behavior at end-of-string can only be called broken.
But that doesn't seem central to your question:
int k =2; /*defined before the function, from the book: programming
pearl*/
It's a mixture of superstition and good sense.

<snip>

Thanks for your comments Eric. It is very helpful. Complementary to
your comments on prime number hash table size, Indeed I just found a
web page:
http://www.concentric.net/~Ttwang/tech/primehash.htm
 
G

gokkog

Richard said:
(e-mail address removed) said:




Why not find out yourself? Generic hashing routines are intended to get you
a decent spread of hashes for arbitrary data. So cook up a few million
records of arbitrary data, and see what kind of spreads you get for various
multipliers.

There's an alternative way to discuss it on the usenet.
If you call it research, maybe you can even fool someone into paying you.

Sometimes "Usenet is a strange place."
 
C

Chris Uppal

Eric said:
You'd also like a MULT that "smears" its operand bits across
as much of `h' as you can manage, so MULT shouldn't be too small
(consider the degenerate case of MULT==1). Also, MULT shouldn't
be too large -- to put it differently, UINT_MAX-MULT shouldn't be
too small. How small is "too small" depends to some extent on
the expected lengths of the strings; I suspect 31 is "too small"
if the strings are short (the values won't have time to invade
the high-order part of `h').

I have had some success using the multiplier from some "approved"
linear-conguential PRNG before -- even with "difficult" input data such
as words which are all 3-5 letters and all capitals. I think of it as
a "peturbed" random number generator, in that the input values (chars,
or whatever) replace the additive step in the PRNG. I don't know how
much justification there really is for that way of thinking, but I find
it reassuring...

E.g. (digging out some old C code)

static HASH
lc_hash(uchar *ptr)
{
HASH hash;

for (hash = 0; *ptr; ptr++)
{
hash += *ptr;
hash *= 597485621;
}

return hash;
}

(HASH is some kind of unsigned integer)

-- chris
 
C

Chris Uppal

I said:
I have had some success using the multiplier from some "approved"
linear-conguential PRNG before -- even with "difficult" input data
such as words which are all 3-5 letters and all capitals.

I forgot to add that by "some success" I meant that I measured the
distributions of final values for a range of table sizes and found (a)
no apparent significant divergence from randomness and (b) no apparent
tendency to "resonate" with any particular sizes. That was using
real-world sized samples of real-world data.

So I used power-of-two sized hashtables, which I'm sure that Eric, as a
founder member and leading light of the Campaign To Stamp Out Primes,
would find pleasing -- were it not that he is also active their sister
organisation: the Campaign To Stamp Out Powers Of Two. ;-)

-- chris
 
A

Ancient_Hacker

Richard said:
Eric Sosman said:


<snorfl>

1 is prime, 3 is prime, 5 is prime, 7 is prime, 9 is prime, 11 is prime...

When picking a prime for a middling-size table, I just recall that
Knuth's MIX language (a horribly ancient IBM 650 emu-alike) also
denotes a prime in roman numerals. If that's not big enough, the
International Motherhood of Multiple Marine Mammal Machinists is also a
prime.
 
B

Ben Pfaff

Ancient_Hacker said:
When picking a prime for a middling-size table, I just recall that
Knuth's MIX language (a horribly ancient IBM 650 emu-alike) also
denotes a prime in roman numerals.

I just use a decent hash algorithm. Then I can use a
power-of-two size table, saving an expensive remaindering
operation on every table search.
 
C

Charles Richmond

Richard said:
(e-mail address removed) said:




Why not find out yourself? Generic hashing routines are intended to get you
a decent spread of hashes for arbitrary data. So cook up a few million
records of arbitrary data, and see what kind of spreads you get for various
multipliers.

If you call it research, maybe you can even fool someone into paying you.
Hey, Richard, you are the one who said:

"Give me a couple of years and a large research grant,
and I'll give you a receipt."

;-)
 
C

Charles Richmond

Ancient_Hacker said:
When picking a prime for a middling-size table, I just recall that
Knuth's MIX language (a horribly ancient IBM 650 emu-alike) also
denotes a prime in roman numerals. If that's not big enough, the
International Motherhood of Multiple Marine Mammal Machinists is also a
prime.
"If 91 were prime, it would be a counterexample to your conjecture."
-- Bruce Wheeler
 
E

Eric Sosman

Chris said:
I wrote:




I forgot to add that by "some success" I meant that I measured the
distributions of final values for a range of table sizes and found (a)
no apparent significant divergence from randomness and (b) no apparent
tendency to "resonate" with any particular sizes. That was using
real-world sized samples of real-world data.

So I used power-of-two sized hashtables, which I'm sure that Eric, as a
founder member and leading light of the Campaign To Stamp Out Primes,
would find pleasing -- were it not that he is also active their sister
organisation: the Campaign To Stamp Out Powers Of Two. ;-)

Superstition! Fight it everywhere! Join the Society to
Suppress Multiples of Unity!
 
L

Logan Shaw

Ben said:
I just use a decent hash algorithm. Then I can use a
power-of-two size table, saving an expensive remaindering
operation on every table search.

Computing the remainder is expensive? Yeah, sure, true on 8086.
But on anything recent? I thought on any pipelined processor
(which mean most processors in use today) essentially all
register-to-register integer operations could be completed in
only one cycle.

To put it another way, is there any modern processor (even an
embedded processor) where "mod" is actually more expensive than
"and"? I agree the former takes more silicon, but it seems
like most chips have gone ahead and devoted the silicon to it
to make it fast.

- Logan
 
R

Roland Csaszar

Richard said:
1 is prime, 3 is prime, 5 is prime, 7 is prime, 9 is prime, 11 is prime...

<OT>1 is not prime, the first prime number is 2, the only even prime. The
definition of a prime number is: "has excatly 2 distinct divisors, 1 and
itself".</OT>
 
G

gokkog

Logan said:
Computing the remainder is expensive? Yeah, sure, true on 8086.
But on anything recent? I thought on any pipelined processor
(which mean most processors in use today) essentially all
register-to-register integer operations could be completed in
only one cycle.

To put it another way, is there any modern processor (even an
embedded processor) where "mod" is actually more expensive than
"and"? I agree the former takes more silicon, but it seems
like most chips have gone ahead and devoted the silicon to it
to make it fast.

- Logan

I made a quick measurement of "modulus" v.s. "and" or "&". It shows
that modulus is expensive than "and" and "&" on my "modern" PC
generally(there are cases when the output is similar):
%: 1.09143201161e-006
&: 4.56217200789e-007
and: 4.06694146882e-007

That's not significant though (IMHO).
---
def testMOD():
"""Modulus Timing"""
aha = 100%3


def testAND():
"""AND Timing"""
aha = 100 & 3

def testAND2():
"""AND Timing"""
aha = 100 and 3

if __name__=='__main__':
from timeit import Timer

t = Timer("testMOD()", "from __main__ import testMOD")
print "%: ", t.timeit(number=100000)/100000

t = Timer("testAND()", "from __main__ import testAND")
print "&: ", t.timeit(number=100000)/100000

t = Timer("testAND2()", "from __main__ import testAND2")
print "and: ", t.timeit(number=100000)/100000
 
S

Spoon

Logan said:
Computing the remainder is expensive? Yeah, sure, true on 8086.
But on anything recent? I thought on any pipelined processor
(which mean most processors in use today) essentially all
register-to-register integer operations could be completed in
only one cycle.

Seems you thought wrong. Division is expensive.
To put it another way, is there any modern processor (even an
embedded processor) where "mod" is actually more expensive than
"and"? I agree the former takes more silicon, but it seems
like most chips have gone ahead and devoted the silicon to it
to make it fast.

Is NetBurst modern enough?

http://en.wikipedia.org/wiki/NetBurst

DIV takes 66-80 cycles on Northwood, 56-70 cycles on Prescott.

AND takes only half a cycle (!) on Prescott.

http://www.intel.com/design/pentium4/manuals/248966.htm

As far as AMD is concerned, 32-bit DIV takes ~40 cycles.
 
M

Michael Wojcik

A colleague of mine once ran
across this little coding gem:

#define HASHSIZE 51 /* a smallish prime */

Well, at least it's smallish, for suitable definitions of "ish".
How small is "too small" depends to some extent on
the expected lengths of the strings; I suspect 31 is "too small"
if the strings are short (the values won't have time to invade
the high-order part of `h').

Depends on the size of the hash value, of course. If this is used
to calculate a 16-bit hash, then any string of at least two printable
ASCII characters affects both bytes of the value, any string of
at least three printable ASCII characters affects at least the lower
15 bits, and any string of at least three ASCII non-whitespace
printable characters affects all 16 bits.

If it's used to calculate a 32-bit hash, then even relatively short
strings like "abcdef" (in ASCII) affect all 32 bits.
I think it would be wiser to choose
MULT somewhere between sqrt(UINT_MAX) and UINT_MAX-sqrt(UINT_MAX),
making sure it's odd and has lots of one-bits. Primality doesn't
seem important here -- but perhaps someone else may offer a good
reason to choose a prime.

Well, if you're not using the whole range of hash values - if you're
taking the value modulo hash table size - then you want MULT to be
relatively prime to the modulus. For example, say MULT is 6 and your
table size is 15; you'll tend to get clumping around every third
bucket, because 3 is a common factor.

Sometimes the hash table size is fixed, but if it's variable, then
you'll want to choose MULT so that it's unlikely to share factors
with the modulus. Making MULT a prime that's not too small will do
the trick; for most applications, it seems unlikely that the hash
table size will be a multiple of 31.

That's my guess, anyway.

That said, I suspect you're right, and using a prime in this sort
of application may often owe as much to cargo-cult programming as
to real analysis. I note that this construct is basically a
linear-congruential PRNG with a variable increment, and primality
isn't a requirement for the multiplier in a good LCPRNG. Most of
the multipliers listed in the table of "good" LCPRNG parameters
in Schneier's _Applied Cryptography_ are composite.
 
E

Eric Sosman

Eric Sosman wrote:



int k =2; /*defined before the function, from the book: programming
pearl*/

I still think it's mis-transcribed. Either that, or it's
not intended as a hash function for strings, but for groups of
k strings stored consecutively (first character of second string
right after the '\0' of the first, and so on).

Next time you post a code snippet, consider including a
brief description of what it's supposed to do. If you don't,
people will have to guess about the intent of the code and
may guess wrong. "//from programming pearls" does not qualify
as a specification ...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top