General Hash Functions In Python

A

Arash Partow

Hi all,

I've ported various hash functions to python if anyone is interested:



def RSHash(key):
a = 378551
b = 63689
hash = 0
for i in range(len(key)):
hash = hash * a + ord(key)
a = a * b
return (hash & 0x7FFFFFFF)


def JSHash(key):
hash = 1315423911
for i in range(len(key)):
hash ^= ((hash << 5) + ord(key) + (hash >> 2))
return (hash & 0x7FFFFFFF)


def PJWHash(key):
BitsInUnsignedInt = 4 * 8
ThreeQuarters = long((BitsInUnsignedInt * 3) / 4)
OneEighth = long(BitsInUnsignedInt / 8)
HighBits = (0xFFFFFFFF) << (BitsInUnsignedInt - OneEighth)
hash = 0
test = 0

for i in range(len(key)):
hash = (hash << OneEighth) + ord(key)
test = hash & HighBits
if test != 0:
hash = (( hash ^ (test >> ThreeQuarters)) & (~HighBits));
return (hash & 0x7FFFFFFF)


def ELFHash(key):
hash = 0
x = 0
for i in range(len(key)):
hash = (hash << 4) + ord(key)
x = hash & 0xF0000000
if x != 0:
hash ^= (x >> 24)
hash &= ~x
return (hash & 0x7FFFFFFF)


def BKDRHash(key):
seed = 131 # 31 131 1313 13131 131313 etc..
hash = 0
for i in range(len(key)):
hash = (hash * seed) + ord(key)
return (hash & 0x7FFFFFFF)


def SDBMHash(key):
hash = 0
for i in range(len(key)):
hash = ord(key) + (hash << 6) + (hash << 16) - hash;
return (hash & 0x7FFFFFFF)


def DJBHash(key):
hash = 5381
for i in range(len(key)):
hash = ((hash << 5) + hash) + ord(key)
return (hash & 0x7FFFFFFF)


def DEKHash(key):
hash = len(key);
for i in range(len(key)):
hash = ((hash << 5) ^ (hash >> 27)) ^ ord(key)
return (hash & 0x7FFFFFFF)


def APHash(key):
hash = 0
for i in range(len(key)):
if ((i & 1) == 0):
hash ^= ((hash << 7) ^ ord(key) ^ (hash >> 3))
else:
hash ^= (~((hash << 11) ^ ord(key) ^ (hash >> 5)))
return (hash & 0x7FFFFFFF)


print RSHash ('abcdefghijklmnopqrstuvwxyz1234567890')
print JSHash ('abcdefghijklmnopqrstuvwxyz1234567890')
print PJWHash ('abcdefghijklmnopqrstuvwxyz1234567890')
print ELFHash ('abcdefghijklmnopqrstuvwxyz1234567890')
print BKDRHash('abcdefghijklmnopqrstuvwxyz1234567890')
print SDBMHash('abcdefghijklmnopqrstuvwxyz1234567890')
print DJBHash ('abcdefghijklmnopqrstuvwxyz1234567890')
print DEKHash ('abcdefghijklmnopqrstuvwxyz1234567890')
print APHash ('abcdefghijklmnopqrstuvwxyz1234567890')





Arash Partow
________________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net
 
P

Paul Rubin

Arash Partow said:
I've ported various hash functions to python if anyone is interested:

Are these useful for any particular interoperability purposes?
Otherwise I'd say just one such hash is plenty, or maybe don't even
bother, since Python's built-in dicts are sufficient for most
applications that call for hashing, and the cryptographic hashes are
available for when you really care about uniformity of the hash
output.
for i in range(len(key)):
hash = hash * a + ord(key)
a = a * b


As a stylistic issue, I'd write this:

for c in key:
hash = hash * a + ord(c)
a *= b

and similarly for the other similar loops.
 
A

Arash Partow

Hi Paul,

For different data types different hash functions work
better/worse aka fewer or more collisions.

I believe the more choice people have and also the more
ways people see how a particular thing can be done, then
the easier it will be for them to come up with their own
specific efficient solutions.

That said, I believe at least one (most likely more) of
the hash functions in the group above will most always work
better (ala less collisions) than the standard default hash
available in the built-in dict for any random set of strings.

Please feel free to prove me wrong :)




Arash Partow
________________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net


Paul said:
Arash Partow said:
I've ported various hash functions to python if anyone is interested:

Are these useful for any particular interoperability purposes?
Otherwise I'd say just one such hash is plenty, or maybe don't even
bother, since Python's built-in dicts are sufficient for most
applications that call for hashing, and the cryptographic hashes are
available for when you really care about uniformity of the hash
output.
for i in range(len(key)):
hash = hash * a + ord(key)
a = a * b


As a stylistic issue, I'd write this:

for c in key:
hash = hash * a + ord(c)
a *= b

and similarly for the other similar loops.
 
P

Paul Rubin

Arash Partow said:
For different data types different hash functions work
better/worse aka fewer or more collisions.

But you give no indication of which of those hashes works best for
what kind of data. How is the user supposed to figure out which one
to choose?
 
R

Robert Kern

Arash said:
That said, I believe at least one (most likely more) of
the hash functions in the group above will most always work
better (ala less collisions) than the standard default hash
available in the built-in dict for any random set of strings.

Please feel free to prove me wrong :)

You have it backwards. You are the one making the positive assertion. You get to
prove yourself correct.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
A

Arash Partow

Trial and error - how else? :)

I believe finding a perfect hash function for
every kind and combination of data is a very
very time consuming operation. Also there is
the common case where the data is online
(ie: stateless) that said it doesn't mean you
can't make assumptions about the kind of data.



Arash Partow
________________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net
 
A

Arash Partow

That is true, but I'm not about to do something
that might potentially prove my point wrong... :)



Arash Partow
________________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net
 
J

John Machin

Hi Paul,

For different data types different hash functions work
better/worse aka fewer or more collisions.

I believe the more choice people have and also the more
ways people see how a particular thing can be done, then
the easier it will be for them to come up with their own
specific efficient solutions.

Who is likely to bother? In timbot we trust. Have you read the comments
at the start of Objects/dictobject.c?
That said, I believe at least one (most likely more) of
the hash functions in the group above will most always work
better (ala less collisions) than the standard default hash
available in the built-in dict for any random set of strings.

[Aside: it's not "in the built-in dict". Any type which wants its
instances to be hashable defines its own hash method, one that suits the
type.]

This belief would be based on:
(a) actual testing by you
or (b) a refereed academic paper which did such tests on hash functions
(including the Python "standard default hash")
or (c) ...???

What is the Python "standard default hash", anyway? It is written in C.
Wouldn't it have been a good idea to provide a Python function for it,
so that people who are going to "come up with their own specific
efficient solutions" had something to compare with? Or perhaps they are
intended to "port" your functions back into C?

A few more questions:

Why have you truncated the answers to 31 bits?

Have you tested that these functions produce the same output (apart from
the 31-bit thing) as the originals that you worked from? The reason for
asking is that Python unlike C doesn't lose any bits in the <<
operation; if this is followed by a >> you may be shifting some unwanted
1-bits back in again.

Talking about not losing bits: For your 36-byte example input, the
SDBMHash (with its << 16) is up to about 566 bits before you truncate it
to 31. A little over the top, perhaps. Maybe not the fastest way of
doing it.

What is the purpose of the calls to long() in PJWHash?

And the $64K question: What is the quintessential difference between
PJWHash and ELFHash?

Cheers,
John
 
A

Arash Partow

John said:
Who is likely to bother? In timbot we trust. Have you read the comments
at the start of Objects/dictobject.c?
No I haven't probably wont be anytime soon, as far as time, well
people interested, as is how started my original port, would be more
than willing to try/assess the routines for sets of strings that they
wish to hash etc.. this site may help explain plus has some code
snippets that may help you understand what I mean.

http://www.partow.net/programming/hashfunctions/index.html

[Aside: it's not "in the built-in dict". Any type which wants its
instances to be hashable defines its own hash method, one that suits the
type.]

This belief would be based on:
(a) actual testing by you
or (b) a refereed academic paper which did such tests on hash functions
(including the Python "standard default hash")
or (c) ...???

Experience has shown me that the belief than one "default" way of
hashing is generally not the optimal way for the overwhelming
majority of data, but then again....


What is the Python "standard default hash", anyway? It is written in C.
Wouldn't it have been a good idea to provide a Python function for it,
so that people who are going to "come up with their own specific
efficient solutions" had something to compare with? Or perhaps they are
intended to "port" your functions back into C?

The above link provides a very simple hash test framework, the
"standard default hash" can be placed in there and tested amongst
the other algorithms.

A few more questions:

Why have you truncated the answers to 31 bits?
algorithms were adapted from unsigned int, i should have removed that
final truncation in python.

Have you tested that these functions produce the same output (apart from
the 31-bit thing) as the originals that you worked from? The reason for
asking is that Python unlike C doesn't lose any bits in the <<
operation; if this is followed by a >> you may be shifting some unwanted
1-bits back in again.

Some of them do others don't (not really important unless you are
trying to be compatible with other implementations which this is
not), I am aware of python not truncating/wrapping of values under
various operations, I believe its a nice little side effect from
python which gives more bits to play with as long as you don't
truncate them as I have.

Talking about not losing bits: For your 36-byte example input, the
SDBMHash (with its << 16) is up to about 566 bits before you truncate it
to 31. A little over the top, perhaps. Maybe not the fastest way of
doing it.
Possibly, do you have a better solution I'm very keen to learn...

What is the purpose of the calls to long() in PJWHash?
trying to cast to long, looking at it now its rather superfluous.

And the $64K question: What is the quintessential difference between
PJWHash and ELFHash?
Nothing, elf is essentially pjw, its just optimised for 32-bit systems
in that the calculation for th's etc are static where has pjw
required sizeof to calc the th's which i couldn't find a way of doing,
so i fudged it in the hope that maybe sometime in the future a work
around of sorts could be developed.

Again this was just a simple posting, I hoped to maybe gets some
comments and may present some ideas to people, I don't expect anyone
to drop everything and start using these, just thought it might be
something interesting for this NG. btw I'm not a very good python
programmer ATM.




Arash Partow
________________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net
 
A

Arash Partow

I would like to but I think my lack of language and time will
be a barrier. Also I don't believe there is much material there
to warrant a technical write up.



Arash Partow
________________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net


Stefan said:
Arash said:
I've ported various hash functions to python if anyone is interested:
[snip]

Ok, so if you think they are useful, what about writing up an article for the
Python Cookbook that describes their usage and specific advantages/disadvantages?

http://aspn.activestate.com/ASPN/Python/Cookbook/

Stefan
 
J

John Machin

No I haven't probably wont be anytime soon,

Perhaps you should, if you profess an interest in hashed lookup -- it
gives some interesting commentary on the second aspect: collision
handling. What matters is the *total* time to return an answer.
as far as time, well
people interested, as is how started my original port, would be more
than willing to try/assess the routines for sets of strings that they
wish to hash etc.
this site may help explain plus has some code
snippets that may help you understand what I mean.

http://www.partow.net/programming/hashfunctions/index.html

That JSHAsh allegedly written by "Justin Sobel": by coincidence, there's
a Melbourne academic named Justin Zobel who has done something amazingly
similar:

http://goanna.cs.rmit.edu.au/~hugh/zhw-ipl.html

Searching for "Justin Sobel" did lead me to a Russian website which
apart from repeating your typo/reado/whatevero did propose (and test)
some more hash functions.

http://vak.ru/doku.php/proj/hash/efficiency-en

In particular look at the "rot13" function which was right up near the
front as far as number of collisions goes, and which would appear (my
guess based on reading the source) to be very fast (with the right
compiler (e.g. gcc 4)).
Some of them do others don't (not really important unless you are
trying to be compatible with other implementations which this is
not),

I would have thought it important especially in the case of well-known
functions whose properties have been discussed in the literature that
you should not publish a version that gives a different answer, without
noting that fact prominently.
I am aware of python not truncating/wrapping of values under
various operations, I believe its a nice little side effect from
python which gives more bits to play with as long as you don't
truncate them as I have.


Possibly, do you have a better solution I'm very keen to learn...

You can't avoid using Python longs if you want to simulate unsigned
32-bit arithmetic. However judicious truncation can be used to stop the
longs becoming longer and slower. General rules:
1. Avoid exceeding 32 bits where possible. E.g. instead of
hash <<= 8
do
hash = (hash & 0xFFFFFF) << 8
2. Where unavoidable (e.g. hash *= constant), use hash &= 0xFFFFFFFF to
chop back to 32 bits, once per iteration.
trying to cast to long, looking at it now its rather superfluous.


Nothing, elf is essentially pjw, its just optimised for 32-bit systems
in that the calculation for th's etc are static where has pjw
required sizeof to calc the th's

You've found a C compiler where sizeof(unsigned int) is not static i.e.
calculated by the compiler at compile time???
which i couldn't find a way of doing,
so i fudged it in the hope that maybe sometime in the future a work
around of sorts could be developed.

Google is a wonderful thing:

http://users.physik.tu-muenchen.de/gammel/matpack/html/LibDoc/Strings/strings.html
"""
Thanks to Josh Bloch ([email protected]) who also informed me about
another fault that is found in Aho, Sethi and Ullman's book: The line
with h ^= (g >> 28) now replaces the original h ^= (g >> 24). According
to a correspondence of Josh Bloch with Ravi Sethi this correction will
be made in the next edition of the book. Including these two changes
this hash function is now comparable to Vo's, Torek's and WAIS's hash
functions.
"""

(1) Whoops! (2) Vo? Torek? WAIS? Could these be possible additions to
your website?

http://www.math.columbia.edu/~bayer/annote/root/root.html
"""
Peter J. Weinberger hash function; see e.g. 21st Century Compilers, by
Alfred V. Aho, Ravi Sethi, Monica Lam, Jeffrey D. Ullman, ISBN 0321131436.

Hash unsigned X into H, using the temporary variable G. G and H are
unsigned variables; X may be an expression. G is nonzero e.g. 92% of
time, so a conditional expression would be slower. As noted by Josh
Bloch, 28 is the correct replacement for the frequent misprint 24.

#define HASH(G,H,X) ( H <<= 4, H += (X), G = H & 0xf0000000, H ^= G >>
28, H ^= G )
"""

Cheers,
John
 
A

Arash Partow

John said:
Perhaps you should, if you profess an interest in hashed lookup -- it
gives some interesting commentary on the second aspect: collision
handling. What matters is the *total* time to return an answer.
Time or algorithm complexity is merely one aspect of a hash function
design. there are many others.
That JSHAsh allegedly written by "Justin Sobel": by coincidence, there's
a Melbourne academic named Justin Zobel who has done something amazingly
similar:

http://goanna.cs.rmit.edu.au/~hugh/zhw-ipl.html
Same guy, he was a lecturer during my uni days. As far as his surname
that is another issue altogether.

Searching for "Justin Sobel" did lead me to a Russian website which
apart from repeating your typo/reado/whatevero did propose (and test)
some more hash functions.

http://vak.ru/doku.php/proj/hash/efficiency-en

In particular look at the "rot13" function which was right up near the
front as far as number of collisions goes, and which would appear (my
guess based on reading the source) to be very fast (with the right
compiler (e.g. gcc 4)).
I've already spoken to the guy that did those measurements, there
are some flaws in the way he represents data which could lead one
to make inaccurate assumptions about the "quality" of the hash
functions. One of the many issues that i took up with him is that
he only used 1 set of data instead of having multiple sets and
aggregating the results. Whether or not he decides to re-do is
analysis is another issue altogether. TAOCP has a nice section
on how potential analysis can be done.
I would have thought it important especially in the case of well-known
functions whose properties have been discussed in the literature that
you should not publish a version that gives a different answer, without
noting that fact prominently.
True, the c versions work fine, i guess the python versions require
a bit more work. feel free to re-post modified versions :)

You can't avoid using Python longs if you want to simulate unsigned
32-bit arithmetic. However judicious truncation can be used to stop the
longs becoming longer and slower. General rules:
1. Avoid exceeding 32 bits where possible. E.g. instead of
hash <<= 8
do
hash = (hash & 0xFFFFFF) << 8
2. Where unavoidable (e.g. hash *= constant), use hash &= 0xFFFFFFFF to
chop back to 32 bits, once per iteration.

That is something I thought about, but I also considered, as
I mentioned before, the extra bits. The more bits you have to
avalanche with - (in a very general and oversimplified way)
the better your hash function "can" be.
Thanks to Josh Bloch ([email protected]) who also informed me about
another fault that is found in Aho, Sethi and Ullman's book: The line
with h ^= (g >> 28) now replaces the original h ^= (g >> 24). According
to a correspondence of Josh Bloch with Ravi Sethi this correction will
be made in the next edition of the book. Including these two changes
this hash function is now comparable to Vo's, Torek's and WAIS's hash
functions.
"""

(1) Whoops! (2) Vo? Torek? WAIS? Could these be possible additions to
your website?
Indeed! I had read about WAIS a long time ago, I'll be putting them up
very soon, thanks for the input.

http://www.math.columbia.edu/~bayer/annote/root/root.html
"""
Peter J. Weinberger hash function; see e.g. 21st Century Compilers, by
Alfred V. Aho, Ravi Sethi, Monica Lam, Jeffrey D. Ullman, ISBN 0321131436.

Hash unsigned X into H, using the temporary variable G. G and H are
unsigned variables; X may be an expression. G is nonzero e.g. 92% of
time, so a conditional expression would be slower. As noted by Josh
Bloch, 28 is the correct replacement for the frequent misprint 24.

#define HASH(G,H,X) ( H <<= 4, H += (X), G = H & 0xf0000000, H ^= G >>
28, H ^= G )
"""

I'm planning on adding various integer hash functions as well, just
haven't had the time. the above seems like one so I'll give it a go.





Arash Partow
________________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,608
Members
45,250
Latest member
Charlesreero

Latest Threads

Top