crc32 to be used as hash

W

Weiguang Shi

Hi there,

I'm thinking of using binascii.crc32 as a hash-function when I read in
the reference
http://www.python.org/doc/current/lib/module-binascii.html:

crc32( data[, crc])

Compute CRC-32, the 32-bit checksum of data, starting with an
initial crc. This is consistent with the ZIP file checksum.
Since the algorithm is designed for use as a checksum algorithm,
it is not suitable for use as a general hash algorithm.

CRC32 has been shown in the (Internetworking) literature that it can
be used as a good hash function. I'm wondering what's the concern
here.

Thanks
Weiguang
 
T

Tim Peters

[Weiguang Shi]
I'm thinking of using binascii.crc32 as a hash-function when I read in
the reference
http://www.python.org/doc/current/lib/module-binascii.html:

crc32( data[, crc])

Compute CRC-32, the 32-bit checksum of data, starting with an
initial crc. This is consistent with the ZIP file checksum.
Since the algorithm is designed for use as a checksum algorithm,
it is not suitable for use as a general hash algorithm.

CRC32 has been shown in the (Internetworking) literature that it can
be used as a good hash function. I'm wondering what's the concern
here.

Sorry, I've no idea what "(Internetworking) literature" means.

PythonLabs ran large-scale statistical tests of string hashing
algorithms in 2000, using millions of strings taken from a large email
corpus. binascii.crc32 delivered collision rates hundreds of standard
deviations worse than a "truly random" 32-bit hash would have
delivered. That's the concern.

In the same tests, Python's builtin hash() was statistically
indistinguishable from a random 32-bit hash function wrt # of
collisions.

BTW, it's possible to design CRC polynomials with better hash
statistics, but that's not what binascii.crc32 was designed for. Like
most CRC polynomials in wide use, it was designed to catch common
forms of data corruption (single-bit errors, bursts of 0 or 1 bits at
either end, stuff like that).
 
W

Weiguang Shi

Thanks for the explanation.

The literature I was refering to includes:

.. R. Jain, A Comparison of Hashing Schemes for Address Lookup in
Computer Networks, IEEE Trans. on Communications, 1992, v40, n3.

.. Z. Cao, Z. Wang, E. Zegura, Performance of Hashing-based Schemes for
Internet Load Balancing, IEEE INFOCOM 2000, p332-341.

Weiguang
 
T

Tim Peters

[Weiguang Shi]
The literature I was refering to includes:

. R. Jain, A Comparison of Hashing Schemes for Address Lookup in
Computer Networks, IEEE Trans. on Communications, 1992, v40, n3.

. Z. Cao, Z. Wang, E. Zegura, Performance of Hashing-based Schemes for
Internet Load Balancing, IEEE INFOCOM 2000, p332-341.

So they're doing hashing of "small" address-related inputs, squashing
them down a lot, and using a variety of different 32- and 16-bit CRC
algorithms. In one case, they found that simply xor'ing fields
together did very well -- which implies that they weren't facing a
very demanding problem.

There's no reason to suppose, a priori, that's got much to do with the
32-bit hash distribution of strings-in-email PythonLabs tested.

Since you haven't said what it is you're trying to hash, it's anyone's
guess how relevant what anyone has done to what you want to do. So
the thing for you to do is to write a little app and *try* different
ways. Details can matter, a lot.

For example, we didn't run any tests on subfields of 32-bit hashes,
and some bits may be "more random" than others. One of the papers you
cited found that it wasn't so in what they were testing. But here's
an example showing that's not a reliable conclusion in general:

"""
def drive():
from binascii import crc32
mask = 0x1fffff
d = {}
count = 0
bytes = map(chr, range(256))
for i in bytes[-32:]:
for j in bytes:
prefix = i+j
count += len(bytes)
for k in bytes:
d[crc32(prefix + k) & mask] = 1
print mask+1, "bins", count, "balls", len(d), "occupied"

drive()
"""

This is the output:

2097152 bins 2097152 balls 1048576 occupied

Amazingly enough, crc32 on this data, and taking the last 21 bits,
managed to miss half the bins entirely. For a truly random hash
yielding 21 bits, the expected # of occupied bins is about 1325653
when tossing 2**21 balls into 2**21 buckets, with a standard deviation
of about 452. So this is more than 600 sdevs "worse than random".

If we replace crc32 with hash, the output changes to

2097152 bins 2097152 balls 1490432 occupied

which is a few hundred sdevs "better than random". The reason
Python's hash can be "better than random" for bin problems is
explained in the comments in dictobject.c -- some of the internal hash
functions are specifically trying to give better-than-random collision
statistics when fed dense inputs, because hashing is used in Python
mostly to drive dict lookups.

This isn't 100% reliable either, though, and you can fiddle the above
to find subfields and inputs where crc32 does better than the builtin
hash. For example, in the above, if the hash is shifted right by 11
before taking the last 21 bits, the builtin hash does horribly,
filling only 8K bins, but crc32 doesn't do any worse then than it did
before.
 
N

Nick Craig-Wood

Tim Peters said:
BTW, it's possible to design CRC polynomials with better hash
statistics, but that's not what binascii.crc32 was designed for.
Like most CRC polynomials in wide use, it was designed to catch
common forms of data corruption (single-bit errors, bursts of 0 or
1 bits at either end, stuff like that).

[Digression] Amongst other properties, data with CRC32(data) appended
is guaranteed a Hamming distance of 3. You can actually use the CRC32
to correct single bit errors (at the cost of some of its error
detection properties). That isn't usually worth it though! (I spent a
long time (a long time ago now) analysing and coding CRC (and FEC)
polynomials ;-)

Anyway, another reason why you'd not want to use it as a hash is that
its linear and its trivial to find things which have the same crc32
value. Its really easy to run backwards and forwards. This may or may
not be important for the application depending on whether you are
expecting malicious attack on the hashes. Probably better safe than
sorry though!
 
W

Weiguang Shi

Tim said:
...
together did very well -- which implies that they weren't facing a
very demanding problem.
I don't know about that. But all the experiments were trace-driven and
showed CRC32 is an excellent hash function (for the problem). Since
there haven't been counter-results reported, we might be able to
reject the null hypothesis: CRC32 is _not_ good for hashing address
traces.
There's no reason to suppose, a priori, that's got much to do with the
32-bit hash distribution of strings-in-email PythonLabs tested.
...
And I didn't say it did. Different workloads.

It should be apparent by now I was trying to do the similar things the
cited works did. As I gladly found there is a CRC32 function, upon my
first program in Python, I was alerted by the reference page saying
it's not a good hash function and wondered for what particular
reasons.
...

This isn't 100% reliable either, though, and you can fiddle the above
to find subfields and inputs where crc32 does better than the builtin
hash. For example, in the above, if the hash is shifted right by 11
before taking the last 21 bits, the builtin hash does horribly,
filling only 8K bins, but crc32 doesn't do any worse then than it did
before.

It's good to know. According to Jain's paper, CRC32 _is_ relatively
bit-wise random (again, for the network workload). That's the property
I'm after.

Thanks for the discussion and the program.

Weiguang
 
M

Mike Meyer

It's good to know. According to Jain's paper, CRC32 _is_ relatively
bit-wise random (again, for the network workload). That's the property
I'm after.

Since no one else has mentioned it - have you considered using md5,
which is also built into Python?

<mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top