Shorter checksum than MD5

M

Mercuro

Hello

i'm looking for a simple way to checksum my data.
The data is 70 bytes long per record, so a 32
byte hex md5sum would increase the size of my
mysql db a lot.

I'm looking for something that is 5 bytes long,
for the moment i'm just taking a part of the hex
md5 sum (like this: checksum = md5sum[3:8]). I
don't have any duplicates, and I have over 100000
records, but i'm not sure for the future...


Can anybody give me something better? Or point me
to some website?


thx!!


PS: I use this checksum to periodically compare 2
versions of this DB, which are on 2 sides of a
slow internet connection. My hope is to keep down
unneeded traffic between the 2 servers.
 
J

Joachim Bauch

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Mercuro wrote:
| i'm looking for a simple way to checksum my data. The data is 70 bytes
| long per record, so a 32 byte hex md5sum would increase the size of my
| mysql db a lot.
|
| I'm looking for something that is 5 bytes long, for the moment i'm just
| taking a part of the hex md5 sum (like this: checksum = md5sum[3:8]). I
| don't have any duplicates, and I have over 100000 records, but i'm not
| sure for the future...
|
|
| Can anybody give me something better? Or point me to some website?

You could use binascii.crc32() which generates a 4 byte checksum.

~From the python docs:
crc32( data[, crc])

Compute CRC-32, the 32-bit checksum of data, starting with an initial crc.
This is consistent with the ZIP file checksum. Since the algorithm is
designed for use as a checksum algorithm, it is not suitable for use as a
general hash algorithm. Use as follows:

~ print binascii.crc32("hello world")
~ # Or, in two pieces:
~ crc = binascii.crc32("hello")
~ crc = binascii.crc32(" world", crc)
~ print crc

Cheers,
~ Joachim

- --
Joachim Bauch
struktur AG, (e-mail address removed)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFBQBpEvb5cTc087cURAqBCAJ9CiSYI57djBUHRDweG7U0efIfR2wCfXIu5
e81vPBECKZh+wRVSn2jjFYo=
=uObd
-----END PGP SIGNATURE-----
 
B

Brian Gough

Mercuro said:
I'm looking for something that is 5 bytes long, for the moment i'm
just taking a part of the hex md5 sum (like this: checksum =
md5sum[3:8]). I don't have any duplicates, and I have over 100000
records, but i'm not sure for the future... Can anybody give me
something better? Or point me to some website?

For making a smaller hash, I think your approach is a good one since
you can easily increase the length if you need to.

For comparing two databases, maybe there are other options not using a
hash though (e.g. keeping a log of which records have changed since
the last comparison).
 
P

Paul Rubin

Mercuro said:
i'm looking for a simple way to checksum my data. The data is 70 bytes
long per record, so a 32 byte hex md5sum would increase the size of my
mysql db a lot.

If the data is binary, the md5 checksum is 16 bytes, not 32.
I'm looking for something that is 5 bytes long, for the moment i'm
just taking a part of the hex md5 sum (like this: checksum =
md5sum[3:8]). I don't have any duplicates, and I have over 100000
records, but i'm not sure for the future...

Using 5 hex digits would give you just 20 bits of hash, so you would
almost definitely get collisions with that many records.
PS: I use this checksum to periodically compare 2 versions of this DB,
which are on 2 sides of a slow internet connection. My hope is to
keep down unneeded traffic between the 2 servers.

How about putting a timestamp in each record, so you only have to
compare the records that have been updated since the last period
comparison.

Or, if you expect only occasional changes, you could compare hashes of
long runs of records, then narrow down the comparisons to locate the
records that actually differ. You could straightforwardly put a tree
structure over the hashes, but maybe there's some even better way.
 
M

Mercuro

Joachim said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Mercuro wrote:
| i'm looking for a simple way to checksum my data. The data is 70 bytes
| long per record, so a 32 byte hex md5sum would increase the size of my
| mysql db a lot.
|
| I'm looking for something that is 5 bytes long, for the moment i'm just
| taking a part of the hex md5 sum (like this: checksum = md5sum[3:8]). I
| don't have any duplicates, and I have over 100000 records, but i'm not
| sure for the future...
|
|
| Can anybody give me something better? Or point me to some website?

You could use binascii.crc32() which generates a 4 byte checksum.

~From the python docs:
crc32( data[, crc])

Compute CRC-32, the 32-bit checksum of data, starting with an initial crc.
This is consistent with the ZIP file checksum. Since the algorithm is
designed for use as a checksum algorithm, it is not suitable for use as a
general hash algorithm. Use as follows:

~ print binascii.crc32("hello world")
~ # Or, in two pieces:
~ crc = binascii.crc32("hello")
~ crc = binascii.crc32(" world", crc)
~ print crc

Cheers,
~ Joachim

thx
 
M

Mercuro

Paul Rubin wrote:

How about putting a timestamp in each record, so you only have to
compare the records that have been updated since the last period
comparison.

ok, i will give some more information:

I have a proprietary system, which I can't modify.
But, it uses Foxpro DBF files which I can read.
I have found all the data I want to have in a
MySQL table. (this table will be used to lookop
prices and to find other information about articles)

Since I'm not able to put some timestamps on
changed records, I got the idea to put a checksum
on each record and save it in the MySQL table.
Every night I would 'SELECT' all checksums
together with the artikelnumbers and than compare
it one by one with newly calculated checksums from
the DBF file. Only the changed checksums shall be
'UPDATED' and missing numbers would be 'INSERTED'.

This is the code I have for now:
(I will probably change md5 with crc32)

import sys, os, string, dbfreader, md5
from string import strip

# import MySQL module
import MySQLdb

# connect
db = MySQLdb.connect( .... )

# create a cursor
cursor = db.cursor()

cursor.execute("SELECT ID, md5sum, 0 FROM ARTIKEL;")
resultaat = list(cursor.fetchall())
f = dbfreader.DBFFile("ARTIKEL.DBF")

f.open()
i = 0
while 1:
i += 1
updated = 0
rec=f.get_next_record()
if rec==None:
break
pr_kassa = str(rec["PR_KASSA"])
ID = rec["ID"]
IDs = str(ID)
assortiment =
strip(str(rec["ASSORTIMENT"]))[0:1]
pr_tarief = str(rec["PR_TARIEF"])
status = strip(str(rec["STATUS"]))[0:1]
pr_aank = str(rec["PR_AANK"])
benaming =
string.join(string.split(str(rec["BENAMING"]),
"'"), "\\'")

md5sum = md5.new(pr_kassa + IDs +
assortiment + pr_tarief + status + pr_aank +
benaming).hexdigest()[3:8]

if (i % 100) == 0:
print "record %i: ID %s" % (i, IDs)
# lijst optimaal maken om in te
zoeken make list more optimal to search trough
tmp = resultaat[:90]
resultaat = resultaat[90:]
resultaat.extend(tmp)

if resultaat != None:
for record in resultaat:
if record[0] == ID:
#record[2] = 1
if record[1]!=md5sum:
print "update record (ID:
%s)" % IDs
# update van bestaand record,
md5 sum does not match
cursor.execute("UPDATE
ARTIKEL SET " +

"benaming='%s', status=%s, assortiment='%s',
pr_aank=%s, pr_tarief=%s, pr_kassa=%s, md5sum='%s'
WHERE ID=%s ;" %
(benaming,
status, assortiment, pr_aank, pr_tarief, pr_kassa,
md5sum, IDs))
updated = 1
break

if (updated == 0) & (ID < 8000000):
# nieuw record
print "nieuw record (ID: %s)" % IDs
cursor.execute("INSERT INTO ARTIKEL
(ID, benaming, status, assortiment, pr_aank,
pr_tarief, pr_kassa, md5sum)" +

" VALUES ( %s, '%s', %s, '%s', %s, %s, %s,
'%s', '%s' );" %

(IDs, benaming, status, assortiment, pr_aank,
pr_tarief, pr_kassa, md5sum))



f.close()

#############################################



If anybody has any better ideas, I'm happy to hear
them!
 
M

Mercuro

Brian said:
I'm looking for something that is 5 bytes long, for the moment i'm
just taking a part of the hex md5 sum (like this: checksum =
md5sum[3:8]). I don't have any duplicates, and I have over 100000
records, but i'm not sure for the future... Can anybody give me
something better? Or point me to some website?


For making a smaller hash, I think your approach is a good one since
you can easily increase the length if you need to.

For comparing two databases, maybe there are other options not using a
hash though (e.g. keeping a log of which records have changed since
the last comparison).

thx

please read the reply under Paul Rubin for more
information about the compare thingy. It's a bit
to long to post twice.
 
J

Jeff Epler

If you have a good cryptographic hash function then picking any N bits
from it should be just as good as any other N bits.

sha and md5 were designed with cryptographic needs in mind, while CRC
wasn't (it's designed to detect spurious bit errors in noisy transmission
media, which tend to have particular properties), but collisions have
now been found in md5 which means most people would not recommend its
use in new software. I would choose SHA at this point, even though new
research implies it may be weak too. (an attack on a simplified version
of SHA can produce collisions) On the other hand, if your database
replication app doesn't have a threat model where an attacker would want
to deliberately make the two sites out of sync, a cryptographically weak
hash might still be acceptable.

For your application, you should consider the total number of records
you ever expect to have, and use more than 2 * lg(records) bits of hash.
Due to the so-called "birthday paradox", when you have N possible hash
values, two will be identical with 50% probability with around sqrt(N)
items. You'd probably prefer that the probability be much lower in your
application, since a collision will result in incorrect results.

You might consider some way to group records together, so that it's not
a hash per 70-byte record, but a hash per N 70-byte records. Or you
might skip this approach and implement transactions which can be stored
and played to the second server when they sync up.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFBQGEQJd01MZaTXX0RAoM8AJ0REeocUrw/v9OmOEE7bwW06hNVUwCfTEdE
7CtyR9qi8KXoKDYfwVeSK+8=
=YjNO
-----END PGP SIGNATURE-----
 
P

Paul Rubin

Mercuro said:
I have a proprietary system, which I can't modify. But, it uses Foxpro
DBF files which I can read. I have found all the data I want to have
in a MySQL table. (this table will be used to lookop prices and to
find other information about articles)

Since I'm not able to put some timestamps on changed records, I got
the idea to put a checksum on each record and save it in the MySQL
table. Every night I would 'SELECT' all checksums together with the
artikelnumbers and than compare it one by one with newly calculated
checksums from the DBF file. Only the changed checksums shall be
'UPDATED' and missing numbers would be 'INSERTED'.

I'm a little confused. Is only the DBF file getting updated? If you
can put a checksum on each record, why can't you put a timestamp on
each record? Or why can't you just migrate all the data from the DBF
into another file every night, and then just scan the file to find the
changes from the previous night's version?
This is the code I have for now:
(I will probably change md5 with crc32)

Where are the updates coming from? Note that if you use a 32-bit
checksum, with 100000 records you will probably have some records with
the same checksum by accident. Is that a problem? Also, with CRC32,
it's very easy to create a record on purpose that has any given
checksum. Is THAT a problem? For example, it means that if someone
can change the price of an article, he can choose a new price so that
the record will have the same checksum as the old price and the change
won't get noticed. Could he buy something for $1.00, change the price
to $11.73 or something, then return the item and get an $11.73 refund
because you didn't notice the update?
 
D

Dan Bishop

Paul Rubin said:
Where are the updates coming from? Note that if you use a 32-bit
checksum, with 100000 records you will probably have some records with
the same checksum by accident.

Only if you use a checksum algorithm with really bad clustering
problems.

If all 2**32 checksums are equally likely, the probability of a
collision is only about 0.0000232828.
 
P

Paul Rubin

Only if you use a checksum algorithm with really bad clustering problems.

If all 2**32 checksums are equally likely, the probability of a
collision is only about 0.0000232828.

That's incorrect, the probability is much higher. It's more like 0.7.

If you have 30 people in a room, do you know how to find the
probability that some two have the same birthday?
 
P

Paul Rubin

Paul Rubin said:
That's incorrect, the probability is much higher. It's more like 0.7.

I should be clear: the claim was, if you have 100000 records and
32-bit hash, with probability around 0.7, some two records will have
the same hash. That makes some uses of the hashes inappropriate.

However, if you only care about record #n on system A accidentally
colliding with record #n on system B, rather than about it colliding
with an arbitrary other record, then you're in much better shape.

If you're worried about intentional collisions that an attacker might
construct, rather than only about accidental collisions, you need a
longer hash regardless of the number of records.
 
M

Mercuro

Paul said:
I'm a little confused. Is only the DBF file getting updated?

the MySQL table should be updated by the DBF file

If you
can put a checksum on each record, why can't you put a timestamp on
each record?

Not in the DBF file, which is immutable for me.

Or why can't you just migrate all the data from
the DBF
into another file every night, and then just scan the file to find the
changes from the previous night's version?

That I can give a try
Where are the updates coming from?
from a proprietary system, which uses DBF files.
It is updated by modem connection, every day once.
 
E

Elbert Lev

For your application, you should consider the total number of records
you ever expect to have, and use more than 2 * lg(records) bits of hash.
Due to the so-called "birthday paradox", when you have N possible hash
values, two will be identical with 50% probability with around sqrt(N)
items. You'd probably prefer that the probability be much lower in your
application, since a collision will result in incorrect results.

Wrong! "birthday paradox" is not applicable here.

If you want an analogy with this combinatorial problem, imagine 2 rows with N
objects in each, There exists a "measure" of each object.
Some objects can be modified with probability 1/2**32
the measure will not change after modification.
Objects in the SAME POSITION in each row are compared by comparing their measures.

After M objects are modified what is the probability that
at least one modification will be "missed" by the comparison process.
I don't think, that in the foreseen future (if M and N are not too high)
such collision will occur.
 
E

Elbert Lev

Also, with CRC32,
it's very easy to create a record on purpose that has any given
checksum. Is THAT a problem? For example, it means that if someone
can change the price of an article, he can choose a new price so that
the record will have the same checksum as the old price and the change
won't get noticed. Could he buy something for $1.00, change the price
to $11.73 or something, then return the item and get an $11.73 refund
because you didn't notice the update?

Really? If this is "very easy" please modify the string to have the same CRC32:

"The probability of the changed dada having the same CRC32 is 1/2**32"

crc32 = de6acdf9

To make the task even easier, I will not give you the "salt" value :)
 
T

Tim Peters

[Paul Rubin]
[Elbert Lev]
Really?
Yup.

If this is "very easy" please modify the string to have the same CRC32:

I will, but I'm going to lecture first <wink>. CRC algorithms were
designed to catch common (especially in older days) kinds of
transmission corruption, not to resist attacks. CRC wants to guard
against things like single-bit inversions, double-bit inversions, and
bursts of zero or one bits. There are physical causes for those kinds
of corruption.

Now suppose your messages were big integers instead of strings. A
decent checksum algorithm would be to send, along with your big
integer N, its remainder modulo a fixed prime P. For concreteness,
say P is 1003. Then

def pseudocrc(N):
return N % 1003

That will catch many kinds of corruption. But it's dead easy to fool
if you want to! For any given "message" N, the set of messages with
the same pseudocrc is exactly the set of all integers of the form
N+P*i for some integer i. Given N, it's not only easy to find *a*
distinct message with the same pseudocrc, it's easy to find *all*
messages with the same pseudocrc.

Now "real" CRCs have essentially the same story! They use a different
form of arithmetic ("binary polynomials", informally speaking), but
it's the same thing: within that arithmetic, the CRC of a string is
just the remainder of the string (viewed as a polynomial) modulo a
fixed polynomial P.

Real CRC algorithms are more complicated than just that, playing goofy
but shallow tricks with bit inversions and byte reversals, and playing
seriously cool tricks with optimization. That can make it hard to
read a chunk of CRC code and realize what it's doing. Still, at
heart, it's just computing a remainder.

To show a string with the same CRC as yours above, we need some
utility functions first:

import binascii

# View a string as a giant base-256 integer, and return the latter.
def str2long(s):
h = binascii.hexlify(s)
return long(h, 16)

# View a long as a base-256 integer, and return the string.
# This is str2long's inverse.
def long2str(n):
h = hex(n)
assert h.startswith('0x')
h = h[2:]
if h.endswith('L'):
h = h[:-1]
if len(h) & 1:
h = '0' + h
return binascii.unhexlify(h)

# Compute binascii.crc32 on string s with given salt,
# and return as a hex string.
def crc(s, salt):
return hex(binascii.crc32(s, salt))

If you don't like binascii.crc32 (there are many CRC polynomials in
use), you'll have to figure out a different P. For binascii.crc32,
the magic value is

P = 0x196300777

Now a driver function:

# Compute the crc (with given salt) on s, and on s again
# after xor'ing toxor into it.
def checkit(s, toxor, salt=0):
t = long2str(str2long(s) ^ toxor)
print repr(s)
print repr(t)
print crc(s, salt), crc(t, salt)

and your string:

s = "The probability of the changed dada having the same CRC32 is 1/2**32"

Let's try it:
'The probability of the changed dada having the same CRC32 is 1/2**32'
'The probability of the changed dada having the same CRC32 is 1/2**32'
0x5a91f29 0x5a91f29

That's not amazing, because xor'ing 0 into your string didn't change
'The probability of the changed dada having the same CRC32 is 1/2**32'
'The probability of the changed dada having the same CRC32 is 1/3\xbc\x1a4E'
0x5a91f29 0x5a91f29

More, xor'ing in "multiples" of P also don't change the crc. The
underlying arithmetic uses xor instead of addition, and for obscure
technical reasons (having to do with details of the binascii.crc32
implementation) we have to restrict shift counts to multiples of 8.
So I'll pick one out of my hat:
'The probability of the changed dada having the same CRC32 is 1/2**32'
"The probability of the chao\xf0\xc3SP\x13ada having the
s`\xfbU'4RC33\xb6\x10n\x04 1/2**32"
0x5a91f29 0x5a91f29

And so on -- we can systematically generate an unbounded number of
distinct strings with the same crc as your string.
...
To make the task even easier, I will not give you the "salt" value :)

Surprise again: that doesn't make any difference. The pairs of
strings above have the same crc for all salt values -- adding a salt
doesn't do any more good for CRCs than adding this to our pseudocrc
function:

def pseudocrc(N, salt):
return (N + salt) % 1003

That is, it just shifts the result by a constant, and so has no effect
'The probability of the changed dada having the same CRC32 is 1/2**32'
'The probability of the changed dada having the same CRC32 is 1/3\xbc\x1a4E'
0x75a90579 0x75a90579

The real purpose of a CRC salt (it doesn't, and can't, improve
"security") is to let you compute a CRC of a very large blob of data
in chunks, feeding the result of one chunk to the next crc(chunk) call
as "the salt".
 
T

Tim Peters

[Elbert Lev]
Very good. But I beleive, he [Paul Rubin] did not mentioned a "hostile
environment".

You quoted this qualification from Paul, which I quoted too:

I suppose that's not a hostile environment if you believe that when
someone deliberately tries to fool your checksum, they're trying to
express love. I don't personally believe that. Well, not in general.
Some people express their emotions in peculiar ways. But I assume
too that Paul wasn't really talking about abnormal psychology, in
which case his point here was quite clear.

then-again-perhaps-love-&-hostility-are-the-same-thing<wink>-ly y'rs - tim
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top