Fast 1 -> 1 non md5 hash algorithm

T

thecodemachine

Hi,

I'm looking for a fast and simple one to one hash function, suitable
for longer strings (up to 2048 in length). I'd like keys to be
relatively short, I doubt I'd be creating more than 256 keys..

I could use md5 but even that is more than I need, and I assume not the
fastest algorithm (?), so I'm reluctant to use it.

I've been looking around a lot, not finding much anything come to mind?

thanks!
Calin
 
E

Eric Sosman

Hi,

I'm looking for a fast and simple one to one hash function, suitable
for longer strings (up to 2048 in length). I'd like keys to be
relatively short, I doubt I'd be creating more than 256 keys..

I could use md5 but even that is more than I need, and I assume not the
fastest algorithm (?), so I'm reluctant to use it.

I've been looking around a lot, not finding much anything come to mind?

The first thing that comes to mind is that you've found
the wrong newsgroup. comp.lang.c is about the C programming
language, not about algorithms.

The second thing that comes to mind is that a one-to-one
hash function can be extremely simple and extremely fast:
just use the hashed object as its own hash code. (Note that
a code with either fewer or more bits than the original cannot
be one-to-one.)

The third thing that comes to mind is that since the
second thing didn't come to your mind, you may not have a
clear idea of what you're trying to do.
 
W

Walter Roberson

I'm looking for a fast and simple one to one hash function, suitable
for longer strings (up to 2048 in length). I'd like keys to be
relatively short, I doubt I'd be creating more than 256 keys..
I could use md5 but even that is more than I need, and I assume not the
fastest algorithm (?), so I'm reluctant to use it.

There are lot of different hash functions. In order to be able to
recommend one, we would have to know:

- do you need a hash function that is *certain* not to produce
the same hash for values you wish to distinguish; or

- are you using a "bucket" hashing system; or

- are you just going to fill the next available bucket when you have
a hash collision; or

- are you going to use a series of rehash functions to attempt to
resolve hash collisions;

- are you truly looking for a hash function, or are you looking
for a digital signature?

- is this a case in which you can take your time hashing during the
data analysis phase, but then expect to retrieve the data over and
over again, so theoretical read performance is much more important than
time to insert into hash structures? If so, then have you considered
"perfect hash" ?

- would it be acceptable to put an arbitrary limit on the number of
keys if by doing so you could create a noticably more efficient hash?
For example, would either of 251 or 257 be acceptable limits?

- if you find out later that you need more than 256 keys, must it be
easy to expand the maximum number of keys, or would it be acceptable for
noticable "work" (by the computer) to be involved? Though in this
particular case it sounds like you won't be needing more than 512 Kb for
your hash, in the general case someone might be hashing over a million
entries each of which might be a couple of hundred megabytes, and
so for them, it could be quite important that as the number of keys
expanded that the previously processed objects continue to hash into
exactly the same disk storage location that they did before.
 
T

thecodemachine

Thanks for your reply.
- do you need a hash function that is *certain* not to produce
the same hash for values you wish to distinguish; or
Yes

- are you using a "bucket" hashing system; or
No

- are you just going to fill the next available bucket when you have
a hash collision; or
Yes

- are you going to use a series of rehash functions to attempt to
resolve hash collisions;
No

- are you truly looking for a hash function, or are you looking
for a digital signature?

More a digital signature i suppose, not a hash function as is normally
used with chained hash table & buckets where some collisions are
acceptable etc.
- is this a case in which you can take your time hashing during the
data analysis phase, but then expect to retrieve the data over and
over again, so theoretical read performance is much more important than
time to insert into hash structures?

Yes...but time to insert can't be "slow" ...anything that might take
more than 1 sec would be too slow..though obviously that'd vary widely
on the machine

If so, then have you considered
"perfect hash" ?

I noticed that but wasn't sure about it, I'll look into it further
thank you.
- would it be acceptable to put an arbitrary limit on the number of
keys if by doing so you could create a noticably more efficient hash?
For example, would either of 251 or 257 be acceptable limits?

definately


- if you find out later that you need more than 256 keys, must it be
easy to expand the maximum number of keys, or would it be acceptable for
noticable "work" (by the computer) to be involved?

Noticable work would be fine, that's an unlikely case and speed is
paramount.


thanks very much!
Calin
 
M

Michael Wojcik

[Please preserve attributions for quoted material.]
- do you need a hash function that is *certain* not to produce
the same hash for values you wish to distinguish; or
Yes

- are you just going to fill the next available bucket when you have
a hash collision; or

Yes

This is inconsistent with your first answer. If the hash function is
guaranteed to avoid collisions, then the bucket will always be
available.
More a digital signature i suppose, not a hash function as is normally
used with chained hash table & buckets where some collisions are
acceptable etc.

A hash function is not a digital signature. They are completely
different operations. Many digital signature algorithms are built
on top of a cryptographic hash (aka "message digest"), but not all
are (eg encrypt-and-sign modes), and a digital signature built on
a cryptographic hash is much more than just a hash.

I would submit that if you think you may be looking for any sort
of cryptographic primitive, you should know *exactly* what you
need, and where to find it. Otherwise, either you don't need crypto,
or you're going to use it incorrectly. The history of correct
crypto implemented by non-experts is very thin.
I noticed that but wasn't sure about it, I'll look into it further
thank you.

Perfect hashing may be a solution for you.

In your first message you mentioned cryptographic hashes such as
MD5 and SHA-1. While they're very unlikely to accidentally produce
a collision, in practice it's now quite feasible to deliberately
produce MD5 collisions, and SHA-1 may not be far behind. And they
are not fast - in fact, they're deliberately slow, to make it more
difficult to mount certain kinds of attacks.
 
D

Dave Thompson

On 12 Jan 2006 18:13:31 GMT, (e-mail address removed) (Michael Wojcik)
wrote:
I would submit that if you think you may be looking for any sort
of cryptographic primitive, you should know *exactly* what you
need, and where to find it. Otherwise, either you don't need crypto,
or you're going to use it incorrectly. The history of correct
crypto implemented by non-experts is very thin.
Concur strongly.
In your first message you mentioned cryptographic hashes such as
MD5 and SHA-1. While they're very unlikely to accidentally produce
a collision, in practice it's now quite feasible to deliberately
produce MD5 collisions, and SHA-1 may not be far behind. And they
are not fast - in fact, they're deliberately slow, to make it more
difficult to mount certain kinds of attacks.

Crypto-hashes are not deliberately slow, and in fact modern
(implementations of) them are nearly as fast as CRCs even though they
(must!) "mix" the input data much "better". Where you want deliberate
slowness, like in PKCS5 (and 12) and IIRC in password "stretching"
protocols like SRP et al, you iterate the hash many times.

Some crypto operations like RSA (which is already pretty slow)
sometimes need to be implemented as constant-speed to prevent
"side-channel" (power and timing) attacks, but no hashes AFAIK.
Although, Bernstein has recently claimed a timing attack on a popular
speedup for the new US standard symmetric cipher AES or Rijndael,
although I am among those skeptical his prerequisites hold widely.

- David.Thompson1 at worldnet.att.net
 
M

Michael Wojcik

Crypto-hashes are not deliberately slow, and in fact modern
(implementations of) them are nearly as fast as CRCs even though they
(must!) "mix" the input data much "better". Where you want deliberate
slowness, like in PKCS5 (and 12) and IIRC in password "stretching"
protocols like SRP et al, you iterate the hash many times.

Right, right. I was actually thinking of older digest algorithms
like Snefru and MD2 that used iteration as a strengthening factor,
and key expansion, and somehow generalized that into "digests are
deliberately slow to increase the work factor of brute-force image
searches", which clearly doesn't make much sense.
Some crypto operations like RSA (which is already pretty slow)
sometimes need to be implemented as constant-speed to prevent
"side-channel" (power and timing) attacks, but no hashes AFAIK.

I imagine that if a digest used operations with variable power or
time costs, side-channel attacks would be an issue; but since the
data-dependent operations of MD5, SHA-*, and RIPE-MD (which represent
most real-world commercial use) are bitwise operations or integer
addition on fixed-sized quantitites, all general-purpose CPUs will
perform them at constant time and dissipation.

Digest functions built from block ciphers or asymmetric ciphers could
be vulnerable to side-channel attacks if the underlying ciphers are,
assuming the channel was modulated by the data being digested.

But a side-channel attack against a hash probably has a smaller
attack surface than one for encryption does; leaking some information
about the preimage seems to be of limited use for most hashing
applications, for example.

Anyway, this has drifted off-topic for the group, though if the OP
is still reading he may by now have gotten the impression that this
is not a trivial do-it-yourself task.

--
Michael Wojcik (e-mail address removed)

There are some, who do not understand true enjoyment, will tell you that
rules spoil convivial meetings, and that a merry company becomes a dull
committee as soon as it is called a club. Do not believe them: the
precedents are all against them. -- Arthur Ransome
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top