Generate unique ID for URL

J

Johannes Bauer

The next step is to reduce the number of bits you are encoding. You
said in another post that "1 collision in 10 million hashes would be
tolerable". So you need:

23.25349666421154

24 bits worth of key.

Nope :)
Base64 encoded, that's only 4 characters.
Actually, I probably just proved that I don't really understand how
probabilities work, so maybe what you really need is 32 or 48 or 64
bits.

:))

When doing these calculations, it's important to keep the birthday
paradox in mind (this is kind of counter-intuitive): The chance of a
collission raises tremendously when we're looking for *any* arbitrary
two hashes colliding within a certain namespace. The probability you've
calculated is the pre-image probability (which you also again need to
multiply with a factor of two, because when trying to collide one given
hash, in the mean case you'll only have to search *half* the namespace
before finding a collision).

There are three things you need to know before you can give an estimate:
1. The permissible probability of a collision (1e-7 in this case)
2. The hash size
3. The worst-case number of elements in the namespace

You neglected 3 completely -- but knowing this is really important. This
becomes obvious when looking at the extreme cases: Let's say you have a
hash of arbitrary size, but only hash one element. The chance of a
collision is *always* zero. Or look at a hash of size 2^n. Then put 2^n
+ 1 elements in the namespace. The chance of a collision is *always* one.

Doing the calculations (formulas can be found on wikipedia on the site
of the birthday phaenomenon), you can come up with these following
bitlenghts of the hash with a 1e-7 probability of collision in respect
to the worst-case number of elements

10k elements: 49 bit
100k elements: 56 bit
1e6 elements: 63 bit
100e6 elements: 76 bit
1e9 elements: 83 bit
1e12 elements: 102 bit

Best regards,
Johannes

--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 
D

Dave Angel

<snip>

When doing these calculations, it's important to keep the birthday
paradox in mind (this is kind of counter-intuitive): The chance of a
collission raises tremendously when we're looking for *any* arbitrary
two hashes colliding within a certain namespace. The probability you've
calculated is the pre-image probability (which you also again need to
multiply with a factor of two, because when trying to collide one given
hash, in the mean case you'll only have to search *half* the namespace
before finding a collision).
<snip>

Te birthday paradox could have been important had the OP stated his goal
differently. What he said was:

"""Ideally I would want to avoid collisions altogether. But if that means significant extra CPU time then 1 collision in 10 million hashes would be tolerable."""

That means that he's willing to do the necessary overhead of collision
resolution, once in every 10 million lookups. That's not the same as
saying that he wants only one chance in 10 million of having ANY
collisions among his data items.
 
J

Johannes Bauer

Te birthday paradox could have been important had the OP stated his goal
differently. What he said was:

"""Ideally I would want to avoid collisions altogether. But if that means significant extra CPU time then 1 collision in 10 million hashes would be tolerable."""

That means that he's willing to do the necessary overhead of collision
resolution, once in every 10 million lookups. That's not the same as
saying that he wants only one chance in 10 million of having ANY
collisions among his data items.

Since he stated in a later post that he actually went with MD5, the
calculations are indeed relevant. They give the number of bits a perfect
hash needs to have in order to get the desired low probablility of
collision resolutions. And for that the birthday paradox probability
must be considered instead of the (much lower) pre-image probability.

In any case, it appeared to me as if the OP was rather looking for ideas
and wasn't sure himself what approach to take -- so I find it quite
appropriate to give suggestions one way or another (even if they might
not fit the exact phrasing of one of his postings).

Best regards,
Johannes

--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,778
Messages
2,569,605
Members
45,238
Latest member
Top CryptoPodcasts

Latest Threads

Top