Generating valid identifiers

L

Laszlo Nagy

I have a program that creates various database objects in PostgreSQL.
There is a DOM, and for each element in the DOM, a database object is
created (schema, table, field, index and tablespace).

I do not want this program to generate very long identifiers. It would
increase SQL parsing time, and don't look good. Let's just say that the
limit should be 32 characters. But I also want to recognize the
identifiers when I look at their modified/truncated names.

So I have come up with this solution:

- I have restricted original identifiers not to contain the dollar sign.
They can only contain [A-Z] or [a-z] or [0-9] and the underscore. Here
is a valid example:

"group1_group2_group3_some_field_name"

- I'm trying to use a hash function to reduce the length of the
identifier when it is too long:

class Connection(object):
# ... more code here
@classmethod
def makename(cls, basename):
if len(basename)>32:
h = hashlib.sha256()
h.update(basename)
tail = base64.b64encode(h.digest(),"_$")[:10]
return basename[:30]+"$"+tail
else:
return basename

Here is the result:

print repr(Connection.makename("some_field_name"))
'some_field_name'
print repr(Connection.makename("group1_group2_group3_some_field_name"))
'group1_group2_group3_some_fiel$AyQVQUXoyf'

So, if the identifier is too long, then I use a modified version, that
should be unique, and similar to the original name. Let's suppose that
nobody wants to crack this modified hash on purpose.

And now, the questions:

* Would it be a problem to use CRC32 instead of SHA? (Since security is
not a problem, and CRC32 is faster.)
* I'm truncating the digest value to 10 characters. Is it safe enough?
I don't want to use more than 10 characters, because then it wouldn't be
possible to recognize the original name.
* Can somebody think of a better algorithm, that would give a bigger
chance of recognizing the original identifier from the modified one?

Thanks,

Laszlo
 
S

Steven D'Aprano

I do not want this program to generate very long identifiers. It would
increase SQL parsing time,

Will that increase in SQL parsing time be more, or less, than the time it
takes to generate CRC32 or SHA hashsums and append them to a truncated
identifier?

* Would it be a problem to use CRC32 instead of SHA? (Since security is
not a problem, and CRC32 is faster.)

What happens if you get a collision?

That is, you have two different long identifiers:

a.b.c.d...something
a.b.c.d...anotherthing

which by bad luck both hash to the same value:

a.b.c.d.$AABB99
a.b.c.d.$AABB99

(or whatever).


* I'm truncating the digest value to 10 characters. Is it safe enough?
I don't want to use more than 10 characters, because then it wouldn't be
possible to recognize the original name.
* Can somebody think of a
better algorithm, that would give a bigger chance of recognizing the
original identifier from the modified one?

Rather than truncating the most significant part of the identifier, the
field name, you should truncate the least important part, the middle.

a.b.c.d.e.f.g.something

goes to:

a.b...g.something

or similar.
 
L

Laszlo Nagy

What happens if you get a collision?

That is, you have two different long identifiers:

a.b.c.d...something
a.b.c.d...anotherthing

which by bad luck both hash to the same value:

a.b.c.d.$AABB99
a.b.c.d.$AABB99

(or whatever).
Yes, that was the question. How do I avoid that? (Of course I can avoid
that by using a full sha256 hash value.)
Rather than truncating the most significant part of the identifier, the
field name, you should truncate the least important part, the middle.

a.b.c.d.e.f.g.something

goes to:

a.b...g.something

or similar.
Yes, this is a good idea. Thank you.
 
I

Ian Kelly

What happens if you get a collision?

That is, you have two different long identifiers:

a.b.c.d...something
a.b.c.d...anotherthing

which by bad luck both hash to the same value:

a.b.c.d.$AABB99
a.b.c.d.$AABB99

(or whatever).

The odds of a given pair of identifiers having the same digest to 10
hex digits are 1 in 16^10, or approximately 1 in a trillion. If you
bought one lottery ticket a day at those odds, you would win
approximately once every 3 billion years. But it's not enough just to
have a hash collision, they also have to match exactly in the first 21
(or 30, or whatever) characters of their actual names, and they have
to both be long enough to invoke the truncating scheme in the first
place.

The Oracle backend for Django uses this same approach with an MD5 sum
to ensure that identifiers will be no more than 30 characters long (a
hard limit imposed by Oracle). It actually truncates the hash to 4
digits, though, not 10. This hasn't caused any problems that I'm
aware of.
 
I

Ian Kelly

The odds of a given pair of identifiers having the same digest to 10
hex digits are 1 in 16^10, or approximately 1 in a trillion. If you
bought one lottery ticket a day at those odds, you would win
approximately once every 3 billion years. But it's not enough just to
have a hash collision, they also have to match exactly in the first 21
(or 30, or whatever) characters of their actual names, and they have
to both be long enough to invoke the truncating scheme in the first
place.

The Oracle backend for Django uses this same approach with an MD5 sum
to ensure that identifiers will be no more than 30 characters long (a
hard limit imposed by Oracle). It actually truncates the hash to 4
digits, though, not 10. This hasn't caused any problems that I'm
aware of.

As a side note, the odds of having at least one hash collision among
multiple tables are known as the birthday problem. At 4 hex digits
there are 65536 possible digests, and it turns out that at 302 tables
there is a >50% chance that at least one pair of those names have the
same 4-digit digest. That doesn't mean you should be concerned if you
have 302 tables in your Django Oracle database, though, because those
colliding tables also have to match completely in the first 26
characters of their generated names, which is not that common. If a
collision ever did occur, the resolution would be simple: manually set
the name of one of the offending tables in the model definition.

With 16 ** 10 possible digests, the probability of collision hits 50%
at 1234605 tables.
 
S

Steven D'Aprano

The odds of a given pair of identifiers having the same digest to 10 hex
digits are 1 in 16^10, or approximately 1 in a trillion.

Unless an attacker can select the field names, in which case they may be
able to improve those odds significantly. In the case of MD5, they can
possibly improve those odds to 1 in 1, since MD5 is vulnerable to
collision attacks. Not so for some (all?) of the SHA hashes, at least not
yet, but they're much more expensive to calculate.

If the OP sticks with his intention to use CRC32, the odds won't be
anywhere near that low. CRC32 is neither collision-resistant nor
cryptographically random, and only generates eight hex digits, not ten.
 
L

Laszlo Nagy

As a side note, the odds of having at least one hash collision among
multiple tables are known as the birthday problem. At 4 hex digits
there are 65536 possible digests, and it turns out that at 302 tables
there is a >50% chance that at least one pair of those names have the
same 4-digit digest. That doesn't mean you should be concerned if you
have 302 tables in your Django Oracle database, though, because those
colliding tables also have to match completely in the first 26
characters of their generated names, which is not that common. If a
collision ever did occur, the resolution would be simple: manually set
the name of one of the offending tables in the model definition.

With 16 ** 10 possible digests, the probability of collision hits 50%
at 1234605 tables.
Thank you for the precise explanation. :) Well, if Django and Oracle
uses this, then it can't be a very bad idea. :)
 
L

Laszlo Nagy

Unless an attacker can select the field names, in which case they may be
able to improve those odds significantly. In the case of MD5, they can
possibly improve those odds to 1 in 1, since MD5 is vulnerable to
collision attacks. Not so for some (all?) of the SHA hashes, at least not
yet, but they're much more expensive to calculate.

If the OP sticks with his intention to use CRC32, the odds won't be
anywhere near that low. CRC32 is neither collision-resistant nor
cryptographically random, and only generates eight hex digits, not ten.
I'm not affraid of attackers. As I said, nobody will want to "crack" the
hash. This is for an in-house project. Only the dba can create new
database objects. If the dba wants to do something wrong, he can simply
delete whole database. He doesn't need to crack any hash value. :)

So yes, CRC32 is not collision-resistant, and not cryptographically
random. But in my case, sha256 is not collision resistant either
(because I'm using the first few chars of the digest value only). And
cryptographic randomness is not a requirement.

Given these circumstances, maybe using CRC32 would be fine too.

I wonder what kind of hash Django uses for Oracle.
 
L

Laszlo Nagy

With 16 ** 10 possible digests, the probability of collision hits 50%
at 1234605 tables


Actually, I'm using base64 encoding. So it is 64**10. I guess using 6
characters will enough.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top