Generating valid identifiers

Discussion in 'Python' started by Laszlo Nagy, Jul 26, 2012.

  1. Laszlo Nagy

    Laszlo Nagy Guest

    I have a program that creates various database objects in PostgreSQL.
    There is a DOM, and for each element in the DOM, a database object is
    created (schema, table, field, index and tablespace).

    I do not want this program to generate very long identifiers. It would
    increase SQL parsing time, and don't look good. Let's just say that the
    limit should be 32 characters. But I also want to recognize the
    identifiers when I look at their modified/truncated names.

    So I have come up with this solution:

    - I have restricted original identifiers not to contain the dollar sign.
    They can only contain [A-Z] or [a-z] or [0-9] and the underscore. Here
    is a valid example:

    "group1_group2_group3_some_field_name"

    - I'm trying to use a hash function to reduce the length of the
    identifier when it is too long:

    class Connection(object):
    # ... more code here
    @classmethod
    def makename(cls, basename):
    if len(basename)>32:
    h = hashlib.sha256()
    h.update(basename)
    tail = base64.b64encode(h.digest(),"_$")[:10]
    return basename[:30]+"$"+tail
    else:
    return basename

    Here is the result:

    print repr(Connection.makename("some_field_name"))
    'some_field_name'
    print repr(Connection.makename("group1_group2_group3_some_field_name"))
    'group1_group2_group3_some_fiel$AyQVQUXoyf'

    So, if the identifier is too long, then I use a modified version, that
    should be unique, and similar to the original name. Let's suppose that
    nobody wants to crack this modified hash on purpose.

    And now, the questions:

    * Would it be a problem to use CRC32 instead of SHA? (Since security is
    not a problem, and CRC32 is faster.)
    * I'm truncating the digest value to 10 characters. Is it safe enough?
    I don't want to use more than 10 characters, because then it wouldn't be
    possible to recognize the original name.
    * Can somebody think of a better algorithm, that would give a bigger
    chance of recognizing the original identifier from the modified one?

    Thanks,

    Laszlo
     
    Laszlo Nagy, Jul 26, 2012
    #1
    1. Advertising

  2. On Thu, 26 Jul 2012 14:26:16 +0200, Laszlo Nagy wrote:

    > I do not want this program to generate very long identifiers. It would
    > increase SQL parsing time,


    Will that increase in SQL parsing time be more, or less, than the time it
    takes to generate CRC32 or SHA hashsums and append them to a truncated
    identifier?


    > * Would it be a problem to use CRC32 instead of SHA? (Since security is
    > not a problem, and CRC32 is faster.)


    What happens if you get a collision?

    That is, you have two different long identifiers:

    a.b.c.d...something
    a.b.c.d...anotherthing

    which by bad luck both hash to the same value:

    a.b.c.d.$AABB99
    a.b.c.d.$AABB99

    (or whatever).



    > * I'm truncating the digest value to 10 characters. Is it safe enough?
    > I don't want to use more than 10 characters, because then it wouldn't be
    > possible to recognize the original name.


    > * Can somebody think of a
    > better algorithm, that would give a bigger chance of recognizing the
    > original identifier from the modified one?


    Rather than truncating the most significant part of the identifier, the
    field name, you should truncate the least important part, the middle.

    a.b.c.d.e.f.g.something

    goes to:

    a.b...g.something

    or similar.



    --
    Steven
     
    Steven D'Aprano, Jul 26, 2012
    #2
    1. Advertising

  3. Laszlo Nagy

    Laszlo Nagy Guest


    >> * Would it be a problem to use CRC32 instead of SHA? (Since security is
    >> not a problem, and CRC32 is faster.)

    > What happens if you get a collision?
    >
    > That is, you have two different long identifiers:
    >
    > a.b.c.d...something
    > a.b.c.d...anotherthing
    >
    > which by bad luck both hash to the same value:
    >
    > a.b.c.d.$AABB99
    > a.b.c.d.$AABB99
    >
    > (or whatever).

    Yes, that was the question. How do I avoid that? (Of course I can avoid
    that by using a full sha256 hash value.)
    >> * Can somebody think of a
    >> better algorithm, that would give a bigger chance of recognizing the
    >> original identifier from the modified one?

    > Rather than truncating the most significant part of the identifier, the
    > field name, you should truncate the least important part, the middle.
    >
    > a.b.c.d.e.f.g.something
    >
    > goes to:
    >
    > a.b...g.something
    >
    > or similar.

    Yes, this is a good idea. Thank you.
     
    Laszlo Nagy, Jul 26, 2012
    #3
  4. Laszlo Nagy

    Ian Kelly Guest

    On Thu, Jul 26, 2012 at 9:30 AM, Steven D'Aprano
    <> wrote:
    > What happens if you get a collision?
    >
    > That is, you have two different long identifiers:
    >
    > a.b.c.d...something
    > a.b.c.d...anotherthing
    >
    > which by bad luck both hash to the same value:
    >
    > a.b.c.d.$AABB99
    > a.b.c.d.$AABB99
    >
    > (or whatever).


    The odds of a given pair of identifiers having the same digest to 10
    hex digits are 1 in 16^10, or approximately 1 in a trillion. If you
    bought one lottery ticket a day at those odds, you would win
    approximately once every 3 billion years. But it's not enough just to
    have a hash collision, they also have to match exactly in the first 21
    (or 30, or whatever) characters of their actual names, and they have
    to both be long enough to invoke the truncating scheme in the first
    place.

    The Oracle backend for Django uses this same approach with an MD5 sum
    to ensure that identifiers will be no more than 30 characters long (a
    hard limit imposed by Oracle). It actually truncates the hash to 4
    digits, though, not 10. This hasn't caused any problems that I'm
    aware of.
     
    Ian Kelly, Jul 26, 2012
    #4
  5. Laszlo Nagy

    Ian Kelly Guest

    On Thu, Jul 26, 2012 at 1:28 PM, Ian Kelly <> wrote:
    > The odds of a given pair of identifiers having the same digest to 10
    > hex digits are 1 in 16^10, or approximately 1 in a trillion. If you
    > bought one lottery ticket a day at those odds, you would win
    > approximately once every 3 billion years. But it's not enough just to
    > have a hash collision, they also have to match exactly in the first 21
    > (or 30, or whatever) characters of their actual names, and they have
    > to both be long enough to invoke the truncating scheme in the first
    > place.
    >
    > The Oracle backend for Django uses this same approach with an MD5 sum
    > to ensure that identifiers will be no more than 30 characters long (a
    > hard limit imposed by Oracle). It actually truncates the hash to 4
    > digits, though, not 10. This hasn't caused any problems that I'm
    > aware of.


    As a side note, the odds of having at least one hash collision among
    multiple tables are known as the birthday problem. At 4 hex digits
    there are 65536 possible digests, and it turns out that at 302 tables
    there is a >50% chance that at least one pair of those names have the
    same 4-digit digest. That doesn't mean you should be concerned if you
    have 302 tables in your Django Oracle database, though, because those
    colliding tables also have to match completely in the first 26
    characters of their generated names, which is not that common. If a
    collision ever did occur, the resolution would be simple: manually set
    the name of one of the offending tables in the model definition.

    With 16 ** 10 possible digests, the probability of collision hits 50%
    at 1234605 tables.
     
    Ian Kelly, Jul 26, 2012
    #5
  6. On Thu, 26 Jul 2012 13:28:26 -0600, Ian Kelly wrote:

    > The odds of a given pair of identifiers having the same digest to 10 hex
    > digits are 1 in 16^10, or approximately 1 in a trillion.


    Unless an attacker can select the field names, in which case they may be
    able to improve those odds significantly. In the case of MD5, they can
    possibly improve those odds to 1 in 1, since MD5 is vulnerable to
    collision attacks. Not so for some (all?) of the SHA hashes, at least not
    yet, but they're much more expensive to calculate.

    If the OP sticks with his intention to use CRC32, the odds won't be
    anywhere near that low. CRC32 is neither collision-resistant nor
    cryptographically random, and only generates eight hex digits, not ten.


    --
    Steven
     
    Steven D'Aprano, Jul 27, 2012
    #6
  7. Laszlo Nagy

    Laszlo Nagy Guest


    > As a side note, the odds of having at least one hash collision among
    > multiple tables are known as the birthday problem. At 4 hex digits
    > there are 65536 possible digests, and it turns out that at 302 tables
    > there is a >50% chance that at least one pair of those names have the
    > same 4-digit digest. That doesn't mean you should be concerned if you
    > have 302 tables in your Django Oracle database, though, because those
    > colliding tables also have to match completely in the first 26
    > characters of their generated names, which is not that common. If a
    > collision ever did occur, the resolution would be simple: manually set
    > the name of one of the offending tables in the model definition.
    >
    > With 16 ** 10 possible digests, the probability of collision hits 50%
    > at 1234605 tables.

    Thank you for the precise explanation. :) Well, if Django and Oracle
    uses this, then it can't be a very bad idea. :)
     
    Laszlo Nagy, Jul 27, 2012
    #7
  8. Laszlo Nagy

    Laszlo Nagy Guest


    > Unless an attacker can select the field names, in which case they may be
    > able to improve those odds significantly. In the case of MD5, they can
    > possibly improve those odds to 1 in 1, since MD5 is vulnerable to
    > collision attacks. Not so for some (all?) of the SHA hashes, at least not
    > yet, but they're much more expensive to calculate.
    >
    > If the OP sticks with his intention to use CRC32, the odds won't be
    > anywhere near that low. CRC32 is neither collision-resistant nor
    > cryptographically random, and only generates eight hex digits, not ten.
    >
    >

    I'm not affraid of attackers. As I said, nobody will want to "crack" the
    hash. This is for an in-house project. Only the dba can create new
    database objects. If the dba wants to do something wrong, he can simply
    delete whole database. He doesn't need to crack any hash value. :)

    So yes, CRC32 is not collision-resistant, and not cryptographically
    random. But in my case, sha256 is not collision resistant either
    (because I'm using the first few chars of the digest value only). And
    cryptographic randomness is not a requirement.

    Given these circumstances, maybe using CRC32 would be fine too.

    I wonder what kind of hash Django uses for Oracle.
     
    Laszlo Nagy, Jul 27, 2012
    #8
  9. Laszlo Nagy

    Laszlo Nagy Guest

    > With 16 ** 10 possible digests, the probability of collision hits 50%
    at 1234605 tables


    Actually, I'm using base64 encoding. So it is 64**10. I guess using 6
    characters will enough.
     
    Laszlo Nagy, Jul 27, 2012
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    1
    Views:
    355
    Roedy Green
    Apr 22, 2008
  2. Arnaud Delobelle

    Re: Generating valid identifiers

    Arnaud Delobelle, Jul 26, 2012, in forum: Python
    Replies:
    0
    Views:
    190
    Arnaud Delobelle
    Jul 26, 2012
  3. Terry Reedy

    Re: Generating valid identifiers

    Terry Reedy, Jul 26, 2012, in forum: Python
    Replies:
    0
    Views:
    198
    Terry Reedy
    Jul 26, 2012
  4. Peter Otten

    Re: Generating valid identifiers

    Peter Otten, Jul 26, 2012, in forum: Python
    Replies:
    0
    Views:
    152
    Peter Otten
    Jul 26, 2012
  5. Emile van Sebille

    Re: Generating valid identifiers

    Emile van Sebille, Jul 26, 2012, in forum: Python
    Replies:
    0
    Views:
    172
    Emile van Sebille
    Jul 26, 2012
Loading...

Share This Page