Generating a unique identifier

  • Thread starter Steven D'Aprano
  • Start date
S

Steven D'Aprano

I have an application that will be producing many instances, using them
for a while, then tossing them away, and I want each one to have a unique
identifier that won't be re-used for the lifetime of the Python session.

I can't use the id() of the object, because that is only guaranteed to be
unique during the lifetime of the object.

For my application, it doesn't matter if the ids are predictable, so I
could do something as simple as this:

def unique_id():
n = 1234567890
while True:
yield n
n += 1

unique_id = unique_id()

while Application_Is_Running:
make_an_object(id=unique_id())
do_stuff_with_objects()
delete_some_of_them()

which is easy enough, but I thought I'd check if there was an existing
solution in the standard library that I missed. Also, for other
applications, I might want them to be rather less predictable.
 
M

Marc 'BlackJack' Rintsch

I have an application that will be producing many instances, using them
for a while, then tossing them away, and I want each one to have a unique
identifier that won't be re-used for the lifetime of the Python session.

I can't use the id() of the object, because that is only guaranteed to be
unique during the lifetime of the object.

For my application, it doesn't matter if the ids are predictable, so I
could do something as simple as this:

def unique_id():
n = 1234567890
while True:
yield n
n += 1

unique_id = unique_id()

while Application_Is_Running:
make_an_object(id=unique_id())
do_stuff_with_objects()
delete_some_of_them()

which is easy enough, but I thought I'd check if there was an existing
solution in the standard library that I missed.

For that easy solution you can use `itertools.count()`.

Ciao,
Marc 'BlackJack' Rintsch
 
K

kyosohma

I have an application that will be producing many instances, using them
for a while, then tossing them away, and I want each one to have a unique
identifier that won't be re-used for the lifetime of the Python session.

I can't use the id() of the object, because that is only guaranteed to be
unique during the lifetime of the object.

For my application, it doesn't matter if the ids are predictable, so I
could do something as simple as this:

def unique_id():
n = 1234567890
while True:
yield n
n += 1

unique_id = unique_id()

while Application_Is_Running:
make_an_object(id=unique_id())
do_stuff_with_objects()
delete_some_of_them()

which is easy enough, but I thought I'd check if there was an existing
solution in the standard library that I missed. Also, for other
applications, I might want them to be rather less predictable.

You could always use the md5 module along with time.time() to generate
a unique id.

Mike
 
W

Wildemar Wildenburger

Will said:
]
which is easy enough, but I thought I'd check if there was an existing
solution in the standard library that I missed. Also, for other
applications, I might want them to be rather less predictable.

2.5 includes the uuid module for RFC 4122 universally-unique IDs:

http://docs.python.org/lib/module-uuid.html
Jesus Christ! What in all the world *doesn't* Python do already?!

Can't we now just write up a module for the standard lib that takes care
of writing all the other code as well and just forget about getting our
hands dirty with programming altogether?

/W
(just in case: ;)!)
 
P

Paul Rubin

Steven D'Aprano said:
def unique_id():
n = 1234567890
while True:
yield n
n += 1

unique_id = itertools.count(1234567890)
which is easy enough, but I thought I'd check if there was an existing
solution in the standard library that I missed. Also, for other
applications, I might want them to be rather less predictable.

def unique_id():
return os.urandom(10).encode('hex')
 
P

Paul Rubin

Paul Rubin said:
def unique_id():
return os.urandom(10).encode('hex')

Sorry, make that 32 or 40 instead of 10, if the number of id's is large,
to make birthday collisions unlikely.

If you don't want the id's to be that large, you can implement a
Feistel cipher using md5 or sha as the round function pretty
straightforwardly, then just feed successive integers through it.
That also guarantees uniqueness, at least within one run of the
program. I have some sample code around for that, let me know if you
need it.
 
B

Ben Finney

Will Maier said:
]
which is easy enough, but I thought I'd check if there was an
existing solution in the standard library that I missed. Also, for
other applications, I might want them to be rather less
predictable.

2.5 includes the uuid module for RFC 4122 universally-unique IDs:

http://docs.python.org/lib/module-uuid.html

I second this recommendation. If you want arbitrary unique IDs that
are not a function of the data they identify, the simplest solution is
a monotonically-increasing serial number. If you want more than that,
you might as well go straight to the standard UUIDs.

Even if you're not using Python 2.5, grab the external 'python-uuid'
package for your operating system and use that. If you use the options
for including random elements, the generated UUIDs will even satisfy
your "unpredictable" requirement.
 
S

Steven D'Aprano

unique_id = itertools.count(1234567890)

Sweet!

I really must make itertools second-nature. I always forget it.

def unique_id():
return os.urandom(10).encode('hex')

Any time I see something using a random number to generate IDs, I worry
about collisions. Am I being paranoid? (But even paranoids write code
with bugs...)


Here's something which is a little less predictable than a straight
counter:

def unpredictable_counter(n=12345678):
while True:
n += random.randint(1, 68)
yield n

def more_unpredictable_counter(n=1234567):
uc = unpredictable_counter(n)
pool = []
while True:
if not pool:
pool = [None]*99
for i in range(99):
pool = uc.next()
random.shuffle(pool)
yield pool.pop()
 
P

Paul Rubin

Ben Finney said:
I second this recommendation. If you want arbitrary unique IDs that
are not a function of the data they identify, the simplest solution is
a monotonically-increasing serial number. If you want more than that,
you might as well go straight to the standard UUIDs.

The standard UUID's as described in that url don't make any serious
attempt to be unguessable. It's better to use a string from os.urandom().

To Stephen: double oops, I was thinking of hex digits rather than bytes
when I suggested reading 32 (length of an md5 hash) or 40 (length of
an sha1 hash) from os.urandom. That should have said 16 or 20 bytes.

If k is the number of unique id's you'll actually use, and N is the
number of possible random uid's (so for 16 bytes, N=2**128 since
16 bytes is 128 bits), the probability of a collision occuring is
approximately exp(-k**2 / (2*N)), assuming k is small compared
with sqrt(N).
 
P

Paul Rubin

Steven D'Aprano said:
Sweet!

I really must make itertools second-nature. I always forget it.

Beware that count() output (like enumerate) is always <= sys.maxint.
It raises an exception if it overflows. IMO this is a bug.
Any time I see something using a random number to generate IDs, I worry
about collisions. Am I being paranoid? (But even paranoids write code
with bugs...)

Well, the idea is to make the string long enough (I shouldn't have
picked 10 bytes) to make the probability of collisions acceptably low.
The probability is about exp(-k**2/(2*N)) where k is the number of
id's you use and N is the number of possible ID's. So with
os.urandom(20), if you use 1 billion (= approx 2**30) id's,
the probability is about exp(-(2**60 / 2*2**160)) = 1/exp(2**101)
which is extremely small. Using random strings is probably safer
from collisions than maintain keep distinct initial counters
across multiple runs of a program, on multiple computers, etc.

The classic example is ethernet cards. Each one is supposed to have a
unique hardware 48-bit MAC address, with routers etc. relying on the
uniqueness. Some organization is supposed to assign a unique code to
each manufacturer, and then the manufacturer uses that code as a
prefix and assigns addresses in sequence within that space. That
works fine until sales go up, manufacturers start opening multiple
factories operating out of the same space, low-rent manufacturers
start making up their own codes, etc. So companies that buy large
numbers of cards often find multiple cards with the same address,
causing various hassles. If they just assigned 128-bit MAC addresses
at random it's very unlikely that there would ever be a collision.
Here's something which is a little less predictable than a straight
counter:

It's still very easy to generate valid id's from that, or guess the
next one with non-negligible probability. IMO it's almost never worth
messing with a "little less predictable". If you're trying to prevent
an actual opponent from guessing something, use full-strength
cryptography.

You could try something like this:

import sha, os, itertools

radix = 2**32 # keep this == 2**(some even number <= 80)
secret_key = os.urandom(20)

def prp(key, n):
# pseudo-random permutation (8-round Feistel network)
# transform n to a unique number < radix**2
assert 0 <= n < radix**2

def F(i,x):
return int(sha.new('%s,%x,%x'%(key,i,x)).hexdigest(), 16) % radix

a,b = divmod(n, radix)
for i in xrange(8):
a ^= F(i,b)
a,b = b,a
return radix*a + b

unique_id = (prp(secret_key, i) for i in itertools.count())

It should be pretty secure and (unless I made an error) is a
permutation from [0:radix**2] to [0:radix**2], similar to DES.
It's invertible if you know the secret key (I'll leave that as an
exercise). 8 rounds is probably overkill for radix 2**32.
 
P

Paul Rubin

Paul Rubin said:
the probability is about exp(-(2**60 / 2*2**160)) = 1/exp(2**101)

Bah. Got that backwards and calculated wrong. The probability of no
collisions is

exp(-(2**60) / (2*2**160)) = exp(-(2**-101)) = approx (1 - 2**-101)

which is very close to 1. The probability of collision is about
2**-101 which is close to 0. Proofread first, then post. I keep
getting that in the wrong order.
 
S

Steve Holden

Paul said:
Bah. Got that backwards and calculated wrong. The probability of no
collisions is

exp(-(2**60) / (2*2**160)) = exp(-(2**-101)) = approx (1 - 2**-101)

which is very close to 1. The probability of collision is about
2**-101 which is close to 0. Proofread first, then post. I keep
getting that in the wrong order.

Right. The words "birthday attack" should have been in the back of your
mind while you were writing that message.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------
 
S

Steven D'Aprano

Sorry, make that 32 or 40 instead of 10, if the number of id's is large,
to make birthday collisions unlikely.

I did a small empirical test, and with 16 million ids, I found no
collisions.

However, I did find that trying to dispose of a set of 16 million short
strings caused my Python session to lock up for twenty minutes until I
got fed up and killed the process. Should garbage-collecting 16 million
strings really take 20+ minutes?

If you don't want the id's to be that large, you can implement a Feistel
cipher using md5 or sha as the round function pretty straightforwardly,
then just feed successive integers through it. That also guarantees
uniqueness, at least within one run of the program. I have some sample
code around for that, let me know if you need it.

I'm not sure that I need it, but I would certainly be curious to see it.


Thanks,
 
S

Steven D'Aprano

It's still very easy to generate valid id's from that, or guess the next
one with non-negligible probability.

Well, in my specific application, the only consequences of guessing a
valid id is that you get to congratulate yourself for guessing a valid
id :)

As it turned out, I decided that since the only reason I had for avoiding
consecutive ids was that they didn't look random, and that this would not
really matter because no human being (including myself) was actually
going to be looking at them except during debugging, I used the
intertools.count() solution.


Thanks to all who replied.
 
P

Paul Rubin

Steven D'Aprano said:
I did a small empirical test, and with 16 million ids, I found no
collisions.

16 million 32-byte ids? With string and dictionary overhead that's
probably on the order of 1 GB. Anyway, 16 bytes is enough, as
mentioned elsewhere.
However, I did find that trying to dispose of a set of 16 million short
strings caused my Python session to lock up for twenty minutes until I
got fed up and killed the process. Should garbage-collecting 16 million
strings really take 20+ minutes?

Maybe your system was thrashing, or maybe the GC was happening during
allocation (there was some discussion of that a while back).
I'm not sure that I need it, but I would certainly be curious to see it.

I posted some code elsewhere in the thread.
 
H

Hrvoje Niksic

Steven D'Aprano said:
Should garbage-collecting 16 million strings really take 20+
minutes?

It shouldn't. For testing purposes I've created a set of 16 milion
strings like this:

s = set()
for n in xrange(16000000):
s.add('somerandomprefix' + str(n)) # prefix makes the strings a bit larger

It takes maybe about 20 seconds to create the set. Quitting Python
takes 4-5 seconds. This is stock Python 2.5.1.
 
S

stephen.lewitowski

I have an application that will be producing many instances, using them
for a while, then tossing them away, and I want each one to have a unique
identifier that won't be re-used for the lifetime of the Python session.

I can't use the id() of the object, because that is only guaranteed to be
unique during the lifetime of the object.

For my application, it doesn't matter if the ids are predictable, so I
could do something as simple as this:

def unique_id():
n = 1234567890
while True:
yield n
n += 1

unique_id = unique_id()

while Application_Is_Running:
make_an_object(id=unique_id())
do_stuff_with_objects()
delete_some_of_them()

which is easy enough, but I thought I'd check if there was an existing
solution in the standard library that I missed. Also, for other
applications, I might want them to be rather less predictable.

This is very simple. Use a class variable that increments every time a
new object is created. If you need to use it in a number of different
types of object then you could use the code below in a metaclass and
tag all of your classes with the metaclass. Unless you are making
billions of objects then i think that this should suffice.

class MyObj:
id_count = 0
def __init__(self):
self.id = self.id_count
MyObj.id_count += 1
print self.id, MyObj.id_count

MyObj()
MyObj()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top