testing for uniquness in a large list

L

Lol McBride

Hi All,
I'm looking for some help in developing a way to test for the uniqueness
of an item in a large list.To illustrate what I mean the code below is an
indicator of what I want to do but I'm hoping for a faster way than the
one shown.Basically,I have a list of 20 integers and I want to create
another list of 200000 unique subsets of 12 integers from this list.WhatI
have done here is to use the sample()function from the random module
and then compare the result to every item in the ints list to check for
uniqueness - as you can guess this takes an interminable amount of time to
grind through.Can someone please enlighten me as to how I can do this and
keep the amount of time to do it to a minimum?

Thanks for taking the time to read this and doubly so if you're able to
help.
Cheers,
Lol McBride

#!/usr/bin/python
from random import *
#seq is the pool to choose from
seq = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
# this is how many times to choose a unique selection
rlen = 200000
counter = 100
ints = [0]
while 1:
# if this is the last value for rlen then we stop
if rlen == 0:
break
intLen = len(ints)
cnt = 0
testVal = sample(seq,12)
testVal.sort()
if len(ints)>=counter:
print len(ints)
counter = counter+100
while 1:
if ints.count(testVal)==1:
cnt = cnt+1
break
intLen = intLen -1
if intLen == 0:
break
if cnt == 1:
continue
else:
ints.append(testVal)
rlen = rlen -1
 
C

Cliff Wells

Hi All,
I'm looking for some help in developing a way to test for the uniqueness
of an item in a large list.To illustrate what I mean the code below is an
indicator of what I want to do but I'm hoping for a faster way than the
one shown.Basically,I have a list of 20 integers and I want to create
another list of 200000 unique subsets of 12 integers from this list.WhatI
have done here is to use the sample()function from the random module
and then compare the result to every item in the ints list to check for
uniqueness - as you can guess this takes an interminable amount of time to
grind through.Can someone please enlighten me as to how I can do this and
keep the amount of time to do it to a minimum?

Thanks for taking the time to read this and doubly so if you're able to
help.
Cheers,
Lol McBride

#!/usr/bin/python
from random import *
#seq is the pool to choose from
seq = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
# this is how many times to choose a unique selection
rlen = 200000
counter = 100
ints = [0]
while 1:
# if this is the last value for rlen then we stop
if rlen == 0:
break
intLen = len(ints)
cnt = 0
testVal = sample(seq,12)
testVal.sort()
if len(ints)>=counter:
print len(ints)
counter = counter+100
while 1:
if ints.count(testVal)==1:
cnt = cnt+1
break
intLen = intLen -1
if intLen == 0:
break
if cnt == 1:
continue
else:
ints.append(testVal)
rlen = rlen -1

Disregarding more efficient methods of generating your list (of which
I'm sure there are many), part of the "interminable time" is probably
taken up in the infinite loop you have ;) Set rlen to a much smaller
number and you'll see that the program still never terminates.

You'll notice that:

ints = [0]
while 1:
...
intLen = len(ints) # aka 1
...
while 1:
...
intLen = intLen - 1 # aka 0
if intLen == 0: # always true
break
...



Basically this means that nothing is ever appended to ints nor is rlen
ever decremented (and these are the tests by which the loops terminate).

I hear that Python 3000 will complete infinite loops in half the time
however ;)

Regards,
Cliff
 
A

Alex Martelli

Lol McBride said:
I'm looking for some help in developing a way to test for the uniqueness
of an item in a large list.To illustrate what I mean the code below is an
indicator of what I want to do but I'm hoping for a faster way than the
one shown.Basically,I have a list of 20 integers and I want to create
another list of 200000 unique subsets of 12 integers from this list.WhatI
have done here is to use the sample()function from the random module
and then compare the result to every item in the ints list to check for
uniqueness - as you can guess this takes an interminable amount of time to
grind through.Can someone please enlighten me as to how I can do this and
keep the amount of time to do it to a minimum?

One word: dictionaries!

Untested, but should work...:


import random # dont't use from...import *, on general grounds

def picks(seq=xrange(1, 21), rlen=200000, picklen=12):
results = {}
while len(results) < rlen:
pick = random.sample(seq, picklen)
pick.sort()
pick = tuple(pick)
results[pick] = 1
return results.keys()

this returns a list of tuples; if you need a list of lists,

return [list(pick) for pick in results]

In 2.4, you could use "result=set()" instead of {}, results.add(pick) to
maybe add a new pick, and possibly "return list(result)" if you need a
list of tuples as the function's overall return value (the list
comprehension for the case in which you need a list lists stays OK).
But that's just an issue of readability, not speed. Slight speedups can
be had by hoisting some lookups out of the while loop, but nothing
major, I think -- eg. sample=random.sample just before the while, and
use sample(seq, picklen) within the while. Putting all the code in a
function and using local variables IS a major speed win -- local
variable setting and access is WAY faster than globals'.


Alex
 
L

Lol McBride

Lol McBride said:
I'm looking for some help in developing a way to test for the uniqueness
of an item in a large list.To illustrate what I mean the code below is an
indicator of what I want to do but I'm hoping for a faster way than the
one shown.Basically,I have a list of 20 integers and I want to create
another list of 200000 unique subsets of 12 integers from this list.WhatI
have done here is to use the sample()function from the random module
and then compare the result to every item in the ints list to check for
uniqueness - as you can guess this takes an interminable amount of time to
grind through.Can someone please enlighten me as to how I can do this and
keep the amount of time to do it to a minimum?

One word: dictionaries!

Untested, but should work...:


import random # dont't use from...import *, on general grounds

def picks(seq=xrange(1, 21), rlen=200000, picklen=12):
results = {}
while len(results) < rlen:
pick = random.sample(seq, picklen)
pick.sort()
pick = tuple(pick)
results[pick] = 1
return results.keys()

this returns a list of tuples; if you need a list of lists,

return [list(pick) for pick in results]

In 2.4, you could use "result=set()" instead of {}, results.add(pick) to
maybe add a new pick, and possibly "return list(result)" if you need a
list of tuples as the function's overall return value (the list
comprehension for the case in which you need a list lists stays OK).
But that's just an issue of readability, not speed. Slight speedups can
be had by hoisting some lookups out of the while loop, but nothing
major, I think -- eg. sample=random.sample just before the while, and
use sample(seq, picklen) within the while. Putting all the code in a
function and using local variables IS a major speed win -- local
variable setting and access is WAY faster than globals'.


Alex
Thank you Alex - i'll try your method and see how i get on
Lol
 
B

bearophile

Alex Martelli's solution is *very* good, but there is a sampling
problem: putting a simple printing inside his program:

if not (len(results) % 5000):
print len(results)

You can see that it slows down a lot when the dictionary contains
about 100000-120000 different sequences, because there are many
collisions, and it keeps slowing down. Probably a little speed up of
this code cannot help. A different algoritm can be useful.
I don't know... direct sequences generation doesn't seem possibile.
Maybe a partially direct generation can be okay.

Hugs,
bearophile
 
J

Josiah Carlson

Alex Martelli's solution is *very* good, but there is a sampling
problem: putting a simple printing inside his program:

if not (len(results) % 5000):
print len(results)

You can see that it slows down a lot when the dictionary contains
about 100000-120000 different sequences, because there are many
collisions, and it keeps slowing down. Probably a little speed up of
this code cannot help. A different algoritm can be useful.
I don't know... direct sequences generation doesn't seem possibile.
Maybe a partially direct generation can be okay.

There are only 125,970 unique sequences of 12 items from 20 where order
does not matter (20!)/(12!8!), which is what Alex was generating, and
what "Lol McBride" asked for.

Remove the "ordering does not matter" constraint (by removing the
list.sort() call) will complete much faster as the space to select from
becomes 60,339,831,552,000 in size.

Math saves the day!

- Josiah
 
B

bearophile

Josiah:
There are only 125,970 unique sequences of 12 items
from 20 where order does not matter (20!)/(12!8!),

Right! I was thinking the same thing *after* writing my last post ^_-

With random sampling the generation of the last 10 combinations is a
bit slow, it require a couple of minutes.
There are direct ways to compute them:
http://www.netlib.org/toms/382

Bye,
bearophile
 
P

Paul McGuire

bearophile said:
Alex Martelli's solution is *very* good, but there is a sampling
problem: putting a simple printing inside his program:

if not (len(results) % 5000):
print len(results)

You can see that it slows down a lot when the dictionary contains
about 100000-120000 different sequences, because there are many
collisions, and it keeps slowing down. Probably a little speed up of
this code cannot help. A different algoritm can be useful.
I don't know... direct sequences generation doesn't seem possibile.
Maybe a partially direct generation can be okay.

Hugs,
bearophile

Is part of this slowdown the fact that the loop will exit only when rlen
reaches 200,000, when you have just shown that it can never be greater than
125,970?

Perhaps some heuristic would opt for using sets instead of random.sample if
rlen approaches some threshold value of
len(seq)!/picklen!(len(seq)-picklen)! , perhaps .8 or so. There should be a
crossover point at which it is cheaper to build a set of all 125,970
combinations, and then select 100,000 of them using set.pop() (which selects
an arbitrary element from the set, and removes it).

-- Paul
 
P

Paul McGuire

Paul McGuire said:
Is part of this slowdown the fact that the loop will exit only when rlen
reaches 200,000, when you have just shown that it can never be greater than
125,970?

Perhaps some heuristic would opt for using sets instead of random.sample if
rlen approaches some threshold value of
len(seq)!/picklen!(len(seq)-picklen)! , perhaps .8 or so. There should be a
crossover point at which it is cheaper to build a set of all 125,970
combinations, and then select 100,000 of them using set.pop() (which selects
an arbitrary element from the set, and removes it).

-- Paul
.... and of course, raising a MathError if rlen >
len(seq)!/picklen!(len(seq)-picklen)!

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top