testing for uniquness in a large list

Lol McBride · Oct 20, 2004

Hi All,
I'm looking for some help in developing a way to test for the uniqueness
of an item in a large list.To illustrate what I mean the code below is an
indicator of what I want to do but I'm hoping for a faster way than the
one shown.Basically,I have a list of 20 integers and I want to create
another list of 200000 unique subsets of 12 integers from this list.WhatI
have done here is to use the sample()function from the random module
and then compare the result to every item in the ints list to check for
uniqueness - as you can guess this takes an interminable amount of time to
grind through.Can someone please enlighten me as to how I can do this and
keep the amount of time to do it to a minimum?

Thanks for taking the time to read this and doubly so if you're able to
help.
Cheers,
Lol McBride

#!/usr/bin/python
from random import *
#seq is the pool to choose from
seq = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
# this is how many times to choose a unique selection
rlen = 200000
counter = 100
ints = [0]
while 1:
# if this is the last value for rlen then we stop
if rlen == 0:
break
intLen = len(ints)
cnt = 0
testVal = sample(seq,12)
testVal.sort()
if len(ints)>=counter:
print len(ints)
counter = counter+100
while 1:
if ints.count(testVal)==1:
cnt = cnt+1
break
intLen = intLen -1
if intLen == 0:
break
if cnt == 1:
continue
else:
ints.append(testVal)
rlen = rlen -1

Cliff Wells · Oct 20, 2004

Hi All,
I'm looking for some help in developing a way to test for the uniqueness
of an item in a large list.To illustrate what I mean the code below is an
indicator of what I want to do but I'm hoping for a faster way than the
one shown.Basically,I have a list of 20 integers and I want to create
another list of 200000 unique subsets of 12 integers from this list.WhatI
have done here is to use the sample()function from the random module
and then compare the result to every item in the ints list to check for
uniqueness - as you can guess this takes an interminable amount of time to
grind through.Can someone please enlighten me as to how I can do this and
keep the amount of time to do it to a minimum?

Thanks for taking the time to read this and doubly so if you're able to
help.
Cheers,
Lol McBride

#!/usr/bin/python
from random import *
#seq is the pool to choose from
seq = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
# this is how many times to choose a unique selection
rlen = 200000
counter = 100
ints = [0]
while 1:
# if this is the last value for rlen then we stop
if rlen == 0:
break
intLen = len(ints)
cnt = 0
testVal = sample(seq,12)
testVal.sort()
if len(ints)>=counter:
print len(ints)
counter = counter+100
while 1:
if ints.count(testVal)==1:
cnt = cnt+1
break
intLen = intLen -1
if intLen == 0:
break
if cnt == 1:
continue
else:
ints.append(testVal)
rlen = rlen -1

Disregarding more efficient methods of generating your list (of which
I'm sure there are many), part of the "interminable time" is probably
taken up in the infinite loop you have

Set rlen to a much smaller
number and you'll see that the program still never terminates.

You'll notice that:

ints = [0]
while 1:
...
intLen = len(ints) # aka 1
...
while 1:
...
intLen = intLen - 1 # aka 0
if intLen == 0: # always true
break
...

Basically this means that nothing is ever appended to ints nor is rlen
ever decremented (and these are the tests by which the loops terminate).

I hear that Python 3000 will complete infinite loops in half the time
however

Regards,
Cliff

Alex Martelli · Oct 20, 2004

Lol McBride said:
I'm looking for some help in developing a way to test for the uniqueness
of an item in a large list.To illustrate what I mean the code below is an
indicator of what I want to do but I'm hoping for a faster way than the
one shown.Basically,I have a list of 20 integers and I want to create
another list of 200000 unique subsets of 12 integers from this list.WhatI
have done here is to use the sample()function from the random module
and then compare the result to every item in the ints list to check for
uniqueness - as you can guess this takes an interminable amount of time to
grind through.Can someone please enlighten me as to how I can do this and
keep the amount of time to do it to a minimum?

One word: dictionaries!

Untested, but should work...:

import random # dont't use from...import *, on general grounds

def picks(seq=xrange(1, 21), rlen=200000, picklen=12):
results = {}
while len(results) < rlen:
pick = random.sample(seq, picklen)
pick.sort()
pick = tuple(pick)
results[pick] = 1
return results.keys()

this returns a list of tuples; if you need a list of lists,

return [list(pick) for pick in results]

In 2.4, you could use "result=set()" instead of {}, results.add(pick) to
maybe add a new pick, and possibly "return list(result)" if you need a
list of tuples as the function's overall return value (the list
comprehension for the case in which you need a list lists stays OK).
But that's just an issue of readability, not speed. Slight speedups can
be had by hoisting some lookups out of the while loop, but nothing
major, I think -- eg. sample=random.sample just before the while, and
use sample(seq, picklen) within the while. Putting all the code in a
function and using local variables IS a major speed win -- local
variable setting and access is WAY faster than globals'.

Alex

Lol McBride · Oct 20, 2004

Doh! And once againwith feeling Doh!
Thank you for pointing that out to me
Lol

Lol McBride · Oct 20, 2004

Lol McBride said:
Lol McBride said:

I'm looking for some help in developing a way to test for the uniqueness
of an item in a large list.To illustrate what I mean the code below is an
indicator of what I want to do but I'm hoping for a faster way than the
one shown.Basically,I have a list of 20 integers and I want to create
another list of 200000 unique subsets of 12 integers from this list.WhatI
have done here is to use the sample()function from the random module
and then compare the result to every item in the ints list to check for
uniqueness - as you can guess this takes an interminable amount of time to
grind through.Can someone please enlighten me as to how I can do this and
keep the amount of time to do it to a minimum?

Click to expand...

One word: dictionaries!

Untested, but should work...:

import random # dont't use from...import *, on general grounds

def picks(seq=xrange(1, 21), rlen=200000, picklen=12):
results = {}
while len(results) < rlen:
pick = random.sample(seq, picklen)
pick.sort()
pick = tuple(pick)
results[pick] = 1
return results.keys()

this returns a list of tuples; if you need a list of lists,

return [list(pick) for pick in results]

In 2.4, you could use "result=set()" instead of {}, results.add(pick) to
maybe add a new pick, and possibly "return list(result)" if you need a
list of tuples as the function's overall return value (the list
comprehension for the case in which you need a list lists stays OK).
But that's just an issue of readability, not speed. Slight speedups can
be had by hoisting some lookups out of the while loop, but nothing
major, I think -- eg. sample=random.sample just before the while, and
use sample(seq, picklen) within the while. Putting all the code in a
function and using local variables IS a major speed win -- local
variable setting and access is WAY faster than globals'.

Alex

Thank you Alex - i'll try your method and see how i get on
Lol

bearophile · Oct 20, 2004

Alex Martelli's solution is *very* good, but there is a sampling
problem: putting a simple printing inside his program:

if not (len(results) % 5000):
print len(results)

You can see that it slows down a lot when the dictionary contains
about 100000-120000 different sequences, because there are many
collisions, and it keeps slowing down. Probably a little speed up of
this code cannot help. A different algoritm can be useful.
I don't know... direct sequences generation doesn't seem possibile.
Maybe a partially direct generation can be okay.

Hugs,
bearophile

Josiah Carlson · Oct 20, 2004

Alex Martelli's solution is *very* good, but there is a sampling
problem: putting a simple printing inside his program:

if not (len(results) % 5000):
print len(results)

You can see that it slows down a lot when the dictionary contains
about 100000-120000 different sequences, because there are many
collisions, and it keeps slowing down. Probably a little speed up of
this code cannot help. A different algoritm can be useful.
I don't know... direct sequences generation doesn't seem possibile.
Maybe a partially direct generation can be okay.

There are only 125,970 unique sequences of 12 items from 20 where order
does not matter (20!)/(12!8!), which is what Alex was generating, and
what "Lol McBride" asked for.

Remove the "ordering does not matter" constraint (by removing the
list.sort() call) will complete much faster as the space to select from
becomes 60,339,831,552,000 in size.

Math saves the day!

- Josiah

bearophile · Oct 20, 2004

Josiah:

There are only 125,970 unique sequences of 12 items
from 20 where order does not matter (20!)/(12!8!),

Right! I was thinking the same thing *after* writing my last post ^_-

With random sampling the generation of the last 10 combinations is a
bit slow, it require a couple of minutes.
There are direct ways to compute them:
http://www.netlib.org/toms/382

Bye,
bearophile

Paul McGuire · Oct 20, 2004

bearophile said:
Alex Martelli's solution is *very* good, but there is a sampling
problem: putting a simple printing inside his program:

if not (len(results) % 5000):
print len(results)

You can see that it slows down a lot when the dictionary contains
about 100000-120000 different sequences, because there are many
collisions, and it keeps slowing down. Probably a little speed up of
this code cannot help. A different algoritm can be useful.
I don't know... direct sequences generation doesn't seem possibile.
Maybe a partially direct generation can be okay.

Hugs,
bearophile

Is part of this slowdown the fact that the loop will exit only when rlen
reaches 200,000, when you have just shown that it can never be greater than
125,970?

Perhaps some heuristic would opt for using sets instead of random.sample if
rlen approaches some threshold value of
len(seq)!/picklen!(len(seq)-picklen)! , perhaps .8 or so. There should be a
crossover point at which it is cheaper to build a set of all 125,970
combinations, and then select 100,000 of them using set.pop() (which selects
an arbitrary element from the set, and removes it).

-- Paul

Paul McGuire · Oct 20, 2004

Paul McGuire said:
Is part of this slowdown the fact that the loop will exit only when rlen
reaches 200,000, when you have just shown that it can never be greater than
125,970?

Perhaps some heuristic would opt for using sets instead of random.sample if
rlen approaches some threshold value of
len(seq)!/picklen!(len(seq)-picklen)! , perhaps .8 or so. There should be a
crossover point at which it is cheaper to build a set of all 125,970
combinations, and then select 100,000 of them using set.pop() (which selects
an arbitrary element from the set, and removes it).

-- Paul

.... and of course, raising a MathError if rlen >
len(seq)!/picklen!(len(seq)-picklen)!

-- Paul

Range / empty list issues??	1	Dec 11, 2023
PHP cURL for large content and single HTTP request	1	Feb 23, 2023
Index Error during backpropagation in a multilayer neural network.	1	Jun 17, 2023
List comprehension for testing **params	10	Nov 11, 2012
My subroutine won't call within a for and if...is this normal?	1	Jan 25, 2022
Help for my project in the last minute	0	Apr 23, 2022
Q for a source code in an exercise	1	Dec 15, 2023
testing if a list contains a sublist	25	Aug 16, 2011

testing for uniquness in a large list

Lol McBride

Cliff Wells

Alex Martelli

Lol McBride

Lol McBride

bearophile

Josiah Carlson

bearophile

Paul McGuire

Paul McGuire

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads