finding items that occur more than once in a list

S

Simon Forman

Is there a more efficient way to do this?

def f(L):
'''Return a set of the items that occur more than once in L.'''
L = list(L)
for item in set(L):
L.remove(item)
return set(L)


|>> f([0, 0, 1, 1, 2, 2, 3])
set([0, 1, 2])
 
C

Chris

Is there a more efficient way to do this?

def f(L):
    '''Return a set of the items that occur more than once in L.'''
    L = list(L)
    for item in set(L):
        L.remove(item)
    return set(L)

|>> f([0, 0, 1, 1, 2, 2, 3])
set([0, 1, 2])

def f(L):
D = dict()
for item in L:
if item in D:
D[item] += 1
else:
D[item] = 1
return [i for i,j in D.items() if j > 1]

That would be my way to do it, would need to test it via several
thousand iterations to see which one is most efficient though.
 
B

Bryan Olson

Simon said:
Is there a more efficient way to do this?

def f(L):
'''Return a set of the items that occur more than once in L.'''
L = list(L)
for item in set(L):
L.remove(item)
return set(L)

That's neat, but quadratic time because list.remove() requires
a linear search. We can make an efficient variant by using
remove on a set rather than a list:

def multiples(lst):
singles = set(lst)
mults = set()
for x in lst:
if x in singles:
singles.remove(x)
else:
mults.add(x)
return mults

Though probably better is:

def multiples(lst):
seen = set()
mults = set()
for x in lst:
if x in seen:
mults.add(x)
else:
seen.add(x)
return mults


I've typically used dicts for such things, as in:

def multiples(lst):
h = {}
for x in lst:
h[x] = h.get(x, 0) + 1
return set([x for x in h if h[x] > 1])
 
N

Ninereeds

Just to throw in one more alternative, if you sort your list, you only
need to test adjacent items for equality rather than needing a search
for each unique item found. You should get O(n log n) rather than
O(n^2), since the performance bottleneck is now the sorting rather
than the searching for duplicates. Also, since the sort method is
built into the list, sorting performance should be very good.

The dictionary version Chris suggests (and the essentially equivalent
set-based approach) is doing essentially the same thing in a way, but
using hashing rather than ordering to organise the list and spot
duplicates. This is *not* O(n) due to the rate of collisions
increasing as the hash table fills. If collisions are handled by
building a linked list for each hash table entry, the performance
overall is still O(n^2) since (for large amounts of data) there is
still a linear search on each collision. If collisions are handled by
building binary trees, the performance is back to O(n log n). That is,
for large enough datasets, the performance of hash tables is basically
the performance of the collision handling data structure (ignoring
scaling constants).

I suspect Python handles collisions using linked lists, since using
trees would require that all dictionary keys support ordered
comparisons. Using a dictionary or set to eliminate duplicates
therefore gives O(n^2) performance.

That said, hash tables have very good scaling constants to their
performance, so the dictionary technique will be a very good performer
in general. And if the lists are reasonably small the performance will
often seem like O(n) in practice.

The sort-then-filter approach should still be competitive, but of
course it requires that the contents of the list can be ordered
consistently. An inappropriate hash function can similarly cause
problems for the dictionary approach, causing that theoretical O(n^2)
performance to become apparent very quickly. This is probably one
reason why the cookbook recipe recommends an O(n^2) searching-based
approach.
 
H

Hrvoje Niksic

Ninereeds said:
The dictionary version Chris suggests (and the essentially
equivalent set-based approach) is doing essentially the same thing
in a way, but using hashing rather than ordering to organise the
list and spot duplicates. This is *not* O(n) due to the rate of
collisions increasing as the hash table fills. If collisions are
handled by building a linked list for each hash table entry, the
performance overall is still O(n^2)

This doesn't apply to Python, which implements dict storage as an
open-addressed table and automatically (and exponentially) grows the
table when the number of entries approaches 2/3 of the table size.
Assuming a good hash function, filling the dict should yield amortized
constant time for individual additions.
I suspect Python handles collisions using linked lists,

Why suspect, it's trivial to check. dictobject.c says:

The basic lookup function used by all operations.
This is based on Algorithm D from Knuth Vol. 3, Sec. 6.4.
Open addressing is preferred over chaining since the link overhead for
chaining would be substantial (100% with typical malloc overhead).
 
N

Ninereeds

Hrvoje said:
This doesn't apply to Python, which implements dict storage as an
open-addressed table and automatically (and exponentially) grows the
table when the number of entries approaches 2/3 of the table size.
Assuming a good hash function, filling the dict should yield amortized
constant time for individual additions.

OK. I obviously need to look up open-addressed tables. I thought this
was just, in effect, implicit linked listing - ie it still needs a
linear search to handle collisions, it just avoids the need for
explicitly stored link fields. Perhaps I'm mistaken.

As for the growth pattern, each time you grow the table you have to
redistribute all the items previously inserted to new locations.
Resizes would get rarer as more items are added due to the exponential
growth, but every table resize would take longer too since there are
more items to move. Inserting n items still intuitively looks like
O(n^2) to me.

That said, it does remind me of that old exponential realloc trick for
array resizing. Same thing, I suppose, since a hash table is basically
an array. Maybe my math "intuition" is just wrong.
 
R

Raymond Hettinger

Is there a more efficient way to do this?

def f(L):
'''Return a set of the items that occur more than once in L.'''
L = list(L)
for item in set(L):
L.remove(item)
return set(L)

|>> f([0, 0, 1, 1, 2, 2, 3])
set([0, 1, 2])


def f(L):
seen = set()
dups = set()
for e in L:
if e in seen:
dups.add(e)
else:
seen.add(e)
return dups



Raymond
 
H

Hrvoje Niksic

Ninereeds said:
As for the growth pattern, each time you grow the table you have to
redistribute all the items previously inserted to new locations.
Resizes would get rarer as more items are added due to the
exponential growth, but every table resize would take longer too
since there are more items to move. Inserting n items still
intuitively looks like O(n^2) to me.

That said, it does remind me of that old exponential realloc trick
for array resizing. Same thing, I suppose, since a hash table is
basically an array. Maybe my math "intuition" is just wrong.

That's it. While table resizes grow linearly in complexity, they
become geometrically rarer. This is exactly what happens when
resizing dynamic arrays such as Python lists. Applying your
intuition, appending n elements to a Python list would also have
O(n^2) complexity, which it doesn't. See, for example,
http://en.wikipedia.org/wiki/Dynamic_array#Geometric_expansion_and_amortized_cost
 
S

sturlamolden

def f(L):
'''Return a set of the items that occur more than once in L.'''
L = list(L)
for item in set(L):
L.remove(item)
return set(L)


def nonunique(lst):
slst = sorted(lst)
return list(set([s[0] for s in
filter(lambda t : not(t[0]-t[1]), zip(slst[:-1],slst[1:]))]))
 
S

sturlamolden

def nonunique(lst):
slst = sorted(lst)
return list(set([s[0] for s in
filter(lambda t : not(t[0]-t[1]), zip(slst[:-1],slst[1:]))]))


Or perhaps better:

def nonunique(lst):
slst = sorted(lst)
return list(set([s[0] for s in
filter(lambda t : t[0] != t[1], zip(slst[:-1],slst[1:]))]))
 
P

Paul Rubin

sturlamolden said:
def nonunique(lst):
slst = sorted(lst)
return list(set([s[0] for s in
filter(lambda t : t[0] != t[1], zip(slst[:-1],slst[1:]))]))

The items are all comparable and you're willing to take them out of order?

from collections import defaultdict
def nonunique(lst):
d = defaultdict(int)
for x in lst: d[x] += 1
return [x for x,n in d.iterkeys() if n > 1]
 
S

sturlamolden

def nonunique(lst):
slst = sorted(lst)
return list(set([s[0] for s in
filter(lambda t : t[0] != t[1], zip(slst[:-1],slst[1:]))]))

Obviously that should be 'lambda t : t[0] == t[1]'. Instead of using
the set function, there is more structure to exploit when the list is
sorted:


def nonunique(lst):
slst = sorted(lst)
dups = [s[0] for s in
filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
return [dups[0]] + [s[1] for s in
filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]
 
A

Arnaud Delobelle

def nonunique(lst):
   slst = sorted(lst)
   return list(set([s[0] for s in
       filter(lambda t : t[0] != t[1], zip(slst[:-1],slst[1:]))]))

Obviously that should be 'lambda t : t[0] == t[1]'. Instead of using
the set function, there is more structure to exploit when the list is
sorted:

def nonunique(lst):
   slst = sorted(lst)
   dups = [s[0] for s in
        filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
   return [dups[0]] + [s[1] for s in
        filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]

Argh! What's wrong with something like:

def duplicates(l):
i = j = object()
for k in sorted(l):
if i != j == k: yield k
i, j = j, k
list(duplicates([1, 2, 3, 2, 2, 4, 1]))
[1, 2]
 
S

sturlamolden

def nonunique(lst):
slst = sorted(lst)
dups = [s[0] for s in
filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
return [dups[0]] + [s[1] for s in
filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]

Argh! What's wrong with something like:

def duplicates(l):
i = j = object()
for k in sorted(l):
if i != j == k: yield k
i, j = j, k


Nice, and more readable. But I'd use Paul Robin's solution. It is O(N)
as opposed to ours which are O(N log N).
 
A

Arnaud Delobelle

Isn't this average constant time rather than amortized?
OK. I obviously need to look up open-addressed tables. I thought this
was just, in effect, implicit linked listing - ie it still needs a
linear search to handle collisions, it just avoids the need for
explicitly stored link fields. Perhaps I'm mistaken.

I don't think you are mistaken, but if I'm wrong I'd be grateful for a
link to further details.

Thanks
 
C

castironpi

Isn't this average constant time rather than amortized?


I don't think you are mistaken, but if I'm wrong I'd be grateful for a
link to further details.

Is there a known algorithm which for Key A, Key B s.t. h( KeyA ) =
h( KeyB ) for hash function h, h( KeyA, 1 ) != h( KeyB, 1 ), where
h( x, i+ 1 ) is searched when h( x, i ) is filled? That is, the
search sequence is unique (or at least orthogonal to the h_value)?

Could you just scramble the bits and hash key -> hash table -> hash
table?

Is deletion - resize to small - a strong bottleneck? Can you store an
object's search sequence, as it looks for "a home", and when one of
its earlier slots vacantizes, promote it?
 
B

Bryan Olson

Arnaud said:
Isn't this average constant time rather than amortized?

This is expected amortized constant time. Is that really the same
thing as average constant time? Hmmm... that's subtle.


The amortized doubling breaks that.
I don't think you are mistaken, but if I'm wrong I'd be grateful for a
link to further details.

Arnaud Delobelle offered a good Wikipedia link, and for more background
look up "amortized analysis.
 
C

castironpi

This is expected amortized constant time. Is that really the same
thing as average constant time? Hmmm... that's subtle.

It's the same as expected constant time.
The amortized doubling breaks that.
I don't think you are mistaken, but if I'm wrong I'd be grateful for a
link to further details.

Arnaud Delobelle offered a good Wikipedia link, and for more background
look up "amortized analysis.
[ (i,hash(2**i)) for i in range( 0, 5000, 32 ) ]
[(0, 1), (32, 1), (64, 1), (96, 1), (128, 1), (160, 1), (192, 1),
(224, 1), (256
, 1), (288, 1), (320, 1), (352, 1), (384, 1), (416, 1), (448, 1),
(480, 1), (512
, 1), (544, 1), (576, 1), (608, 1), (640, 1), (672, 1), (704, 1),
(736, 1), (768
, 1), (800, 1), (832, 1), (864, 1), (896, 1), (928, 1), (960, 1),
(992, 1), (102
4, 1), (1056, 1), (1088, 1), (1120, 1), (1152, 1), (1184, 1), (1216,
1), (1248,
1), (1280, 1), (1312, 1), (1344, 1), (1376, 1), (1408, 1), (1440, 1),
(1472, 1),
(1504, 1), (1536, 1), (1568, 1), (1600, 1), (1632, 1), (1664, 1),
(1696, 1), (1

Not so farfetched, all. Don't forget big-theta and big-omega.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top