Removing duplicates from a list

Rubinho · Sep 14, 2005

I've a list with duplicate members and I need to make each entry
unique.

I've come up with two ways of doing it and I'd like some input on what
would be considered more pythonic (or at least best practice).

Method 1 (the traditional approach)

for x in mylist:
if mylist.count(x) > 1:
mylist.remove(x)

Method 2 (not so traditional)

mylist = set(mylist)
mylist = list(mylist)

Converting to a set drops all the duplicates and converting back to a
list, well, gets it back to a list which is what I want.

I can't imagine one being much faster than the other except in the case
of a huge list and mine's going to typically have less than 1000
elements.

What do you think?

Cheers,

Robin

Thomas Guettler · Sep 14, 2005

Am Wed, 14 Sep 2005 04:38:35 -0700 schrieb Rubinho:

I've a list with duplicate members and I need to make each entry
unique.

I've come up with two ways of doing it and I'd like some input on what
would be considered more pythonic (or at least best practice).

mylist = set(mylist)
mylist = list(mylist)

Converting to a set drops all the duplicates and converting back to a
list, well, gets it back to a list which is what I want.

I can't imagine one being much faster than the other except in the case
of a huge list and mine's going to typically have less than 1000
elements.

What do you think?

Hi,

I would use "set":

mylist=list(set(mylist))

Thomas

Will McGugan · Sep 14, 2005

Rubinho said:
I've a list with duplicate members and I need to make each entry
unique.

I've come up with two ways of doing it and I'd like some input on what
would be considered more pythonic (or at least best practice).

Method 1 (the traditional approach)

for x in mylist:
if mylist.count(x) > 1:
mylist.remove(x)

Method 2 (not so traditional)

mylist = set(mylist)
mylist = list(mylist)

Converting to a set drops all the duplicates and converting back to a
list, well, gets it back to a list which is what I want.

I can't imagine one being much faster than the other except in the case
of a huge list and mine's going to typically have less than 1000
elements.

I would imagine that 2 would be significantly faster. Method 1 uses
'count' which must make a pass through every element of the list, which
would be slower than the efficient hashing that set does. I'm also not
sure about removing an element whilst iterating, I think thats a no-no.

Will McGugan

Peter Otten · Sep 14, 2005

Rubinho said:
I've a list with duplicate members and I need to make each entry
unique.

I've come up with two ways of doing it and I'd like some input on what
would be considered more pythonic (or at least best practice).

Method 1 (the traditional approach)

for x in mylist:
if mylist.count(x) > 1:
mylist.remove(x)

That would be an odd tradition:

mylist = [1, 2, 1, 3, 2, 3]
for x in mylist:

Click to expand...

Click to expand...

.... if mylist.count(x) > 1:
.... mylist.remove(x)
....[2, 1, 2, 3] # oops!

See "Unexpected Behavior Iterating over a Mutating Object"
http://mail.python.org/pipermail/python-list/2005-September/298993.html
thread for the most recent explanation.

Rather, the traditional approach for an algorithmic problem in Python is to
ask Tim Peters, see his recipe at
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560/
(which predates Python's set class).

Peter

Rubinho · Sep 14, 2005

Peter said:
That would be an odd tradition:

By tradition I wasn't really talking Python tradition; what I meant was
that the above pattern is similar to what would be generated by people
used to traditional programming languages.

mylist = [1, 2, 1, 3, 2, 3]
for x in mylist:

Click to expand...

Click to expand...

... if mylist.count(x) > 1:
... mylist.remove(x)
...[2, 1, 2, 3] # oops!

But you're absolutely right, it doesn't work! Oops indeed

I've gone with Thomas's suggestion above of: mylist=list(set(mylist))

Thanks,

Robin

martijn · Sep 14, 2005

I do this:

def unique(keys):
unique = []
for i in keys:
if i not in unique:unique.append(i)
return unique

I don't know what is faster at the moment.

Christian Stapfer · Sep 14, 2005

I do this:

def unique(keys):
unique = []
for i in keys:
if i not in unique:unique.append(i)
return unique

I don't know what is faster at the moment.

This is quadratic, O(n^2), in the length n of the list
if all keys are unique.
Conversion to a set just might use a better sorting
algorithm than this (i.e. n*log(n)) and throwing out
duplicates (which, after sorting, are positioned
next to each other) is O(n). If conversion
to a set should turn out to be slower than O(n*log(n))
[depending on the implementation], then you are well
advised to sort the list first (n*log(n)) and then
throw out the duplicate keys with a single walk over
the list. In this case you know at least what to
expect for large n...

Regards,
Christian

Rocco Moretti · Sep 14, 2005

Rubinho said:
I can't imagine one being much faster than the other except in the case
of a huge list and mine's going to typically have less than 1000
elements.

To add to what others said, I'd imagine that the technique that's going
to be fastest is going to depend not only on the length of the list, but
also the estimated redundancy. (i.e. a technique that gives good
performance with a list that has only one or two elements duplicated
might be painfully slow when there is 10-100 copies of each element.)

There really is no substitute for profiling with representitive data sets.

Steven D'Aprano · Sep 14, 2005

I would imagine that 2 would be significantly faster.

Don't imagine, measure.

Resist the temptation to guess. Write some test functions and time the two
different methods. But first test that the functions do what you expect:
there is no point having a blindingly fast bug.

Method 1 uses
'count' which must make a pass through every element of the list, which
would be slower than the efficient hashing that set does.

But count passes through the list in C and is also very fast. Is that
faster or slower than the hashing code used by sets? I don't know, and
I'll bet you don't either.

Will McGugan · Sep 14, 2005

Steven said:
Don't imagine, measure.

Resist the temptation to guess. Write some test functions and time the two
different methods. But first test that the functions do what you expect:
there is no point having a blindingly fast bug.

Thats is absolutely correct. Although I think you do sometimes have to
guess. Otherwise you would write multiple versions of every line of code.

But count passes through the list in C and is also very fast. Is that
faster or slower than the hashing code used by sets? I don't know, and
I'll bet you don't either.

Sure. But if I'm not currently optimizing I would go for the method with
the best behaviour, which usualy means hashing rather than searching.
Since even if it is actualy slower - its not likely to be _very_ slow.

Will McGugan

przemek drochomirecki · Sep 14, 2005

I've a list with duplicate members and I need to make each entry

unique.

I've come up with two ways of doing it and I'd like some input on what
would be considered more pythonic (or at least best practice).

Method 1 (the traditional approach)

for x in mylist:
if mylist.count(x) > 1:
mylist.remove(x)

Method 2 (not so traditional)

mylist = set(mylist)
mylist = list(mylist)

Converting to a set drops all the duplicates and converting back to a
list, well, gets it back to a list which is what I want.

I can't imagine one being much faster than the other except in the case
of a huge list and mine's going to typically have less than 1000
elements.

What do you think?

Cheers,

Robin

Hi,

Try this:

def unique(s):
e = {}
for x in s:
if not e.has_key(x):
e[x] = 1
return e.keys()

Regards
Przemek

tcc.chapman · Sep 14, 2005

This works too, if speed isn't your thing..

a = [ 1,2,3,2,6,1,3,4,1,7,5,6,7]
a = dict( ( (i,None) for i in a)).keys()

Click to expand...

a
[1, 2, 3, 4, 5, 6, 7]

Steven Bethard · Sep 15, 2005

przemek said:
def unique(s):
e = {}
for x in s:
if not e.has_key(x):
e[x] = 1
return e.keys()

This is basically identical in functionality to the code:

def unique(s):
return list(set(s))

And with the new-and-improved C implementation of sets coming in Python
2.5, there's even more of a reason to use them when you can.

STeVe

drochom · Sep 15, 2005

Rubinho napisal(a):

I've a list with duplicate members and I need to make each entry
unique.

hi,

other possibility (my newest discovery

)

a = [1,2,2,4,2,1,3,4]
unique = d.fromkeys(a).keys()
unique

Click to expand...

Click to expand...

[1, 2, 3, 4]

regards
przemek

martijn · Sep 15, 2005

Look at the code below

def unique(s):
return list(set(s))

def unique2(keys):
unique = []
for i in keys:
if i not in unique:unique.append(i)
return unique

tmp = [0,1,2,4,2,2,3,4,1,3,2]
print tmp
print unique(tmp)
print unique2(tmp)
--------------------------
[0, 1, 2, 4, 2, 2, 3, 4, 1, 3, 2]
[0, 1, 2, 3, 4]
[0, 1, 2, 4, 3]

As you can see the end result is not the same.
I must get the end result [0, 1, 2, 4, 3] and not [0, 1, 2, 3, 4].
Thats why I use unique2()

drochom · Sep 15, 2005

there wasn't any information about ordering...
maybe i'll find something better which don't destroy original ordering

regards
przemek

drochom · Sep 15, 2005

i suppose this one is faster (but in most cases efficiency doesn't
matter)
e = {}
ret = []
for x in s:
if not e.has_key(x):
e[x] = 1
ret.append(x)
return ret

cheers,
przemek

martijn · Sep 15, 2005

Ow thanks , i'm I newbie and I did this test. (don't know if this is
the best way to do a small speed test)

import timeit

def unique2(keys):
unique = []
for i in keys:
if i not in unique:unique.append(i)
return unique

def unique3(s):
e = {}
ret = []
for x in s:
if not e.has_key(x):
e[x] = 1
ret.append(x)
return ret

tmp = [0,1,2,4,2,2,3,4,1,3,2]
s = """\
try:
str.__nonzero__
except AttributeError:
pass
"""
t = timeit.Timer(stmt=s)
print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
print tmp
print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
print unique2(tmp)
print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
print unique3(tmp)
print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
---------------------
5.80 usec/pass
[0, 1, 2, 4, 2, 2, 3, 4, 1, 3, 2]
7.51 usec/pass
[0, 1, 2, 4, 3]
6.93 usec/pass
[0, 1, 2, 4, 3]
6.45 usec/pass <--- your code unique2(s):

drochom · Sep 15, 2005

thanks, nice job. but this benchmark is pretty deceptive:

try this:
(definition of unique2 and unique3 as above)
51.52945844819817

unique2 has quadratic complexity
unique3 has amortized linear complexity
what it means?
it means that speed of your algorithm strongly depends on
len(unique2(a)). the greater distinct elements in a the greater
difference in execution time of both implementations

regards
przemek

martijn · Sep 16, 2005

Thanks for all the information.
And now I understand the timeit module

GC-Martijn

Iterate from 2nd element of a huge list	12	Feb 1, 2012
remove elements incrementally from a list	4	May 19, 2010
Uniquifying a list?	4	Apr 18, 2006
remove duplicates from list preserving order	11	Feb 3, 2005
Eliminating duplicates entries from a list efficiently	9	Jul 3, 2004
Removing Duplicate Objects from Object List	8	Oct 9, 2006
pickling a circular object inherited from list	4	Dec 9, 2008
Collect Excel Data from Website	5	Apr 30, 2022

Removing duplicates from a list

Rubinho

Thomas Guettler

Will McGugan

Peter Otten

Rubinho

martijn

Christian Stapfer

Rocco Moretti

Steven D'Aprano

Will McGugan

przemek drochomirecki

tcc.chapman

Steven Bethard

drochom

martijn

drochom

drochom

martijn

drochom

martijn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads