delete duplicates in list

christof hoeke · Oct 29, 2003

hello,
this must have come up before, so i am already sorry for asking but a
quick googling did not give me any answer.

i have a list from which i want a simpler list without the duplicates
an easy but somehow contrived solution would be

>>> a = [1, 2, 2, 3]
>>> d = {}.fromkeys(a)
>>> b = d.keys()
>>> print b

Click to expand...

Click to expand...

[1, 2, 3]

there should be an easier or more intuitive solution, maybe with a list
comprehension=

somthing like

>>> b = [x for x in a if x not in b]
>>> print b

Click to expand...

Click to expand...

[]

does not work though.

thanks for any help
chris

Diez B. Roggisch · Oct 29, 2003

Hi,

this must have come up before, so i am already sorry for asking but a
quick googling did not give me any answer.

i have a list from which i want a simpler list without the duplicates
an easy but somehow contrived solution would be

In python 2.3, this should work:

import sets
l = [1,2,2,3]
s = sets.Set(l)

A set is a container for which the invariant that no object is in it twice
holds. So it suits your needs.

Regards,

Diez

=?ISO-8859-1?Q?Bernard_Delm=E9e?= · Oct 29, 2003

a = [1, 2, 2, 3]

[1, 2, 3]

That, or using a Set (python 2.3+). Actually I seem to recall that
"The Python CookBook" still advises building a dict as the fastest
solution - if your elements can be hashed. See the details at

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560

Cheers,

Bernard.

Alex Martelli · Oct 29, 2003

christof hoeke wrote:
...

i have a list from which i want a simpler list without the duplicates

Canonical is:

import sets
simplerlist = list(sets.Set(thelist))

if you're allright with destroying order, as your example solution suggests.
But dict.fromkeys(a).keys() is probably faster. Your assertion:

there should be an easier or more intuitive solution, maybe with a list
comprehension=

doesn't seem self-evident to me. A list-comprehension might be, e.g:

[ x for i, x in enumerate(a) if i==a.index(x) ]

and it does have the advantages of (a) keeping order AND (b) not
requiring hashable (nor even inequality-comparable!) elements -- BUT
it has the non-indifferent cost of being O(N*N) while the others
are about O(N). If you really want something similar to your approach:

b = [x for x in a if x not in b]

Click to expand...

Click to expand...

you'll have, o horrors!-), to do a loop, so name b is always bound to
"the result list so far" (in the LC, name b is only bound at the end):

b = []
for x in a:
if x not in b:
b.append(x)

However, this is O(N*N) too. In terms of "easier or more intuitive",
I suspect only this latter solution might qualify.

Alex

Alex Martelli · Oct 29, 2003

Bernard said:
a = [1, 2, 2, 3]
d = {}.fromkeys(a)
b = d.keys()
print b

Click to expand...

[1, 2, 3]

Click to expand...

That, or using a Set (python 2.3+). Actually I seem to recall that
"The Python CookBook" still advises building a dict as the fastest
solution - if your elements can be hashed. See the details at

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560

Yep, but a Set is roughly as fast as a dict -- it has a small
penalty wrt a dict, but, emphasis on small. Still, if you're
trying to squeeze every last cycle out of a bottleneck, it's
worth measuring, and perhaps moving from Set to dict.

Alex

Alex Martelli · Oct 29, 2003

Diez B. Roggisch wrote:

this must have come up before, so i am already sorry for asking but a
quick googling did not give me any answer.

i have a list from which i want a simpler list without the duplicates
an easy but somehow contrived solution would be

Click to expand...

In python 2.3, this should work:

import sets
l = [1,2,2,3]
s = sets.Set(l)

A set is a container for which the invariant that no object is in it twice
holds. So it suits your needs.

....except it's not a LIST, which was part of the specifications given
by the original poster. It may, of course, be that you've read his
mind correctly and that, despite his words, he doesn't really care
whether he gets a list or a very different container

.

Alex

Diez B. Roggisch · Oct 30, 2003

Hi,

...except it's not a LIST, which was part of the specifications given
by the original poster. It may, of course, be that you've read his
mind correctly and that, despite his words, he doesn't really care
whether he gets a list or a very different container.

You're right - mathematically. However, there is no such thing like a set in
computers - you always end up with some sort of list

So - he'll have a list anyway. But if it respects the order the list
parameter... <just checking, standby>

.... nope:

l = [1,2,3,2]
import sets
sets.Set(l) Set([1, 2, 3])
l = [2,1,2,3,2]
sets.Set(l)

Click to expand...

Click to expand...

Set([1, 2, 3])

Which is of course reasonable, as the check for existence in the set might
be performed in O(ln n) instead of O(n).

Regards,

Diez

Hannu Kankaanp?? · Oct 30, 2003

Diez B. Roggisch said:
Hi,

You're right - mathematically. However, there is no such thing like a set in
computers - you always end up with some sort of list

Those are just implementation details. There could be a group of monkeys
emulating Python under the hood and their implementation of a set
would be a neural network instead of any kind of sequence, but you
still wouldn't care as a programmer. The only thing that matters is,
if the interface stays same. Nope, the items in a set are no longer
accessible by their index (among other differences).

So - he'll have a list anyway.

So I can't agree with this. You don't know if his Python virtual machine
is a group of monkeys. Python is supposed to be a high level language.

But if it respects the order the list
parameter... <just checking, standby>

... nope:

l = [1,2,3,2]
import sets
sets.Set(l) Set([1, 2, 3])
l = [2,1,2,3,2]
sets.Set(l)

Click to expand...

Click to expand...

Set([1, 2, 3])

Which is of course reasonable, as the check for existence in the set might
be performed in O(ln n) instead of O(n).

Actually the Set in sets module has an average lookup of O(1), worst
case O(n) (not 100% sure of worst case, but 99% sure). It's been
implemented with dictionaries, which in turn are hash tables.

Duncan Booth · Oct 30, 2003

Your assertion:

there should be an easier or more intuitive solution, maybe with a list
comprehension=

Click to expand...

doesn't seem self-evident to me. A list-comprehension might be, e.g:

[ x for i, x in enumerate(a) if i==a.index(x) ]

and it does have the advantages of (a) keeping order AND (b) not
requiring hashable (nor even inequality-comparable!) elements -- BUT
it has the non-indifferent cost of being O(N*N) while the others
are about O(N). If you really want something similar to your approach:

b = [x for x in a if x not in b]

Click to expand...

Click to expand...

you'll have, o horrors!-), to do a loop, so name b is always bound to
"the result list so far" (in the LC, name b is only bound at the end):

b = []
for x in a:
if x not in b:
b.append(x)

However, this is O(N*N) too. In terms of "easier or more intuitive",
I suspect only this latter solution might qualify.

I dunno about more intuitive, but here's a fairly simple list comprehension
solution which is O(N) and preserves the order. Of course its back to
requiring hashable elements again:

startList = [5,1,2,1,3,4,2,5,3,4]
d = {}
[ d.setdefault(x,x) for x in startList if x not in d ]

Click to expand...

Click to expand...

[5, 1, 2, 3, 4]

And for the 'I must do it on one line' freaks, here's the single expression
variant of the above: :^)

[ d.setdefault(x,x) for d in [{}] for x in startList if x not in d ]

Click to expand...

Click to expand...

[5, 1, 2, 3, 4]

Alex Martelli · Oct 30, 2003

Hannu Kankaanp?? wrote:
...

So I can't agree with this. You don't know if his Python virtual machine
is a group of monkeys. Python is supposed to be a high level language.

So, in that case those monkeys would surely be perched up on tall trees.

Actually the Set in sets module has an average lookup of O(1), worst
case O(n) (not 100% sure of worst case, but 99% sure). It's been

Hmmm -- could you give an example of that worstcase...? a _full_
hashtable would give such behavior, but Python's dicts always ensure
the underlying hashtables aren't too full...

implemented with dictionaries, which in turn are hash tables.

Yep - check it out...:

[alex@lancelot bo]$ timeit.py -c -s'import sets' -s'x=sets.Set(xrange(100))'
'7 in x'
100000 loops, best of 3: 2.3 usec per loop
[alex@lancelot bo]$ timeit.py -c -s'import sets'
-s'x=sets.Set(xrange(1000))' '7 in x'
100000 loops, best of 3: 2.3 usec per loop
[alex@lancelot bo]$ timeit.py -c -s'import sets'
-s'x=sets.Set(xrange(10000))' '7 in x'
100000 loops, best of 3: 2.2 usec per loop
[alex@lancelot bo]$ timeit.py -c -s'import sets'
-s'x=sets.Set(xrange(100000))' '7 in x'
100000 loops, best of 3: 2.2 usec per loop
[alex@lancelot bo]$ timeit.py -c -s'import sets'
-s'x=sets.Set(xrange(1000000))' '7 in x'
100000 loops, best of 3: 2.3 usec per loop

see the pattern...?-)

Same when something is NOT there, BTW:

[alex@lancelot bo]$ timeit.py -c -s'import sets'
-s'x=sets.Set(xrange(1000000))' 'None in x'
100000 loops, best of 3: 2.3 usec per loop

etc, etc.

Michael Hudson · Oct 30, 2003

Alex Martelli said:
Hmmm -- could you give an example of that worstcase...? a _full_
hashtable would give such behavior, but Python's dicts always ensure
the underlying hashtables aren't too full...

Try inserting a bunch of instances of

class C:
def __hash__(self): return 0

into a dictionary.

Cheers,
mwh

Corrado Gioannini · Oct 30, 2003

there should be an easier or more intuitive solution, maybe with a list
comprehension=

somthing like

b = [x for x in a if x not in b]
print b

Click to expand...

Click to expand...

[]

does not work though.

if you want a list comp solution, similar to the one you proposed and valid
also with python < 2.3 you could do:

a = [1, 2, 2, 3]
b=[]
[b.append(x) for x in a if x not in b] [None, None, None]
b

Click to expand...

Click to expand...

[1, 2, 3]

or even:

a = [1, 2, 2, 3]
[b.append(x) for b in [[]] for x in a if x not in b] [None, None, None]
b

Click to expand...

Click to expand...

[1, 2, 3]

Corrado.

Thomas Heller · Oct 30, 2003

Michael Hudson said:
Try inserting a bunch of instances of

class C:
def __hash__(self): return 0

into a dictionary.

I've though about using something like this in production code
to be able to store mutable instances in a dict.
Performance problems aside (since there are only a couple of key/value
pairs in the dict), is it such a bad idea?

Thomas

Michael Hudson · Oct 30, 2003

Thomas Heller said:
I've though about using something like this in production code
to be able to store mutable instances in a dict.
Eek!

Performance problems aside (since there are only a couple of key/value
pairs in the dict), is it such a bad idea?

Um, I don't know actually. It pretty much defeats the point of using
a hashtable. If your class has some non-mutable parts it'd be an
idea to hash them, of course.

And the performance problems are severe -- inserting just a thousand
instances of the class C() into a dictionary takes noticable time.

(What *is* a bad idea is giving such classes an __eq__ method that
mutates the dict being inserted into -- I found a bunch of such bugs a
couple years back and I think it's still possible to core Python (via
stack overflow) by being even more devious).

Cheers,
mwh

Bernhard Herzog · Oct 30, 2003

Thomas Heller said:
I've though about using something like this in production code
to be able to store mutable instances in a dict.
Performance problems aside (since there are only a couple of key/value
pairs in the dict), is it such a bad idea?

IMO it depends on what equality means for instances. E.g. if two
instances are only equal if they're identical, i.e. a == b is equivalent
to a is b, then defining __hash__ can be very useful, because then you
can use them in dictionaries and mutability doesn't really matter
because no change to one instance can make it equal to a nother
instance.

I'd define __hash__ to return id(self), though, so that the hash values
are different for different instances to reduce collisions.

This also seems to be what class objects in Python do:
.... pass
....
Bernhard

Michael Hudson · Oct 30, 2003

Bernhard Herzog said:
IMO it depends on what equality means for instances. E.g. if two
instances are only equal if they're identical, i.e. a == b is equivalent
to a is b, then defining __hash__ can be very useful, because then you
can use them in dictionaries and mutability doesn't really matter
because no change to one instance can make it equal to a nother
instance.

Um. In that case why define __hash__ or __eq__ at all?

Cheers,
mwh

Bernhard Herzog · Oct 30, 2003

Michael Hudson said:
Um. In that case why define __hash__ or __eq__ at all?

Good question. I didn't know that that was the default implementation of
__hash__. It's not documented AFAICT, but it does seem to be the case. I
wonder why, even though it's useful.

The language reference for __hash__ says:

If a class does not define a __cmp__() method it should not define a
__hash__() operation either;

Seems a bit outdated in that it only mentions __cmp__ here and not
__eq__ as well as it does right after the semicolon, but that would not
indicate to me that hash works for an instance without a hash method. Of
course the sentence above could be taken to mean that if you define
neither method the defaults to the right thing, but that would seem very
obscure to me.

Anyway, I've found that if you end up comparing objects very often is
can be a substantial performance gain if you do define __eq__ even
though it ends up doing the same as the default comparison. At least for
old style classes when __eq__ is not defined python tries __cmp__ and
__coerce__ and __getattr__ if defined is called for all of them.

Then, once you have defined __eq__ you also need to have __hash__ if you
want your objects to be hashable.

Bernhard

Alex Martelli · Oct 30, 2003

Bernhard Herzog wrote:
...

Good question. I didn't know that that was the default implementation of
__hash__. It's not documented AFAICT, but it does seem to be the case. I

I don't know why the official docs don't mention this fact -- I do
mention it in "Python in a Nutshell", of course (p. 132).

wonder why, even though it's useful.

The reason this is done is because it's useful.

Anyway, I've found that if you end up comparing objects very often is
can be a substantial performance gain if you do define __eq__ even
though it ends up doing the same as the default comparison. At least for
old style classes when __eq__ is not defined python tries __cmp__ and
__coerce__ and __getattr__ if defined is called for all of them.

Yes, but making your classes newstyle would be a MUCH greater performance
boost in this situation. See:

[alex@lancelot bo]$ timeit.py -c -s'class X

ass' -s'x=X()' 'x==None'
100000 loops, best of 3: 3.5 usec per loop
[alex@lancelot bo]$ timeit.py -c -s'class X:

def __eq__(self, other): return id(self)==id(other)' -s'x=X()' 'x==None'

100000 loops, best of 3: 2.1 usec per loop

[alex@lancelot bo]$ timeit.py -c -s'class X(object)

ass' -s'x=X()'
'x==None'
1000000 loops, best of 3: 0.42 usec per loop
[alex@lancelot bo]$ timeit.py -c -s'class X(object):
def __eq__(self, other): return id(self)==id(other)' -s'x=X()' 'x==None'
100000 loops, best of 3: 2.1 usec per loop

Instead of saving 40% of the comparison time by defining __eq__, why not
save 80+% of it, instead, just by making the class newstyle? Note that
the comparison time is avoud equal for both old and new style classes if
you do define an __eq__ -- but while it's a 40% speedup for oldstyle
ones, it's a slowdown by 5 times for newstyle ones.

Then, once you have defined __eq__ you also need to have __hash__ if you
want your objects to be hashable.

Another good performance reason to make the class newstyle and thus
avoid having to define __eq__, IMHO.

These days the only casee where I use oldstyle classes (except in Q&D
scripts where I simply forget to say '(object)', but that's getting
rare as the good habit becomes ingrained

is for exception classes,
which (for reasons that escape me) ARE supposed to be oldstyle. The
exception that proves the rule?-)

Alex

Michael Hudson · Oct 30, 2003

Bernhard Herzog said:
Good question. I didn't know that that was the default implementation of
__hash__. It's not documented AFAICT, but it does seem to be the case. I
wonder why, even though it's useful.

Well, I didn't so much know that this was the default implementation
of __hash__ as know that instances of classes that don't do anything
to the contrary compare equal iff identical and are usefully hashable.

(Python Mystery Theatre question: for such a class, when can you have
id(ob) != hash(ob)?)

The language reference for __hash__ says:

If a class does not define a __cmp__() method it should not define a
__hash__() operation either;

Seems a bit outdated in that it only mentions __cmp__ here and not
__eq__ as well as it does right after the semicolon, but that would not
indicate to me that hash works for an instance without a hash method.

I hadn't had the distraction of reading this section of the language
reference

Did you seriously think that you *had* to define
__hash__ to make instances of your classes hashable?

Of course the sentence above could be taken to mean that if you
define neither method the defaults to the right thing, but that
would seem very obscure to me.

Well, this is Python after all. I would have read

if it defines __cmp__() or __eq__() but not __hash__(), its
instances will not be usable as dictionary keys.

to imply that in other situations that instances would be usable as
dictionary keys.

Anyway, I've found that if you end up comparing objects very often
is can be a substantial performance gain if you do define __eq__
even though it ends up doing the same as the default comparison. At
least for old style classes when __eq__ is not defined python tries
__cmp__ and __coerce__ and __getattr__ if defined is called for all
of them.

Old-style classes? What are they?

Didn't know about that.

Then, once you have defined __eq__ you also need to have __hash__ if you
want your objects to be hashable.

Yes. I'm glad that bit of silliness is now dead (ish).

Cheers,
mwh

Thomas Heller · Oct 30, 2003

[about object.__hash__]

Bernhard Herzog wrote:
...

I don't know why the official docs don't mention this fact -- I do
mention it in "Python in a Nutshell", of course (p. 132).

The reason this is done is because it's useful.

Well, Guido seems to disagree. Per his request, when I complained about
the default __hash__() method he asked me to enter a bug about it:
http://www.python.org/sf/660098. But it doesn't seem easy at all to fix
this...

Thomas

Range / empty list issues??	1	Dec 11, 2023
Python List Comprehension Error: Unexpected Output	1	Aug 28, 2023
Taking list as an input from Python to C	1	Jul 21, 2022
Delete all list entries of length unknown	5	Oct 5, 2009
Function is not worked in C	2	Jun 27, 2023
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'	3	Oct 2, 2022
Removing duplicates from a list	20	Sep 14, 2005
Lowest Value in List	5	Oct 2, 2013

delete duplicates in list

christof hoeke

Diez B. Roggisch

=?ISO-8859-1?Q?Bernard_Delm=E9e?=

Alex Martelli

Alex Martelli

Alex Martelli

Diez B. Roggisch

Hannu Kankaanp??

Duncan Booth

Alex Martelli

Michael Hudson

Corrado Gioannini

Thomas Heller

Michael Hudson

Bernhard Herzog

Michael Hudson

Bernhard Herzog

Alex Martelli

Michael Hudson

Thomas Heller

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads