delete duplicates in list

C

christof hoeke

hello,
this must have come up before, so i am already sorry for asking but a
quick googling did not give me any answer.

i have a list from which i want a simpler list without the duplicates
an easy but somehow contrived solution would be
>>> a = [1, 2, 2, 3]
>>> d = {}.fromkeys(a)
>>> b = d.keys()
>>> print b
[1, 2, 3]

there should be an easier or more intuitive solution, maybe with a list
comprehension=

somthing like
>>> b = [x for x in a if x not in b]
>>> print b
[]

does not work though.

thanks for any help
chris
 
D

Diez B. Roggisch

Hi,
this must have come up before, so i am already sorry for asking but a
quick googling did not give me any answer.

i have a list from which i want a simpler list without the duplicates
an easy but somehow contrived solution would be

In python 2.3, this should work:

import sets
l = [1,2,2,3]
s = sets.Set(l)

A set is a container for which the invariant that no object is in it twice
holds. So it suits your needs.

Regards,

Diez
 
A

Alex Martelli

christof hoeke wrote:
...
i have a list from which i want a simpler list without the duplicates

Canonical is:

import sets
simplerlist = list(sets.Set(thelist))

if you're allright with destroying order, as your example solution suggests.
But dict.fromkeys(a).keys() is probably faster. Your assertion:
there should be an easier or more intuitive solution, maybe with a list
comprehension=

doesn't seem self-evident to me. A list-comprehension might be, e.g:

[ x for i, x in enumerate(a) if i==a.index(x) ]

and it does have the advantages of (a) keeping order AND (b) not
requiring hashable (nor even inequality-comparable!) elements -- BUT
it has the non-indifferent cost of being O(N*N) while the others
are about O(N). If you really want something similar to your approach:
b = [x for x in a if x not in b]

you'll have, o horrors!-), to do a loop, so name b is always bound to
"the result list so far" (in the LC, name b is only bound at the end):

b = []
for x in a:
if x not in b:
b.append(x)

However, this is O(N*N) too. In terms of "easier or more intuitive",
I suspect only this latter solution might qualify.


Alex
 
A

Alex Martelli

Bernard said:
a = [1, 2, 2, 3]
d = {}.fromkeys(a)
b = d.keys()
print b
[1, 2, 3]

That, or using a Set (python 2.3+). Actually I seem to recall that
"The Python CookBook" still advises building a dict as the fastest
solution - if your elements can be hashed. See the details at

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560

Yep, but a Set is roughly as fast as a dict -- it has a small
penalty wrt a dict, but, emphasis on small. Still, if you're
trying to squeeze every last cycle out of a bottleneck, it's
worth measuring, and perhaps moving from Set to dict.


Alex
 
A

Alex Martelli

Diez B. Roggisch wrote:

this must have come up before, so i am already sorry for asking but a
quick googling did not give me any answer.

i have a list from which i want a simpler list without the duplicates
an easy but somehow contrived solution would be

In python 2.3, this should work:

import sets
l = [1,2,2,3]
s = sets.Set(l)

A set is a container for which the invariant that no object is in it twice
holds. So it suits your needs.

....except it's not a LIST, which was part of the specifications given
by the original poster. It may, of course, be that you've read his
mind correctly and that, despite his words, he doesn't really care
whether he gets a list or a very different container:).


Alex
 
D

Diez B. Roggisch

Hi,
...except it's not a LIST, which was part of the specifications given
by the original poster. It may, of course, be that you've read his
mind correctly and that, despite his words, he doesn't really care
whether he gets a list or a very different container:).

You're right - mathematically. However, there is no such thing like a set in
computers - you always end up with some sort of list :)

So - he'll have a list anyway. But if it respects the order the list
parameter... <just checking, standby>

.... nope:
l = [1,2,3,2]
import sets
sets.Set(l) Set([1, 2, 3])
l = [2,1,2,3,2]
sets.Set(l)
Set([1, 2, 3])

Which is of course reasonable, as the check for existence in the set might
be performed in O(ln n) instead of O(n).

Regards,

Diez
 
H

Hannu Kankaanp??

Diez B. Roggisch said:
Hi,


You're right - mathematically. However, there is no such thing like a set in
computers - you always end up with some sort of list :)

Those are just implementation details. There could be a group of monkeys
emulating Python under the hood and their implementation of a set
would be a neural network instead of any kind of sequence, but you
still wouldn't care as a programmer. The only thing that matters is,
if the interface stays same. Nope, the items in a set are no longer
accessible by their index (among other differences).
So - he'll have a list anyway.

So I can't agree with this. You don't know if his Python virtual machine
is a group of monkeys. Python is supposed to be a high level language.
But if it respects the order the list
parameter... <just checking, standby>

... nope:
l = [1,2,3,2]
import sets
sets.Set(l) Set([1, 2, 3])
l = [2,1,2,3,2]
sets.Set(l)
Set([1, 2, 3])

Which is of course reasonable, as the check for existence in the set might
be performed in O(ln n) instead of O(n).

Actually the Set in sets module has an average lookup of O(1), worst
case O(n) (not 100% sure of worst case, but 99% sure). It's been
implemented with dictionaries, which in turn are hash tables.
 
D

Duncan Booth

Your assertion:
there should be an easier or more intuitive solution, maybe with a list
comprehension=

doesn't seem self-evident to me. A list-comprehension might be, e.g:

[ x for i, x in enumerate(a) if i==a.index(x) ]

and it does have the advantages of (a) keeping order AND (b) not
requiring hashable (nor even inequality-comparable!) elements -- BUT
it has the non-indifferent cost of being O(N*N) while the others
are about O(N). If you really want something similar to your approach:
b = [x for x in a if x not in b]

you'll have, o horrors!-), to do a loop, so name b is always bound to
"the result list so far" (in the LC, name b is only bound at the end):

b = []
for x in a:
if x not in b:
b.append(x)

However, this is O(N*N) too. In terms of "easier or more intuitive",
I suspect only this latter solution might qualify.

I dunno about more intuitive, but here's a fairly simple list comprehension
solution which is O(N) and preserves the order. Of course its back to
requiring hashable elements again:
startList = [5,1,2,1,3,4,2,5,3,4]
d = {}
[ d.setdefault(x,x) for x in startList if x not in d ]
[5, 1, 2, 3, 4]

And for the 'I must do it on one line' freaks, here's the single expression
variant of the above: :^)
[ d.setdefault(x,x) for d in [{}] for x in startList if x not in d ]
[5, 1, 2, 3, 4]
 
A

Alex Martelli

Hannu Kankaanp?? wrote:
...
So I can't agree with this. You don't know if his Python virtual machine
is a group of monkeys. Python is supposed to be a high level language.

So, in that case those monkeys would surely be perched up on tall trees.

Actually the Set in sets module has an average lookup of O(1), worst
case O(n) (not 100% sure of worst case, but 99% sure). It's been

Hmmm -- could you give an example of that worstcase...? a _full_
hashtable would give such behavior, but Python's dicts always ensure
the underlying hashtables aren't too full...
implemented with dictionaries, which in turn are hash tables.

Yep - check it out...:

[alex@lancelot bo]$ timeit.py -c -s'import sets' -s'x=sets.Set(xrange(100))'
'7 in x'
100000 loops, best of 3: 2.3 usec per loop
[alex@lancelot bo]$ timeit.py -c -s'import sets'
-s'x=sets.Set(xrange(1000))' '7 in x'
100000 loops, best of 3: 2.3 usec per loop
[alex@lancelot bo]$ timeit.py -c -s'import sets'
-s'x=sets.Set(xrange(10000))' '7 in x'
100000 loops, best of 3: 2.2 usec per loop
[alex@lancelot bo]$ timeit.py -c -s'import sets'
-s'x=sets.Set(xrange(100000))' '7 in x'
100000 loops, best of 3: 2.2 usec per loop
[alex@lancelot bo]$ timeit.py -c -s'import sets'
-s'x=sets.Set(xrange(1000000))' '7 in x'
100000 loops, best of 3: 2.3 usec per loop

see the pattern...?-)


Same when something is NOT there, BTW:

[alex@lancelot bo]$ timeit.py -c -s'import sets'
-s'x=sets.Set(xrange(1000000))' 'None in x'
100000 loops, best of 3: 2.3 usec per loop

etc, etc.
 
M

Michael Hudson

Alex Martelli said:
Hmmm -- could you give an example of that worstcase...? a _full_
hashtable would give such behavior, but Python's dicts always ensure
the underlying hashtables aren't too full...

Try inserting a bunch of instances of

class C:
def __hash__(self): return 0

into a dictionary.

Cheers,
mwh
 
C

Corrado Gioannini

there should be an easier or more intuitive solution, maybe with a list
comprehension=

somthing like
b = [x for x in a if x not in b]
print b
[]

does not work though.
if you want a list comp solution, similar to the one you proposed and valid
also with python < 2.3 you could do:
a = [1, 2, 2, 3]
b=[]
[b.append(x) for x in a if x not in b] [None, None, None]
b
[1, 2, 3]

or even:
a = [1, 2, 2, 3]
[b.append(x) for b in [[]] for x in a if x not in b] [None, None, None]
b
[1, 2, 3]

Corrado.
 
T

Thomas Heller

Michael Hudson said:
Try inserting a bunch of instances of

class C:
def __hash__(self): return 0

into a dictionary.

I've though about using something like this in production code
to be able to store mutable instances in a dict.
Performance problems aside (since there are only a couple of key/value
pairs in the dict), is it such a bad idea?

Thomas
 
M

Michael Hudson

Thomas Heller said:
I've though about using something like this in production code
to be able to store mutable instances in a dict.
Eek!

Performance problems aside (since there are only a couple of key/value
pairs in the dict), is it such a bad idea?

Um, I don't know actually. It pretty much defeats the point of using
a hashtable. If your class has some non-mutable parts it'd be an
idea to hash them, of course.

And the performance problems are severe -- inserting just a thousand
instances of the class C() into a dictionary takes noticable time.

(What *is* a bad idea is giving such classes an __eq__ method that
mutates the dict being inserted into -- I found a bunch of such bugs a
couple years back and I think it's still possible to core Python (via
stack overflow) by being even more devious).

Cheers,
mwh
 
B

Bernhard Herzog

Thomas Heller said:
I've though about using something like this in production code
to be able to store mutable instances in a dict.
Performance problems aside (since there are only a couple of key/value
pairs in the dict), is it such a bad idea?

IMO it depends on what equality means for instances. E.g. if two
instances are only equal if they're identical, i.e. a == b is equivalent
to a is b, then defining __hash__ can be very useful, because then you
can use them in dictionaries and mutability doesn't really matter
because no change to one instance can make it equal to a nother
instance.

I'd define __hash__ to return id(self), though, so that the hash values
are different for different instances to reduce collisions.

This also seems to be what class objects in Python do:
.... pass
....
Bernhard
 
M

Michael Hudson

Bernhard Herzog said:
IMO it depends on what equality means for instances. E.g. if two
instances are only equal if they're identical, i.e. a == b is equivalent
to a is b, then defining __hash__ can be very useful, because then you
can use them in dictionaries and mutability doesn't really matter
because no change to one instance can make it equal to a nother
instance.

Um. In that case why define __hash__ or __eq__ at all?

Cheers,
mwh
 
B

Bernhard Herzog

Michael Hudson said:
Um. In that case why define __hash__ or __eq__ at all?

Good question. I didn't know that that was the default implementation of
__hash__. It's not documented AFAICT, but it does seem to be the case. I
wonder why, even though it's useful.

The language reference for __hash__ says:

If a class does not define a __cmp__() method it should not define a
__hash__() operation either;

Seems a bit outdated in that it only mentions __cmp__ here and not
__eq__ as well as it does right after the semicolon, but that would not
indicate to me that hash works for an instance without a hash method. Of
course the sentence above could be taken to mean that if you define
neither method the defaults to the right thing, but that would seem very
obscure to me.

Anyway, I've found that if you end up comparing objects very often is
can be a substantial performance gain if you do define __eq__ even
though it ends up doing the same as the default comparison. At least for
old style classes when __eq__ is not defined python tries __cmp__ and
__coerce__ and __getattr__ if defined is called for all of them.

Then, once you have defined __eq__ you also need to have __hash__ if you
want your objects to be hashable.

Bernhard
 
A

Alex Martelli

Bernhard Herzog wrote:
...
Good question. I didn't know that that was the default implementation of
__hash__. It's not documented AFAICT, but it does seem to be the case. I

I don't know why the official docs don't mention this fact -- I do
mention it in "Python in a Nutshell", of course (p. 132).
wonder why, even though it's useful.

The reason this is done is because it's useful.
Anyway, I've found that if you end up comparing objects very often is
can be a substantial performance gain if you do define __eq__ even
though it ends up doing the same as the default comparison. At least for
old style classes when __eq__ is not defined python tries __cmp__ and
__coerce__ and __getattr__ if defined is called for all of them.

Yes, but making your classes newstyle would be a MUCH greater performance
boost in this situation. See:

[alex@lancelot bo]$ timeit.py -c -s'class X:pass' -s'x=X()' 'x==None'
100000 loops, best of 3: 3.5 usec per loop
[alex@lancelot bo]$ timeit.py -c -s'class X:
def __eq__(self, other): return id(self)==id(other)' -s'x=X()' 'x==None'
100000 loops, best of 3: 2.1 usec per loop

[alex@lancelot bo]$ timeit.py -c -s'class X(object):pass' -s'x=X()'
'x==None'
1000000 loops, best of 3: 0.42 usec per loop
[alex@lancelot bo]$ timeit.py -c -s'class X(object):
def __eq__(self, other): return id(self)==id(other)' -s'x=X()' 'x==None'
100000 loops, best of 3: 2.1 usec per loop

Instead of saving 40% of the comparison time by defining __eq__, why not
save 80+% of it, instead, just by making the class newstyle? Note that
the comparison time is avoud equal for both old and new style classes if
you do define an __eq__ -- but while it's a 40% speedup for oldstyle
ones, it's a slowdown by 5 times for newstyle ones.

Then, once you have defined __eq__ you also need to have __hash__ if you
want your objects to be hashable.

Another good performance reason to make the class newstyle and thus
avoid having to define __eq__, IMHO.

These days the only casee where I use oldstyle classes (except in Q&D
scripts where I simply forget to say '(object)', but that's getting
rare as the good habit becomes ingrained:) is for exception classes,
which (for reasons that escape me) ARE supposed to be oldstyle. The
exception that proves the rule?-)


Alex
 
M

Michael Hudson

Bernhard Herzog said:
Good question. I didn't know that that was the default implementation of
__hash__. It's not documented AFAICT, but it does seem to be the case. I
wonder why, even though it's useful.

Well, I didn't so much know that this was the default implementation
of __hash__ as know that instances of classes that don't do anything
to the contrary compare equal iff identical and are usefully hashable.

(Python Mystery Theatre question: for such a class, when can you have
id(ob) != hash(ob)?)
The language reference for __hash__ says:

If a class does not define a __cmp__() method it should not define a
__hash__() operation either;

Seems a bit outdated in that it only mentions __cmp__ here and not
__eq__ as well as it does right after the semicolon, but that would not
indicate to me that hash works for an instance without a hash method.

I hadn't had the distraction of reading this section of the language
reference :) Did you seriously think that you *had* to define
__hash__ to make instances of your classes hashable?
Of course the sentence above could be taken to mean that if you
define neither method the defaults to the right thing, but that
would seem very obscure to me.

Well, this is Python after all. I would have read

if it defines __cmp__() or __eq__() but not __hash__(), its
instances will not be usable as dictionary keys.

to imply that in other situations that instances would be usable as
dictionary keys.
Anyway, I've found that if you end up comparing objects very often
is can be a substantial performance gain if you do define __eq__
even though it ends up doing the same as the default comparison. At
least for old style classes when __eq__ is not defined python tries
__cmp__ and __coerce__ and __getattr__ if defined is called for all
of them.

Old-style classes? What are they? :) Didn't know about that.
Then, once you have defined __eq__ you also need to have __hash__ if you
want your objects to be hashable.

Yes. I'm glad that bit of silliness is now dead (ish).

Cheers,
mwh
 
T

Thomas Heller

[about object.__hash__]
Bernhard Herzog wrote:
...

I don't know why the official docs don't mention this fact -- I do
mention it in "Python in a Nutshell", of course (p. 132).


The reason this is done is because it's useful.

Well, Guido seems to disagree. Per his request, when I complained about
the default __hash__() method he asked me to enter a bug about it:
http://www.python.org/sf/660098. But it doesn't seem easy at all to fix
this...

Thomas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top