remove duplicates from list preserving order

Steven Bethard · Feb 3, 2005

I'm sorry, I assume this has been discussed somewhere already, but I
found only a few hits in Google Groups... If you know where there's a
good summary, please feel free to direct me there.

I have a list[1] of objects from which I need to remove duplicates. I
have to maintain the list order though, so solutions like set(lst), etc.
will not work for me. What are my options? So far, I can see:

def filterdups(iterable):
result = []
for item in iterable:
if item not in result:
result.append(item)
return result

def filterdups(iterable):
result = []
seen = set()
for item in iterable:
if item not in seen:
result.append(item)
seen.add(item)
return result

def filterdups(iterable):
seen = set()
for item in iterable:
if item not in seen:
seen.add(item)
yield item

Does anyone have a better[2] solution?

STeve

[1] Well, actually it's an iterable of objects, but I can convert it to
a list if that's helpful.

[2] Yes I know, "better" is ambiguous. If it helps any, for my
particular situation, speed is probably more important than memory, so
I'm leaning towards the second or third implementation.

Carl Banks · Feb 3, 2005

Steven said:
I'm sorry, I assume this has been discussed somewhere already, but I
found only a few hits in Google Groups... If you know where there's a
good summary, please feel free to direct me there.

I have a list[1] of objects from which I need to remove duplicates. I
have to maintain the list order though, so solutions like set(lst), etc.
will not work for me. What are my options? So far, I can see:

def filterdups(iterable):
result = []
for item in iterable:
if item not in result:
result.append(item)
return result

def filterdups(iterable):
result = []
seen = set()
for item in iterable:
if item not in seen:
result.append(item)
seen.add(item)
return result

def filterdups(iterable):
seen = set()
for item in iterable:
if item not in seen:
seen.add(item)
yield item

Does anyone have a better[2] solution?

STeve

[1] Well, actually it's an iterable of objects, but I can convert it to
a list if that's helpful.

[2] Yes I know, "better" is ambiguous. If it helps any, for my
particular situation, speed is probably more important than memory, so
I'm leaning towards the second or third implementation.

from itertools import *
[ x for (x,s) in izip(iterable,repeat(set()))
if (x not in s,s.add(x))[0] ]

that's-one-ambiguously-better-solution-ly yr's,

wittempj · Feb 3, 2005

You could do it with a class, like this, I guess it is a bit faster
than option 1, although I'm no connaisseur of python internals.....
-class uniquelist(list):
- def __init__(self, l):
- for item in l:
- self.append(item)

- def append(self, item):
- if item not in self:
- list.append(self, item)

-l = [1, 1, 2, 3, 3, 4]
-print l
-m = uniquelist(l)
-print m

Larry Bates · Feb 3, 2005

Take a look at this recipe on ASPN:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/204297

I think it might help.

Larry Bates

Steven Bethard · Feb 3, 2005

Carl said:
from itertools import *
[ x for (x,s) in izip(iterable,repeat(set()))
if (x not in s,s.add(x))[0] ]

Wow, that's evil! Pretty cool, but for the sake of readers of my code,
I think I'll have to opt against it.

STeVe

wittempj · Feb 3, 2005

You could create a class based on a list which takes a list as
argument, like this:
-class uniquelist(list):
- def __init__(self, l):
- for item in l:
- self.append(item)
-
- def append(self, item):
- if item not in self:
- list.append(self, item)
-
-l = [1, 1, 2, 3, 3, 4]
-print l
-m = uniquelist(l)
-print m

will print
py:[1, 1, 2, 3, 3, 4]
py:[1, 2, 3, 4]

Michael Spencer · Feb 3, 2005

Steven said:
I'm sorry, I assume this has been discussed somewhere already, but I
found only a few hits in Google Groups... If you know where there's a
good summary, please feel free to direct me there.

I have a list[1] of objects from which I need to remove duplicates. I
have to maintain the list order though, so solutions like set(lst), etc.
will not work for me. What are my options? So far, I can see:

def filterdups(iterable):
result = []
for item in iterable:
if item not in result:
result.append(item)
return result

def filterdups(iterable):
result = []
seen = set()
for item in iterable:
if item not in seen:
result.append(item)
seen.add(item)
return result

def filterdups(iterable):
seen = set()
for item in iterable:
if item not in seen:
seen.add(item)
yield item

Does anyone have a better[2] solution?

STeve

[1] Well, actually it's an iterable of objects, but I can convert it to
a list if that's helpful.

[2] Yes I know, "better" is ambiguous. If it helps any, for my
particular situation, speed is probably more important than memory, so
I'm leaning towards the second or third implementation.

How about:
.... seen = set()
.... def _seen(item):
.... return item in seen or seen.add(item)
.... return itertools.ifilterfalse(_seen,iterable)
....

>>> list(filterdups3([1,2,2,3,3,3,4,4,4,2,2,5])) [1, 2, 3, 4, 5]
>>>

Click to expand...

Click to expand...

Michael

Francis Girard · Feb 6, 2005

Hi,

I think your last solution is not good unless your "list" is sorted (in which
case the solution is trivial) since you certainly do have to see all the
elements in the list before deciding that a given element is not a duplicate.
You have to exhaust the iteratable before yielding anything.

Besides that, I was thinking that the best solution presented here proceed in
two steps :

1- Check that an element is not already in some ordered data type of already
seen element

2- If not, put the element in that structure

That's probably twice the work.

There might be some way to do both at the same time, i.e.

- Put the element in the data structure only if not already there and tell me,
with some boolean, if you did put the element.

Then you have to do the work of finding the right place where to insert
(fetch) the element only once. I don't see any easy way to do this in python,
other than rolling your sleeves and code your own data structure in Python,
which would be slowlier than the provided dict or set C implementation.

I think this a hole into the Pythin libs.

Regards

Francis Girard

Le jeudi 3 Février 2005 21:39, Steven Bethard a écrit :

Steven Bethard · Feb 6, 2005

Francis said:
I think your last solution is not good unless your "list" is sorted (in which
case the solution is trivial) since you certainly do have to see all the
elements in the list before deciding that a given element is not a duplicate.
You have to exhaust the iteratable before yielding anything.

I don't follow. The first time I see an element, it is not a duplicate,
so I yield it. After that, the element should show up in the 'seen'
list, which means I won't yield any of the remaining equivalent
elements. Could you explain your logic here?

Steve

John Machin · Feb 7, 2005

Francis said:
Hi,

I think your last solution is not good unless your "list" is sorted (in which
case the solution is trivial) since you certainly do have to see all the
elements in the list before deciding that a given element is not a duplicate.
You have to exhaust the iteratable before yielding anything.

Last solution? All of them have essentially the same logic to decide
which items to reject.

Further, what you say is true only if you are interpreting Steven's
ambiguous(?) requirement as: remove ALL instances of a bunch of
duplicates. Such a requirement would seem to be (a) just a tad unlikely
and (b) contrary to the output of solutions already shown, which have
brought no protest from Steven.

So, just to remove ambiguity, WHICH one of the bunch should be
retained? Short answer: "the first seen" is what the proverbial "man in
the street" would expect -- "Have you voted already today?" -- it's a
practical requirement and doesn't need any more complicated data
structure than a set of folk who've already voted and no
sleeve-rolling-up.

Steven Bethard · Feb 7, 2005

John said:
So, just to remove ambiguity, WHICH one of the bunch should be
retained? Short answer: "the first seen" is what the proverbial "man in
the street" would expect

For my purposes, it doesn't matter which instance is retained and which
are removed, so yes, retaining the first one is quite acceptable. (That
was also the simplest thing to do, so that's why all my solutions did that.)

Steve

Alex Martelli · Feb 7, 2005

Steven Bethard said:
I have a list[1] of objects from which I need to remove duplicates. I
have to maintain the list order though, so solutions like set(lst), etc.
will not work for me. What are my options? So far, I can see:

I think the recipe by that subject in the 1st Edition Python Cookbook is
pretty exhaustive wrt the possibilities that were available up to Python
2.2 under various hypotheses of: are the items hashable, do you want to
keep the first one of several duplicates or the last one or some one
chosen by arbitrary criteria, etc, etc. In 2.3/2.4, the addition of
`set' simplifies many solutions, and itertools simplify others; besides,
it appears that you don't care which duplicate is kept, need use
equality only (not an equivalence-function), and do have hashable items,
so most of the ideas in that recipe aren't needed anyway.

def filterdups(iterable):
seen = set()
for item in iterable:
if item not in seen:
seen.add(item)
yield item

Does anyone have a better[2] solution?

STeve

[1] Well, actually it's an iterable of objects, but I can convert it to
a list if that's helpful.

[2] Yes I know, "better" is ambiguous. If it helps any, for my
particular situation, speed is probably more important than memory, so
I'm leaning towards the second or third implementation.

I doubt you can do _substantially_ better than filterdups, unless some
particularly special conditions apply, e.g., the items are hashable but
hashing them is very time-consuming. In that case you might avoid
hashing items twice, as filterdups does, by some hack such as:

def filterdups_hackish(iterable):
seen = set()
olen = 0
for item in iterable:
seen.add(item)
nlen = len(seen)
if nlen > olen:
yield item
olen = nlen

Here's a toy example of how this might help, given costly hashing:

class examp(object):
def __init__(self, i): self.i = i//2
def __hash__(self): return self.i**7 % 999999

dups = [examp(x) for x in range(1000)]

def fd1():
for x in filterdups(dups): pass

def fd2():
for x in filterdups_hackish(dups): pass

kallisti:~/cb alex$ python -mtimeit -s'import fid' 'fid.fd1()'
10 loops, best of 3: 55.6 msec per loop
kallisti:~/cb alex$ python -mtimeit -s'import fid' 'fid.fd2()'
10 loops, best of 3: 33.4 msec per loop

Probably not worth the bother even for this level of hashing costs and
density of duplicates, as you see, but maybe an idea keeping in mind if
the duplicates are few and hashing is really costly: having both the
membership test AND the insertion means you hash the item twice; the
hackish alternative is to insert anyway (hashing but once) and see if
that made a difference (taking the len is always cheap).

Alex

Idioms combining 'next(items)' and 'for item in items:'	1	Sep 10, 2011
Range / empty list issues??	1	Dec 11, 2023
Grouping items by a key?	1	Mar 22, 2013
Just started coding and im stuck on a lesson?	1	Oct 30, 2022
Remove empty strings from list	11	Sep 15, 2009
how to avoid checking the same condition repeatedly ?	14	Oct 28, 2013
Selecting k smallest or largest elements from a large list in python; (benchmarking)	9	Sep 2, 2010
Wrapping around a list	8	Nov 27, 2013

remove duplicates from list preserving order

Steven Bethard

Carl Banks

wittempj

Larry Bates

Steven Bethard

wittempj

Michael Spencer

Francis Girard

Steven Bethard

John Machin

Steven Bethard

Alex Martelli

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

remove duplicates from list *preserving order*

Steven Bethard

Carl Banks

wittempj

Larry Bates

Steven Bethard

wittempj

Michael Spencer

Francis Girard

Steven Bethard

John Machin

Steven Bethard

Alex Martelli

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

remove duplicates from list preserving order