remove duplicates from list *preserving order*

S

Steven Bethard

I'm sorry, I assume this has been discussed somewhere already, but I
found only a few hits in Google Groups... If you know where there's a
good summary, please feel free to direct me there.


I have a list[1] of objects from which I need to remove duplicates. I
have to maintain the list order though, so solutions like set(lst), etc.
will not work for me. What are my options? So far, I can see:

def filterdups(iterable):
result = []
for item in iterable:
if item not in result:
result.append(item)
return result

def filterdups(iterable):
result = []
seen = set()
for item in iterable:
if item not in seen:
result.append(item)
seen.add(item)
return result

def filterdups(iterable):
seen = set()
for item in iterable:
if item not in seen:
seen.add(item)
yield item

Does anyone have a better[2] solution?

STeve

[1] Well, actually it's an iterable of objects, but I can convert it to
a list if that's helpful.

[2] Yes I know, "better" is ambiguous. If it helps any, for my
particular situation, speed is probably more important than memory, so
I'm leaning towards the second or third implementation.
 
C

Carl Banks

Steven said:
I'm sorry, I assume this has been discussed somewhere already, but I
found only a few hits in Google Groups... If you know where there's a
good summary, please feel free to direct me there.


I have a list[1] of objects from which I need to remove duplicates. I
have to maintain the list order though, so solutions like set(lst), etc.
will not work for me. What are my options? So far, I can see:

def filterdups(iterable):
result = []
for item in iterable:
if item not in result:
result.append(item)
return result

def filterdups(iterable):
result = []
seen = set()
for item in iterable:
if item not in seen:
result.append(item)
seen.add(item)
return result

def filterdups(iterable):
seen = set()
for item in iterable:
if item not in seen:
seen.add(item)
yield item

Does anyone have a better[2] solution?

STeve

[1] Well, actually it's an iterable of objects, but I can convert it to
a list if that's helpful.

[2] Yes I know, "better" is ambiguous. If it helps any, for my
particular situation, speed is probably more important than memory, so
I'm leaning towards the second or third implementation.


from itertools import *
[ x for (x,s) in izip(iterable,repeat(set()))
if (x not in s,s.add(x))[0] ]


that's-one-ambiguously-better-solution-ly yr's,
 
W

wittempj

You could do it with a class, like this, I guess it is a bit faster
than option 1, although I'm no connaisseur of python internals.....
-class uniquelist(list):
- def __init__(self, l):
- for item in l:
- self.append(item)

- def append(self, item):
- if item not in self:
- list.append(self, item)

-l = [1, 1, 2, 3, 3, 4]
-print l
-m = uniquelist(l)
-print m
 
S

Steven Bethard

Carl said:
from itertools import *
[ x for (x,s) in izip(iterable,repeat(set()))
if (x not in s,s.add(x))[0] ]

Wow, that's evil! Pretty cool, but for the sake of readers of my code,
I think I'll have to opt against it. ;)

STeVe
 
W

wittempj

You could create a class based on a list which takes a list as
argument, like this:
-class uniquelist(list):
- def __init__(self, l):
- for item in l:
- self.append(item)
-
- def append(self, item):
- if item not in self:
- list.append(self, item)
-
-l = [1, 1, 2, 3, 3, 4]
-print l
-m = uniquelist(l)
-print m

will print
py:[1, 1, 2, 3, 3, 4]
py:[1, 2, 3, 4]
 
M

Michael Spencer

Steven said:
I'm sorry, I assume this has been discussed somewhere already, but I
found only a few hits in Google Groups... If you know where there's a
good summary, please feel free to direct me there.


I have a list[1] of objects from which I need to remove duplicates. I
have to maintain the list order though, so solutions like set(lst), etc.
will not work for me. What are my options? So far, I can see:

def filterdups(iterable):
result = []
for item in iterable:
if item not in result:
result.append(item)
return result

def filterdups(iterable):
result = []
seen = set()
for item in iterable:
if item not in seen:
result.append(item)
seen.add(item)
return result

def filterdups(iterable):
seen = set()
for item in iterable:
if item not in seen:
seen.add(item)
yield item

Does anyone have a better[2] solution?

STeve

[1] Well, actually it's an iterable of objects, but I can convert it to
a list if that's helpful.

[2] Yes I know, "better" is ambiguous. If it helps any, for my
particular situation, speed is probably more important than memory, so
I'm leaning towards the second or third implementation.

How about:
.... seen = set()
.... def _seen(item):
.... return item in seen or seen.add(item)
.... return itertools.ifilterfalse(_seen,iterable)
....
>>> list(filterdups3([1,2,2,3,3,3,4,4,4,2,2,5])) [1, 2, 3, 4, 5]
>>>

Michael
 
F

Francis Girard

Hi,

I think your last solution is not good unless your "list" is sorted (in which
case the solution is trivial) since you certainly do have to see all the
elements in the list before deciding that a given element is not a duplicate.
You have to exhaust the iteratable before yielding anything.

Besides that, I was thinking that the best solution presented here proceed in
two steps :

1- Check that an element is not already in some ordered data type of already
seen element

2- If not, put the element in that structure

That's probably twice the work.

There might be some way to do both at the same time, i.e.

- Put the element in the data structure only if not already there and tell me,
with some boolean, if you did put the element.

Then you have to do the work of finding the right place where to insert
(fetch) the element only once. I don't see any easy way to do this in python,
other than rolling your sleeves and code your own data structure in Python,
which would be slowlier than the provided dict or set C implementation.

I think this a hole into the Pythin libs.

Regards

Francis Girard

Le jeudi 3 Février 2005 21:39, Steven Bethard a écrit :
 
S

Steven Bethard

Francis said:
I think your last solution is not good unless your "list" is sorted (in which
case the solution is trivial) since you certainly do have to see all the
elements in the list before deciding that a given element is not a duplicate.
You have to exhaust the iteratable before yielding anything.

I don't follow. The first time I see an element, it is not a duplicate,
so I yield it. After that, the element should show up in the 'seen'
list, which means I won't yield any of the remaining equivalent
elements. Could you explain your logic here?

Steve
 
J

John Machin

Francis said:
Hi,

I think your last solution is not good unless your "list" is sorted (in which
case the solution is trivial) since you certainly do have to see all the
elements in the list before deciding that a given element is not a duplicate.
You have to exhaust the iteratable before yielding anything.

Last solution? All of them have essentially the same logic to decide
which items to reject.

Further, what you say is true only if you are interpreting Steven's
ambiguous(?) requirement as: remove ALL instances of a bunch of
duplicates. Such a requirement would seem to be (a) just a tad unlikely
and (b) contrary to the output of solutions already shown, which have
brought no protest from Steven.

So, just to remove ambiguity, WHICH one of the bunch should be
retained? Short answer: "the first seen" is what the proverbial "man in
the street" would expect -- "Have you voted already today?" -- it's a
practical requirement and doesn't need any more complicated data
structure than a set of folk who've already voted and no
sleeve-rolling-up.
 
S

Steven Bethard

John said:
So, just to remove ambiguity, WHICH one of the bunch should be
retained? Short answer: "the first seen" is what the proverbial "man in
the street" would expect

For my purposes, it doesn't matter which instance is retained and which
are removed, so yes, retaining the first one is quite acceptable. (That
was also the simplest thing to do, so that's why all my solutions did that.)

Steve
 
A

Alex Martelli

Steven Bethard said:
I have a list[1] of objects from which I need to remove duplicates. I
have to maintain the list order though, so solutions like set(lst), etc.
will not work for me. What are my options? So far, I can see:

I think the recipe by that subject in the 1st Edition Python Cookbook is
pretty exhaustive wrt the possibilities that were available up to Python
2.2 under various hypotheses of: are the items hashable, do you want to
keep the first one of several duplicates or the last one or some one
chosen by arbitrary criteria, etc, etc. In 2.3/2.4, the addition of
`set' simplifies many solutions, and itertools simplify others; besides,
it appears that you don't care which duplicate is kept, need use
equality only (not an equivalence-function), and do have hashable items,
so most of the ideas in that recipe aren't needed anyway.

def filterdups(iterable):
seen = set()
for item in iterable:
if item not in seen:
seen.add(item)
yield item

Does anyone have a better[2] solution?

STeve

[1] Well, actually it's an iterable of objects, but I can convert it to
a list if that's helpful.

[2] Yes I know, "better" is ambiguous. If it helps any, for my
particular situation, speed is probably more important than memory, so
I'm leaning towards the second or third implementation.

I doubt you can do _substantially_ better than filterdups, unless some
particularly special conditions apply, e.g., the items are hashable but
hashing them is very time-consuming. In that case you might avoid
hashing items twice, as filterdups does, by some hack such as:

def filterdups_hackish(iterable):
seen = set()
olen = 0
for item in iterable:
seen.add(item)
nlen = len(seen)
if nlen > olen:
yield item
olen = nlen

Here's a toy example of how this might help, given costly hashing:

class examp(object):
def __init__(self, i): self.i = i//2
def __hash__(self): return self.i**7 % 999999

dups = [examp(x) for x in range(1000)]

def fd1():
for x in filterdups(dups): pass

def fd2():
for x in filterdups_hackish(dups): pass


kallisti:~/cb alex$ python -mtimeit -s'import fid' 'fid.fd1()'
10 loops, best of 3: 55.6 msec per loop
kallisti:~/cb alex$ python -mtimeit -s'import fid' 'fid.fd2()'
10 loops, best of 3: 33.4 msec per loop


Probably not worth the bother even for this level of hashing costs and
density of duplicates, as you see, but maybe an idea keeping in mind if
the duplicates are few and hashing is really costly: having both the
membership test AND the insertion means you hash the item twice; the
hackish alternative is to insert anyway (hashing but once) and see if
that made a difference (taking the len is always cheap).


Alex
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,733
Messages
2,569,440
Members
44,830
Latest member
ZADIva7383

Latest Threads

Top