Yet another unique() function...

P

Paul Rubin

MonkeeSage said:
Here's yet another take on a unique() function for sequences. It's
more terse than others I've seen and works for all the common use
cases (please report any errors on the recipe page):

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/502263

That looks pretty messy, and it's a quadratic time algorithm (or maybe
worse) because of all the list.index and deletion operations.

This version is also quadratic and passes your test suite, but
might differ in some more complicated cases:

def unique(seq, keepstr=True):
t = type(seq)
if t==str:
t = (list, ''.join)[bool(keepstr)]
seen = []
return t(c for c in seq if (c not in seen, seen.append(c))[0])
 
P

Paul Rubin

Paul Rubin said:
def unique(seq, keepstr=True):
t = type(seq)
if t==str:
t = (list, ''.join)[bool(keepstr)]
seen = []
return t(c for c in seq if (c not in seen, seen.append(c))[0])

Preferable:

def unique(seq, keepstr=True):
t = type(seq)
if t==str:
t = (list, ''.join)[bool(keepstr)]
seen = []
return t(c for c in seq if not (c in seen or seen.append(c)))
 
M

MonkeeSage

Paul Rubin said:
def unique(seq, keepstr=True):
t = type(seq)
if t==str:
t = (list, ''.join)[bool(keepstr)]
seen = []
return t(c for c in seq if (c not in seen, seen.append(c))[0])

Preferable:

def unique(seq, keepstr=True):
t = type(seq)
if t==str:
t = (list, ''.join)[bool(keepstr)]
seen = []
return t(c for c in seq if not (c in seen or seen.append(c)))

Wow, nice! Very cool. :)

Regards,
Jordan
 
M

MonkeeSage

Paul Rubin said:
def unique(seq, keepstr=True):
t = type(seq)
if t==str:
t = (list, ''.join)[bool(keepstr)]
seen = []
return t(c for c in seq if (c not in seen, seen.append(c))[0])
Preferable:

def unique(seq, keepstr=True):
t = type(seq)
if t==str:
t = (list, ''.join)[bool(keepstr)]
seen = []
return t(c for c in seq if not (c in seen or seen.append(c)))

Wow, nice! Very cool. :)

Regards,
Jordan

I posted this (attributed to you of course) in the comments section
for the recipe.
 
P

Paul McGuire

Paul Rubin said:
def unique(seq, keepstr=True):
t = type(seq)
if t==str:
t = (list, ''.join)[bool(keepstr)]
seen = []
return t(c for c in seq if (c not in seen, seen.append(c))[0])

Preferable:

def unique(seq, keepstr=True):
t = type(seq)
if t==str:
t = (list, ''.join)[bool(keepstr)]
seen = []
return t(c for c in seq if not (c in seen or seen.append(c)))


Any reason not to use a set for the 'seen' variable? Avoids searching
through a linear list. The input order is preserved because the
return value is created in the generator expression, not by using the
seen variable directly.

def unique2(seq, keepstr=True):
t = type(seq)
if t==str:
t = (list, ''.join)[bool(keepstr)]
seen = set()
return t(c for c in seq if not (c in seen or seen.add(c)))

-- Paul
 
P

Paul Rubin

Paul McGuire said:
Any reason not to use a set for the 'seen' variable?

Yes, the sequence can contain non-hashable elements. See the test
vectors for examples.
 
B

bearophileHUGS

MonkeeSage:
Here's yet another take on a unique() function for sequences. It's
more terse than others I've seen and works for all the common use
cases (please report any errors on the recipe page):

It's more terse, but my version is built to be faster in the more
common cases of all hashable or/and all sortable items (while working
in other cases too).
Try your unique on an unicode string, that's probably a bug (keepstr
is being ignored).
Version by Paul Rubin is very short, but rather unreadable too.

Bye,
bearophile
 
P

Paul Rubin

It's more terse, but my version is built to be faster in the more
common cases of all hashable or/and all sortable items (while working
in other cases too).
Try your unique on an unicode string, that's probably a bug (keepstr
is being ignored).
Version by Paul Rubin is very short, but rather unreadable too.

Bye,
bearophile

Unicode fix (untested):

def unique(seq, keepstr=True):
t = type(seq)
if t in (unicode, str):
t = (list, t('').join)[bool(keepstr)]
seen = []
return t(c for c in seq if not (c in seen or seen.append(c)))

Case by case optimization (untested):

def unique(seq, keepstr=True):
t = type(seq)
if t in (unicode, str):
t = (list, t('').join)[bool(keepstr)]
try:
remaining = set(seq)
seen = set()
return t(c for c in seq if (c in remaining and
not remaining.remove(c)))
except TypeError: # hashing didn't work, see if seq is sortable
try:
from itertools import groupby
s = sorted(enumerate(seq),key=lambda (i,v):(v,i))
return t(g.next() for k,g in groupby(s, lambda (i,v): v))
except: # not sortable, use brute force
seen = []
return t(c for c in seq if not (c in seen or seen.append(c)))

I don't have Python 2.4 available right now to try either of the above.

Note that all the schemes fail if seq is some arbitrary iterable,
rather one of the built-in sequence types.

I think these iterator approaches get more readable as one becomes
used to them.
 
M

MonkeeSage

Unicode fix (untested):

def unique(seq, keepstr=True):
t = type(seq)
if t in (unicode, str):
t = (list, t('').join)[bool(keepstr)]
seen = []
return t(c for c in seq if not (c in seen or seen.append(c)))

This definitely works. I tried to post a message about this a few
hours ago, but I guess it didn't go through. I've already updated the
recipe and comments.
I think these iterator approaches get more readable as one becomes
used to them.

I agree. I understood your code in a matter of seconds.
 
M

MonkeeSage

Paul,

In your case optimized version, in the second try clause using
itertools, it should be like this, shouldn't it?

return t(g.next()[1] for k,g in groupby(s, lambda (i,v): v))
^^^

Regards,
Jordan
 
P

Paul Rubin

MonkeeSage said:
In your case optimized version, in the second try clause using
itertools, it should be like this, shouldn't it?

return t(g.next()[1] for k,g in groupby(s, lambda (i,v): v))

I didn't think so but I can't conveniently test it for now. Maybe
tonight.
 
M

MonkeeSage

I didn't think so but I can't conveniently test it for now. Maybe
tonight.

After playing with it a bit (only have 2.5 on this box), it looks like
you do need to subscript the next() call. For example, the return from
"unique( [[1], [2]] )" is "[(0, [1]), (1, [2])]". I'm not overly
familiar with the functional paradigm or itertools though, so I might
have missed something.

Regards,
Jordan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,113
Latest member
KetoBurn
Top