finding items that occur more than once in a list

J

John Machin

def nonunique(lst):
slst = sorted(lst)
dups = [s[0] for s in
filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
return [dups[0]] + [s[1] for s in
filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]
Argh! What's wrong with something like:
def duplicates(l):
i = j = object()
for k in sorted(l):
if i != j == k: yield k
i, j = j, k

Nice, and more readable. But I'd use Paul Robin's solution. It is O(N)
as opposed to ours which are O(N log N).

I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
and is IMHO more readable than Paul's.
 
S

sturlamolden

I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
and is IMHO more readable than Paul's.

Is a Python set implemented using a hash table?
 
J

Justin Bozonier

def nonunique(lst):
slst = sorted(lst)
dups = [s[0] for s in
filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
return [dups[0]] + [s[1] for s in
filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]
Argh! What's wrong with something like:
def duplicates(l):
i = j = object()
for k in sorted(l):
if i != j == k: yield k
i, j = j, k
Nice, and more readable. But I'd use Paul Robin's solution. It is O(N)
as opposed to ours which are O(N log N).

I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
and is IMHO more readable than Paul's.

It's not as much O(N)... Paul Robin's uses a sort first which is
definitely not O(N). Paul's could be prettied up a bit but the general
principle is sound.
 
J

John Machin

def nonunique(lst):
slst = sorted(lst)
dups = [s[0] for s in
filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
return [dups[0]] + [s[1] for s in
filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]
Argh! What's wrong with something like:
def duplicates(l):
i = j = object()
for k in sorted(l):
if i != j == k: yield k
i, j = j, k
Nice, and more readable. But I'd use Paul Robin's solution. It is O(N)
as opposed to ours which are O(N log N).
I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
and is IMHO more readable than Paul's.

It's not as much O(N)... Paul Robin's uses a sort first which is
definitely not O(N). Paul's could be prettied up a bit but the general
principle is sound.

"""
from collections import defaultdict
def nonunique(lst):
d = defaultdict(int)
for x in lst: d[x] += 1
return [x for x,n in d.iterkeys() if n > 1]
"""

I see no sort here.
 
S

sturlamolden

What don't you understand about the comments in the first two
screenfuls of Objects/setobject.c?

I had not looked at it, but now I have. Is seems Hettinger is the
author :) Ok, so sets are implemented as hash tables. Then I agree,
use Raymond Hettinger's solution.
 
A

Arnaud Delobelle

This is expected amortized constant time. Is that really the same
thing as average constant time? Hmmm... that's subtle.

I am not sure what the difference between expected amortized time
complexity and average time complexity is (I know that they are
defined as different notions but I don't know whether they reduce to
the same thing or not). Anyway both are average case complexities and
AFAIK worst case time complexity of insertion / lookup in a hashtable
is still O(n).
The amortized doubling breaks that.


Arnaud Delobelle offered a good Wikipedia link, and for more background
look up "amortized analysis.

Hrvoje Niksic provided the link :). I still think two unrelated
things are being compared in this thread when people say that method A
(using dictionaries / sets) is O(n) and method B (sorting the list)
O(nlogn).

Isn't it the case that:

| Worst case | Average case
---------|------------|-------------
Method A | O(n^2) | O(n)
Method B | O(nlogn) | O(nlogn)

Which method is the best would then be determined by the distribution
of the hash values of your data, and whether you want to have a
guarantee the method will have a less-than-quadratic behaviour.
 
B

Bryan Olson

Arnaud said:
Bryan Olson wrote: [...]
Arnaud Delobelle offered a good Wikipedia link, and for more background
look up "amortized analysis.

Hrvoje Niksic provided the link :).

Oops, careless thread-following. Hrvoje Niksic it was.
I still think two unrelated
things are being compared in this thread when people say that method A
(using dictionaries / sets) is O(n) and method B (sorting the list)
O(nlogn).

Isn't it the case that:

| Worst case | Average case
---------|------------|-------------
Method A | O(n^2) | O(n)
Method B | O(nlogn) | O(nlogn)

Which method is the best would then be determined by the distribution
of the hash values of your data, and whether you want to have a
guarantee the method will have a less-than-quadratic behaviour.

If we exclude the case where an adversary is choosing the keys, the
chance of a seriously degenerate case in the hashing method is so
remote that we do should not worry about it. Those who insist on
ranking by worst-case behavior have probably abandoned Python long
ago, as it uses those hashed 'dict' things everywhere. Of course
they've also abandoned OS's with demand-paged virtual memory and
such.
 
C

castironpi

Arnaud said:
Bryan Olson wrote: [...]
Arnaud Delobelle offered a good Wikipedia link, and for more background
look up "amortized analysis.
Hrvoje Niksic provided the link :).

Oops, careless thread-following. Hrvoje Niksic it was.
I still think two unrelated
things are being compared in this thread when people say that method A
(using dictionaries / sets) is O(n) and method B (sorting the list)
O(nlogn).
Isn't it the case that:
         | Worst case | Average case
---------|------------|-------------
Method A | O(n^2)     | O(n)
Method B | O(nlogn)   | O(nlogn)
Which method is the best would then be determined by the distribution
of the hash values of your data, and whether you want to have a
guarantee the method will have a less-than-quadratic behaviour.

If we exclude the case where an adversary is choosing the keys, the
chance of a seriously degenerate case in the hashing method is so
remote that we do should not worry about it. Those who insist on
ranking by worst-case behavior have probably abandoned Python long
ago, as it uses those hashed 'dict' things everywhere. Of course
they've also abandoned OS's with demand-paged virtual memory and
such.

A collision sequence is not so rare.
[ hash( 2**i ) for i in range( 0, 256, 32 ) ]
[1, 1, 1, 1, 1, 1, 1, 1]
 
J

John Machin

Arnaud said:
Bryan Olson wrote: [...]
Arnaud Delobelle offered a good Wikipedia link, and for more background
look up "amortized analysis.
Hrvoje Niksic provided the link :).
Oops, careless thread-following. Hrvoje Niksic it was.
If we exclude the case where an adversary is choosing the keys, the
chance of a seriously degenerate case in the hashing method is so
remote that we do should not worry about it. Those who insist on
ranking by worst-case behavior have probably abandoned Python long
ago, as it uses those hashed 'dict' things everywhere. Of course
they've also abandoned OS's with demand-paged virtual memory and
such.

A collision sequence is not so rare.
[ hash( 2**i ) for i in range( 0, 256, 32 ) ]

[1, 1, 1, 1, 1, 1, 1, 1]

Bryan did qualify his remarks: "If we exclude the case where an
adversary is choosing the keys, ..."
 
B

Bryan Olson

John said:
A collision sequence is not so rare.
[ hash( 2**i ) for i in range( 0, 256, 32 ) ]
[1, 1, 1, 1, 1, 1, 1, 1]
Bryan did qualify his remarks: "If we exclude the case where an
adversary is choosing the keys, ..."

Some adversary. What, you mean, my boss or my customers?

We mean that the party supplying the keys deliberately chose
them to make the hash table inefficient. In this thread the goal
is efficiency; a party working toward an opposing goal is an
adversary.

If you find real-world data sets that tend to induce bad-case
behavior in Python's hash, please do tell. It would be reason
enough to adjust the hash function. The hashes in popular
software such as Python are already quite well vetted.

Even a hash function that behaves as a random oracle has
worst-case quadratic-time in the algorithm here, but that's
an issue in theory and not in serving customers.
 
A

Arnaud Delobelle

John said:
On Mar 22, 1:11 am, (e-mail address removed) wrote:
A collision sequence is not so rare.
[ hash( 2**i ) for i in range( 0, 256, 32 ) ]
[1, 1, 1, 1, 1, 1, 1, 1]
Bryan did qualify his remarks: "If we exclude the case where an
adversary is choosing the keys, ..."
Some adversary.  What, you mean, my boss or my customers?

We mean that the party supplying the keys deliberately chose
them to make the hash table inefficient. In this thread the goal
is efficiency; a party working toward an opposing goal is an
adversary.

There are situations where this can happen I guess
If you find real-world data sets that tend to induce bad-case
behavior in Python's hash, please do tell. It would be reason
enough to adjust the hash function. The hashes in popular
software such as Python are already quite well vetted.

As Python allows you to make your own hash functions for your own
classes, another danger is that you are dealing with objects with a
bad hash function. I've actually tried it (as I am *not* a pro!) and
FWIW here are the results.

-------- badhash.py ----------

class BadHash(object):
def __hash__(self):
return 1 # that's a bad hash function!

from time import time

def test(n):
t0 = time()
# Create a hash table of size n
s = set(BadHash() for x in xrange(n))
t1 = time()
# Add an object to it
obj = BadHash()
s.add(obj)
t2 = time()
# Find that object
obj in s
t3 = time()
print "%s\t%.3e\t%.3e\t%.3e" % (n, (t1-t0)/n**2, (t2-t1)/n, (t3-
t2)/n)

print "n\tcreate/n^2\tadd/n\t\tfind/n"
for k in range(8, 17):
# I'm hoping that adding an element to a dict of size (1 << k) + 1
# will not trigger a resize of the hasmap.
test((1 << k) + 1)

--------------------------------

marigold:junk arno$ python badhash.py
n create/n^2 add/n find/n
257 3.917e-08 6.958e-08 7.051e-08
513 3.641e-08 8.598e-08 7.018e-08
1025 3.522e-08 7.118e-08 6.722e-08
2049 3.486e-08 6.935e-08 6.982e-08
4097 3.480e-08 7.053e-08 6.931e-08
8193 3.477e-08 6.897e-08 6.981e-08
16385 3.441e-08 6.963e-08 7.075e-08
32769 3.720e-08 7.672e-08 7.885e-08
65537 3.680e-08 7.159e-08 7.510e-08

So theory and practice agree! In this case The hash table behaves
just like a linked list.

Lastly, if one deals with a totally ordered set of object but they are
not hashable (there isn't a good hash function), then Ninereeds' idea
of sorting first is still useful.
 
S

sturlamolden

Even a hash function that behaves as a random oracle has
worst-case quadratic-time in the algorithm here

In which case inserts are not amortized to O(1).
 
C

castironpi

John said:
On Mar 22, 1:11 am, (e-mail address removed) wrote:
A collision sequence is not so rare.
[ hash( 2**i ) for i in range( 0, 256, 32 ) ]
[1, 1, 1, 1, 1, 1, 1, 1]
Bryan did qualify his remarks: "If we exclude the case where an
adversary is choosing the keys, ..."
Some adversary.  What, you mean, my boss or my customers?

We mean that the party supplying the keys deliberately chose
them to make the hash table inefficient. In this thread the goal
is efficiency; a party working toward an opposing goal is an
adversary.

If you find real-world data sets that tend to induce bad-case
behavior in Python's hash, please do tell. It would be reason
enough to adjust the hash function. The hashes in popular
software such as Python are already quite well vetted.

Even a hash function that behaves as a random oracle has
worst-case quadratic-time in the algorithm here, but that's
an issue in theory and not in serving customers.

Hey Net:

Where -we- left off, all real-world data sets were equally common,
unless you think in peckbird bartenders. Nuh uh at all, unless you've
got a characterize real-world data project underway. Did you know
letter frequency distributions vary by domain? And now for the 2-
squarebar spider.

raise ISay( "Python Science is unusually prone against non-power
multiples of 32 hashing integers", "Unknown value of I" )

parse( 'strangers are perfect' )
parse( 'i think i know someone' )
prevention.value() == 12* cure.value()
raise SolveFor
(raise CureValueSterling)
keep: time.value().encode() > money.value()
assert time.value().encode()== time.encode().value()
perch( Out )

compose( Fluctuate )
compose( Punctuate )
compose( Explain )

@partial( partial, prefix ) (ha, h)
raise Distracts( 'Well, would you look at that?' )
raise LivingConstraint( NoseFalls )
raise Superlative
voice Absolute
raise NonScalar( "Fit: It's apples -over- oranges." )
raise OverPull( "Not on Google." )
raise Drop( "It." )
moment+= instant
raise LeavesImpression
raise GoodMuleTwoHaystacks
raise ScaleDecomposition
raise Bias( 'Discrimination' )
raise BadFor
raise RequestedControlCurrencyExpiration
raise InsufficientTrust
raise Shock
raise EnoughTheory
raise EnoughCustomers
raise EyebrowTwo
raise DataKill
raise Reckless
raise ThenWhat( AfterYou )
raise NotImplemented( 'Offer.refuse' )
raise LaughterUnacceptable
raise Hell
raise DesignUnintelligent
raise HashesPerturbation
raise ProfessionException
yield Tick
yield Tally
yield Ho
yield Gee
yield Whoa
raise ActuallyActuallyActually
raise CurrencyAStream

If you can't fit an analogy...

Bee Dance : Hive Good :: Human This :: Community Good?
What and how is it when to who where?
What does it know? What does it do but sleep?
Does it specialize, momentarily fair?

Hey, what do you think of the spelling of this? "The group you are
posting to is a Usenet group. Messages posted to this group will make
your email address visible to anyone on the Internet." What do you
think of the wording of "spelling"? Take me to your lexicographer:
wake-up; you're up web, Europe web. Pass self to a generator.

Dust is sett'ling;
Tables turn:
Water's drying;
Sugars burn.

Phaser fire!
Shoulders square!
join( TenForward )?
Who is where?

Mushroom's growing;
Mighty... falls.
Cauldron's bubbling;
That is all.
... n= True
... while 1:
... k= yield( n )
... if k is not None: n= k
...
 
B

Bryan Olson

Arnaud said:
There are situations where this can happen I guess

Cormen-Leiserson-Rivest /Introduction to Algorithms/ discusses
it in section 12.3.3 (first edition). They suggest a family of
hash functions, where an individual hash is chosen at random
when creating the table. That still doesn't beat adversaries
who can get on-line timing information and figure out when
they induce bad cases.

More generally, security researchers have started to look at
"algorithmic complexity denial of service attacks".
http://www.cs.rice.edu/~scrosby/hash/

As Python allows you to make your own hash functions for your own
classes, another danger is that you are dealing with objects with a
bad hash function. I've actually tried it (as I am *not* a pro!) and
FWIW here are the results.

-------- badhash.py ----------

class BadHash(object):
def __hash__(self):
return 1 # that's a bad hash function!

The sorting algorithm can have the same problem. With user-defined
methods, there's no telling how long __cmp__ will actually take.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top