finding items that occur more than once in a list

John Machin · Mar 19, 2008

def nonunique(lst):
slst = sorted(lst)
dups = [s[0] for s in
filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
return [dups[0]] + [s[1] for s in
filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]

Click to expand...

Click to expand...

Argh! What's wrong with something like:

Click to expand...

def duplicates(l):
i = j = object()
for k in sorted(l):
if i != j == k: yield k
i, j = j, k

Click to expand...

Nice, and more readable. But I'd use Paul Robin's solution. It is O(N)
as opposed to ours which are O(N log N).

I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
and is IMHO more readable than Paul's.

sturlamolden · Mar 19, 2008

I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
and is IMHO more readable than Paul's.

Is a Python set implemented using a hash table?

Justin Bozonier · Mar 19, 2008

def nonunique(lst):
slst = sorted(lst)
dups = [s[0] for s in
filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
return [dups[0]] + [s[1] for s in
filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]
Argh! What's wrong with something like:
def duplicates(l):
i = j = object()
for k in sorted(l):
if i != j == k: yield k
i, j = j, k

Click to expand...

Click to expand...

Nice, and more readable. But I'd use Paul Robin's solution. It is O(N)
as opposed to ours which are O(N log N).

Click to expand...

I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
and is IMHO more readable than Paul's.

It's not as much O(N)... Paul Robin's uses a sort first which is
definitely not O(N). Paul's could be prettied up a bit but the general
principle is sound.

John Machin · Mar 19, 2008

def nonunique(lst):
slst = sorted(lst)
dups = [s[0] for s in
filter(lambda t : t[0] == t[1], zip(slst[:-1],slst[1:]))]
return [dups[0]] + [s[1] for s in
filter(lambda t : t[0] != t[1], zip(dups[:-1],dups[1:]))]
Argh! What's wrong with something like:
def duplicates(l):
i = j = object()
for k in sorted(l):
if i != j == k: yield k
i, j = j, k
Nice, and more readable. But I'd use Paul Robin's solution. It is O(N)
as opposed to ours which are O(N log N).

Click to expand...

Click to expand...

I'd use Raymond Hettinger's solution. It is as much O(N) as Paul's,
and is IMHO more readable than Paul's.

Click to expand...

It's not as much O(N)... Paul Robin's uses a sort first which is
definitely not O(N). Paul's could be prettied up a bit but the general
principle is sound.

"""
from collections import defaultdict
def nonunique(lst):
d = defaultdict(int)
for x in lst: d[x] += 1
return [x for x,n in d.iterkeys() if n > 1]
"""

I see no sort here.

John Machin · Mar 19, 2008

Is a Python set implemented using a hash table?

What don't you understand about the comments in the first two
screenfuls of Objects/setobject.c?

sturlamolden · Mar 20, 2008

What don't you understand about the comments in the first two
screenfuls of Objects/setobject.c?

I had not looked at it, but now I have. Is seems Hettinger is the
author

Ok, so sets are implemented as hash tables. Then I agree,
use Raymond Hettinger's solution.

Gabriel Genellina · Mar 20, 2008

En Wed said:
What don't you understand about the comments in the first two
screenfuls of Objects/setobject.c?

That comment was unnecesarily harsh, don't you think?

John Machin · Mar 20, 2008

That comment was unnecesarily harsh, don't you think?

No, otherwise I wouldn't have made it.

Arnaud Delobelle · Mar 20, 2008

This is expected amortized constant time. Is that really the same
thing as average constant time? Hmmm... that's subtle.

I am not sure what the difference between expected amortized time
complexity and average time complexity is (I know that they are
defined as different notions but I don't know whether they reduce to
the same thing or not). Anyway both are average case complexities and
AFAIK worst case time complexity of insertion / lookup in a hashtable
is still O(n).

The amortized doubling breaks that.

Arnaud Delobelle offered a good Wikipedia link, and for more background
look up "amortized analysis.

Hrvoje Niksic provided the link

. I still think two unrelated
things are being compared in this thread when people say that method A
(using dictionaries / sets) is O(n) and method B (sorting the list)
O(nlogn).

Isn't it the case that:

| Worst case | Average case
---------|------------|-------------
Method A | O(n^2) | O(n)
Method B | O(nlogn) | O(nlogn)

Which method is the best would then be determined by the distribution
of the hash values of your data, and whether you want to have a
guarantee the method will have a less-than-quadratic behaviour.

castironpi · Mar 21, 2008

No, otherwise I wouldn't have made it.

Screenfuls is an awful lot, the way you guys write. Besides, what's
the answer? What don't you understand?

Bryan Olson · Mar 21, 2008

Arnaud said:
Bryan Olson wrote: [...]

Arnaud Delobelle offered a good Wikipedia link, and for more background
look up "amortized analysis.

Click to expand...

Hrvoje Niksic provided the link .

Oops, careless thread-following. Hrvoje Niksic it was.

I still think two unrelated
things are being compared in this thread when people say that method A
(using dictionaries / sets) is O(n) and method B (sorting the list)
O(nlogn).

Isn't it the case that:

| Worst case | Average case
---------|------------|-------------
Method A | O(n^2) | O(n)
Method B | O(nlogn) | O(nlogn)

Which method is the best would then be determined by the distribution
of the hash values of your data, and whether you want to have a
guarantee the method will have a less-than-quadratic behaviour.

If we exclude the case where an adversary is choosing the keys, the
chance of a seriously degenerate case in the hashing method is so
remote that we do should not worry about it. Those who insist on
ranking by worst-case behavior have probably abandoned Python long
ago, as it uses those hashed 'dict' things everywhere. Of course
they've also abandoned OS's with demand-paged virtual memory and
such.

castironpi · Mar 21, 2008

Arnaud said:
Arnaud said:

Bryan Olson wrote: [...]

Arnaud Delobelle offered a good Wikipedia link, and for more background
look up "amortized analysis.

Click to expand...

Click to expand...

Hrvoje Niksic provided the link .

Click to expand...

Oops, careless thread-following. Hrvoje Niksic it was.

I still think two unrelated
things are being compared in this thread when people say that method A
(using dictionaries / sets) is O(n) and method B (sorting the list)
O(nlogn).

Click to expand...

Isn't it the case that:

Click to expand...

| Worst case | Average case
---------|------------|-------------
Method A | O(n^2) | O(n)
Method B | O(nlogn) | O(nlogn)

Click to expand...

Which method is the best would then be determined by the distribution
of the hash values of your data, and whether you want to have a
guarantee the method will have a less-than-quadratic behaviour.

Click to expand...

If we exclude the case where an adversary is choosing the keys, the
chance of a seriously degenerate case in the hashing method is so
remote that we do should not worry about it. Those who insist on
ranking by worst-case behavior have probably abandoned Python long
ago, as it uses those hashed 'dict' things everywhere. Of course
they've also abandoned OS's with demand-paged virtual memory and
such.

A collision sequence is not so rare.

[ hash( 2**i ) for i in range( 0, 256, 32 ) ]

Click to expand...

Click to expand...

[1, 1, 1, 1, 1, 1, 1, 1]

John Machin · Mar 21, 2008

Arnaud said:
Arnaud said:

Bryan Olson wrote: [...]
Arnaud Delobelle offered a good Wikipedia link, and for more background
look up "amortized analysis.
Hrvoje Niksic provided the link .

Click to expand...

Click to expand...

Oops, careless thread-following. Hrvoje Niksic it was.

Click to expand...

If we exclude the case where an adversary is choosing the keys, the
chance of a seriously degenerate case in the hashing method is so
remote that we do should not worry about it. Those who insist on
ranking by worst-case behavior have probably abandoned Python long
ago, as it uses those hashed 'dict' things everywhere. Of course
they've also abandoned OS's with demand-paged virtual memory and
such.

Click to expand...

A collision sequence is not so rare.

[ hash( 2**i ) for i in range( 0, 256, 32 ) ]

Click to expand...

Click to expand...

[1, 1, 1, 1, 1, 1, 1, 1]

Bryan did qualify his remarks: "If we exclude the case where an
adversary is choosing the keys, ..."

castironpi · Mar 22, 2008

A collision sequence is not so rare.

[ hash( 2**i ) for i in range( 0, 256, 32 ) ]

Click to expand...

Click to expand...

[1, 1, 1, 1, 1, 1, 1, 1]

Click to expand...

Bryan did qualify his remarks: "If we exclude the case where an
adversary is choosing the keys, ..."

Some adversary. What, you mean, my boss or my customers?

Bryan Olson · Mar 22, 2008

John said:
John said:

A collision sequence is not so rare.
[ hash( 2**i ) for i in range( 0, 256, 32 ) ]
[1, 1, 1, 1, 1, 1, 1, 1]

Click to expand...

Bryan did qualify his remarks: "If we exclude the case where an
adversary is choosing the keys, ..."

Click to expand...

Some adversary. What, you mean, my boss or my customers?

We mean that the party supplying the keys deliberately chose
them to make the hash table inefficient. In this thread the goal
is efficiency; a party working toward an opposing goal is an
adversary.

If you find real-world data sets that tend to induce bad-case
behavior in Python's hash, please do tell. It would be reason
enough to adjust the hash function. The hashes in popular
software such as Python are already quite well vetted.

Even a hash function that behaves as a random oracle has
worst-case quadratic-time in the algorithm here, but that's
an issue in theory and not in serving customers.

Arnaud Delobelle · Mar 22, 2008

[email protected] said:
[email protected] said:

John said:

On Mar 22, 1:11 am, (e-mail address removed) wrote:
A collision sequence is not so rare.
[ hash( 2**i ) for i in range( 0, 256, 32 ) ]
[1, 1, 1, 1, 1, 1, 1, 1]
Bryan did qualify his remarks: "If we exclude the case where an
adversary is choosing the keys, ..."

Click to expand...

Click to expand...

Some adversary. What, you mean, my boss or my customers?

Click to expand...

We mean that the party supplying the keys deliberately chose
them to make the hash table inefficient. In this thread the goal
is efficiency; a party working toward an opposing goal is an
adversary.

There are situations where this can happen I guess

If you find real-world data sets that tend to induce bad-case
behavior in Python's hash, please do tell. It would be reason
enough to adjust the hash function. The hashes in popular
software such as Python are already quite well vetted.

As Python allows you to make your own hash functions for your own
classes, another danger is that you are dealing with objects with a
bad hash function. I've actually tried it (as I am *not* a pro!) and
FWIW here are the results.

-------- badhash.py ----------

class BadHash(object):
def __hash__(self):
return 1 # that's a bad hash function!

from time import time

def test(n):
t0 = time()
# Create a hash table of size n
s = set(BadHash() for x in xrange(n))
t1 = time()
# Add an object to it
obj = BadHash()
s.add(obj)
t2 = time()
# Find that object
obj in s
t3 = time()
print "%s\t%.3e\t%.3e\t%.3e" % (n, (t1-t0)/n**2, (t2-t1)/n, (t3-
t2)/n)

print "n\tcreate/n^2\tadd/n\t\tfind/n"
for k in range(8, 17):
# I'm hoping that adding an element to a dict of size (1 << k) + 1
# will not trigger a resize of the hasmap.
test((1 << k) + 1)

--------------------------------

marigold:junk arno$ python badhash.py
n create/n^2 add/n find/n
257 3.917e-08 6.958e-08 7.051e-08
513 3.641e-08 8.598e-08 7.018e-08
1025 3.522e-08 7.118e-08 6.722e-08
2049 3.486e-08 6.935e-08 6.982e-08
4097 3.480e-08 7.053e-08 6.931e-08
8193 3.477e-08 6.897e-08 6.981e-08
16385 3.441e-08 6.963e-08 7.075e-08
32769 3.720e-08 7.672e-08 7.885e-08
65537 3.680e-08 7.159e-08 7.510e-08

So theory and practice agree! In this case The hash table behaves
just like a linked list.

Lastly, if one deals with a totally ordered set of object but they are
not hashable (there isn't a good hash function), then Ninereeds' idea
of sorting first is still useful.

John Machin · Mar 22, 2008

Lastly, if one deals with a totally ordered set of object but they are
not hashable (there isn't a good hash function), then Ninereeds' idea
of sorting first is still useful.

.... and if not totally ordered, then ... I'll just stand aside and cue
the timbot

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560

sturlamolden · Mar 22, 2008

Even a hash function that behaves as a random oracle has
worst-case quadratic-time in the algorithm here

In which case inserts are not amortized to O(1).

castironpi · Mar 22, 2008

[email protected] said:
[email protected] said:

John said:

On Mar 22, 1:11 am, (e-mail address removed) wrote:
A collision sequence is not so rare.
[ hash( 2**i ) for i in range( 0, 256, 32 ) ]
[1, 1, 1, 1, 1, 1, 1, 1]
Bryan did qualify his remarks: "If we exclude the case where an
adversary is choosing the keys, ..."

Click to expand...

Click to expand...

Some adversary. What, you mean, my boss or my customers?

Click to expand...

We mean that the party supplying the keys deliberately chose
them to make the hash table inefficient. In this thread the goal
is efficiency; a party working toward an opposing goal is an
adversary.

If you find real-world data sets that tend to induce bad-case
behavior in Python's hash, please do tell. It would be reason
enough to adjust the hash function. The hashes in popular
software such as Python are already quite well vetted.

Even a hash function that behaves as a random oracle has
worst-case quadratic-time in the algorithm here, but that's
an issue in theory and not in serving customers.

Hey Net:

Where -we- left off, all real-world data sets were equally common,
unless you think in peckbird bartenders. Nuh uh at all, unless you've
got a characterize real-world data project underway. Did you know
letter frequency distributions vary by domain? And now for the 2-
squarebar spider.

raise ISay( "Python Science is unusually prone against non-power
multiples of 32 hashing integers", "Unknown value of I" )

parse( 'strangers are perfect' )
parse( 'i think i know someone' )
prevention.value() == 12* cure.value()
raise SolveFor
(raise CureValueSterling)
keep: time.value().encode() > money.value()
assert time.value().encode()== time.encode().value()
perch( Out )

compose( Fluctuate )
compose( Punctuate )
compose( Explain )

@partial( partial, prefix ) (ha, h)
raise Distracts( 'Well, would you look at that?' )
raise LivingConstraint( NoseFalls )
raise Superlative
voice Absolute
raise NonScalar( "Fit: It's apples -over- oranges." )
raise OverPull( "Not on Google." )
raise Drop( "It." )
moment+= instant
raise LeavesImpression
raise GoodMuleTwoHaystacks
raise ScaleDecomposition
raise Bias( 'Discrimination' )
raise BadFor
raise RequestedControlCurrencyExpiration
raise InsufficientTrust
raise Shock
raise EnoughTheory
raise EnoughCustomers
raise EyebrowTwo
raise DataKill
raise Reckless
raise ThenWhat( AfterYou )
raise NotImplemented( 'Offer.refuse' )
raise LaughterUnacceptable
raise Hell
raise DesignUnintelligent
raise HashesPerturbation
raise ProfessionException
yield Tick
yield Tally
yield Ho
yield Gee
yield Whoa
raise ActuallyActuallyActually
raise CurrencyAStream

If you can't fit an analogy...

Bee Dance : Hive Good :: Human This :: Community Good?
What and how is it when to who where?
What does it know? What does it do but sleep?
Does it specialize, momentarily fair?

Hey, what do you think of the spelling of this? "The group you are
posting to is a Usenet group. Messages posted to this group will make
your email address visible to anyone on the Internet." What do you
think of the wording of "spelling"? Take me to your lexicographer:
wake-up; you're up web, Europe web. Pass self to a generator.

Dust is sett'ling;
Tables turn:
Water's drying;
Sugars burn.

Phaser fire!
Shoulders square!
join( TenForward )?
Who is where?

Mushroom's growing;
Mighty... falls.
Cauldron's bubbling;
That is all.
... n= True
... while 1:
... k= yield( n )
... if k is not None: n= k
...

Bryan Olson · Mar 23, 2008

Arnaud said:
There are situations where this can happen I guess

Cormen-Leiserson-Rivest /Introduction to Algorithms/ discusses
it in section 12.3.3 (first edition). They suggest a family of
hash functions, where an individual hash is chosen at random
when creating the table. That still doesn't beat adversaries
who can get on-line timing information and figure out when
they induce bad cases.

More generally, security researchers have started to look at
"algorithmic complexity denial of service attacks".
http://www.cs.rice.edu/~scrosby/hash/

As Python allows you to make your own hash functions for your own
classes, another danger is that you are dealing with objects with a
bad hash function. I've actually tried it (as I am *not* a pro!) and
FWIW here are the results.

-------- badhash.py ----------

class BadHash(object):
def __hash__(self):
return 1 # that's a bad hash function!

The sorting algorithm can have the same problem. With user-defined
methods, there's no telling how long __cmp__ will actually take.

Drawing missing in bitmap in a pure C win32 program	4	Jun 3, 2023
Need help finding Segmentation fault C++	0	Apr 16, 2022
how to read list from file	10	Oct 6, 2013
Grouping items by a key?	1	Mar 22, 2013
I Need Fix In Code	1	Apr 12, 2023
finding items that occur before or after an item in lists	1	Jul 4, 2006
Range / empty list issues??	1	Dec 11, 2023
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022

finding items that occur more than once in a list

John Machin

sturlamolden

Justin Bozonier

John Machin

John Machin

sturlamolden

Gabriel Genellina

John Machin

Arnaud Delobelle

castironpi

Bryan Olson

castironpi

John Machin

castironpi

Bryan Olson

Arnaud Delobelle

John Machin

sturlamolden

castironpi

Bryan Olson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads