Clustering the keys of a dict according to its values

bearophileHUGS · Nov 14, 2008

Florian Brucker:

We may assume that all values in the
original dict/list can be used as dict keys.

Are you sure? Is this for school?

Bye,
bearophile

alex23 · Nov 14, 2008

Florian Brucker said:
Given a dictionary, I want to create a clustered version of it,
collecting keys that have the same value:

>>> d = {'a':1, 'b':2, 'c':1, 'd':1, 'e':2, 'f':3}
>>> cluster(d)
{1:['a', 'c', 'd'], 2:['b', 'e'], 3:['f']}

That is, generate a new dict which holds for each value of the old dict
a list of the keys of the old dict that have that very value.

.... if not hasattr(d, 'iteritems'):
.... d = dict((x,y) for x,y in enumerate(d))
.... cluster = defaultdict(list)
.... for key, value in d.iteritems():
.... cluster[value].append(key)
.... return cluster
....

{1: ['a' said:
d1 = {'a':1, 'b':2, 'c':1, 'd':1, 'e':2, 'f':3}
d2 = ['a','a','b','c','b','d','d']
cluster(d1)

Click to expand...

{1: ['a' said:

cluster(d2)

Click to expand...

Click to expand...

defaultdict(<type 'list'>, {'a': [0, 1], 'c': [3], 'b': [2, 4], 'd':
[5, 6]})

defaultdicts act pretty much exactly like dicts, except for the
default-value-for-everything behaviour, but if you'd rather have pure
dicts, just coerce with a dict(cluster) at the end of the function.

My version converts lists to dicts so the same core behaviour will
work. At worst it'll step through a sequence twice, but if it's passed
a dict it'll only do it once, unlike yours' which steps through each
twice no matter what (in order to obtain the unique values you
require).

A side note: if you're using Python2.4+, you can replace your unique()
function with the built-in set().

Arnaud Delobelle · Nov 14, 2008

Hi everybody!

Given a dictionary, I want to create a clustered version of it,
collecting keys that have the same value:
{1:['a', 'c', 'd'], 2:['b', 'e'], 3:['f']}

That is, generate a new dict which holds for each value of the old dict
a list of the keys of the old dict that have that very value.

Another requirement is that it should also work on lists, in that case
with indices instead of keys. We may assume that all values in the
original dict/list can be used as dict keys.

Right now I'm doing it like this:

def cluster(d):
try:
# is d a dict?
values = unique(d.values())
keys = d.keys()
except AttributeError:
# assume d is a list
values = unique(d)
keys = range(len(d))

clusters = {}
for value in values:
clusters[value] = filter(lambda v: d[v] == value, keys)

return clusters

where unique() is from O'Reilly's Python Cookbook and returns a list
containing each item of the given list exactly once.

Now I'm pretty new to Python and chances are rather high that this could
be done prettier and/or more efficient. Therefore any comment on the
above code is greatly appreciated.

Regards,
Florian

Your implementation is pretty inefficient as it iterates over the
whole dictionary as many times as there are distinct values in it.
Here is a simpler solution.

from collections import defaultdict

def cluster(d):
clusters = defaultdict(list)
for key, val in d.iteritems():
clusters[val].append(key)
return clusters

Or without defaultdict:

def cluster(d):
clusters = {}
for key, val in d.iteritems():
clusters.setdefault(val, []).append(key)
return clusters

Florian Brucker · Nov 14, 2008

Hi everybody!

Given a dictionary, I want to create a clustered version of it,
collecting keys that have the same value:
{1:['a', 'c', 'd'], 2:['b', 'e'], 3:['f']}

That is, generate a new dict which holds for each value of the old dict
a list of the keys of the old dict that have that very value.

Another requirement is that it should also work on lists, in that case
with indices instead of keys. We may assume that all values in the
original dict/list can be used as dict keys.

Right now I'm doing it like this:

def cluster(d):
try:
# is d a dict?
values = unique(d.values())
keys = d.keys()
except AttributeError:
# assume d is a list
values = unique(d)
keys = range(len(d))

clusters = {}
for value in values:
clusters[value] = filter(lambda v: d[v] == value, keys)

return clusters

where unique() is from O'Reilly's Python Cookbook and returns a list
containing each item of the given list exactly once.

Now I'm pretty new to Python and chances are rather high that this could
be done prettier and/or more efficient. Therefore any comment on the
above code is greatly appreciated.

Regards,
Florian

bearophileHUGS · Nov 14, 2008

Not much tested:

from collections import defaultdict

def cluster(pairs):
"""

>>> d = {'a':1, 'b':2, 'c':1, 'd':1, 'e':2, 'f':3}
>>> cluster(d) == {1:['a', 'c', 'd'], 2:['b', 'e'], 3:['f']} True
>>> p = [1, 2, 1, 1, 2, 3]
>>> cluster(p) == {1: [0, 2, 3], 2: [1, 4], 3: [5]}

Click to expand...

Click to expand...

True
"""
d = defaultdict(list)
if isinstance(pairs, list):
for k, v in enumerate(pairs):
d[v].append(k)
else:
for k, v in pairs.iteritems():
d[v].append(k)
return d

if __name__ == "__main__":
import doctest
doctest.testmod()
print "Doctests finished.\n"

Bye,
bearophile

bearophileHUGS · Nov 14, 2008

Alternative version:

def cluster(data):
d = defaultdict(list)
pairs = enumerate(data) if isinstance(data, list) else
data.iteritems()
for k, v in pairs:
d[v].append(k)
return d

Bye,
bearophile

alex23 · Nov 14, 2008

Alternative version:

def cluster(data):
d = defaultdict(list)
pairs = enumerate(data) if isinstance(data, list) else
data.iteritems()
for k, v in pairs:
d[v].append(k)
return d

Bye,
bearophile

Very nice, +1.

Florian Brucker · Nov 14, 2008

Are you sure? Is this for school?

Yes, I'm pretty sure (the values of the original dict are integers), and
no, this isn't for school

If the "We may assume ..." made you think
so, I guess that's the way everybody formulates requirements after
having read to many math papers

If it is of interest to you: I'm trying to reproduce a certain sorted
version of the keys of a dict (sorted according to their value) where
the values are not unique, and thus the sorted version is not unique,
either. All I have left of the original sorting is some statistics, so I
can check for every sorted version if it's the same as the original. If
I have the clustering I'm talking about getting all the valid sorted
lists is a matter of concatenating permutations of the clusters.

Regards,
Florian

Bruno Desthuilliers · Nov 14, 2008

Florian Brucker a écrit :

Hi everybody!

Given a dictionary, I want to create a clustered version of it,
collecting keys that have the same value:
{1:['a', 'c', 'd'], 2:['b', 'e'], 3:['f']}

That is, generate a new dict which holds for each value of the old dict
a list of the keys of the old dict that have that very value.
>
Another requirement is that it should also work on lists, in that case
with indices instead of keys. We may assume that all values in the
original dict/list can be used as dict keys.

from collections import defaultdict

def cluster(source):
iteritems = getattr(
source, "iteritems",
lambda : enumerate(source)
)
index = defaultdict(list)
for k, v in iteritems():
index[k].append(v)
return index

HTH

Bruno Desthuilliers · Nov 14, 2008

(e-mail address removed) a écrit :

Alternative version:

def cluster(data):
d = defaultdict(list)
pairs = enumerate(data) if isinstance(data, list) else
data.iteritems()

What is data is another type of sequence or iterable ?-)

for k, v in pairs:
d[v].append(k)
return d

Bye,
bearophile

Mark Wooding · Nov 14, 2008

Florian Brucker said:
That is, generate a new dict which holds for each value of the old
dict a list of the keys of the old dict that have that very value.
Another requirement is that it should also work on lists, in that case
with indices instead of keys. We may assume that all values in the
original dict/list can be used as dict keys.

import itertools

def invert(d):
def hack(d):
try:
return d.iteritems()
except AttributeError:
return itertools.izip(itertools.count(), d)
dd = {}
for k, v in hack(d):
dd.setdefault(v, []).append(k)
return dd

(It's the setdefault trick which is the important part.)

-- [mdw]

Florian Brucker · Nov 15, 2008

Wow, thanks everybody! There's a lot to learn for me from these examples...

Enjoy your weekend!
Florian

Florian said:
Hi everybody!

Given a dictionary, I want to create a clustered version of it,
collecting keys that have the same value:
{1:['a', 'c', 'd'], 2:['b', 'e'], 3:['f']}

That is, generate a new dict which holds for each value of the old dict
a list of the keys of the old dict that have that very value.

Another requirement is that it should also work on lists, in that case
with indices instead of keys. We may assume that all values in the
original dict/list can be used as dict keys.

Right now I'm doing it like this:

def cluster(d):
try:
# is d a dict?
values = unique(d.values())
keys = d.keys()
except AttributeError:
# assume d is a list
values = unique(d)
keys = range(len(d))

clusters = {}
for value in values:
clusters[value] = filter(lambda v: d[v] == value, keys)

return clusters

where unique() is from O'Reilly's Python Cookbook and returns a list
containing each item of the given list exactly once.

Now I'm pretty new to Python and chances are rather high that this could
be done prettier and/or more efficient. Therefore any comment on the
above code is greatly appreciated.

Regards,
Florian

bearophileHUGS · Nov 16, 2008

Bruno Desthuilliers:

What is data is another type of sequence or iterable ?-)<

The original problem statement was:

Florian Brucker:

Given a dictionary, I want to create a clustered version of it, collecting keys that have the same value: [...] Another requirement is that it should also work on lists, in that case with indices instead of keys. We may assume that all values in the original dict/list can be used as dict keys.<

If the problem changes, then the code has to/can change. When you
write code it's better to avoid over-generalization (this is also a
rule in Agile programming practices).

Bye,
bearophile

Gerard flanagan · Nov 16, 2008

Florian said:
Florian said:

Hi everybody!

Given a dictionary, I want to create a clustered version of it,
collecting keys that have the same value:

d = {'a':1, 'b':2, 'c':1, 'd':1, 'e':2, 'f':3}
cluster(d)

Click to expand...

{1:['a', 'c', 'd'], 2:['b', 'e'], 3:['f']}

That is, generate a new dict which holds for each value of the old
dict a list of the keys of the old dict that have that very value.

Another requirement is that it should also work on lists, in that case
with indices instead of keys. We may assume that all values in the
original dict/list can be used as dict keys.

Click to expand...

[...]

> Wow, thanks everybody! There's a lot to learn for me from these examples...
>
>

d = {'a':1, 'b':2, 'c':1, 'd':1, 'e':2, 'f':3}

from itertools import groupby

d2 = dict(
(key, [G[1] for G in g]) for (key, g) in
groupby(
sorted( (val, key) for (key, val) in d.iteritems() ),
lambda X: X[0]
)
)

print d2

{1: ['a', 'c', 'd'], 2: ['b', 'e'], 3: ['f']}

;-) *ducks*

G.

Bruno Desthuilliers · Nov 16, 2008

(e-mail address removed) a écrit :

Bruno Desthuilliers:

The original problem statement was:

I did read it, thanks.

If the problem changes, then the code has to/can change. When you
write code it's better to avoid over-generalization

It's not a problem of "over-generalization", it's a problem of not being
_uselessly_ restrictive. Your solution will fail with any non-list
sequence, for no good reason.

> (this is also a
> rule in Agile programming practices).

The rational behind the YAGNI rule is to avoid wasting time on useless
generalizations. This does not apply here since the more generic
solution doesn't take more time to write or maintain.

Python 3: dict & dict.keys()	4	Jul 24, 2013
Chanelling Guido - dict subclasses	11	Jan 15, 2014
Documentation of dict views change request	2	Jan 19, 2014
List behaviours with Clustering Algorithm	2	Jan 24, 2011
Just curious - I thought dict keys would maintain sequence	0	Mar 18, 2013
set and dict iteration	42	Aug 16, 2012
Duplicate keys in dict?	6	Mar 7, 2010
How to try a range of hex values in C# code ?	0	Nov 19, 2022

Clustering the keys of a dict according to its values

bearophileHUGS

alex23

Arnaud Delobelle

Florian Brucker

bearophileHUGS

bearophileHUGS

alex23

Florian Brucker

Bruno Desthuilliers

Bruno Desthuilliers

Mark Wooding

Florian Brucker

bearophileHUGS

Gerard flanagan

Bruno Desthuilliers

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads