Grouping pairs - suggested tools

Astley Le Jasper · Sep 20, 2010

I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.

Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and

Sorry for being so vague.

Example Data:
[(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
(e,f)
(e,g)
(f,g)
(h,i)]

Output (grouping id, item ref)
(1,a),
(1,b),
(1,c),
(1,d),
(2,e),
(2,f),
(2,g),
(3,h),
(3,i)

Alf P. Steinbach /Usenet · Sep 20, 2010

* Astley Le Jasper, on 20.09.2010 23:42:

I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.

Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and

Sorry for being so vague.

Example Data:
[(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
(e,f)
(e,g)
(f,g)
(h,i)]

Output (grouping id, item ref)
(1,a),
(1,b),
(1,c),
(1,d),
(2,e),
(2,f),
(2,g),
(3,h),
(3,i)

It seems to be the same problem as "equivalence sets".

This problem was solved early on because e.g. Fortran compilers had to construct
such sets (equivalence partitions of a set).

I though I'd just say "Google it!", because I know there's a standard algorithm
but I can't recall the name. However, googling it I found no mention of that
algorithm. Not even in the Wikipedia articles on equivalence sets.

A number of approaches spring to mind:

Approach 1:
Multi-pass. Originally you assign a distinct set number to each symbol.
In each pass through the symbols you replace one number with another as
per one of the equivalence specs. Stop when no replacements are done.

Approach 2:
Merging. For each new equivalence A=B you check whether A has been assigned
to a set, if not you create a new one, call that S1, and ditto for B, S2.
Merge S1 and S2, e.g. move all elements of S2 to S1.

Approach 3:
In a first pass convert the data to more explicit form, linking each symbol
to the symbols it's directly equivalent to. Then in second pass simply drag
out each equivalence group (recurse on each symbol's list of equivalences).

Approaches 1 and 2 seem to be pretty inefficient for a large number of symbols,
but I think approach 1 may be a practical option for a small number of symbols.

Cheers & hth.,

- Alf

Alf P. Steinbach /Usenet · Sep 21, 2010

* Alf P. Steinbach /Usenet, on 21.09.2010 01:09:

* Astley Le Jasper, on 20.09.2010 23:42:

I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.

Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and

Sorry for being so vague.

Example Data:
[(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
(e,f)
(e,g)
(f,g)
(h,i)]

Output (grouping id, item ref)
(1,a),
(1,b),
(1,c),
(1,d),
(2,e),
(2,f),
(2,g),
(3,h),
(3,i)

Click to expand...

It seems to be the same problem as "equivalence sets".

This problem was solved early on because e.g. Fortran compilers had to construct
such sets (equivalence partitions of a set).

I though I'd just say "Google it!", because I know there's a standard algorithm
but I can't recall the name. However, googling it I found no mention of that
algorithm. Not even in the Wikipedia articles on equivalence sets.

A number of approaches spring to mind:

Approach 1:
Multi-pass. Originally you assign a distinct set number to each symbol.
In each pass through the symbols you replace one number with another as
per one of the equivalence specs. Stop when no replacements are done.

Approach 2:
Merging. For each new equivalence A=B you check whether A has been assigned
to a set, if not you create a new one, call that S1, and ditto for B, S2.
Merge S1 and S2, e.g. move all elements of S2 to S1.

Approach 3:
In a first pass convert the data to more explicit form, linking each symbol
to the symbols it's directly equivalent to. Then in second pass simply drag
out each equivalence group (recurse on each symbol's list of equivalences).

Approaches 1 and 2 seem to be pretty inefficient for a large number of symbols,
but I think approach 1 may be a practical option for a small number of symbols.

Uhm, thinking about it (it must have been my unconscious mind doing the work, it
just popped into my head), if you first sort each individual equivalence
relation so that you never have e.g. C=A but only A=C and so on, and then sort
the list of equivalences, then it should reduce to walking the list and just
starting a new set whenever a symbol is encountered that isn't yet in a set.

<code language="Py3">
class Attributes: pass

sym = Attributes()
for name in ("a", "b", "c", "d", "e", "f", "g", "h", "i"):
setattr( sym, name, name )

eq_specs = [
(sym.a, sym.d),
(sym.b, sym.a),
(sym.b, sym.c),
(sym.b, sym.d),
(sym.c, sym.d),
(sym.c, sym.a),
(sym.e, sym.f),
(sym.e, sym.g),
(sym.f, sym.g),
(sym.h, sym.i),
]

equalities = []
for eq in eq_specs:
sorted_eq = eq if eq[0] <= eq[1] else (eq[1], eq[0])
equalities.append( sorted_eq )
equalities.sort()

eq_sets = {}
eq_set_ids = {}
current_eq_set_id = 0
for eq in equalities:
if eq_set_ids.get( eq[0] ) is None:
current_eq_set_id += 1
eq_set_ids[eq[0]] = current_eq_set_id
eq_sets[current_eq_set_id] = set( eq[0] )
eq_set_id = eq_set_ids[eq[0]]
eq_set_ids[eq[1]] = eq_set_id
eq_sets[eq_set_id].add( eq[1] )

for eq_set_id in eq_sets.keys():
for sym_name in eq_sets[eq_set_id]:
print( "{}, {}".format( eq_set_id, sym_name ) )
</code>

<output>
1, a
1, c
1, b
1, d
2, e
2, g
2, f
3, i
3, h
</output>

Disclaimer: for me it's pretty late in day/night.

Cheers & hth.,

- Alf

Arnaud Delobelle · Sep 21, 2010

I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.

Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and

Sorry for being so vague.

Example Data:
[(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
(e,f)
(e,g)
(f,g)
(h,i)]

Output (grouping id, item ref)
(1,a),
(1,b),
(1,c),
(1,d),
(2,e),
(2,f),
(2,g),
(3,h),
(3,i)

What you are doing is finding the connected components of a graph.
There aren't any tools in the standard library to do this. But for
example python-graph [1] has a connected_components function.
Probably other packages will. If you don't want a dependency, it is
not too hard to implement if you think about it.

[1] http://code.google.com/p/python-graph/

Astley Le Jasper · Sep 21, 2010

Thanks for the feedback both. I'll have a look at these.

One of the frustrating things I find about python is that there are so
many modules. There have been times when I've spend ages doing
something and then some has said, "Oh yeah, there is a module for
that".

Anyway. Nearly finished.

Cheers

ALJ

Arnaud Delobelle · Sep 21, 2010

* Alf P. Steinbach /Usenet, on 21.09.2010 01:09:

* Astley Le Jasper, on 20.09.2010 23:42:

I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.
Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and
Sorry for being so vague.
Example Data:
[(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
(e,f)
(e,g)
(f,g)
(h,i)]
Output (grouping id, item ref)
(1,a),
(1,b),
(1,c),
(1,d),
(2,e),
(2,f),
(2,g),
(3,h),
(3,i)

Click to expand...

Click to expand...

It seems to be the same problem as "equivalence sets".

Click to expand...

This problem was solved early on because e.g. Fortran compilers had to construct
such sets (equivalence partitions of a set).

Click to expand...

I though I'd just say "Google it!", because I know there's a standard algorithm
but I can't recall the name. However, googling it I found no mention of that
algorithm. Not even in the Wikipedia articles on equivalence sets.

Click to expand...

A number of approaches spring to mind:

Click to expand...

Approach 1:
Multi-pass. Originally you assign a distinct set number to each symbol.
In each pass through the symbols you replace one number with another as
per one of the equivalence specs. Stop when no replacements are done.

Click to expand...

Approach 2:
Merging. For each new equivalence A=B you check whether A has been assigned
to a set, if not you create a new one, call that S1, and ditto for B, S2.
Merge S1 and S2, e.g. move all elements of S2 to S1.

Click to expand...

Approach 3:
In a first pass convert the data to more explicit form, linking each symbol
to the symbols it's directly equivalent to. Then in second pass simply drag
out each equivalence group (recurse on each symbol's list of equivalences).

Click to expand...

Approaches 1 and 2 seem to be pretty inefficient for a large number of symbols,
but I think approach 1 may be a practical option for a small number of symbols.

Click to expand...

Uhm, thinking about it (it must have been my unconscious mind doing the work, it
just popped into my head), if you first sort each individual equivalence
relation so that you never have e.g. C=A but only A=C and so on, and then sort
the list of equivalences, then it should reduce to walking the list and just
starting a new set whenever a symbol is encountered that isn't yet in a set.

<code language="Py3">
class Attributes: pass

sym = Attributes()
for name in ("a", "b", "c", "d", "e", "f", "g", "h", "i"):
setattr( sym, name, name )

eq_specs = [
(sym.a, sym.d),
(sym.b, sym.a),
(sym.b, sym.c),
(sym.b, sym.d),
(sym.c, sym.d),
(sym.c, sym.a),
(sym.e, sym.f),
(sym.e, sym.g),
(sym.f, sym.g),
(sym.h, sym.i),
]

equalities = []
for eq in eq_specs:
sorted_eq = eq if eq[0] <= eq[1] else (eq[1], eq[0])
equalities.append( sorted_eq )
equalities.sort()

eq_sets = {}
eq_set_ids = {}
current_eq_set_id = 0
for eq in equalities:

This would make the body of the loop easier to read:

for x, y in equalities:

if eq_set_ids.get( eq[0] ) is None:

Why not simply:

if eq[0] in eq_set_ids:

current_eq_set_id += 1
eq_set_ids[eq[0]] = current_eq_set_id
eq_sets[current_eq_set_id] = set( eq[0] )
eq_set_id = eq_set_ids[eq[0]]
eq_set_ids[eq[1]] = eq_set_id
eq_sets[eq_set_id].add( eq[1] )

for eq_set_id in eq_sets.keys():
for sym_name in eq_sets[eq_set_id]:
print( "{}, {}".format( eq_set_id, sym_name ) )
</code>

<output>
1, a
1, c
1, b
1, d
2, e
2, g
2, f
3, i
3, h
</output>

Disclaimer: for me it's pretty late in day/night.

Cheers & hth.,

- Alf

I think this won't work with the following graph:

eq_specs = [('a', 'c'), ('b', 'd'), ('c', 'd')]

Note that it is already sorted according to your sorting method. I
don't have a Python machine to check this but I believe it will
output:

1, a
1, c
1, d
2, b

The flaw is that you fail to consider that the two vertices in the
current pair may already be part of a (partial) connected component.

Alf P. Steinbach /Usenet · Sep 21, 2010

* Arnaud Delobelle, on 21.09.2010 11:13:

* Alf P. Steinbach /Usenet, on 21.09.2010 01:09:

* Astley Le Jasper, on 20.09.2010 23:42:
I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.

Click to expand...

Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and

Click to expand...

Sorry for being so vague.

Click to expand...

Example Data:
[(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
(e,f)
(e,g)
(f,g)
(h,i)]

Click to expand...

Output (grouping id, item ref)
(1,a),
(1,b),
(1,c),
(1,d),
(2,e),
(2,f),
(2,g),
(3,h),
(3,i)

Click to expand...

It seems to be the same problem as "equivalence sets".

Click to expand...

This problem was solved early on because e.g. Fortran compilers had to construct
such sets (equivalence partitions of a set).

Click to expand...

I though I'd just say "Google it!", because I know there's a standard algorithm
but I can't recall the name. However, googling it I found no mention of that
algorithm. Not even in the Wikipedia articles on equivalence sets.

Click to expand...

A number of approaches spring to mind:

Click to expand...

Approach 1:
Multi-pass. Originally you assign a distinct set number to each symbol.
In each pass through the symbols you replace one number with another as
per one of the equivalence specs. Stop when no replacements are done.

Click to expand...

Approach 2:
Merging. For each new equivalence A=B you check whether A has been assigned
to a set, if not you create a new one, call that S1, and ditto for B, S2.
Merge S1 and S2, e.g. move all elements of S2 to S1.

Click to expand...

Approach 3:
In a first pass convert the data to more explicit form, linking each symbol
to the symbols it's directly equivalent to. Then in second pass simply drag
out each equivalence group (recurse on each symbol's list of equivalences).

Click to expand...

Approaches 1 and 2 seem to be pretty inefficient for a large number of symbols,
but I think approach 1 may be a practical option for a small number of symbols.

Click to expand...

Uhm, thinking about it (it must have been my unconscious mind doing the work, it
just popped into my head), if you first sort each individual equivalence
relation so that you never have e.g. C=A but only A=C and so on, and then sort
the list of equivalences, then it should reduce to walking the list and just
starting a new set whenever a symbol is encountered that isn't yet in a set.

<code language="Py3">
class Attributes: pass

sym = Attributes()
for name in ("a", "b", "c", "d", "e", "f", "g", "h", "i"):
setattr( sym, name, name )

eq_specs = [
(sym.a, sym.d),
(sym.b, sym.a),
(sym.b, sym.c),
(sym.b, sym.d),
(sym.c, sym.d),
(sym.c, sym.a),
(sym.e, sym.f),
(sym.e, sym.g),
(sym.f, sym.g),
(sym.h, sym.i),
]

equalities = []
for eq in eq_specs:
sorted_eq = eq if eq[0]<= eq[1] else (eq[1], eq[0])
equalities.append( sorted_eq )
equalities.sort()

eq_sets = {}
eq_set_ids = {}
current_eq_set_id = 0
for eq in equalities:

Click to expand...

This would make the body of the loop easier to read:

for x, y in equalities:

if eq_set_ids.get( eq[0] ) is None:

Click to expand...

Why not simply:

if eq[0] in eq_set_ids:

current_eq_set_id += 1
eq_set_ids[eq[0]] = current_eq_set_id
eq_sets[current_eq_set_id] = set( eq[0] )
eq_set_id = eq_set_ids[eq[0]]
eq_set_ids[eq[1]] = eq_set_id
eq_sets[eq_set_id].add( eq[1] )

for eq_set_id in eq_sets.keys():
for sym_name in eq_sets[eq_set_id]:
print( "{}, {}".format( eq_set_id, sym_name ) )
</code>

<output>
1, a
1, c
1, b
1, d
2, e
2, g
2, f
3, i
3, h
</output>

Disclaimer: for me it's pretty late in day/night.

Cheers& hth.,

- Alf

Click to expand...

I think this won't work with the following graph:

eq_specs = [('a', 'c'), ('b', 'd'), ('c', 'd')]

Note that it is already sorted according to your sorting method. I
don't have a Python machine to check this but I believe it will
output:

1, a
1, c
1, d
2, b

The flaw is that you fail to consider that the two vertices in the
current pair may already be part of a (partial) connected component.

Yeah, thanks.

I think the three general methods I listed will nicely work, though.

Moral: don't post insights from dream-world... ;-)

Cheers,

- Alf

Peter Otten · Sep 21, 2010

Astley said:
I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.

Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and

Sorry for being so vague.

Example Data:
[(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
(e,f)
(e,g)
(f,g)
(h,i)]

Output (grouping id, item ref)
(1,a),
(1,b),
(1,c),
(1,d),
(2,e),
(2,f),
(2,g),
(3,h),
(3,i)

A straightforward implementation:

$ cat group_edges.py
def find_groups(edges):
lookup = {} # node --> group
groups = {} # id(group) --> group
for a, b in edges:
if a in lookup:
if b in lookup:
ga = lookup[a]
gb = lookup
if ga is not gb:
# merge groups
if len(ga) < len(gb):
# always update the smaller group
ga, gb = gb, ga
del groups[id(gb)]
ga.update(gb)
for k in gb:
lookup[k] = ga
else:
# add b to a's group
ga = lookup[a]
ga.add(b)
lookup = ga
elif b in lookup:
# add a to b's group
gb = lookup
gb.add(a)
lookup[a] = gb
else:
# create new group
g = set((a, b))
groups[id(g)] = g
lookup[a] = lookup = g
return groups.values()

def show_groups(edges):
groups = sorted(sorted(g) for g in find_groups(edges))
for i, group in enumerate(groups, 1):
for node in sorted(group):
print i, node

if __name__ == "__main__":
edges = [
('a', 'b'),
('a', 'c'),
('a', 'd'),
('b', 'c'),
('b', 'd'),
('c', 'd'),
('e', 'f'),
('e', 'g'),
('f', 'g'),
('h', 'i'),
("lonesome john", "lonesome john")]

show_groups(edges)
print

show_groups([('a', 'c'), ('b', 'd'), ('c', 'd')])

$ python group_edges.py
1 a
1 b
1 c
1 d
2 e
2 f
2 g
3 h
3 i
4 lonesome john

1 a
1 b
1 c
1 d

Untested.

Peter

Astley Le Jasper · Sep 21, 2010

Thanks all. I was playing around with this and came up with another
solution using dictionaries (... comfort zone ...!!)

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Code
from operator import itemgetter

def group_stuff(data):

group_dic = {}
group_id = 0

for line in data:
i = 0
for ref in line:
if group_dic.get(ref) is None:
if group_dic.get(line[1-i]) is not None:
group_id = group_dic[line[1-i]]
else:
group_id +=1
group_dic[ref] = group_id
i+=1

group_list = []
for id, group in sorted(group_dic.items(), key=itemgetter(1,0)):
group_list.append((group, id))

return group_list

if __name__ == '__main__':
data = [('a','b'),('a','c'),('a','d'),('b','c'),('b','d'),
('c','d'),('e','f'),('e','g'),('f','g'),('h','i')]
grouped = group_stuff(data)
print grouped

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Code
Output: [(1, 'a'), (1, 'b'), (1, 'c'), (1, 'd'), (2, 'e'), (2, 'f'),
(2, 'g'), (3, 'h'), (3, 'i')]

Peter Otten · Sep 21, 2010

Astley said:
Thanks all. I was playing around with this and came up with another
solution using dictionaries (... comfort zone ...!!)

from operator import itemgetter

def group_stuff(data):

group_dic = {}
group_id = 0

for line in data:
i = 0
for ref in line:
if group_dic.get(ref) is None:
if group_dic.get(line[1-i]) is not None:
group_id = group_dic[line[1-i]]
else:
group_id +=1
group_dic[ref] = group_id
i+=1

group_list = []
for id, group in sorted(group_dic.items(), key=itemgetter(1,0)):
group_list.append((group, id))

return group_list

if __name__ == '__main__':
data = [('a','b'),('a','c'),('a','d'),('b','c'),('b','d'),
('c','d'),('e','f'),('e','g'),('f','g'),('h','i')]
grouped = group_stuff(data)
print grouped

Output: [(1, 'a'), (1, 'b'), (1, 'c'), (1, 'd'), (2, 'e'), (2, 'f'),
(2, 'g'), (3, 'h'), (3, 'i')]

I think you have the same bug as Alf's code, you never merge existing
groups. Have you tried Arnaud's counterexample?

By the way, are ('a', 'b') and ('b', 'a') to be considered equivalent for
your problem?

Peter

Astley Le Jasper · Sep 21, 2010

I think you have the same bug as Alf's code, you never merge existing
groups. Have you tried Arnaud's counterexample?

By the way, are ('a', 'b') and ('b', 'a') to be considered equivalent for
your problem?

Peter

Hi Peter,

Yes. I realise that this doesn't take into account existing
relationships/groupings. For the particular situation that I am
looking at this is unlikely to happen and/or not critical. However, I
understand that generally you would want to potentially assign an item
to one or more groups.

In my case the ('a','b') and ('b','a') are equivalent. Perhaps for
anyone else looking at this, I can elaborate on the problem to make it
a bit more concrete.

I have a very long listing of customer details and am trying to clean
the data. In particular I am looking for duplicates:

<< Core Data >>
id, Name, Address
a, Acme Ltd, 1 Main Street
b, Acme Limited, 1 Main St
c, Acme L'td, 1 Main Street
d, Smiths, 22 Upper Road
e, Smyths, 22 Upper Rd
f, Smiths ltd, 22 Upperrd
g, Apple Empire, 222 Lower Way
h, Apple Emp, 222 Lower Way

Obviously this is oversimplified. The actual dataset has thousands of
records. I am using the difflib module and comparing each item against
all those below it, and where the items are similar they are stored in
a paired table

<< Paired Data >>
id1, id2, relationship_strength
a, b, 0.8
a, c, 0.88
b, c, 0.8
d, e, 0.75
d, f, 0.88
e, f, 0.87
g, h, 0.77

However, these pairing aren't so easy to read and I want to include
the full address and account information so it's easier for the lucky
person who is going to clean the data. And this is where the grouping
cluster comes in.

<< Grouped Data >>
group_id, id
1, a
1, b
1, c
2, d
2, e
2, f
3, g
3, h

So in my situation those records that are very similar to each other
will be clustered together.

Thanks again.

ALJ

Arnaud Delobelle · Sep 21, 2010

On 21 Sep said:
A straightforward implementation:

$ cat group_edges.py
def find_groups(edges):
lookup = {} # node --> group
groups = {} # id(group) --> group
for a, b in edges:
if a in lookup:
if b in lookup:
ga = lookup[a]
gb = lookup
if ga is not gb:
# merge groups
if len(ga) < len(gb):
# always update the smaller group
ga, gb = gb, ga
del groups[id(gb)]
ga.update(gb)
for k in gb:
lookup[k] = ga
else:
# add b to a's group
ga = lookup[a]
ga.add(b)
lookup = ga
elif b in lookup:
# add a to b's group
gb = lookup
gb.add(a)
lookup[a] = gb
else:
# create new group
g = set((a, b))
groups[id(g)] = g
lookup[a] = lookup = g
return groups.values()

def show_groups(edges):
groups = sorted(sorted(g) for g in find_groups(edges))
for i, group in enumerate(groups, 1):
for node in sorted(group):
print i, node

I think I would go for the two-step approach of constructing the graph
first and then recursively building connected components. It sounds
more complicated at first but when you implement it it turns out quite
simple:

from collections import defaultdict
from itertools import count

def build_groups(edges):
neighbors = defaultdict(set)
for x, y in edges:
neighbors[x].add(y)
neighbors[y].add(x)

groups, group_indices = {}, count(1)
def set_group(x, group_index):
groups[x] = group_index
for y in neighbors[x]:
if y not in groups:
set_group(y, group_index)

for x in neighbors:
if x not in groups:
set_group(x, group_indices.next())
return groups

if __name__ == "__main__":
def print_groups(edges):
print " ".join(edges)
groups = build_groups(edges)
for x, i in sorted(groups.items()):
print i, x
print
examples = [
['ab', 'bc', 'ad', 'ef', 'gf', 'hi'],
['ac', 'bd', 'cd'],
]
for edges in examples:
print_groups(edges)

$ python connected.py
ab bc ad ef gf hi
1 a
1 b
1 c
1 d
2 e
2 f
2 g
3 h
3 i

ac bd cd
1 a
1 b
1 c
1 d

Peter Otten · Sep 22, 2010

Arnaud said:
I think I would go for the two-step approach of constructing the graph
first and then recursively building connected components. It sounds
more complicated at first but when you implement it it turns out quite
simple:

from collections import defaultdict
from itertools import count

def build_groups(edges):
neighbors = defaultdict(set)
for x, y in edges:
neighbors[x].add(y)
neighbors[y].add(x)

groups, group_indices = {}, count(1)
def set_group(x, group_index):
groups[x] = group_index
for y in neighbors[x]:
if y not in groups:
set_group(y, group_index)

for x in neighbors:
if x not in groups:
set_group(x, group_indices.next())
return groups

That looks a little less like legwork than mine. My only objection is that
you may hit Python's low recursion limit.

Peter

Arnaud Delobelle · Sep 22, 2010

Arnaud said:
Arnaud said:

I think I would go for the two-step approach of constructing the graph
first and then recursively building connected components. It sounds
more complicated at first but when you implement it it turns out quite
simple:

Click to expand...

from collections import defaultdict
from itertools import count

Click to expand...

def build_groups(edges):
neighbors = defaultdict(set)
for x, y in edges:
neighbors[x].add(y)
neighbors[y].add(x)

Click to expand...

groups, group_indices = {}, count(1)
def set_group(x, group_index):
groups[x] = group_index
for y in neighbors[x]:
if y not in groups:
set_group(y, group_index)

Click to expand...

for x in neighbors:
if x not in groups:
set_group(x, group_indices.next())
return groups

Click to expand...

That looks a little less like legwork than mine. My only objection is that
you may hit Python's low recursion limit.

Peter

True. One could setrecursionlimit() temporarily to work around this.
Or here is an alternative set_group() function:

def set_group(x, group_index):
group = [x]
for y in group:
groups[y] = group_index
group.extend(z for z in neighbors[y] if z not in groups])

It's untested!

MRAB · Sep 22, 2010

I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.

Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and

Sorry for being so vague.

[snip]
I thought I'd join in. I've used repr to ensure that I have something
that's hashable:

def show_groups(pairs):
groups = {}
items = {}
for x, y in pairs:
x_repr, y_repr = repr(x), repr(y)
items[x_repr] = x
items[y_repr] = y
x_group = groups.setdefault(x_repr, set([x_repr]))
y_group = groups.setdefault(y_repr, set([y_repr]))
union = set()
for m in x_group | y_group:
union |= groups[m]
for m in union:
groups[m] = union

indexes = {}
for s in groups.values():
indexes.setdefault(frozenset(s), len(indexes) + 1)

indexes = sorted(indexes.items(), key=lambda pair: pair[1])

for s, i in indexes:
for m in sorted(s):
print("{0} {1}".format(i, items[m]))

if __name__ == "__main__":
edges = [
('a', 'b'),
('a', 'c'),
('a', 'd'),
('b', 'c'),
('b', 'd'),
('c', 'd'),
('e', 'f'),
('e', 'g'),
('f', 'g'),
('h', 'i'),
]

show_groups(edges)
print()

show_groups([('a', 'c'), ('b', 'd'), ('c', 'd')])

Python3 string slicing	2	Dec 11, 2023
Blue J Ciphertext Program	2	Nov 22, 2023
Sort and count word pairs in a string	6	Jan 29, 2023
My Status, Ciphertext	2	Nov 27, 2023
How to play corresponding sound?	2	Jun 10, 2023
An empty initializer is invalid for an array with unspecified bound	0	Jul 1, 2020
Reverse search for a website	2	Apr 24, 2024
A real math problem.	5	Jan 2, 2024

Grouping pairs - suggested tools

Astley Le Jasper

Alf P. Steinbach /Usenet

Alf P. Steinbach /Usenet

Arnaud Delobelle

Astley Le Jasper

Arnaud Delobelle

Alf P. Steinbach /Usenet

Peter Otten

Astley Le Jasper

Peter Otten

Astley Le Jasper

Arnaud Delobelle

Peter Otten

Arnaud Delobelle

MRAB

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads