Grouping pairs - suggested tools

A

Astley Le Jasper

I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.

Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and

Sorry for being so vague.

Example Data:
[(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
(e,f)
(e,g)
(f,g)
(h,i)]

Output (grouping id, item ref)
(1,a),
(1,b),
(1,c),
(1,d),
(2,e),
(2,f),
(2,g),
(3,h),
(3,i)
 
A

Alf P. Steinbach /Usenet

* Astley Le Jasper, on 20.09.2010 23:42:
I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.

Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and

Sorry for being so vague.

Example Data:
[(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
(e,f)
(e,g)
(f,g)
(h,i)]

Output (grouping id, item ref)
(1,a),
(1,b),
(1,c),
(1,d),
(2,e),
(2,f),
(2,g),
(3,h),
(3,i)

It seems to be the same problem as "equivalence sets".

This problem was solved early on because e.g. Fortran compilers had to construct
such sets (equivalence partitions of a set).

I though I'd just say "Google it!", because I know there's a standard algorithm
but I can't recall the name. However, googling it I found no mention of that
algorithm. Not even in the Wikipedia articles on equivalence sets.

A number of approaches spring to mind:

Approach 1:
Multi-pass. Originally you assign a distinct set number to each symbol.
In each pass through the symbols you replace one number with another as
per one of the equivalence specs. Stop when no replacements are done.

Approach 2:
Merging. For each new equivalence A=B you check whether A has been assigned
to a set, if not you create a new one, call that S1, and ditto for B, S2.
Merge S1 and S2, e.g. move all elements of S2 to S1.

Approach 3:
In a first pass convert the data to more explicit form, linking each symbol
to the symbols it's directly equivalent to. Then in second pass simply drag
out each equivalence group (recurse on each symbol's list of equivalences).

Approaches 1 and 2 seem to be pretty inefficient for a large number of symbols,
but I think approach 1 may be a practical option for a small number of symbols.


Cheers & hth.,

- Alf
 
A

Alf P. Steinbach /Usenet

* Alf P. Steinbach /Usenet, on 21.09.2010 01:09:
* Astley Le Jasper, on 20.09.2010 23:42:
I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.

Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and

Sorry for being so vague.

Example Data:
[(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
(e,f)
(e,g)
(f,g)
(h,i)]

Output (grouping id, item ref)
(1,a),
(1,b),
(1,c),
(1,d),
(2,e),
(2,f),
(2,g),
(3,h),
(3,i)

It seems to be the same problem as "equivalence sets".

This problem was solved early on because e.g. Fortran compilers had to construct
such sets (equivalence partitions of a set).

I though I'd just say "Google it!", because I know there's a standard algorithm
but I can't recall the name. However, googling it I found no mention of that
algorithm. Not even in the Wikipedia articles on equivalence sets.

A number of approaches spring to mind:

Approach 1:
Multi-pass. Originally you assign a distinct set number to each symbol.
In each pass through the symbols you replace one number with another as
per one of the equivalence specs. Stop when no replacements are done.

Approach 2:
Merging. For each new equivalence A=B you check whether A has been assigned
to a set, if not you create a new one, call that S1, and ditto for B, S2.
Merge S1 and S2, e.g. move all elements of S2 to S1.

Approach 3:
In a first pass convert the data to more explicit form, linking each symbol
to the symbols it's directly equivalent to. Then in second pass simply drag
out each equivalence group (recurse on each symbol's list of equivalences).

Approaches 1 and 2 seem to be pretty inefficient for a large number of symbols,
but I think approach 1 may be a practical option for a small number of symbols.

Uhm, thinking about it (it must have been my unconscious mind doing the work, it
just popped into my head), if you first sort each individual equivalence
relation so that you never have e.g. C=A but only A=C and so on, and then sort
the list of equivalences, then it should reduce to walking the list and just
starting a new set whenever a symbol is encountered that isn't yet in a set.


<code language="Py3">
class Attributes: pass

sym = Attributes()
for name in ("a", "b", "c", "d", "e", "f", "g", "h", "i"):
setattr( sym, name, name )

eq_specs = [
(sym.a, sym.d),
(sym.b, sym.a),
(sym.b, sym.c),
(sym.b, sym.d),
(sym.c, sym.d),
(sym.c, sym.a),
(sym.e, sym.f),
(sym.e, sym.g),
(sym.f, sym.g),
(sym.h, sym.i),
]

equalities = []
for eq in eq_specs:
sorted_eq = eq if eq[0] <= eq[1] else (eq[1], eq[0])
equalities.append( sorted_eq )
equalities.sort()

eq_sets = {}
eq_set_ids = {}
current_eq_set_id = 0
for eq in equalities:
if eq_set_ids.get( eq[0] ) is None:
current_eq_set_id += 1
eq_set_ids[eq[0]] = current_eq_set_id
eq_sets[current_eq_set_id] = set( eq[0] )
eq_set_id = eq_set_ids[eq[0]]
eq_set_ids[eq[1]] = eq_set_id
eq_sets[eq_set_id].add( eq[1] )

for eq_set_id in eq_sets.keys():
for sym_name in eq_sets[eq_set_id]:
print( "{}, {}".format( eq_set_id, sym_name ) )
</code>


<output>
1, a
1, c
1, b
1, d
2, e
2, g
2, f
3, i
3, h
</output>


Disclaimer: for me it's pretty late in day/night.


Cheers & hth.,

- Alf
 
A

Arnaud Delobelle

I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups.  An item will only be associated with a
single cluster.

Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and

Sorry for being so vague.

Example Data:
[(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
(e,f)
(e,g)
(f,g)
(h,i)]

Output (grouping id, item ref)
(1,a),
(1,b),
(1,c),
(1,d),
(2,e),
(2,f),
(2,g),
(3,h),
(3,i)

What you are doing is finding the connected components of a graph.
There aren't any tools in the standard library to do this. But for
example python-graph [1] has a connected_components function.
Probably other packages will. If you don't want a dependency, it is
not too hard to implement if you think about it.


[1] http://code.google.com/p/python-graph/
 
A

Astley Le Jasper

Thanks for the feedback both. I'll have a look at these.

One of the frustrating things I find about python is that there are so
many modules. There have been times when I've spend ages doing
something and then some has said, "Oh yeah, there is a module for
that".

Anyway. Nearly finished.

Cheers

ALJ
 
A

Arnaud Delobelle

* Alf P. Steinbach /Usenet, on 21.09.2010 01:09:




* Astley Le Jasper, on 20.09.2010 23:42:
I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.
Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and
Sorry for being so vague.
Example Data:
[(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
(e,f)
(e,g)
(f,g)  
(h,i)]
Output (grouping id, item ref)
(1,a),
(1,b),
(1,c),
(1,d),
(2,e),
(2,f),
(2,g),
(3,h),
(3,i)
It seems to be the same problem as "equivalence sets".
This problem was solved early on because e.g. Fortran compilers had to construct
such sets (equivalence partitions of a set).
I though I'd just say "Google it!", because I know there's a standard algorithm
but I can't recall the name. However, googling it I found no mention of that
algorithm. Not even in the Wikipedia articles on equivalence sets.
A number of approaches spring to mind:
Approach 1:
Multi-pass. Originally you assign a distinct set number to each symbol.
In each pass through the symbols you replace one number with another as
per one of the equivalence specs. Stop when no replacements are done.
Approach 2:
Merging. For each new equivalence A=B you check whether A has been assigned
to a set, if not you create a new one, call that S1, and ditto for B, S2.
Merge S1 and S2, e.g. move all elements of S2 to S1.
Approach 3:
In a first pass convert the data to more explicit form, linking each symbol
to the symbols it's directly equivalent to. Then in second pass simply drag
out each equivalence group (recurse on each symbol's list of equivalences).
Approaches 1 and 2 seem to be pretty inefficient for a large number of symbols,
but I think approach 1 may be a practical option for a small number of symbols.

Uhm, thinking about it (it must have been my unconscious mind doing the work, it
just popped into my head), if you first sort each individual equivalence
relation so that you never have e.g. C=A but only A=C and so on, and then sort
the list of equivalences, then it should reduce to walking the list and just
starting a new set whenever a symbol is encountered that isn't yet in a set.

<code language="Py3">
class Attributes: pass

sym = Attributes()
for name in ("a", "b", "c", "d", "e", "f", "g", "h", "i"):
     setattr( sym, name, name )

eq_specs = [
     (sym.a, sym.d),
     (sym.b, sym.a),
     (sym.b, sym.c),
     (sym.b, sym.d),
     (sym.c, sym.d),
     (sym.c, sym.a),
     (sym.e, sym.f),
     (sym.e, sym.g),
     (sym.f, sym.g),
     (sym.h, sym.i),
     ]

equalities = []
for eq in eq_specs:
     sorted_eq = eq if eq[0] <= eq[1] else (eq[1], eq[0])
     equalities.append( sorted_eq )
equalities.sort()

eq_sets = {}
eq_set_ids = {}
current_eq_set_id = 0
for eq in equalities:

This would make the body of the loop easier to read:

for x, y in equalities:
     if eq_set_ids.get( eq[0] ) is None:

Why not simply:

if eq[0] in eq_set_ids:
         current_eq_set_id += 1
         eq_set_ids[eq[0]] = current_eq_set_id
         eq_sets[current_eq_set_id] = set( eq[0] )
     eq_set_id = eq_set_ids[eq[0]]
     eq_set_ids[eq[1]] = eq_set_id
     eq_sets[eq_set_id].add( eq[1] )

for eq_set_id in eq_sets.keys():
     for sym_name in eq_sets[eq_set_id]:
         print( "{}, {}".format( eq_set_id, sym_name ) )
</code>

<output>
1, a
1, c
1, b
1, d
2, e
2, g
2, f
3, i
3, h
</output>

Disclaimer: for me it's pretty late in day/night.

Cheers & hth.,

- Alf

I think this won't work with the following graph:

eq_specs = [('a', 'c'), ('b', 'd'), ('c', 'd')]

Note that it is already sorted according to your sorting method. I
don't have a Python machine to check this but I believe it will
output:

1, a
1, c
1, d
2, b

The flaw is that you fail to consider that the two vertices in the
current pair may already be part of a (partial) connected component.
 
A

Alf P. Steinbach /Usenet

* Arnaud Delobelle, on 21.09.2010 11:13:
* Alf P. Steinbach /Usenet, on 21.09.2010 01:09:




* Astley Le Jasper, on 20.09.2010 23:42:
I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.
Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and
Sorry for being so vague.
Example Data:
[(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
(e,f)
(e,g)
(f,g)
(h,i)]
Output (grouping id, item ref)
(1,a),
(1,b),
(1,c),
(1,d),
(2,e),
(2,f),
(2,g),
(3,h),
(3,i)
It seems to be the same problem as "equivalence sets".
This problem was solved early on because e.g. Fortran compilers had to construct
such sets (equivalence partitions of a set).
I though I'd just say "Google it!", because I know there's a standard algorithm
but I can't recall the name. However, googling it I found no mention of that
algorithm. Not even in the Wikipedia articles on equivalence sets.
A number of approaches spring to mind:
Approach 1:
Multi-pass. Originally you assign a distinct set number to each symbol.
In each pass through the symbols you replace one number with another as
per one of the equivalence specs. Stop when no replacements are done.
Approach 2:
Merging. For each new equivalence A=B you check whether A has been assigned
to a set, if not you create a new one, call that S1, and ditto for B, S2.
Merge S1 and S2, e.g. move all elements of S2 to S1.
Approach 3:
In a first pass convert the data to more explicit form, linking each symbol
to the symbols it's directly equivalent to. Then in second pass simply drag
out each equivalence group (recurse on each symbol's list of equivalences).
Approaches 1 and 2 seem to be pretty inefficient for a large number of symbols,
but I think approach 1 may be a practical option for a small number of symbols.

Uhm, thinking about it (it must have been my unconscious mind doing the work, it
just popped into my head), if you first sort each individual equivalence
relation so that you never have e.g. C=A but only A=C and so on, and then sort
the list of equivalences, then it should reduce to walking the list and just
starting a new set whenever a symbol is encountered that isn't yet in a set.

<code language="Py3">
class Attributes: pass

sym = Attributes()
for name in ("a", "b", "c", "d", "e", "f", "g", "h", "i"):
setattr( sym, name, name )

eq_specs = [
(sym.a, sym.d),
(sym.b, sym.a),
(sym.b, sym.c),
(sym.b, sym.d),
(sym.c, sym.d),
(sym.c, sym.a),
(sym.e, sym.f),
(sym.e, sym.g),
(sym.f, sym.g),
(sym.h, sym.i),
]

equalities = []
for eq in eq_specs:
sorted_eq = eq if eq[0]<= eq[1] else (eq[1], eq[0])
equalities.append( sorted_eq )
equalities.sort()

eq_sets = {}
eq_set_ids = {}
current_eq_set_id = 0
for eq in equalities:

This would make the body of the loop easier to read:

for x, y in equalities:
if eq_set_ids.get( eq[0] ) is None:

Why not simply:

if eq[0] in eq_set_ids:
current_eq_set_id += 1
eq_set_ids[eq[0]] = current_eq_set_id
eq_sets[current_eq_set_id] = set( eq[0] )
eq_set_id = eq_set_ids[eq[0]]
eq_set_ids[eq[1]] = eq_set_id
eq_sets[eq_set_id].add( eq[1] )

for eq_set_id in eq_sets.keys():
for sym_name in eq_sets[eq_set_id]:
print( "{}, {}".format( eq_set_id, sym_name ) )
</code>

<output>
1, a
1, c
1, b
1, d
2, e
2, g
2, f
3, i
3, h
</output>

Disclaimer: for me it's pretty late in day/night.

Cheers& hth.,

- Alf

I think this won't work with the following graph:

eq_specs = [('a', 'c'), ('b', 'd'), ('c', 'd')]

Note that it is already sorted according to your sorting method. I
don't have a Python machine to check this but I believe it will
output:

1, a
1, c
1, d
2, b

The flaw is that you fail to consider that the two vertices in the
current pair may already be part of a (partial) connected component.

Yeah, thanks.

I think the three general methods I listed will nicely work, though.

Moral: don't post insights from dream-world... ;-)


Cheers,

- Alf
 
P

Peter Otten

Astley said:
I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.

Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and

Sorry for being so vague.

Example Data:
[(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
(e,f)
(e,g)
(f,g)
(h,i)]

Output (grouping id, item ref)
(1,a),
(1,b),
(1,c),
(1,d),
(2,e),
(2,f),
(2,g),
(3,h),
(3,i)

A straightforward implementation:

$ cat group_edges.py
def find_groups(edges):
lookup = {} # node --> group
groups = {} # id(group) --> group
for a, b in edges:
if a in lookup:
if b in lookup:
ga = lookup[a]
gb = lookup
if ga is not gb:
# merge groups
if len(ga) < len(gb):
# always update the smaller group
ga, gb = gb, ga
del groups[id(gb)]
ga.update(gb)
for k in gb:
lookup[k] = ga
else:
# add b to a's group
ga = lookup[a]
ga.add(b)
lookup = ga
elif b in lookup:
# add a to b's group
gb = lookup
gb.add(a)
lookup[a] = gb
else:
# create new group
g = set((a, b))
groups[id(g)] = g
lookup[a] = lookup = g
return groups.values()

def show_groups(edges):
groups = sorted(sorted(g) for g in find_groups(edges))
for i, group in enumerate(groups, 1):
for node in sorted(group):
print i, node

if __name__ == "__main__":
edges = [
('a', 'b'),
('a', 'c'),
('a', 'd'),
('b', 'c'),
('b', 'd'),
('c', 'd'),
('e', 'f'),
('e', 'g'),
('f', 'g'),
('h', 'i'),
("lonesome john", "lonesome john")]

show_groups(edges)
print

show_groups([('a', 'c'), ('b', 'd'), ('c', 'd')])

$ python group_edges.py
1 a
1 b
1 c
1 d
2 e
2 f
2 g
3 h
3 i
4 lonesome john

1 a
1 b
1 c
1 d

Untested.

Peter
 
A

Astley Le Jasper

Thanks all. I was playing around with this and came up with another
solution using dictionaries (... comfort zone ...!!)


<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Code
from operator import itemgetter

def group_stuff(data):

group_dic = {}
group_id = 0

for line in data:
i = 0
for ref in line:
if group_dic.get(ref) is None:
if group_dic.get(line[1-i]) is not None:
group_id = group_dic[line[1-i]]
else:
group_id +=1
group_dic[ref] = group_id
i+=1

group_list = []
for id, group in sorted(group_dic.items(), key=itemgetter(1,0)):
group_list.append((group, id))

return group_list

if __name__ == '__main__':
data = [('a','b'),('a','c'),('a','d'),('b','c'),('b','d'),
('c','d'),('e','f'),('e','g'),('f','g'),('h','i')]
grouped = group_stuff(data)
print grouped

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Code
Output: [(1, 'a'), (1, 'b'), (1, 'c'), (1, 'd'), (2, 'e'), (2, 'f'),
(2, 'g'), (3, 'h'), (3, 'i')]
 
P

Peter Otten

Astley said:
Thanks all. I was playing around with this and came up with another
solution using dictionaries (... comfort zone ...!!)
from operator import itemgetter

def group_stuff(data):

group_dic = {}
group_id = 0

for line in data:
i = 0
for ref in line:
if group_dic.get(ref) is None:
if group_dic.get(line[1-i]) is not None:
group_id = group_dic[line[1-i]]
else:
group_id +=1
group_dic[ref] = group_id
i+=1

group_list = []
for id, group in sorted(group_dic.items(), key=itemgetter(1,0)):
group_list.append((group, id))

return group_list

if __name__ == '__main__':
data = [('a','b'),('a','c'),('a','d'),('b','c'),('b','d'),
('c','d'),('e','f'),('e','g'),('f','g'),('h','i')]
grouped = group_stuff(data)
print grouped
Output: [(1, 'a'), (1, 'b'), (1, 'c'), (1, 'd'), (2, 'e'), (2, 'f'),
(2, 'g'), (3, 'h'), (3, 'i')]

I think you have the same bug as Alf's code, you never merge existing
groups. Have you tried Arnaud's counterexample?

By the way, are ('a', 'b') and ('b', 'a') to be considered equivalent for
your problem?

Peter
 
A

Astley Le Jasper

I think you have the same bug as Alf's code, you never merge existing
groups. Have you tried Arnaud's counterexample?

By the way, are ('a', 'b') and ('b', 'a') to be considered equivalent for
your problem?

Peter


Hi Peter,

Yes. I realise that this doesn't take into account existing
relationships/groupings. For the particular situation that I am
looking at this is unlikely to happen and/or not critical. However, I
understand that generally you would want to potentially assign an item
to one or more groups.

In my case the ('a','b') and ('b','a') are equivalent. Perhaps for
anyone else looking at this, I can elaborate on the problem to make it
a bit more concrete.

I have a very long listing of customer details and am trying to clean
the data. In particular I am looking for duplicates:

<< Core Data >>
id, Name, Address
a, Acme Ltd, 1 Main Street
b, Acme Limited, 1 Main St
c, Acme L'td, 1 Main Street
d, Smiths, 22 Upper Road
e, Smyths, 22 Upper Rd
f, Smiths ltd, 22 Upperrd
g, Apple Empire, 222 Lower Way
h, Apple Emp, 222 Lower Way

Obviously this is oversimplified. The actual dataset has thousands of
records. I am using the difflib module and comparing each item against
all those below it, and where the items are similar they are stored in
a paired table

<< Paired Data >>
id1, id2, relationship_strength
a, b, 0.8
a, c, 0.88
b, c, 0.8
d, e, 0.75
d, f, 0.88
e, f, 0.87
g, h, 0.77

However, these pairing aren't so easy to read and I want to include
the full address and account information so it's easier for the lucky
person who is going to clean the data. And this is where the grouping
cluster comes in.

<< Grouped Data >>
group_id, id
1, a
1, b
1, c
2, d
2, e
2, f
3, g
3, h

So in my situation those records that are very similar to each other
will be clustered together.

Thanks again.

ALJ
 
A

Arnaud Delobelle

A straightforward implementation:

$ cat group_edges.py
def find_groups(edges):
    lookup = {} # node --> group
    groups = {} # id(group) --> group
    for a, b in edges:              
        if a in lookup:              
            if b in lookup:          
                ga = lookup[a]      
                gb = lookup      
                if ga is not gb:    
                    # merge groups  
                    if len(ga) < len(gb):
                        # always update the smaller group
                        ga, gb = gb, ga                  
                    del groups[id(gb)]                  
                    ga.update(gb)                        
                    for k in gb:                        
                        lookup[k] = ga                  
            else:                                        
                # add b to a's group                    
                ga = lookup[a]
                ga.add(b)
                lookup = ga
        elif b in lookup:
            # add a to b's group
            gb = lookup
            gb.add(a)
            lookup[a] = gb
        else:
            # create new group
            g = set((a, b))
            groups[id(g)] = g
            lookup[a] = lookup = g
    return groups.values()

def show_groups(edges):
    groups = sorted(sorted(g) for g in find_groups(edges))
    for i, group in enumerate(groups, 1):
        for node in sorted(group):
            print i, node


I think I would go for the two-step approach of constructing the graph
first and then recursively building connected components. It sounds
more complicated at first but when you implement it it turns out quite
simple:

from collections import defaultdict
from itertools import count

def build_groups(edges):
neighbors = defaultdict(set)
for x, y in edges:
neighbors[x].add(y)
neighbors[y].add(x)

groups, group_indices = {}, count(1)
def set_group(x, group_index):
groups[x] = group_index
for y in neighbors[x]:
if y not in groups:
set_group(y, group_index)

for x in neighbors:
if x not in groups:
set_group(x, group_indices.next())
return groups

if __name__ == "__main__":
def print_groups(edges):
print " ".join(edges)
groups = build_groups(edges)
for x, i in sorted(groups.items()):
print i, x
print
examples = [
['ab', 'bc', 'ad', 'ef', 'gf', 'hi'],
['ac', 'bd', 'cd'],
]
for edges in examples:
print_groups(edges)


$ python connected.py
ab bc ad ef gf hi
1 a
1 b
1 c
1 d
2 e
2 f
2 g
3 h
3 i

ac bd cd
1 a
1 b
1 c
1 d
 
P

Peter Otten

Arnaud said:
I think I would go for the two-step approach of constructing the graph
first and then recursively building connected components. It sounds
more complicated at first but when you implement it it turns out quite
simple:

from collections import defaultdict
from itertools import count

def build_groups(edges):
neighbors = defaultdict(set)
for x, y in edges:
neighbors[x].add(y)
neighbors[y].add(x)

groups, group_indices = {}, count(1)
def set_group(x, group_index):
groups[x] = group_index
for y in neighbors[x]:
if y not in groups:
set_group(y, group_index)

for x in neighbors:
if x not in groups:
set_group(x, group_indices.next())
return groups

That looks a little less like legwork than mine. My only objection is that
you may hit Python's low recursion limit.

Peter
 
A

Arnaud Delobelle

Arnaud said:
I think I would go for the two-step approach of constructing the graph
first and then recursively building connected components.  It sounds
more complicated at first but when you implement it it turns out quite
simple:
from collections import defaultdict
from itertools import count
def build_groups(edges):
    neighbors = defaultdict(set)
    for x, y in edges:
        neighbors[x].add(y)
        neighbors[y].add(x)
    groups, group_indices = {}, count(1)
    def set_group(x, group_index):
        groups[x] = group_index
        for y in neighbors[x]:
            if y not in groups:
                set_group(y, group_index)
    for x in neighbors:
        if x not in groups:
            set_group(x, group_indices.next())
    return groups

That looks a little less like legwork than mine. My only objection is that
you may hit Python's low recursion limit.

Peter

True. One could setrecursionlimit() temporarily to work around this.
Or here is an alternative set_group() function:

def set_group(x, group_index):
group = [x]
for y in group:
groups[y] = group_index
group.extend(z for z in neighbors[y] if z not in groups])

It's untested!
 
M

MRAB

I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.

Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and

Sorry for being so vague.
[snip]
I thought I'd join in. I've used repr to ensure that I have something
that's hashable:

def show_groups(pairs):
groups = {}
items = {}
for x, y in pairs:
x_repr, y_repr = repr(x), repr(y)
items[x_repr] = x
items[y_repr] = y
x_group = groups.setdefault(x_repr, set([x_repr]))
y_group = groups.setdefault(y_repr, set([y_repr]))
union = set()
for m in x_group | y_group:
union |= groups[m]
for m in union:
groups[m] = union

indexes = {}
for s in groups.values():
indexes.setdefault(frozenset(s), len(indexes) + 1)

indexes = sorted(indexes.items(), key=lambda pair: pair[1])

for s, i in indexes:
for m in sorted(s):
print("{0} {1}".format(i, items[m]))

if __name__ == "__main__":
edges = [
('a', 'b'),
('a', 'c'),
('a', 'd'),
('b', 'c'),
('b', 'd'),
('c', 'd'),
('e', 'f'),
('e', 'g'),
('f', 'g'),
('h', 'i'),
]

show_groups(edges)
print()

show_groups([('a', 'c'), ('b', 'd'), ('c', 'd')])
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,479
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top