[perl-python] exercise: partition a list by equivalence

B

bearophileHUGS

Bearophile:
I presume the complexity is O(n+a); n (the nodes)
is proportional to the number of pairs, and a
(the arcs) depends on the "intricacy" of the input pairs.

Opps... n (the number of nodes) is the number of different numbers
contained in the pairs :-]

Bearophile
 
D

David Eppstein

David Eppstein:

Is the DFS the best one here (instead of BFS)?

I'm not sure it makes much difference.
With the graph implementation that I've just shown here it becomes:

. import graph
. def merge(l):
. g = graph.Graph()
. g.addArcs(l)
. return g.connectedComponents()
. print merge( [ [1,2], [2,4], [5,6] ] )

I presume the complexity is O(n+a); n (the nodes) is proportional to
the number of pairs, and
a (the arcs) depends on the "intricacy" of the input pairs. For a
complete graph, this can
become slow (but there is a high probably that all the pairs are in the
same connected component).

It's still linear in the input size, which is the best you could hope
for.
 
X

Xah Lee

The GOTO statement from Perl has been messed up.

This block:
© for group in interm:
© for newcoup in fin:
© for k in group.keys():
© if newcoup.has_key(k):
© for kk in group.keys(): newcoup[kk]='x';
© break
© break
© fin.append(group)
©

should be:
© goto=False
© for group in interm:
© for newcoup in fin:
© for k in group.keys():
© if newcoup.has_key(k):
© newcoup.update(group);
© goto=True
© break
© if (goto):
© goto=False
© break
© else:
© fin.append(group)

comlete code is at:
http://xahlee.org/perl-python/partition_by_equiv.html

Xah
(e-mail address removed)
http://xahlee.org/PageTwo_dir/more.html
 
E

Eric Pederson

From: "Xah Lee said:
The GOTO statement from Perl has been messed up.


hey, I didn't do it!

This block:
© for group in interm:



what do the funny little "©"s stand for?





Eric Pederson
http://www.songzilla.blogspot.com
:::::::::::::::::::::::::::::::::::
domainNot="@something.com"
domainIs=domainNot.replace("s","z")
ePrefix="".join([chr(ord(x)+1) for x in "do"])
mailMeAt=ePrefix+domainIs
:::::::::::::::::::::::::::::::::::
 
B

Brian Beck

Reinhold said:
def merge(pairings):
sets = {}
for x1, x2 in pairings:
newset = (sets.get(x1, frozenset([x1]))
| sets.get(x2, frozenset([x2])))
for i in newset:
sets = newset

return [list(aset) for aset in set(sets.itervalues())]


Looks good. I used it as inspiration for this new one, which takes less
time for large datasets, and especially for those where a good number of
merges are expected (at least that looks to be the pattern; with some
large datasets this takes less than a quarter of the time as the one
above). I'm sure it can be improved still -- yours is still faster in
many cases.

def merge2(pairings):
elements = {}
sets = []
for x1, x2 in pairings:
i = [elements.get(x1, -1), elements.get(x2, -1)]
i.sort()
if i[1] == -1:
i[1] = len(sets)
sets.append(set([x1, x2]))
elif i[0] == -1:
sets[i[1]] |= set([x1, x2])
elif i[0] == i[1]:
continue
else:
sets[i[1]] |= sets[i[0]]
sets[i[0]].clear()

for x in sets[i[1]]:
elements[x] = i[1]

return [list(s) for s in sets if s]

# Comparison
import profile
import random

# Play with these values
xMax, nPairs = 1000, 5000

l = [[random.randint(0, xMax), random.randint(0, xMax)] for i in
range(nPairs)]

print 'merge2:'
profile.run('merge2(l)') # Mine

print 'merge:'
profile.run('merge(l)') # Reinhold's
 
J

John Machin

Reinhold said:
Xah said:
here's the answer to the partition by equivalence exercise.

Your Python solution is, as expected, wrong (try with
[[10, 8], [7, 3], [1, 7], [5, 4], [2, 2], [3, 8], [7, 10], [2, 3], [6,
10], [3, 2]]
for example).

The solution by John Lenton is wrong, too.

The solution by Brian Beck delivers the correct result for most input,
but for some input lists like
[[3, 3], [8, 7], [3, 2], [8, 5], [5, 6], [6, 3], [10, 8], [8, 10], [4,
10], [10, 2]]
it chokes and returns the empty list.

My solution (which may not be the fastest or most effective, but till
now is the shortest <wink> and it works):

def merge(pairings):
sets = {}
for x1, x2 in pairings:
newset = (sets.get(x1, frozenset([x1]))
| sets.get(x2, frozenset([x2])))
for i in newset:
sets = newset

return [list(aset) for aset in set(sets.itervalues())]


Reinhold


FWIW, here's a brief UAT report:

Appears to work: Reinhold, David, Xah (v2)
Has bug(s): John L (v*), Brian (v*)
Incomprehensible: Xah (v*)

'Come back after lunch' award goes to Xah v2, which at a glance appears
to be O(N**4) -- dictionary.update() inside 3-deep nest of 'for'
statements.

Here's a small input that busts all non-working versions:

input: [[1, 2], [3, 4], [2, 3], [4, 5]]
merge_RB -> [[1, 2, 3, 4, 5]]
merge_DE -> [[1, 2, 3, 4, 5]]
merge_JL -> [[1, 2, 3, 4], [5]]
merge_JL2 -> [[1, 2, 3, 4], [5]]
merge_BB -> []
merge_XL -> [[1, 2, 3, 4, 5], [3, 4, 5]]
merge_XL2 -> [[1, 2, 3, 4, 5]]
 
X

Xah Lee

an interesting problem so developed now is to write a function that
generate test cases for the purpose of testing performance. (just for
fun)

the design of this function could be interesting. We want to be able to
give parameters in this function so as to spit out all possible screw
test cases. First of all, a range of n (some millions) numbers. Then, a
fraction that specifies the percentage of these number are equivalent.
1 being all equivalent. 0 being all "distinct" (having only one
equivalent member (since the input comes in pairs)). Then we want to
have a parameter that says something like the sizes of the equivalence
groups. Suppose 50% of number are equal among themselves (i.e. have
more than one equivalent member). 1 would be all in one group. 0 would
mean all partitions have size 3 or 2. (there are more to it here...
since this is a distribution) ... Then, we need to turn this range of
integers into pairs. That's something more to think about.

So with this function at hand, we'll be able to say for sure which code
performs better (and under what type of input)

the Order notion in computing mathematics is fairly useless for finer
details.

PS it is not trivial to design this pair generating function. I don't
know if the above is on the right track, but essentially we want to
categorize the type of inputs according to the mathematical operational
performance aspect of partition by equivalence, and distill them into
parameters.

another func to write is one that canonicalize the output. Such as
sorting. So that all results can be compared simply by = in Python.

failing to design a elaborate pair_gen, we can just have pairs of
random numbers. But exactly what nature is such input... is more to
think about.

(in my original application, each number represent a computer file,
there are up to tens of thousands of files, and much less than 0.1% is
same as another, and for files that are same, each equivalent group
number no more than 10 or so.)

Xah
(e-mail address removed)
http://xahlee.org/PageTwo_dir/more.html
 
B

bearophileHUGS

John Machin>FWIW, here's a brief UAT report:

Here is something else for you.
Note: for more correct comparisons, for the following tests I've
disabled the use of Psyco in Graph (usually enabled, if present).
This looks faster than merge2 for this (quite sparse) random pairs
test:

.. import graph # http://www.fantascienza.net/animalia/graph.zip
.. def merge4(l):
.. g = graph.Graph()
.. for n1,n2 in l:
.. g.addArc(n1,n2)
.. return g.connectedComponents()

Creating a better addArcs produces good results (addArcs method is
present in Graph, but addArcs2 isn't present):

.. def merge5(l):
.. g = graph.Graph()
.. g.addArcs2(l)
.. return g.connectedComponents()


If you like to see the actual code, without using Graph, I've made a
very stripped down version of addArcs2 and connectedComponents:

.. from collections import deque
.. from itertools import chain
..
.. def merge6(l):
.. # graph creation
.. gi = {}
.. go = {}
.. for n1,n2 in l:
.. if n1 not in go:
.. go[n1] = {}
.. gi[n1] = {}
.. if n2 not in go:
.. go[n2] = {}
.. gi[n2] = {}
.. go[n1][n2] = 0
.. gi[n2][n1] = 0
.. # Connected components:
.. result = []
.. visited = set()
.. for root in go.iterkeys():
.. if root not in visited:
.. component = [root]
.. visited.add(root)
.. Q = deque() # A queue
.. Q.append(root)
.. while Q:
.. n1 = Q.popleft()
.. for n2 in chain( go[n1].iterkeys(), gi[n1].iterkeys()
):
.. if n2 not in visited:
.. visited.add(n2)
.. Q.append(n2)
.. component.append(n2)
.. result.append(component)
.. return result


At the end of the tests I've added this to be sure the results are
always the same:

r2 = set( frozenset(e) for e in merge2(l) )
r4 = set( frozenset(e) for e in merge4(l) )
r5 = set( frozenset(e) for e in merge5(l) )
r6 = set( frozenset(e) for e in merge6(l) )
assert r2 == r4
assert r2 == r6

For xMax, nPairs = 1000, 5000:
merge2: 15750 function calls in 0.939 CPU seconds
merge4: 11029 function calls in 0.560 CPU seconds
merge5: 6030 function calls in 0.303 CPU seconds
merge6: 6011 function calls in 0.297 CPU seconds

For xMax, nPairs = 2000, 10000:
merge2: 31503 function calls in 2.505 CPU seconds
merge6: 12011 function calls in 0.639 CPU seconds

Timings using clock() (much more reliable than the profilers!), for
xMax, nPairs = 2000, 10000:
merge2: 1.222
merge6: 0.201

Probably merge6 can be improved, reducing memory used...

Bear hugs,
Bearophile
 
J

John Machin

John Machin>FWIW, here's a brief UAT report:

Here is something else for you.
Note: for more correct comparisons, for the following tests I've
disabled the use of Psyco in Graph (usually enabled, if present).
This looks faster than merge2 for this (quite sparse) random pairs
test:
[snip]


Timings using clock() (much more reliable than the profilers!), for
xMax, nPairs = 2000, 10000:
merge2: 1.222
merge6: 0.201

Probably merge6 can be improved, reducing memory used...

Perhaps I'm missing a posting of yours -- what are merge2 and merge4?
What is "this random pairs test"? What is xMax (nPairs isn't hard to
guess), and how are you generating your input data?

FWIW, my offering is attached.
 
X

Xah Lee

when i try to run the following program, Python complains about some
global name frozenset is not defined. Is set some new facility in
Python 2.4?

©# from Reinhold Birkenfeld
©def merge(pairings):
© sets = {}
© for x1, x2 in pairings:
© newset = (sets.get(x1, frozenset([x1]))
© | sets.get(x2, frozenset([x2])))
© for i in newset:
© sets = newset
©
© return [list(aset) for aset in set(sets.itervalues())]

it would be nice if the two working programs do not use some package.
This problem shouldn't need to.

Xah
(e-mail address removed)
http://xahlee.org/PageTwo_dir/more.html
 
J

John Machin

Xah said:
when i try to run the following program, Python complains about some
global name frozenset is not defined. Is set some new facility in
Python 2.4?

http://www.python.org/doc/2.3/whatsnew/
http://www.python.org/doc/2.4/whatsnew/whatsnew24.html

You must be running 2.3. If you persist in running 2.3, put the
following at the top of the file:

import sys
PY_VERSION = sys.version_info[:2]
if PY_VERSION == (2,3):
from sets import Set as set
from sets import ImmutableSet as frozenset

That will get Reinhold's gadget working under 2.3. The bearophile's
function uses collections.deque which is built-in to 2.4.
it would be nice if the two working programs do not use some package.
This problem shouldn't need to.

Look at the newsgroup again. By my count there are apparently-working
versions from SIX people: Reinhold, bear.+, Brian Beck, David Eppstein,
yourself, and me. Only David's uses stuff that's not in the Python 2.4
off-the-shelf distribution.
 
B

bearophileHUGS

John Machin:
Perhaps I'm missing a posting of yours -- what are merge2 and merge4?
What is "this random pairs test"? What is xMax (nPairs isn't hard to
guess), and how are you generating your input data?<

merge2 and this random pairs test comes from the post by Brian Beck.
merge4 is the first in my last post. And merge3 isn't included (nor
tested) because it uses the original (slow) addArcs.
This is the simple test generator taken from his post:

xMax = 3000
nPairs = xMax*5
l = [[randint(0,xMax), randint(0,xMax)] for i in xrange(nPairs)]
For such high number of nPairs (arcs), the resulting graph usuall has
one big connected component, and few tiny debris... To produce a more
divided graph, nPairs have to be much smaller, like xMax*1.5.

FWIW, my offering is attached. <

I'm sorry, I don't see the attach... (just the different title).


John Machin:
Only David's uses stuff that's not in the Python 2.4 off-the-shelf
distribution.<

Well, using already mede functions and classes is good, it avoids you
to reinvent things each time. Some of my versions too use my graph
library (for a problem like this I usually don't create stand alone
versions like merge6 and merge7). And my original point was adding one
graph data structure to the standard Python library :-]
I haven't tested the speed of David Eppstein version...

Anyway, this is a simplified version of merge6, that uses an indirect
graph g, instead of the directed graph, and avoids the use of chain:

.. from collections import deque
..
.. def merge7(l):
.. # Undirect graph g creation:
.. g = {}
.. for n1,n2 in l:
.. if n1 not in g: g[n1] = {n2:0}
.. else: g[n1][n2] = 0
.. if n2 not in g: g[n2] = {n1:0}
.. else: g[n2][n1] = 0
.. # Connected components:
.. result = []
.. visited = set()
.. for root in g.iterkeys():
.. if root not in visited:
.. component = [root]
.. visited.add(root)
.. Q = deque() # A queue
.. Q.append(root)
.. while Q:
.. n1 = Q.popleft()
.. for n2 in g[n1].iterkeys():
.. if n2 not in visited:
.. visited.add(n2)
.. Q.append(n2)
.. component.append(n2)
.. result.append(component)
.. return result


It's a bit shorter and a little faster than merge6, memory used is
similar (there is only one dictionary, so there is only one
ammortization overhead of the dictionary, but it contains the sum of
both arcs, so the ammortization can be twice as big, so... memory used
can be the same).
(Usually BFS and DFS memory requirements are a little different; this
implementation uses BFS, and I think it uses a bit less memory than
DFS, but for this problem/test the difference isn't significative.)

Bear hugs,
Bearophile
 
J

John Machin

John Machin:

I'm sorry, I don't see the attach... (just the different title).

Here it is. I'll reply to the rest tomorrow. It's way past sleep-time
here.

!
!class _Stopper:
! pass
!
!def merge_JM(pairings):
! parent = {}
! link = {}
! size = {}
! # 1. x not in parent => x is unknown
! # 2. parent[x] -> root of cluster containing x
! # 3. As a consequence of (2), parent[root] is root
! # 4. link[x] chains the members of the cluster; 1st is root, 2nd
is link[root], ...
! # 5. link[last_member] is _stopper
! # 6. size[root] is the number of members of the cluster
! parentget = parent.get
! _stopper = _Stopper()
! for inxa, inxb in pairings:
! roota = parentget(inxa, _stopper)
! rootb = parentget(inxb, _stopper)
! if roota is not _stopper:
! if rootb is not _stopper:
! if roota is rootb:
! continue
! if size[rootb] > size[roota]:
! newroot = rootb
! joiner = roota
! else:
! newroot = roota
! joiner = rootb
! size[newroot] += size[joiner]
! del size[joiner]
! parent[joiner] = newroot
! tail = joiner
! while link[tail] is not _stopper:
! tail = link[tail]
! parent[tail] = newroot
! link[tail] = link[newroot]
! link[newroot] = joiner
! else:
! size[roota] += 1
! parent[inxb] = roota
! link[inxb] = link[roota]
! link[roota] = inxb
! else:
! if rootb is not _stopper:
! size[rootb] += 1
! parent[inxa] = rootb
! link[inxa] = link[rootb]
! link[rootb] = inxa
! elif inxa is inxb:
! parent[inxa] = inxa
! link[inxa] = _stopper
! size[inxa] = 1
! else:
! parent[inxa] = inxa
! parent[inxb] = inxa
! link[inxa] = inxb
! link[inxb] = _stopper
! size[inxa] = 2
! olist = []
! for root in size.iterkeys():
! family = [root]
! cousin = link[root]
! while cousin is not _stopper:
! family.append(cousin)
! cousin = link[cousin]
! olist.append(family)
! return olist
!
 
R

Reinhold Birkenfeld

Reinhold said:
My solution (which may not be the fastest or most effective, but till
now is the shortest <wink> and it works):

def merge(pairings):
sets = {}
for x1, x2 in pairings:
newset = (sets.get(x1, frozenset([x1]))
| sets.get(x2, frozenset([x2])))
for i in newset:
sets = newset

return [list(aset) for aset in set(sets.itervalues())]


A recursive solution (around twice as fast as the above, though very
slow still...)

def merge_rb2(pairings):
def merge(sets):
res_sets = []
for tset in sets:
for item in tset:
for rset in res_sets:
if item in rset:
rset.update(item for item in tset)
break
else:
continue
break
else:
res_sets.append(set(tset))
if len(res_sets) == len(sets):
return res_sets
else:
return merge(res_sets)

return [list(s) for s in merge([set(pair) for pair in pairings])]

Another one:

def merge_rb3(pairings):
dic = {}
for x1, x2 in pairings:
dic.setdefault(x1, []).append(x2)
dic.setdefault(x2, []).append(x1)

def red(k, l):
l.append(k)
sub = dic[k]
for i in dic[k]:
if i not in l:
red(i, l)
del dic[k]

res = []
while len(dic):
rl = []
red(iter(dic).next(), rl)
res.append(rl)

return res

Reinhold
 
D

David Eppstein

it would be nice if the two working programs do not use some package.
This problem shouldn't need to.

Look at the newsgroup again. By my count there are apparently-working
versions from SIX people: Reinhold, bear.+, Brian Beck, David Eppstein,
yourself, and me. Only David's uses stuff that's not in the Python 2.4
off-the-shelf distribution.[/QUOTE]

Where "stuff" means a single 62-line file that I linked to in my post.
 
J

John Machin

[snip RB]
A recursive solution (around twice as fast as the above, though very
slow still...)
[snip RB2]

Another one:

[snip RB3]

Dunno what data you are using for timing, but my tests suggest that RB
is fast enough, RB3 is slightly faster, but RB2 is a real dog and
appears to be quadratic [hint: it has that same for-for-for-update
signature found in phase 2 of Xah's effort]. Not only that, but it
seems to be somewhat irregular. Below are some results on trivial test
data:

prototype input: python -m timeit -s "import
merge;n=3000;inp=((x,x+1)for x in xrange(0,n,2))"
"merge.merge_RB3(inp)"
100000 loops, best of 3: 3.98 usec per loop

n=3000;RB2 64.9 msec per loop
n=3000;RB3 3.98 usec per loop
n=3000;RB 3.06 usec per loop
n=3000;JM3 2.73 usec per loop

n=1000;RB2 4.92 usec per loop
n=1250;RB2 5.34 usec per loop
n=1500;RB2 20.4 usec per loop
n=1750;RB2 22.1 msec per loop
n=2000;RB2 28.8 msec per loop
n=2500;RB2 44.9 msec per loop
n=3000;RB2 64.9 msec per loop

[background: Python 2.4, Win2000, 1.4GHz Athlon chip, 768Mb memory]

Here's a function to generate some serious timing data:
!def mktimedata(n,lev):
! res = []
! d = 1
! for k in range(lev):
! res.extend(zip(xrange(0, n, 2*d), xrange(d, n, 2*d)))
! d *= 2
! return res

Try that with n=3000 and lev=5 ...

Cheers,
John
 
J

John Machin

Eric said:
what do the funny little "©"s stand for?

....>>> import unicodedata as ucd; ucd.name('©'.decode('cp1252'))
'COPYRIGHT SIGN'

Xah is asserting his right to be recognised as the author of his
artistic creations, line by line.
 
E

Erik Max Francis

John said:
Xah is asserting his right to be recognised as the author of his
artistic creations, line by line.

Not very well, however, since his usage doesn't constitute a valid
copyright notice :).
 
R

Reinhold Birkenfeld

John said:
[snip RB]
A recursive solution (around twice as fast as the above, though very
slow still...)
[snip RB2]

Another one:

[snip RB3]

Dunno what data you are using for timing, but my tests suggest that RB
is fast enough, RB3 is slightly faster, but RB2 is a real dog and
appears to be quadratic [hint: it has that same for-for-for-update
signature found in phase 2 of Xah's effort]. Not only that, but it
seems to be somewhat irregular. Below are some results on trivial test
data:
[snip]

Yes, I don't know exactly how I timed this, and I just posted the
solutions to show that there are very different solutions possible. They
are surely not using the best algorithms, as bearophile's function showed.

Reinhold
 
X

Xah Lee

I started to collect i believe the 4 or so solutions by different
people... but seems it's gonna take some an hour or more... So far the
only other one i've run and find alright is Reinhold Birkenfeld's
original. Others worth noting i'm aware of is David Epsteinn, improved
versions from Reinhold Birkenfeld, the one using graphs by bearophile
....

since many of you have already collected and tested these... i wonder
if anyone'd be interested to put together all the (working) code in a
single message? (or even a webpage.)

thanks.

Xah
(e-mail address removed)
http://xahlee.org/PageTwo_dir/more.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,158
Latest member
Vinay_Kumar Nevatia
Top