List behaviours with Clustering Algorithm

Discussion in 'Python' started by James Ravenscroft, Jan 24, 2011.

  1. Dear All,

    I am currently trying to write a simple Agglomerative Clustering
    algorithm which sorts through my MP3 collection and uses associated
    Last.FM tags to cluster files into 'genres'. Unfortunately, I'm having
    some trouble with my algorithm and some tracks are ending up in multiple
    clusters. I have a feeling this is because of (my lack of understanding
    of) list behaviours within Python. This is all purely a learning
    exercise, so any input would be greatly appreciated

    The actual clustering method can be seen described here. Each Cluster is
    stored within the 'clusters' array and has two elements: the data
    containing song metadata such as artist, title and most importantly
    tags, and a weight: that is, the overall Euclidean weight of the cluster
    (based on the song data within).

    In theory, the algorithm runs in a loop until all clusters have been
    merged into a hierarchy. It takes the weight of each cluster and merges
    each cluster with their closest counterpart. My code has a lot of
    comments so it should be fairly easy to understand.

    Please see below:

    #-----------------------------Code Snippet
    -----------------------------------------------#

    def cluster(self):
    '''Run the clustering operation on the files
    '''

    #add a cluster for each track to clusters
    for song in self.music.keys():
    self.clusters.append({ 'data' : [song], 'weight' : long(0) })

    currentLevel = 0 #current level of hierarchy

    #while there is more than one cluster in the sorting bank
    # i.e. we haven't achieved hierachy yet, run the algorithm
    while( len(self.clusters) > 1):

    print "Currently at Level %d" % currentLevel

    print "There are %d clusters at this level" % len(self.clusters)

    #store the level in the levels array
    self.levels.append(self.clusters)

    #empty the clusters list
    self.clusters = []

    #find the weight of each current cluster
    for c in self.levels[currentLevel]:

    self.clusters.append({'data' : c['data'],
    'weight' :
    self.__getClusterWeight(c['data'])})



    #do the actual clustering

    tmp = []

    for c in self.clusters:

    closestCluster = None
    closestDelta = float('inf')

    for c2 in self.clusters:

    #skip if this is the same node
    if(c == c2):
    continue

    delta = abs(c2['weight'] - c['weight'])

    if(delta < closestDelta):
    closestCluster = c2
    closestDelta = delta


    print "Merging clusters %(c1)d and %(c2)d" % {'c1' :
    self.clusters.index(c),
    'c2' :
    self.clusters.index(closestCluster)}
    #now merge the two clusters
    self.clusters.remove(closestCluster)
    self.clusters.remove(c)

    c['data'].extend(closestCluster['data'])

    tmp.append(c)

    #increment the level of the hierarchy

    self.clusters = tmp

    currentLevel += 1

    #--------------------------------End Code Snippet
    ----------------------------#

    Thanks,

    James

    --
    James Ravenscroft
    http://blog.funkymonkeysoftware.com/
    James Ravenscroft, Jan 24, 2011
    #1
    1. Advertising

  2. James Ravenscroft

    Peter Otten Guest

    James Ravenscroft wrote:

    > Dear All,
    >
    > I am currently trying to write a simple Agglomerative Clustering
    > algorithm which sorts through my MP3 collection and uses associated
    > Last.FM tags to cluster files into 'genres'. Unfortunately, I'm having
    > some trouble with my algorithm and some tracks are ending up in multiple
    > clusters. I have a feeling this is because of (my lack of understanding
    > of) list behaviours within Python. This is all purely a learning
    > exercise, so any input would be greatly appreciated
    >
    > The actual clustering method can be seen described here. Each Cluster is
    > stored within the 'clusters' array and has two elements: the data
    > containing song metadata such as artist, title and most importantly
    > tags, and a weight: that is, the overall Euclidean weight of the cluster
    > (based on the song data within).
    >
    > In theory, the algorithm runs in a loop until all clusters have been
    > merged into a hierarchy. It takes the weight of each cluster and merges
    > each cluster with their closest counterpart. My code has a lot of
    > comments so it should be fairly easy to understand.


    > tmp = []
    >
    > for c in self.clusters:
    >
    > closestCluster = None
    > closestDelta = float('inf')
    >
    > for c2 in self.clusters:
    >
    > #skip if this is the same node
    > if(c == c2):
    > continue
    >
    > delta = abs(c2['weight'] - c['weight'])
    >
    > if(delta < closestDelta):
    > closestCluster = c2
    > closestDelta = delta
    >
    >
    > print "Merging clusters %(c1)d and %(c2)d" % {'c1' :
    > self.clusters.index(c),
    > 'c2' :
    > self.clusters.index(closestCluster)}
    > #now merge the two clusters
    > self.clusters.remove(closestCluster)
    > self.clusters.remove(c)
    >
    > c['data'].extend(closestCluster['data'])
    >
    > tmp.append(c)


    I can't run your code because you didn't make it standalone, but I believe
    that the problem (at least one of them) is that you iterate over
    self.clusters and modify it from within the loop. That's a big no-no in
    python. A simple example to demonstrate the effects:

    >>> import random
    >>> old = range(10)
    >>> new = []
    >>> for item in old:

    .... closest = random.choice(old)
    .... new.append((item, closest))
    .... old.remove(item)
    .... old.remove(closest)
    ....
    >>> old

    [3, 4]
    >>> new

    [(0, 8), (2, 1), (5, 7), (9, 6)]

    The remedy is normally to iterate over a copy

    for item in list(old):
    ...

    but in your case that is probably not enough.
    Try something along these lines:

    # untested
    while len(self.clusters) > 1:
    c = self.clusters.pop()
    # find closest index
    for i, c2 in enumerate(self.clusters):
    ...
    if ...:
    closest_index = i
    closest = self.clusters.pop(closest_index)
    tmp.append(c + closest)
    if self.clusters:
    tmp.append(self.clusters[0])

    Peter
    Peter Otten, Jan 24, 2011
    #2
    1. Advertising

  3. James Ravenscroft

    Peter Otten Guest

    James Ravenscroft wrote:

    >> > I can't run your code because you didn't make it standalone,

    > Thanks for the heads up, I've made a simple version of the clusterer
    > which you can view on pastebin: http://pastebin.com/7HmAkmfj If you have
    > time to look through my code I would be very grateful!
    >
    >
    >
    >> > but in your case that is probably not enough.
    >> > Try something along these lines:
    >> >
    >> > # untested
    >> > while len(self.clusters) > 1:
    >> > c = self.clusters.pop()
    >> > # find closest index
    >> > for i, c2 in enumerate(self.clusters):
    >> > ...
    >> > if ...:
    >> > closest_index = i
    >> > closest = self.clusters.pop(closest_index)
    >> > tmp.append(c + closest)
    >> > if self.clusters:
    >> > tmp.append(self.clusters[0])

    > I had a go at implementing something along the lines above and I'm still
    > getting very bizarre results. There does appear to be some sort of logic
    > to it though, if you look at the graph chart, you can see that It seems
    > to be doing the clustering right and then forgetting to remove the old
    > groupings providing this "cloning" effect for some cluster groups.


    I'm sorry I can't infer the intended algorithm from the code you provide.
    Perhaps you can give a short description in plain English?

    However, here's the implementation of the algorithm you mention as described
    on wikipedia:

    http://en.wikipedia.org/wiki/Cluster_analysis#Agglomerative_hierarchical_clustering

    Repeatedly merge the two closest clusters until there's only one left.

    To keep track of the two merged clusters I've added a "children" key to your
    node dictionaries. The intermediate states of the tree are also put into the
    levels variable (I suppose that's the purpose of your levels attribute).

    The main difference to your code is that

    (1) list modifications occur only in the outer while loop, so both inner
    loops can become for-loops again.
    (2) test all pairs before choosing the pair

    debug = True
    def cluster(self):
    '''Run the clustering operation on the files
    '''
    clusters = []
    #add a cluster for each track to clusters
    for song in self.music:
    clusters.append({'data' : [song]})
    levels = []
    while len(clusters) > 1:
    # calculate weights
    for c in clusters:
    c["weight"] = self.__getClusterWeight(c['data'])

    # find closest pair
    closestDelta = float('inf')
    closestPair = None
    for i, a in enumerate(clusters):
    for k, b in enumerate(clusters):
    if i == k:
    break
    delta = abs(a['weight'] - b['weight'])
    if delta < closestDelta:
    closestPair = i, k
    closestDelta = delta

    # merge closest pair
    hi, lo = closestPair
    left = clusters[lo]
    right = clusters.pop(hi)
    clusters[lo] = {"data": left["data"] + right["data"],
    "children": [left, right]}

    # keep track of the intermediate tree states
    levels.append(list(clusters))

    if self.debug:
    print ("stage %d" % len(levels)).center(40, "-")
    for c in clusters:
    print c["data"]

    # store tree
    self.clusters = clusters
    self.levels = levels

    Note that there's some potential for optimisation. For example, you could
    move the weight calculation out of the while-loop and just calculate the
    weight for the newly merged node.

    Peter
    Peter Otten, Jan 26, 2011
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rennie deGraaf

    "Interesting" C behaviours

    Rennie deGraaf, Nov 27, 2004, in forum: C Programming
    Replies:
    6
    Views:
    422
    glen herrmannsfeldt
    Nov 28, 2004
  2. al pacino
    Replies:
    5
    Views:
    383
    al pacino
    Jan 31, 2006
  3. Bart
    Replies:
    2
    Views:
    308
  4. Michael Tsang
    Replies:
    32
    Views:
    1,086
    Richard Bos
    Mar 1, 2010
  5. Michael Tsang
    Replies:
    54
    Views:
    1,170
    Phil Carmody
    Mar 30, 2010
Loading...

Share This Page