J
James Ravenscroft
Dear All,
I am currently trying to write a simple Agglomerative Clustering
algorithm which sorts through my MP3 collection and uses associated
Last.FM tags to cluster files into 'genres'. Unfortunately, I'm having
some trouble with my algorithm and some tracks are ending up in multiple
clusters. I have a feeling this is because of (my lack of understanding
of) list behaviours within Python. This is all purely a learning
exercise, so any input would be greatly appreciated
The actual clustering method can be seen described here. Each Cluster is
stored within the 'clusters' array and has two elements: the data
containing song metadata such as artist, title and most importantly
tags, and a weight: that is, the overall Euclidean weight of the cluster
(based on the song data within).
In theory, the algorithm runs in a loop until all clusters have been
merged into a hierarchy. It takes the weight of each cluster and merges
each cluster with their closest counterpart. My code has a lot of
comments so it should be fairly easy to understand.
Please see below:
#-----------------------------Code Snippet
-----------------------------------------------#
def cluster(self):
'''Run the clustering operation on the files
'''
#add a cluster for each track to clusters
for song in self.music.keys():
self.clusters.append({ 'data' : [song], 'weight' : long(0) })
currentLevel = 0 #current level of hierarchy
#while there is more than one cluster in the sorting bank
# i.e. we haven't achieved hierachy yet, run the algorithm
while( len(self.clusters) > 1):
print "Currently at Level %d" % currentLevel
print "There are %d clusters at this level" % len(self.clusters)
#store the level in the levels array
self.levels.append(self.clusters)
#empty the clusters list
self.clusters = []
#find the weight of each current cluster
for c in self.levels[currentLevel]:
self.clusters.append({'data' : c['data'],
'weight' :
self.__getClusterWeight(c['data'])})
#do the actual clustering
tmp = []
for c in self.clusters:
closestCluster = None
closestDelta = float('inf')
for c2 in self.clusters:
#skip if this is the same node
if(c == c2):
continue
delta = abs(c2['weight'] - c['weight'])
if(delta < closestDelta):
closestCluster = c2
closestDelta = delta
print "Merging clusters %(c1)d and %(c2)d" % {'c1' :
self.clusters.index(c),
'c2' :
self.clusters.index(closestCluster)}
#now merge the two clusters
self.clusters.remove(closestCluster)
self.clusters.remove(c)
c['data'].extend(closestCluster['data'])
tmp.append(c)
#increment the level of the hierarchy
self.clusters = tmp
currentLevel += 1
#--------------------------------End Code Snippet
----------------------------#
Thanks,
James
I am currently trying to write a simple Agglomerative Clustering
algorithm which sorts through my MP3 collection and uses associated
Last.FM tags to cluster files into 'genres'. Unfortunately, I'm having
some trouble with my algorithm and some tracks are ending up in multiple
clusters. I have a feeling this is because of (my lack of understanding
of) list behaviours within Python. This is all purely a learning
exercise, so any input would be greatly appreciated
The actual clustering method can be seen described here. Each Cluster is
stored within the 'clusters' array and has two elements: the data
containing song metadata such as artist, title and most importantly
tags, and a weight: that is, the overall Euclidean weight of the cluster
(based on the song data within).
In theory, the algorithm runs in a loop until all clusters have been
merged into a hierarchy. It takes the weight of each cluster and merges
each cluster with their closest counterpart. My code has a lot of
comments so it should be fairly easy to understand.
Please see below:
#-----------------------------Code Snippet
-----------------------------------------------#
def cluster(self):
'''Run the clustering operation on the files
'''
#add a cluster for each track to clusters
for song in self.music.keys():
self.clusters.append({ 'data' : [song], 'weight' : long(0) })
currentLevel = 0 #current level of hierarchy
#while there is more than one cluster in the sorting bank
# i.e. we haven't achieved hierachy yet, run the algorithm
while( len(self.clusters) > 1):
print "Currently at Level %d" % currentLevel
print "There are %d clusters at this level" % len(self.clusters)
#store the level in the levels array
self.levels.append(self.clusters)
#empty the clusters list
self.clusters = []
#find the weight of each current cluster
for c in self.levels[currentLevel]:
self.clusters.append({'data' : c['data'],
'weight' :
self.__getClusterWeight(c['data'])})
#do the actual clustering
tmp = []
for c in self.clusters:
closestCluster = None
closestDelta = float('inf')
for c2 in self.clusters:
#skip if this is the same node
if(c == c2):
continue
delta = abs(c2['weight'] - c['weight'])
if(delta < closestDelta):
closestCluster = c2
closestDelta = delta
print "Merging clusters %(c1)d and %(c2)d" % {'c1' :
self.clusters.index(c),
'c2' :
self.clusters.index(closestCluster)}
#now merge the two clusters
self.clusters.remove(closestCluster)
self.clusters.remove(c)
c['data'].extend(closestCluster['data'])
tmp.append(c)
#increment the level of the hierarchy
self.clusters = tmp
currentLevel += 1
#--------------------------------End Code Snippet
----------------------------#
Thanks,
James