help in algorithm

Paolino · Aug 10, 2005

I have a self organizing net which aim is clustering words.
Let's think the clustering is about their 2-grams set.
Words then are instances of this class.

class clusterable(str):
def __abs__(self):# the set of q-grams (to be calculated only once)
return set([(self+self[0])[n:n+2] for n in range(len(self))])
def __sub__(self,other): # the q-grams distance between 2 words
set1=abs(self)
set2=abs(other)
return len(set1|set2)-len(set1&set2)

I'm looking for the medium of a set of words, as the word which
minimizes the sum of the distances from those words.

Aka:sum([medium-word for word in words])

Thanks for ideas, Paolino

___________________________________
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB
http://mail.yahoo.it

gene tani · Aug 10, 2005

this sounds like LSI / singular value decomposition (?)

http://javelina.cet.middlebury.edu/lsa/out/lsa_explanation.htm

Tom Anderson · Aug 11, 2005

I have a self organizing net which aim is clustering words. Let's think
the clustering is about their 2-grams set. Words then are instances of
this class.

class clusterable(str):
def __abs__(self):# the set of q-grams (to be calculated only once)
return set([(self+self[0])[n:n+2] for n in range(len(self))])
def __sub__(self,other): # the q-grams distance between 2 words
set1=abs(self)
set2=abs(other)
return len(set1|set2)-len(set1&set2)

Firstly:

- What do you mean by "to be calculated only once"? The code in __abs__
will run every time anyone calls abs() on the object. Do you mean that
clients should avoid calling abs more than once? If so, how about
memoising the function, or computing the 2-gram set up front, so clients
don't need to worry about it?

- Could i suggest frozenset instead of set, since the 2-gram set of a
string can't change?

- How about making the last line "return len(set1 ^ set2)"?

I'm looking for the medium of a set of words, as the word which
minimizes the sum of the distances from those words.

I think i understand. Does the word have to be drawn from the set of words
you're looking at? You can do that straightforwardly like this:

def distance(w, ws):
return sum([w - x for x in ws])

def medium(ws):
return min([(distance(w, ws), w) for w in ws])[1]

However, this is not terribly efficient - it's O(N**2) if you're counting
calls to __sub__.

If you want a more efficient algorithm, well, that's tricky. Luckily, i am
one of the most brilliant hackers alive, so here is an O(N) solution:

def distance_(w, counts, h, n):
"Returns the total distance from the word to the words in the set; the set is specified by its digram counts, horizon and size."
return h + sum([(n - (2 * counts[digram])) for digram in abs(w)])

def horizon(counts):
return sum(counts.itervalues())

def countdigrams(ws):
"Returns a map from digram to the number of words in which that digram appears."
counts = {}
for w in ws:
for digram in abs(w):
counts[digram] = counts.get(digram, 0) + 1
return counts

def distance(w, ws):
"Returns the total distance from the word to the words in the set."
counts = countdigrams(ws)
return distance_(w, counts, horizon(counts), len(ws))

def medium(ws):
"Returns the word in the set with the least total distance to the other words."
counts = countdigrams(ws)
h = horizon(counts)
n = len(ws)
return min([(distance_(w, counts, h, n), w) for w in ws])[1]

Note that this code calls abs a lot, so you'll want to memoise it. Also,
all of those list comprehensions could be replaced by generator
expressions, which would probably be faster - they certainly wouldn't
allocate as much memory; i'm on 2.3 at the moment, so i don't have
genexps.

I am ashamed to admit that i don't really understand how this code works.
I had a flash of insight into how the problem could be solved, wrote the
skeleton, then set to the details; by the time i'd finished with the
details, i'd forgotten the fundamental idea! I think it's something like
using the counts to represent the ensemble properties of the population of
words, which means measuring the total distance for each word is O(1).

Aka:sum([medium-word for word in words])

I have no idea what you're trying to do here!

tom

Bengt Richter · Aug 11, 2005

I have a self organizing net which aim is clustering words.
Let's think the clustering is about their 2-grams set.
Words then are instances of this class.

class clusterable(str):
def __abs__(self):# the set of q-grams (to be calculated only once)
return set([(self+self[0])[n:n+2] for n in range(len(self))])
def __sub__(self,other): # the q-grams distance between 2 words
set1=abs(self)
set2=abs(other)
return len(set1|set2)-len(set1&set2)

I'm looking for the medium of a set of words, as the word which
minimizes the sum of the distances from those words.

Aka:sum([medium-word for word in words])

Thanks for ideas, Paolino

Just wondering if this is a desired result:
0

i.e., resulting from

>>> abs(clusterable('banana'))-abs(clusterable('bananana')) set([])
>>> abs(clusterable('banana')) set(['na', 'ab', 'ba', 'an'])
>>> abs(clusterable('bananana'))

Click to expand...

Click to expand...

set(['na', 'ab', 'ba', 'an'])

Regards,
Bengt Richter

Bill Mill · Aug 11, 2005

this sounds like LSI / singular value decomposition (?)

Why do you think so? I don't see it, but you might see something I
don't. LSI can be used to cluster things, but I see no reason to
believe that he's using LSI for his clustering.

I ask because I've done some LSI [1], and could help him out with that
if he is doing it.

While I'm on the subject, is there any general interest in my python LSI code?

[1] http://llimllib.f2o.org/files/lsi_paper.pdf

Peace
Bill Mill

Paolino · Aug 11, 2005

Bengt said:
I have a self organizing net which aim is clustering words.
Let's think the clustering is about their 2-grams set.
Words then are instances of this class.

class clusterable(str):
def __abs__(self):# the set of q-grams (to be calculated only once)
return set([(self+self[0])[n:n+2] for n in range(len(self))])
def __sub__(self,other): # the q-grams distance between 2 words
set1=abs(self)
set2=abs(other)
return len(set1|set2)-len(set1&set2)

I'm looking for the medium of a set of words, as the word which
minimizes the sum of the distances from those words.

Aka:sum([medium-word for word in words])

Thanks for ideas, Paolino

Click to expand...

Just wondering if this is a desired result:
0

Yes, the clustering is the main filter,it's good (I hope) to cut the
space of words down one or two magnitudes.
Final choices must be done with the expensive Levenstain distance, or
other edit-type distance.

Now I'm using an empirical solution where I suppose the best set has
lenght L equal the medium of the lenghts.Then I choose from the
frequency distribution of 2-grams the first L 2-grams.

I have no clue this is the right set and I'm sure that set is not a word
as there is no chance to chain those 2-grams to form a word.

Thanks for comments

Paolino

Object private methods	8	Sep 27, 2005
A (unpythonic) pythonable mixin recipe.	0	Aug 16, 2005
Help with my 1st Tkinter program	0	Oct 20, 2004
Need help in comparing the string words in two arrays.	5	Apr 29, 2006

help in algorithm

Paolino

gene tani

Tom Anderson

Bengt Richter

Bill Mill

Paolino

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads