n-gram based & edit distance based comparisons

E

Ezee

Hi

Can anybody please suggest me any useful idea about how to perform
n-gram based comparisons and edit-distance based comparisons between
words.(Words are strings and they are in Vectors).

Thanx in anticipation :)
 
H

Harald

Ezee said:
Can anybody please suggest me any useful idea about how to perform
n-gram based comparisons and edit-distance based comparisons between
words.(Words are strings and they are in Vectors).

n-gram: use java.util.Set to build the intersection of n-grams.

Your question is not exactly specific. Are you looking for code
examples, pointers to algorithm descriptions, class libraries, ...?

Harald.
 
E

Ezee

Harald said:
Your question is not exactly specific. Are you looking for code
examples, pointers to algorithm descriptions, class libraries, ...?

I am looking for code examples. Actually I have to perform comparison
b/w two words e.g peace & piece on base of n-grams. i-e if we consider
all 3-grams of these two words (peace = pea pec pee eac eae ace) and
(piece = pie pic pie iec ice). So that, even if these two words are not
exactly similar, but if compared on basis of n-grams, then they are
similar to some extent, and this degree of similarity is to be
calcultaed. (may be in %age, like piece is 60% similar to peace). I am
not sure if I am right in calling this n-gram based comparison...

Ezee
 
I

Ingo R. Homann

Hi,
I am looking for code examples. Actually I have to perform comparison
b/w two words e.g peace & piece on base of n-grams. i-e if we consider
all 3-grams of these two words (peace = pea pec pee eac eae ace) and
(piece = pie pic pie iec ice). ... I am
not sure if I am right in calling this n-gram based comparison...

AFAIK, this is only called NGram, if you take neighboured letters (for N=3):

piece=pie iec ece
peace=pea esc ace

(In this special case, piece and peace are not similar at all.)

As for your question: Sorry, I don't have source for that, and I think,
there is no official standard package for that, but it should be easy to
implement or to google for "java" and "ngram".

Ciao,
Ingo
 
H

HK

Ezee said:
I am looking for code examples. Actually I have to perform comparison
b/w two words e.g peace & piece on base of n-grams. i-e if we consider
all 3-grams of these two words (peace = pea pec pee eac eae ace) and
(piece = pie pic pie iec ice). So that, even if these two words are not
exactly similar, but if compared on basis of n-grams, then they are
similar to some extent, and this degree of similarity is to be
calcultaed. (may be in %age, like piece is 60% similar to peace). I am
not sure if I am right in calling this n-gram based comparison...

Well, then at least for the n-gram approach my previous
comment is just right. Foreach word, create its n-grams
and put them in a Set. Then use set intersection to find
the common ones. Count them to get a ranking. If you
want, you give weights to different n-grams depending
on how often you find them in the unique words of
a corpus. More frequent means they should be less
decisive.

See, for example http://www.cs.ualberta.ca/~lindek/papers/sim.pdf

Harald.
 
E

Ezee

Thanks for the help. I am gonna try it & if I need some help, perhaps I
will bother you again :).
Ezee
 
T

timjowers

Ezee,

Search on sourceforge for this. If not found, add to appropriate
project or create a new project.

Tim
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top