almost equal strings

Roedy Green · Dec 22, 2008

Have you seen any work on comparing strings for almost equality, or
measuring difference?

So that for example, if you had two proverbs, they would compare equal
or close even if they were worded slightly differently, or punctuated
slightly differently.

You might assign low importance to matching word like "the" and high
importance to words like "democracy".

It might understand plurals. It might have a dictionary of synonyms.

You might use it to detect redundancy in an essay or website, either
to remove it, or ensure both versions are kept up to date.

--
Roedy Green Canadian Mind Products
http://mindprod.com
PM Steven Harper is fixated on the costs of implementing Kyoto, estimated as high as 1% of GDP.
However, he refuses to consider the costs of not implementing Kyoto which the
famous economist Nicholas Stern estimated at 5 to 20% of GDP

Mark Space · Dec 22, 2008

Roedy said:
Have you seen any work on comparing strings for almost equality, or
measuring difference?

So that for example, if you had two proverbs, they would compare equal
or close even if they were worded slightly differently, or punctuated
slightly differently.

You might assign low importance to matching word like "the" and high
importance to words like "democracy".

It might understand plurals. It might have a dictionary of synonyms.

You might use it to detect redundancy in an essay or website, either
to remove it, or ensure both versions are kept up to date.

There's SOUNDEX/SOUNDS-LIKE functions in SQL, as well as DIFFERENCE.
Maybe something like that to get started.

David Segall · Dec 22, 2008

Roedy Green said:
Have you seen any work on comparing strings for almost equality, or
measuring difference?

So that for example, if you had two proverbs, they would compare equal
or close even if they were worded slightly differently, or punctuated
slightly differently.

You might assign low importance to matching word like "the" and high
importance to words like "democracy".

It might understand plurals. It might have a dictionary of synonyms.

You might use it to detect redundancy in an essay or website, either
to remove it, or ensure both versions are kept up to date.

TRE <http://laurikari.net/tre/> is regular expression matching library
that includes approximate matching. In your application, one phrase
could be used to search for a similar one after preprocessing the
first one according to your "importance" rules.

Tom Anderson · Dec 22, 2008

Soundex has already been mentioned. See also
http://en.wikipedia.org/wiki/Levenshtein_distance

Those are good for comparing words, or arbitrary strings, as Roedy asked,
but he reveals that he actually wants to compare phrases. Obviously, you
can still compare phrases as strings, but i wonder if they're actually not
so appropriate here. Edit distance (aka Levenshtein distance) is defined
for a string over any set of symbols; if you symbols were words, you could
apply it at the phrase level. You might need a more refined scoring
method, though - not all insertion, deletions or substitutions are of
equal importance.

That is a lot more complex, unless you go for simple run-length you are
going to need a database of relative importance for each possible word.

Such a think would not be impossible to build - you could get hold of some
sort of corpus (say, a few gigabytes of modern text from Project
Gutenberg) and process it, recording frequencies for each word, and
perhaps second-order probabilities (eg 'raging' is more often followed by
'bull' than 'mouse', even though 'mouse' might be the more common word
overall). You could then apply Bayesian statistics or something to figure
out what the most important words in the processed text were. The database
might not need to be that big - apparently 90% of text is made up of ~7000
words [1], so if you knew those, you could assign any unknown word some
arbitrary 'very rare' frequency.

Anyway, i don't think cljp is the place to ask about this - there is a
huge body of knowlege about text processing which Roedy could tap into,
and there must be a place to ask questions on it.

tom

[1] http://www.askoxford.com/oec/mainpage/oec02/

Daniel Pitts · Dec 22, 2008

Roedy said:
Have you seen any work on comparing strings for almost equality, or
measuring difference?

So that for example, if you had two proverbs, they would compare equal
or close even if they were worded slightly differently, or punctuated
slightly differently.

You might assign low importance to matching word like "the" and high
importance to words like "democracy".

It might understand plurals. It might have a dictionary of synonyms.

You might use it to detect redundancy in an essay or website, either
to remove it, or ensure both versions are kept up to date.

There are many approaches to this problem.

Lucene uses analyzers to break down text into stemmed forms. For example
"Animals" and "animal" could be stemmed from "animal", "Octopus" and
"Octopi" would be stemmed to "octop". This is usually done by
specifying a set of rules that are language specific, and exceptions for
words that don't follow those rules.

Others have already suggested soundex.

There is also the Edit Distance.
<http://en.wikipedia.org/wiki/Edit_distance>

Wikipedia also has a whole section for fuzzy string searching:
<http://en.wikipedia.org/wiki/Fuzzy_string_searching>

Daniel Pitts · Dec 22, 2008

rossum said:
Soundex has already been mentioned. See also
http://en.wikipedia.org/wiki/Levenshtein_distance

That is a lot more complex, unless you go for simple run-length you
are going to need a database of relative importance for each possible
word.

The common approach is that the more frequent a term is in a corpus, the
less important it is to the search result. So you do often have a
"database", but it is derived from your data directly, not (necessarily)
predefined by someone.

jolz · Dec 22, 2008

Have you seen any work on comparing strings for almost equality, or

measuring difference?

Levenshtein distance
Jaro distance

Lew · Dec 22, 2008

There are many approaches to this problem.

Lucene uses analyzers to break down text into stemmed forms. For example
"Animals" and "animal" could be stemmed from "animal", "Octopus" and
"Octopi" would be stemmed to "octop". This is usually done by
specifying a set of rules that are language specific, and exceptions for
words that don't follow those rules.

Others have already suggested soundex.

There is also the Edit Distance.
<http://en.wikipedia.org/wiki/Edit_distance>

Wikipedia also has a whole section for fuzzy string searching:
<http://en.wikipedia.org/wiki/Fuzzy_string_searching>

Latent Semantic Indexing ("LSI") is a powerful technique that maps
word or phrase occurrences from a document base into a metric space,
then projects that space into fewer dimensions. Queries can produce
surprisingly relevant results against that projection. One example
given is the word "driver", which has varied interpretations according
to whether one is speaking of motor vehicle operation, golf, or
computer devices. A query such as "What driver should I use on a
par-5 dogleg?" will produce different hits than "What driver should I
use for an NVIDIA card in Linux?" will.

Martin Gregorie · Dec 22, 2008

Have you seen any work on comparing strings for almost equality, or
measuring difference?

So that for example, if you had two proverbs, they would compare equal
or close even if they were worded slightly differently, or punctuated
slightly differently.

I think you'd end up needing a grammatical parser. Given the phrases:
a) "The cat ate the rat."
b) "The rat ate the cat."
c) "The cat eaten rat."

These are trivially simple single phrase sentences. A grammatical parser
should be able to determine that (a) and (b) have quite different
meanings but (a) and (c) are much closer, but I suspect that even quite
clever comaprisons that neglect grammar would be wrong quite often.

Tom Anderson · Dec 22, 2008

Latent Semantic Indexing ("LSI") is a powerful technique that maps word
or phrase occurrences from a document base into a metric space, then
projects that space into fewer dimensions. Queries can produce
surprisingly relevant results against that projection. One example
given is the word "driver", which has varied interpretations according
to whether one is speaking of motor vehicle operation, golf, or computer
devices. A query such as "What driver should I use on a par-5 dogleg?"
will produce different hits than "What driver should I use for an NVIDIA
card in Linux?" will.

Indeed. For the former, you want a 1 or 2 wood, whereas when dealing with
Linux, really you want a sand wedge or something. Or a baseball bat.

tom

Daniel Pitts · Dec 22, 2008

Martin said:
I think you'd end up needing a grammatical parser. Given the phrases:
a) "The cat ate the rat."
b) "The rat ate the cat."
c) "The cat eaten rat."

d) "The rat the cat ate was brown."
e) "It was a brown rat the cat ate."

Martin Gregorie · Dec 22, 2008

d) Time flies like an arrow.
e) Fruit flies like a banana.

I thought of using that. Then I decided the mutual rodent-feline
consumption was a better example of two phrases that a simplistiC word
matcher might think had the same meaning.

Tom Anderson · Dec 22, 2008

d) "The rat the cat ate was brown."

(z) "There Tate theca"

Correctly spelt, nonsensical, but quite close in edit distance to (b) !

tom

Lew · Dec 22, 2008

rossum said:
d) Time flies like an arrow.
e) Fruit flies like a banana.

Outside of a dog, a book is a man's best friend.
Inside of a dog, it's too dark to read.
- Groucho Marx

Roedy Green · Dec 23, 2008

Have you seen any work on comparing strings for almost equality, or
measuring difference?

Thanks everyone. That was a rich set of responses. I have incorporated
your suggested links into the Java glossary.

I had an idea for another use of such logic.

Lets say you wanted to keep a set of documents on different machines
in sync.

If the master document could be chunked, e.g. into sentences (or
something similar for program text and data records), you could hash
each sentence, then detect which sentences had been added, deleted and
changed, and produce a compact delta file to tend to other machines to
update them.

NOW, if in addition, you could do fuzzy matching, even those sentences
that were just mildly changed could be compacted to send only the
differences
--
Roedy Green Canadian Mind Products
http://mindprod.com
PM Steven Harper is fixated on the costs of implementing Kyoto, estimated as high as 1% of GDP.
However, he refuses to consider the costs of not implementing Kyoto which the
famous economist Nicholas Stern estimated at 5 to 20% of GDP

Martin Gregorie · Dec 23, 2008

I had an idea for another use of such logic.

Why not use CVS? That records (commit) and applies (update) deltas.

John W Kennedy · Dec 24, 2008

d) Time flies like an arrow.
e) Fruit flies like a banana.

It's a pretty little girls school.

Roedy Green · Dec 24, 2008

Why not use CVS? That records (commit) and applies (update) deltas.

CVS does not understand reordering. To it, a paragraph moved is a
paragraph deleted and a paragraph inserted.
--
Roedy Green Canadian Mind Products
http://mindprod.com
PM Steven Harper is fixated on the costs of implementing Kyoto, estimated as high as 1% of GDP.
However, he refuses to consider the costs of not implementing Kyoto which the
famous economist Nicholas Stern estimated at 5 to 20% of GDP

Martin Gregorie · Dec 24, 2008

CVS does not understand reordering. To it, a paragraph moved is a
paragraph deleted and a paragraph inserted.

Does that matter? You said you wanted to sync a set of document copies,
not to track text movement within a document.

Tom Anderson · Dec 27, 2008

I had an idea for another use of such logic.

Lets say you wanted to keep a set of documents on different machines
in sync.

If the master document could be chunked, e.g. into sentences (or
something similar for program text and data records), you could hash
each sentence, then detect which sentences had been added, deleted and
changed, and produce a compact delta file to tend to other machines to
update them.

rsync already does something along these lines, although not, as far as i
know, sentence-based. From the man page:

The rsync remote-update protocol allows rsync to transfer just the dif-
ferences between two sets of files across the network connection, using
an efficient checksum-search algorithm described in the technical
report that accompanies this package.

I'm afraid i can't tell you any more than that!

NOW, if in addition, you could do fuzzy matching, even those sentences
that were just mildly changed could be compacted to send only the
differences

I don't think you need to know about sentences to do this. You can do it
on strings of bytes. A very great deal of work has been done on this sort
of problem (finding the matches, not computing the differences, which is
fairly trivial once you have the matches) in the context of
bioinformatics; if you're interested, start with the Smith-Waterman
algorithm:

http://en.wikipedia.org/wiki/Smith-Waterman_algorithm

tom

case sensitive filenames	62	Jan 11, 2009
Debugging regex	3	Dec 15, 2008
Browser news	4	Dec 17, 2008
Regex Puzzle	5	Dec 14, 2008
Constellations	38	Jan 3, 2009
Smoothing	2	Jan 3, 2009
word_set = set() def should_preceed_with_an(phrase): first_word =	1	Jan 26, 2013
Observations on operator overloading	9	Apr 1, 2010

almost equal strings

Roedy Green

Mark Space

David Segall

Tom Anderson

Daniel Pitts

Daniel Pitts

jolz

Lew

Martin Gregorie

Tom Anderson

Daniel Pitts

Martin Gregorie

Tom Anderson

Lew

Roedy Green

Martin Gregorie

John W Kennedy

Roedy Green

Martin Gregorie

Tom Anderson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads