almost equal strings

R

Roedy Green

Have you seen any work on comparing strings for almost equality, or
measuring difference?

So that for example, if you had two proverbs, they would compare equal
or close even if they were worded slightly differently, or punctuated
slightly differently.

You might assign low importance to matching word like "the" and high
importance to words like "democracy".

It might understand plurals. It might have a dictionary of synonyms.

You might use it to detect redundancy in an essay or website, either
to remove it, or ensure both versions are kept up to date.

--
Roedy Green Canadian Mind Products
http://mindprod.com
PM Steven Harper is fixated on the costs of implementing Kyoto, estimated as high as 1% of GDP.
However, he refuses to consider the costs of not implementing Kyoto which the
famous economist Nicholas Stern estimated at 5 to 20% of GDP
 
M

Mark Space

Roedy said:
Have you seen any work on comparing strings for almost equality, or
measuring difference?

So that for example, if you had two proverbs, they would compare equal
or close even if they were worded slightly differently, or punctuated
slightly differently.

You might assign low importance to matching word like "the" and high
importance to words like "democracy".

It might understand plurals. It might have a dictionary of synonyms.

You might use it to detect redundancy in an essay or website, either
to remove it, or ensure both versions are kept up to date.

There's SOUNDEX/SOUNDS-LIKE functions in SQL, as well as DIFFERENCE.
Maybe something like that to get started.
 
D

David Segall

Roedy Green said:
Have you seen any work on comparing strings for almost equality, or
measuring difference?

So that for example, if you had two proverbs, they would compare equal
or close even if they were worded slightly differently, or punctuated
slightly differently.

You might assign low importance to matching word like "the" and high
importance to words like "democracy".

It might understand plurals. It might have a dictionary of synonyms.

You might use it to detect redundancy in an essay or website, either
to remove it, or ensure both versions are kept up to date.
TRE <http://laurikari.net/tre/> is regular expression matching library
that includes approximate matching. In your application, one phrase
could be used to search for a similar one after preprocessing the
first one according to your "importance" rules.
 
T

Tom Anderson

Soundex has already been mentioned. See also
http://en.wikipedia.org/wiki/Levenshtein_distance

Those are good for comparing words, or arbitrary strings, as Roedy asked,
but he reveals that he actually wants to compare phrases. Obviously, you
can still compare phrases as strings, but i wonder if they're actually not
so appropriate here. Edit distance (aka Levenshtein distance) is defined
for a string over any set of symbols; if you symbols were words, you could
apply it at the phrase level. You might need a more refined scoring
method, though - not all insertion, deletions or substitutions are of
equal importance.
That is a lot more complex, unless you go for simple run-length you are
going to need a database of relative importance for each possible word.

Such a think would not be impossible to build - you could get hold of some
sort of corpus (say, a few gigabytes of modern text from Project
Gutenberg) and process it, recording frequencies for each word, and
perhaps second-order probabilities (eg 'raging' is more often followed by
'bull' than 'mouse', even though 'mouse' might be the more common word
overall). You could then apply Bayesian statistics or something to figure
out what the most important words in the processed text were. The database
might not need to be that big - apparently 90% of text is made up of ~7000
words [1], so if you knew those, you could assign any unknown word some
arbitrary 'very rare' frequency.

Anyway, i don't think cljp is the place to ask about this - there is a
huge body of knowlege about text processing which Roedy could tap into,
and there must be a place to ask questions on it.

tom

[1] http://www.askoxford.com/oec/mainpage/oec02/
 
D

Daniel Pitts

Roedy said:
Have you seen any work on comparing strings for almost equality, or
measuring difference?

So that for example, if you had two proverbs, they would compare equal
or close even if they were worded slightly differently, or punctuated
slightly differently.

You might assign low importance to matching word like "the" and high
importance to words like "democracy".

It might understand plurals. It might have a dictionary of synonyms.

You might use it to detect redundancy in an essay or website, either
to remove it, or ensure both versions are kept up to date.
There are many approaches to this problem.

Lucene uses analyzers to break down text into stemmed forms. For example
"Animals" and "animal" could be stemmed from "animal", "Octopus" and
"Octopi" would be stemmed to "octop". This is usually done by
specifying a set of rules that are language specific, and exceptions for
words that don't follow those rules.

Others have already suggested soundex.

There is also the Edit Distance.
<http://en.wikipedia.org/wiki/Edit_distance>

Wikipedia also has a whole section for fuzzy string searching:
<http://en.wikipedia.org/wiki/Fuzzy_string_searching>
 
D

Daniel Pitts

rossum said:
Soundex has already been mentioned. See also
http://en.wikipedia.org/wiki/Levenshtein_distance

That is a lot more complex, unless you go for simple run-length you
are going to need a database of relative importance for each possible
word.
The common approach is that the more frequent a term is in a corpus, the
less important it is to the search result. So you do often have a
"database", but it is derived from your data directly, not (necessarily)
predefined by someone.
 
L

Lew

There are many approaches to this problem.

Lucene uses analyzers to break down text into stemmed forms. For example
"Animals" and "animal" could be stemmed from "animal", "Octopus" and
"Octopi" would be stemmed to "octop".  This is usually done by
specifying a set of rules that are language specific, and exceptions for
words that don't follow those rules.

Others have already suggested soundex.

There is also the Edit Distance.
<http://en.wikipedia.org/wiki/Edit_distance>

Wikipedia also has a whole section for fuzzy string searching:
<http://en.wikipedia.org/wiki/Fuzzy_string_searching>

Latent Semantic Indexing ("LSI") is a powerful technique that maps
word or phrase occurrences from a document base into a metric space,
then projects that space into fewer dimensions. Queries can produce
surprisingly relevant results against that projection. One example
given is the word "driver", which has varied interpretations according
to whether one is speaking of motor vehicle operation, golf, or
computer devices. A query such as "What driver should I use on a
par-5 dogleg?" will produce different hits than "What driver should I
use for an NVIDIA card in Linux?" will.
 
M

Martin Gregorie

Have you seen any work on comparing strings for almost equality, or
measuring difference?

So that for example, if you had two proverbs, they would compare equal
or close even if they were worded slightly differently, or punctuated
slightly differently.
I think you'd end up needing a grammatical parser. Given the phrases:
a) "The cat ate the rat."
b) "The rat ate the cat."
c) "The cat eaten rat."

These are trivially simple single phrase sentences. A grammatical parser
should be able to determine that (a) and (b) have quite different
meanings but (a) and (c) are much closer, but I suspect that even quite
clever comaprisons that neglect grammar would be wrong quite often.
 
T

Tom Anderson

Latent Semantic Indexing ("LSI") is a powerful technique that maps word
or phrase occurrences from a document base into a metric space, then
projects that space into fewer dimensions. Queries can produce
surprisingly relevant results against that projection. One example
given is the word "driver", which has varied interpretations according
to whether one is speaking of motor vehicle operation, golf, or computer
devices. A query such as "What driver should I use on a par-5 dogleg?"
will produce different hits than "What driver should I use for an NVIDIA
card in Linux?" will.

Indeed. For the former, you want a 1 or 2 wood, whereas when dealing with
Linux, really you want a sand wedge or something. Or a baseball bat.

tom
 
D

Daniel Pitts

Martin said:
I think you'd end up needing a grammatical parser. Given the phrases:
a) "The cat ate the rat."
b) "The rat ate the cat."
c) "The cat eaten rat."
d) "The rat the cat ate was brown."
e) "It was a brown rat the cat ate."
 
M

Martin Gregorie

d) Time flies like an arrow.
e) Fruit flies like a banana.
I thought of using that. Then I decided the mutual rodent-feline
consumption was a better example of two phrases that a simplistiC word
matcher might think had the same meaning.
 
L

Lew

rossum said:
d) Time flies like an arrow.
e) Fruit flies like a banana.

Outside of a dog, a book is a man's best friend.
Inside of a dog, it's too dark to read.
- Groucho Marx
 
R

Roedy Green

Have you seen any work on comparing strings for almost equality, or
measuring difference?

Thanks everyone. That was a rich set of responses. I have incorporated
your suggested links into the Java glossary.

I had an idea for another use of such logic.

Lets say you wanted to keep a set of documents on different machines
in sync.

If the master document could be chunked, e.g. into sentences (or
something similar for program text and data records), you could hash
each sentence, then detect which sentences had been added, deleted and
changed, and produce a compact delta file to tend to other machines to
update them.

NOW, if in addition, you could do fuzzy matching, even those sentences
that were just mildly changed could be compacted to send only the
differences
--
Roedy Green Canadian Mind Products
http://mindprod.com
PM Steven Harper is fixated on the costs of implementing Kyoto, estimated as high as 1% of GDP.
However, he refuses to consider the costs of not implementing Kyoto which the
famous economist Nicholas Stern estimated at 5 to 20% of GDP
 
R

Roedy Green

Why not use CVS? That records (commit) and applies (update) deltas.

CVS does not understand reordering. To it, a paragraph moved is a
paragraph deleted and a paragraph inserted.
--
Roedy Green Canadian Mind Products
http://mindprod.com
PM Steven Harper is fixated on the costs of implementing Kyoto, estimated as high as 1% of GDP.
However, he refuses to consider the costs of not implementing Kyoto which the
famous economist Nicholas Stern estimated at 5 to 20% of GDP
 
M

Martin Gregorie

CVS does not understand reordering. To it, a paragraph moved is a
paragraph deleted and a paragraph inserted.

Does that matter? You said you wanted to sync a set of document copies,
not to track text movement within a document.
 
T

Tom Anderson

I had an idea for another use of such logic.

Lets say you wanted to keep a set of documents on different machines
in sync.

If the master document could be chunked, e.g. into sentences (or
something similar for program text and data records), you could hash
each sentence, then detect which sentences had been added, deleted and
changed, and produce a compact delta file to tend to other machines to
update them.

rsync already does something along these lines, although not, as far as i
know, sentence-based. From the man page:

The rsync remote-update protocol allows rsync to transfer just the dif-
ferences between two sets of files across the network connection, using
an efficient checksum-search algorithm described in the technical
report that accompanies this package.

I'm afraid i can't tell you any more than that!
NOW, if in addition, you could do fuzzy matching, even those sentences
that were just mildly changed could be compacted to send only the
differences

I don't think you need to know about sentences to do this. You can do it
on strings of bytes. A very great deal of work has been done on this sort
of problem (finding the matches, not computing the differences, which is
fairly trivial once you have the matches) in the context of
bioinformatics; if you're interested, start with the Smith-Waterman
algorithm:

http://en.wikipedia.org/wiki/Smith-Waterman_algorithm

tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

case sensitive filenames 62
Debugging regex 3
Browser news 4
Regex Puzzle 5
Constellations 38
Smoothing 2
word_set = set() def should_preceed_with_an(phrase): first_word = 1
Observations on operator overloading 9

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top