almost equal strings

Discussion in 'Java' started by Roedy Green, Dec 22, 2008.

  1. Roedy Green

    Roedy Green Guest

    Have you seen any work on comparing strings for almost equality, or
    measuring difference?

    So that for example, if you had two proverbs, they would compare equal
    or close even if they were worded slightly differently, or punctuated
    slightly differently.

    You might assign low importance to matching word like "the" and high
    importance to words like "democracy".

    It might understand plurals. It might have a dictionary of synonyms.

    You might use it to detect redundancy in an essay or website, either
    to remove it, or ensure both versions are kept up to date.

    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    PM Steven Harper is fixated on the costs of implementing Kyoto, estimated as high as 1% of GDP.
    However, he refuses to consider the costs of not implementing Kyoto which the
    famous economist Nicholas Stern estimated at 5 to 20% of GDP
     
    Roedy Green, Dec 22, 2008
    #1
    1. Advertising

  2. Roedy Green

    Mark Space Guest

    Roedy Green wrote:
    > Have you seen any work on comparing strings for almost equality, or
    > measuring difference?
    >
    > So that for example, if you had two proverbs, they would compare equal
    > or close even if they were worded slightly differently, or punctuated
    > slightly differently.
    >
    > You might assign low importance to matching word like "the" and high
    > importance to words like "democracy".
    >
    > It might understand plurals. It might have a dictionary of synonyms.
    >
    > You might use it to detect redundancy in an essay or website, either
    > to remove it, or ensure both versions are kept up to date.
    >


    There's SOUNDEX/SOUNDS-LIKE functions in SQL, as well as DIFFERENCE.
    Maybe something like that to get started.
     
    Mark Space, Dec 22, 2008
    #2
    1. Advertising

  3. Roedy Green

    David Segall Guest

    Roedy Green <> wrote:

    >Have you seen any work on comparing strings for almost equality, or
    >measuring difference?
    >
    >So that for example, if you had two proverbs, they would compare equal
    >or close even if they were worded slightly differently, or punctuated
    >slightly differently.
    >
    >You might assign low importance to matching word like "the" and high
    >importance to words like "democracy".
    >
    >It might understand plurals. It might have a dictionary of synonyms.
    >
    >You might use it to detect redundancy in an essay or website, either
    >to remove it, or ensure both versions are kept up to date.

    TRE <http://laurikari.net/tre/> is regular expression matching library
    that includes approximate matching. In your application, one phrase
    could be used to search for a similar one after preprocessing the
    first one according to your "importance" rules.
     
    David Segall, Dec 22, 2008
    #3
  4. Roedy Green

    Tom Anderson Guest

    On Mon, 22 Dec 2008, rossum wrote:

    > On Sun, 21 Dec 2008 18:14:48 -0800, Roedy Green
    > <> wrote:
    >
    >> Have you seen any work on comparing strings for almost equality, or
    >> measuring difference?

    >
    > Soundex has already been mentioned. See also
    > http://en.wikipedia.org/wiki/Levenshtein_distance


    Those are good for comparing words, or arbitrary strings, as Roedy asked,
    but he reveals that he actually wants to compare phrases. Obviously, you
    can still compare phrases as strings, but i wonder if they're actually not
    so appropriate here. Edit distance (aka Levenshtein distance) is defined
    for a string over any set of symbols; if you symbols were words, you could
    apply it at the phrase level. You might need a more refined scoring
    method, though - not all insertion, deletions or substitutions are of
    equal importance.

    >> So that for example, if you had two proverbs, they would compare equal
    >> or close even if they were worded slightly differently, or punctuated
    >> slightly differently.
    >>
    >> You might assign low importance to matching word like "the" and high
    >> importance to words like "democracy".

    >
    > That is a lot more complex, unless you go for simple run-length you are
    > going to need a database of relative importance for each possible word.


    Such a think would not be impossible to build - you could get hold of some
    sort of corpus (say, a few gigabytes of modern text from Project
    Gutenberg) and process it, recording frequencies for each word, and
    perhaps second-order probabilities (eg 'raging' is more often followed by
    'bull' than 'mouse', even though 'mouse' might be the more common word
    overall). You could then apply Bayesian statistics or something to figure
    out what the most important words in the processed text were. The database
    might not need to be that big - apparently 90% of text is made up of ~7000
    words [1], so if you knew those, you could assign any unknown word some
    arbitrary 'very rare' frequency.

    Anyway, i don't think cljp is the place to ask about this - there is a
    huge body of knowlege about text processing which Roedy could tap into,
    and there must be a place to ask questions on it.

    tom

    [1] http://www.askoxford.com/oec/mainpage/oec02/

    --
    Programming is a skill best acquired by practice and example rather than
    from books -- Alan Turing
     
    Tom Anderson, Dec 22, 2008
    #4
  5. Roedy Green

    Daniel Pitts Guest

    Roedy Green wrote:
    > Have you seen any work on comparing strings for almost equality, or
    > measuring difference?
    >
    > So that for example, if you had two proverbs, they would compare equal
    > or close even if they were worded slightly differently, or punctuated
    > slightly differently.
    >
    > You might assign low importance to matching word like "the" and high
    > importance to words like "democracy".
    >
    > It might understand plurals. It might have a dictionary of synonyms.
    >
    > You might use it to detect redundancy in an essay or website, either
    > to remove it, or ensure both versions are kept up to date.
    >

    There are many approaches to this problem.

    Lucene uses analyzers to break down text into stemmed forms. For example
    "Animals" and "animal" could be stemmed from "animal", "Octopus" and
    "Octopi" would be stemmed to "octop". This is usually done by
    specifying a set of rules that are language specific, and exceptions for
    words that don't follow those rules.

    Others have already suggested soundex.

    There is also the Edit Distance.
    <http://en.wikipedia.org/wiki/Edit_distance>

    Wikipedia also has a whole section for fuzzy string searching:
    <http://en.wikipedia.org/wiki/Fuzzy_string_searching>

    --
    Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
     
    Daniel Pitts, Dec 22, 2008
    #5
  6. Roedy Green

    Daniel Pitts Guest

    rossum wrote:
    > On Sun, 21 Dec 2008 18:14:48 -0800, Roedy Green
    > <> wrote:
    >
    >> Have you seen any work on comparing strings for almost equality, or
    >> measuring difference?

    > Soundex has already been mentioned. See also
    > http://en.wikipedia.org/wiki/Levenshtein_distance
    >
    >> So that for example, if you had two proverbs, they would compare equal
    >> or close even if they were worded slightly differently, or punctuated
    >> slightly differently.
    >>
    >> You might assign low importance to matching word like "the" and high
    >> importance to words like "democracy".

    > That is a lot more complex, unless you go for simple run-length you
    > are going to need a database of relative importance for each possible
    > word.
    >

    The common approach is that the more frequent a term is in a corpus, the
    less important it is to the search result. So you do often have a
    "database", but it is derived from your data directly, not (necessarily)
    predefined by someone.

    > rossum
    >
    >
    >> It might understand plurals. It might have a dictionary of synonyms.
    >>
    >> You might use it to detect redundancy in an essay or website, either
    >> to remove it, or ensure both versions are kept up to date.

    >



    --
    Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
     
    Daniel Pitts, Dec 22, 2008
    #6
  7. Roedy Green

    jolz Guest

    > Have you seen any work on comparing strings for almost equality, or
    > measuring difference?


    Levenshtein distance
    Jaro distance
     
    jolz, Dec 22, 2008
    #7
  8. Roedy Green

    Lew Guest

    On Dec 22, 12:20 pm, Daniel Pitts
    <> wrote:
    > Roedy Green wrote:
    > > Have you seen any work on comparing strings for almost equality, or
    > > measuring difference?

    >
    > > So that for example, if you had two proverbs, they would compare equal
    > > or close even if they were worded slightly differently, or punctuated
    > > slightly differently.

    >
    > > You might assign low importance to matching word like "the" and high
    > > importance to words like "democracy".

    >
    > > It might understand plurals.  It might have a dictionary of synonyms.

    >
    > > You might use it to detect redundancy in an essay or website, either
    > > to remove it, or ensure both versions are kept up to date.

    >
    > There are many approaches to this problem.
    >
    > Lucene uses analyzers to break down text into stemmed forms. For example
    > "Animals" and "animal" could be stemmed from "animal", "Octopus" and
    > "Octopi" would be stemmed to "octop".  This is usually done by
    > specifying a set of rules that are language specific, and exceptions for
    > words that don't follow those rules.
    >
    > Others have already suggested soundex.
    >
    > There is also the Edit Distance.
    > <http://en.wikipedia.org/wiki/Edit_distance>
    >
    > Wikipedia also has a whole section for fuzzy string searching:
    > <http://en.wikipedia.org/wiki/Fuzzy_string_searching>


    Latent Semantic Indexing ("LSI") is a powerful technique that maps
    word or phrase occurrences from a document base into a metric space,
    then projects that space into fewer dimensions. Queries can produce
    surprisingly relevant results against that projection. One example
    given is the word "driver", which has varied interpretations according
    to whether one is speaking of motor vehicle operation, golf, or
    computer devices. A query such as "What driver should I use on a
    par-5 dogleg?" will produce different hits than "What driver should I
    use for an NVIDIA card in Linux?" will.

    --
    Lew
     
    Lew, Dec 22, 2008
    #8
  9. On Sun, 21 Dec 2008 18:14:48 -0800, Roedy Green wrote:

    > Have you seen any work on comparing strings for almost equality, or
    > measuring difference?
    >
    > So that for example, if you had two proverbs, they would compare equal
    > or close even if they were worded slightly differently, or punctuated
    > slightly differently.
    >

    I think you'd end up needing a grammatical parser. Given the phrases:
    a) "The cat ate the rat."
    b) "The rat ate the cat."
    c) "The cat eaten rat."

    These are trivially simple single phrase sentences. A grammatical parser
    should be able to determine that (a) and (b) have quite different
    meanings but (a) and (c) are much closer, but I suspect that even quite
    clever comaprisons that neglect grammar would be wrong quite often.


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
     
    Martin Gregorie, Dec 22, 2008
    #9
  10. Roedy Green

    Tom Anderson Guest

    On Mon, 22 Dec 2008, Lew wrote:

    > On Dec 22, 12:20 pm, Daniel Pitts
    > <> wrote:
    >> Roedy Green wrote:
    >>> Have you seen any work on comparing strings for almost equality, or
    >>> measuring difference?

    >>
    >>> So that for example, if you had two proverbs, they would compare equal
    >>> or close even if they were worded slightly differently, or punctuated
    >>> slightly differently.

    >>
    >>> You might assign low importance to matching word like "the" and high
    >>> importance to words like "democracy".

    >>
    >>> It might understand plurals.  It might have a dictionary of synonyms.

    >>
    >>> You might use it to detect redundancy in an essay or website, either
    >>> to remove it, or ensure both versions are kept up to date.

    >>
    >> There are many approaches to this problem.
    >>
    >> Lucene uses analyzers to break down text into stemmed forms. For example
    >> "Animals" and "animal" could be stemmed from "animal", "Octopus" and
    >> "Octopi" would be stemmed to "octop".  This is usually done by
    >> specifying a set of rules that are language specific, and exceptions for
    >> words that don't follow those rules.
    >>
    >> Others have already suggested soundex.
    >>
    >> There is also the Edit Distance.
    >> <http://en.wikipedia.org/wiki/Edit_distance>
    >>
    >> Wikipedia also has a whole section for fuzzy string searching:
    >> <http://en.wikipedia.org/wiki/Fuzzy_string_searching>

    >
    > Latent Semantic Indexing ("LSI") is a powerful technique that maps word
    > or phrase occurrences from a document base into a metric space, then
    > projects that space into fewer dimensions. Queries can produce
    > surprisingly relevant results against that projection. One example
    > given is the word "driver", which has varied interpretations according
    > to whether one is speaking of motor vehicle operation, golf, or computer
    > devices. A query such as "What driver should I use on a par-5 dogleg?"
    > will produce different hits than "What driver should I use for an NVIDIA
    > card in Linux?" will.


    Indeed. For the former, you want a 1 or 2 wood, whereas when dealing with
    Linux, really you want a sand wedge or something. Or a baseball bat.

    tom

    --
    It's amazing how often conversations with you have the imaginary sound
    of human bones being crushed to rubble in the background. -- itchyfidget,
    to snowking
     
    Tom Anderson, Dec 22, 2008
    #10
  11. Roedy Green

    Daniel Pitts Guest

    Martin Gregorie wrote:
    > On Sun, 21 Dec 2008 18:14:48 -0800, Roedy Green wrote:
    >
    >> Have you seen any work on comparing strings for almost equality, or
    >> measuring difference?
    >>
    >> So that for example, if you had two proverbs, they would compare equal
    >> or close even if they were worded slightly differently, or punctuated
    >> slightly differently.
    >>

    > I think you'd end up needing a grammatical parser. Given the phrases:
    > a) "The cat ate the rat."
    > b) "The rat ate the cat."
    > c) "The cat eaten rat."

    d) "The rat the cat ate was brown."
    e) "It was a brown rat the cat ate."
    >
    > These are trivially simple single phrase sentences. A grammatical parser
    > should be able to determine that (a) and (b) have quite different
    > meanings but (a) and (c) are much closer, but I suspect that even quite
    > clever comaprisons that neglect grammar would be wrong quite often.

    --
    Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
     
    Daniel Pitts, Dec 22, 2008
    #11
  12. On Mon, 22 Dec 2008 21:22:05 +0000, rossum wrote:

    >
    > d) Time flies like an arrow.
    > e) Fruit flies like a banana.
    >

    I thought of using that. Then I decided the mutual rodent-feline
    consumption was a better example of two phrases that a simplistiC word
    matcher might think had the same meaning.


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
     
    Martin Gregorie, Dec 22, 2008
    #12
  13. Roedy Green

    Tom Anderson Guest

    On Mon, 22 Dec 2008, Daniel Pitts wrote:

    > Martin Gregorie wrote:
    >> On Sun, 21 Dec 2008 18:14:48 -0800, Roedy Green wrote:
    >>
    >>> Have you seen any work on comparing strings for almost equality, or
    >>> measuring difference?
    >>>
    >>> So that for example, if you had two proverbs, they would compare equal
    >>> or close even if they were worded slightly differently, or punctuated
    >>> slightly differently.

    >>
    >> I think you'd end up needing a grammatical parser. Given the phrases: a)
    >> "The cat ate the rat."
    >> b) "The rat ate the cat."
    >> c) "The cat eaten rat."

    > d) "The rat the cat ate was brown."


    (z) "There Tate theca"

    Correctly spelt, nonsensical, but quite close in edit distance to (b) !

    tom

    --
    It's amazing how often conversations with you have the imaginary sound
    of human bones being crushed to rubble in the background. -- itchyfidget,
    to snowking
     
    Tom Anderson, Dec 22, 2008
    #13
  14. Roedy Green

    Lew Guest

    rossum wrote:
    > d) Time flies like an arrow.
    > e) Fruit flies like a banana.


    Outside of a dog, a book is a man's best friend.
    Inside of a dog, it's too dark to read.
    - Groucho Marx

    --
    Lew
     
    Lew, Dec 22, 2008
    #14
  15. Roedy Green

    Roedy Green Guest

    On Sun, 21 Dec 2008 18:14:48 -0800, Roedy Green
    <> wrote, quoted or indirectly quoted
    someone who said :

    >Have you seen any work on comparing strings for almost equality, or
    >measuring difference?


    Thanks everyone. That was a rich set of responses. I have incorporated
    your suggested links into the Java glossary.

    I had an idea for another use of such logic.

    Lets say you wanted to keep a set of documents on different machines
    in sync.

    If the master document could be chunked, e.g. into sentences (or
    something similar for program text and data records), you could hash
    each sentence, then detect which sentences had been added, deleted and
    changed, and produce a compact delta file to tend to other machines to
    update them.

    NOW, if in addition, you could do fuzzy matching, even those sentences
    that were just mildly changed could be compacted to send only the
    differences
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    PM Steven Harper is fixated on the costs of implementing Kyoto, estimated as high as 1% of GDP.
    However, he refuses to consider the costs of not implementing Kyoto which the
    famous economist Nicholas Stern estimated at 5 to 20% of GDP
     
    Roedy Green, Dec 23, 2008
    #15
  16. On Tue, 23 Dec 2008 00:45:38 -0800, Roedy Green wrote:

    > I had an idea for another use of such logic.
    >

    Why not use CVS? That records (commit) and applies (update) deltas.


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
     
    Martin Gregorie, Dec 23, 2008
    #16
  17. On 12/22/08 4:22 PM, rossum wrote:
    > On Mon, 22 Dec 2008 19:18:14 +0000 (UTC), Martin Gregorie
    > <> wrote:
    >
    >> On Sun, 21 Dec 2008 18:14:48 -0800, Roedy Green wrote:
    >>
    >>> Have you seen any work on comparing strings for almost equality, or
    >>> measuring difference?
    >>>
    >>> So that for example, if you had two proverbs, they would compare equal
    >>> or close even if they were worded slightly differently, or punctuated
    >>> slightly differently.
    >>>

    >> I think you'd end up needing a grammatical parser. Given the phrases:
    >> a) "The cat ate the rat."
    >> b) "The rat ate the cat."
    >> c) "The cat eaten rat."

    >
    > d) Time flies like an arrow.
    > e) Fruit flies like a banana.


    It's a pretty little girls school.
     
    John W Kennedy, Dec 24, 2008
    #17
  18. Roedy Green

    Roedy Green Guest

    On Tue, 23 Dec 2008 12:08:47 +0000 (UTC), Martin Gregorie
    <> wrote, quoted or indirectly
    quoted someone who said :

    >Why not use CVS? That records (commit) and applies (update) deltas.


    CVS does not understand reordering. To it, a paragraph moved is a
    paragraph deleted and a paragraph inserted.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    PM Steven Harper is fixated on the costs of implementing Kyoto, estimated as high as 1% of GDP.
    However, he refuses to consider the costs of not implementing Kyoto which the
    famous economist Nicholas Stern estimated at 5 to 20% of GDP
     
    Roedy Green, Dec 24, 2008
    #18
  19. On Wed, 24 Dec 2008 02:01:30 -0800, Roedy Green wrote:

    > On Tue, 23 Dec 2008 12:08:47 +0000 (UTC), Martin Gregorie
    > <> wrote, quoted or indirectly quoted
    > someone who said :
    >
    >>Why not use CVS? That records (commit) and applies (update) deltas.

    >
    > CVS does not understand reordering. To it, a paragraph moved is a
    > paragraph deleted and a paragraph inserted.


    Does that matter? You said you wanted to sync a set of document copies,
    not to track text movement within a document.


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
     
    Martin Gregorie, Dec 24, 2008
    #19
  20. Roedy Green

    Tom Anderson Guest

    On Tue, 23 Dec 2008, Roedy Green wrote:

    > I had an idea for another use of such logic.
    >
    > Lets say you wanted to keep a set of documents on different machines
    > in sync.
    >
    > If the master document could be chunked, e.g. into sentences (or
    > something similar for program text and data records), you could hash
    > each sentence, then detect which sentences had been added, deleted and
    > changed, and produce a compact delta file to tend to other machines to
    > update them.


    rsync already does something along these lines, although not, as far as i
    know, sentence-based. From the man page:

    The rsync remote-update protocol allows rsync to transfer just the dif-
    ferences between two sets of files across the network connection, using
    an efficient checksum-search algorithm described in the technical
    report that accompanies this package.

    I'm afraid i can't tell you any more than that!

    > NOW, if in addition, you could do fuzzy matching, even those sentences
    > that were just mildly changed could be compacted to send only the
    > differences


    I don't think you need to know about sentences to do this. You can do it
    on strings of bytes. A very great deal of work has been done on this sort
    of problem (finding the matches, not computing the differences, which is
    fairly trivial once you have the matches) in the context of
    bioinformatics; if you're interested, start with the Smith-Waterman
    algorithm:

    http://en.wikipedia.org/wiki/Smith-Waterman_algorithm

    tom

    --
    Would you like to remember more?
     
    Tom Anderson, Dec 27, 2008
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steven
    Replies:
    9
    Views:
    425
    Keith Thompson
    Dec 29, 2005
  2. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    814
    Malcolm
    Jun 24, 2006
  3. peter pilsl
    Replies:
    2
    Views:
    182
    Shawn Corey
    Sep 28, 2004
  4. Stefan
    Replies:
    4
    Views:
    148
    comp.llang.perl.moderated
    Jan 2, 2008
  5. Replies:
    3
    Views:
    145
Loading...

Share This Page