Percentage matching of text

Discussion in 'Python' started by Bruce Eckel, Jul 30, 2004.

  1. Bruce Eckel

    Bruce Eckel Guest

    Background: for the 4th edition of Thinking in Java, I'm trying to
    once again improve the testing scheme for the examples in the book. I
    want to verify that the output I show in the book is "reasonably
    correct." I say "Reasonably" because a number of examples produce
    random numbers or text or the time of day or in general things that do
    not repeat themselves from one execution to the next. So, much of the
    text will be the same between the "control sample" and the "test
    sample," but some of it will be different.

    I will be using Python or Jython for the test framework.

    What I'd like to do is find an algorithm that produces the results of
    a text comparison as a percentage-match. Thus I would be able to
    assert that my test samples must match the control sample by at least
    (for example) 83% for the test to pass. Clearly, this wouldn't be a
    perfect test but it would help flag problems, which is primarily what
    I need.

    Does anyone know of an algorithm or library that would do this? Thanks
    in advance.

    Bruce Eckel
    Bruce Eckel, Jul 30, 2004
    #1
    1. Advertising

  2. Bruce Eckel wrote:
    > Background: for the 4th edition of Thinking in Java, I'm trying to
    > once again improve the testing scheme for the examples in the book. I
    > want to verify that the output I show in the book is "reasonably
    > correct." I say "Reasonably" because a number of examples produce
    > random numbers or text or the time of day or in general things that do
    > not repeat themselves from one execution to the next. So, much of the
    > text will be the same between the "control sample" and the "test
    > sample," but some of it will be different.
    >
    > I will be using Python or Jython for the test framework.
    >
    > What I'd like to do is find an algorithm that produces the results of
    > a text comparison as a percentage-match. Thus I would be able to
    > assert that my test samples must match the control sample by at least
    > (for example) 83% for the test to pass. Clearly, this wouldn't be a
    > perfect test but it would help flag problems, which is primarily what
    > I need.
    >
    > Does anyone know of an algorithm or library that would do this? Thanks
    > in advance.
    >


    Sorry, not in Python, but only in Perl
    I think
    ftp://ftp.funet.fi/pub/languages/perl/CPAN/modules/by-module/String/String-Approx-3.23.tar.gz
    can be tweaked to do that.


    --
    Helmut Jarausch

    Lehrstuhl fuer Numerische Mathematik
    RWTH - Aachen University
    D 52056 Aachen, Germany
    Helmut Jarausch, Jul 30, 2004
    #2
    1. Advertising

  3. Bruce Eckel wrote:
    > Background: for the 4th edition of Thinking in Java, I'm trying to
    > once again improve the testing scheme for the examples in the book. I
    > want to verify that the output I show in the book is "reasonably
    > correct." I say "Reasonably" because a number of examples produce
    > random numbers or text or the time of day or in general things that do
    > not repeat themselves from one execution to the next. So, much of the
    > text will be the same between the "control sample" and the "test
    > sample," but some of it will be different.
    >
    > I will be using Python or Jython for the test framework.
    >
    > What I'd like to do is find an algorithm that produces the results of
    > a text comparison as a percentage-match. Thus I would be able to
    > assert that my test samples must match the control sample by at least
    > (for example) 83% for the test to pass. Clearly, this wouldn't be a
    > perfect test but it would help flag problems, which is primarily what
    > I need.
    >
    > Does anyone know of an algorithm or library that would do this? Thanks
    > in advance.
    >


    Sorry, not in Python, but only in Perl
    I think
    ftp://ftp.funet.fi/pub/languages/perl/CPAN/modules/by-module/String/String-Approx-3.23.tar.gz
    can be tweaked to do that.


    --
    Helmut Jarausch

    Lehrstuhl fuer Numerische Mathematik
    RWTH - Aachen University
    D 52056 Aachen, Germany
    Helmut Jarausch, Jul 30, 2004
    #3
  4. Bruce Eckel

    Eddie Corns Guest

    Bruce Eckel <> writes:

    >What I'd like to do is find an algorithm that produces the results of
    >a text comparison as a percentage-match. Thus I would be able to
    >assert that my test samples must match the control sample by at least
    >(for example) 83% for the test to pass. Clearly, this wouldn't be a
    >perfect test but it would help flag problems, which is primarily what
    >I need.


    How about using the edit distance? This would maybe give you finer control,
    eg the maximum edit distance for a date would be within X characters if the
    times are close or XX characters if completely random.

    Googling for "python string edit distance" came up with a few matches.

    Eddie
    Eddie Corns, Jul 30, 2004
    #4
  5. Bruce Eckel wrote:
    > Does anyone know of an algorithm or library that would do this? Thanks
    > in advance.


    Maybe you can utilize crm114 for that - after training it with a few
    examples and maybe a bit of preprocessing (replacing number literals with a
    special token) it should do the job.

    There is a cmme (crm114 made easy) package for python available, however I
    use it as cmd-line-tool from out of python.

    --
    Regards,

    Diez B. Roggisch
    Diez B. Roggisch, Jul 30, 2004
    #5
  6. Bruce Eckel

    Dan Bishop Guest

    Bruce Eckel <> wrote in message news:<>...
    > Background: for the 4th edition of Thinking in Java, I'm trying to
    > once again improve the testing scheme for the examples in the book. I
    > want to verify that the output I show in the book is "reasonably
    > correct." I say "Reasonably" because a number of examples produce
    > random numbers or text or the time of day or in general things that do
    > not repeat themselves from one execution to the next. So, much of the
    > text will be the same between the "control sample" and the "test
    > sample," but some of it will be different.
    >
    > I will be using Python or Jython for the test framework.
    >
    > What I'd like to do is find an algorithm that produces the results of
    > a text comparison as a percentage-match. Thus I would be able to
    > assert that my test samples must match the control sample by at least
    > (for example) 83% for the test to pass. Clearly, this wouldn't be a
    > perfect test but it would help flag problems, which is primarily what
    > I need.
    >
    > Does anyone know of an algorithm or library that would do this? Thanks
    > in advance.


    One of the simpler ones is to calculate the length of the longest
    common subsequence of the test output and the control output.

    def lcsLength(seqA, seqB):
    lenTable = [[0] * len(seqB) for i in seqA]
    # Set each lenTable[j] to lcsLength(seqA[:i+1], seqB[:j+1])
    for i, a in enumerate(seqA):
    for j, b in enumerate(seqB):
    if a == b:
    lenTable[j] = lenTable[i-1][j-1] + 1
    else:
    lenTable[j] = max(lenTable[i-1][j], lenTable[j-1])
    return lenTable[-1][-1]

    To convert this to a percentage value, simply divide by the length of
    the control output.

    Btw, thank you for those footnotes in Thinking in Java that encouraged
    me to try Python :)
    Dan Bishop, Jul 31, 2004
    #6
  7. Hello Bruce,

    Bruce Eckel <> wrote in message
    news:<>

    > ...
    > What I'd like to do is find an algorithm that produces the results
    > of a text comparison as a percentage-match.
    > ...
    > Does anyone know of an algorithm or library that would do this?
    > Thanks in advance.
    >


    I suggest you to look at my software, GetReuse and its SDK:

    http://getreuse.com/
    http://getreuse.com/sdk/

    The formula for the calculation of the similarity is based on the
    scientific research. Any other "good" method of calculations should
    produce results that are equivalent in some terms to the GetReuse
    results. I have not wrote a paper yet; the formula is a improvement
    of the formula from http://www.cs.ucsb.edu/~mli/sid.ps . Unfortunately,
    I froze the project but the current code is tested and should work well.

    > Bruce Eckel
    >


    Regards, Oleg
    Oleg Paraschenko, Aug 2, 2004
    #7
  8. Bruce Eckel <>
    wrote on Fri, 30 Jul 2004 07:52:39 -0600:
    > Background: for the 4th edition of Thinking in Java, I'm trying to
    > once again improve the testing scheme for the examples in the book. I
    > want to verify that the output I show in the book is "reasonably
    > correct." I say "Reasonably" because a number of examples produce
    > random numbers or text or the time of day or in general things that do
    > not repeat themselves from one execution to the next. So, much of the
    > text will be the same between the "control sample" and the "test
    > sample," but some of it will be different.
    >
    > I will be using Python or Jython for the test framework.
    >
    > What I'd like to do is find an algorithm that produces the results of
    > a text comparison as a percentage-match. Thus I would be able to
    > assert that my test samples must match the control sample by at least
    > (for example) 83% for the test to pass. Clearly, this wouldn't be a
    > perfect test but it would help flag problems, which is primarily what
    > I need.
    >
    > Does anyone know of an algorithm or library that would do this? Thanks
    > in advance.


    Here's an outside-the-box solution: set the random number seed and use
    a fixed date in your tests. Now you can test fixed values, even though
    the application is "random".

    --
    <a href="http://kuoi.asui.uidaho.edu/~kamikaze/"> Mark Hughes </a>
    "Virtues foster one another; so too, vices.
    Bad English kills trees, consumes energy, and befouls the Earth.
    Good English renews it." -The Underground Grammarian, v1n2
    Mark 'Kamikaze' Hughes, Aug 2, 2004
    #8
  9. In article <>, Bruce
    Eckel wrote:
    >
    > What I'd like to do is find an algorithm that produces the results of
    > a text comparison as a percentage-match. Thus I would be able to
    > assert that my test samples must match the control sample by at least
    > (for example) 83% for the test to pass. Clearly, this wouldn't be a
    > perfect test but it would help flag problems, which is primarily what
    > I need.
    >
    > Does anyone know of an algorithm or library that would do this? Thanks
    > in advance.
    >


    Have you come across the following yet?

    Levenshtein C extension module for Python:
    http://trific.ath.cx/resources/python/levenshtein/


    And/or:
    http://hetland.org/python/distance.py


    -Steve
    Steve Christensen, Aug 9, 2004
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mihai

    Percentage matching of text

    Mihai, Oct 1, 2004, in forum: Java
    Replies:
    1
    Views:
    706
    Michael Borgwardt
    Oct 1, 2004
  2. Bruce Eckel

    Re[2]: Percentage matching of text

    Bruce Eckel, Jul 30, 2004, in forum: Python
    Replies:
    0
    Views:
    319
    Bruce Eckel
    Jul 30, 2004
  3. Bruce Eckel

    Re[2]: Percentage matching of text

    Bruce Eckel, Jul 30, 2004, in forum: Python
    Replies:
    0
    Views:
    302
    Bruce Eckel
    Jul 30, 2004
  4. Umesh
    Replies:
    44
    Views:
    1,117
    Richard Bos
    Feb 5, 2007
  5. Umesh
    Replies:
    39
    Views:
    1,381
    Richard Bos
    Feb 5, 2007
Loading...

Share This Page