Percentage matching of text

Bruce Eckel · Jul 30, 2004

Background: for the 4th edition of Thinking in Java, I'm trying to
once again improve the testing scheme for the examples in the book. I
want to verify that the output I show in the book is "reasonably
correct." I say "Reasonably" because a number of examples produce
random numbers or text or the time of day or in general things that do
not repeat themselves from one execution to the next. So, much of the
text will be the same between the "control sample" and the "test
sample," but some of it will be different.

I will be using Python or Jython for the test framework.

What I'd like to do is find an algorithm that produces the results of
a text comparison as a percentage-match. Thus I would be able to
assert that my test samples must match the control sample by at least
(for example) 83% for the test to pass. Clearly, this wouldn't be a
perfect test but it would help flag problems, which is primarily what
I need.

Does anyone know of an algorithm or library that would do this? Thanks
in advance.

Bruce Eckel
(e-mail address removed)

Helmut Jarausch · Jul 30, 2004

Bruce said:
Background: for the 4th edition of Thinking in Java, I'm trying to
once again improve the testing scheme for the examples in the book. I
want to verify that the output I show in the book is "reasonably
correct." I say "Reasonably" because a number of examples produce
random numbers or text or the time of day or in general things that do
not repeat themselves from one execution to the next. So, much of the
text will be the same between the "control sample" and the "test
sample," but some of it will be different.

I will be using Python or Jython for the test framework.

What I'd like to do is find an algorithm that produces the results of
a text comparison as a percentage-match. Thus I would be able to
assert that my test samples must match the control sample by at least
(for example) 83% for the test to pass. Clearly, this wouldn't be a
perfect test but it would help flag problems, which is primarily what
I need.

Does anyone know of an algorithm or library that would do this? Thanks
in advance.

Sorry, not in Python, but only in Perl
I think
ftp://ftp.funet.fi/pub/languages/perl/CPAN/modules/by-module/String/String-Approx-3.23.tar.gz
can be tweaked to do that.

--
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

Helmut Jarausch · Jul 30, 2004

Bruce said:
Background: for the 4th edition of Thinking in Java, I'm trying to
once again improve the testing scheme for the examples in the book. I
want to verify that the output I show in the book is "reasonably
correct." I say "Reasonably" because a number of examples produce
random numbers or text or the time of day or in general things that do
not repeat themselves from one execution to the next. So, much of the
text will be the same between the "control sample" and the "test
sample," but some of it will be different.

I will be using Python or Jython for the test framework.

What I'd like to do is find an algorithm that produces the results of
a text comparison as a percentage-match. Thus I would be able to
assert that my test samples must match the control sample by at least
(for example) 83% for the test to pass. Clearly, this wouldn't be a
perfect test but it would help flag problems, which is primarily what
I need.

Does anyone know of an algorithm or library that would do this? Thanks
in advance.

Sorry, not in Python, but only in Perl
I think
ftp://ftp.funet.fi/pub/languages/perl/CPAN/modules/by-module/String/String-Approx-3.23.tar.gz
can be tweaked to do that.

--
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

Eddie Corns · Jul 30, 2004

Bruce Eckel said:
What I'd like to do is find an algorithm that produces the results of
a text comparison as a percentage-match. Thus I would be able to
assert that my test samples must match the control sample by at least
(for example) 83% for the test to pass. Clearly, this wouldn't be a
perfect test but it would help flag problems, which is primarily what
I need.

How about using the edit distance? This would maybe give you finer control,
eg the maximum edit distance for a date would be within X characters if the
times are close or XX characters if completely random.

Googling for "python string edit distance" came up with a few matches.

Eddie

Diez B. Roggisch · Jul 30, 2004

Bruce said:
Does anyone know of an algorithm or library that would do this? Thanks
in advance.

Maybe you can utilize crm114 for that - after training it with a few
examples and maybe a bit of preprocessing (replacing number literals with a
special token) it should do the job.

There is a cmme (crm114 made easy) package for python available, however I
use it as cmd-line-tool from out of python.

Dan Bishop · Jul 31, 2004

Bruce Eckel said:
Background: for the 4th edition of Thinking in Java, I'm trying to
once again improve the testing scheme for the examples in the book. I
want to verify that the output I show in the book is "reasonably
correct." I say "Reasonably" because a number of examples produce
random numbers or text or the time of day or in general things that do
not repeat themselves from one execution to the next. So, much of the
text will be the same between the "control sample" and the "test
sample," but some of it will be different.

I will be using Python or Jython for the test framework.

What I'd like to do is find an algorithm that produces the results of
a text comparison as a percentage-match. Thus I would be able to
assert that my test samples must match the control sample by at least
(for example) 83% for the test to pass. Clearly, this wouldn't be a
perfect test but it would help flag problems, which is primarily what
I need.

Does anyone know of an algorithm or library that would do this? Thanks
in advance.

One of the simpler ones is to calculate the length of the longest
common subsequence of the test output and the control output.

def lcsLength(seqA, seqB):
lenTable = [[0] * len(seqB) for i in seqA]
# Set each lenTable[j] to lcsLength(seqA[:i+1], seqB[:j+1])
for i, a in enumerate(seqA):
for j, b in enumerate(seqB):
if a == b:
lenTable[j] = lenTable[i-1][j-1] + 1
else:
lenTable[j] = max(lenTable[i-1][j], lenTable[j-1])
return lenTable[-1][-1]

To convert this to a percentage value, simply divide by the length of
the control output.

Btw, thank you for those footnotes in Thinking in Java that encouraged
me to try Python

Oleg Paraschenko · Aug 2, 2004

Hello Bruce,

...
What I'd like to do is find an algorithm that produces the results
of a text comparison as a percentage-match.
...
Does anyone know of an algorithm or library that would do this?
Thanks in advance.

I suggest you to look at my software, GetReuse and its SDK:

http://getreuse.com/
http://getreuse.com/sdk/

The formula for the calculation of the similarity is based on the
scientific research. Any other "good" method of calculations should
produce results that are equivalent in some terms to the GetReuse
results. I have not wrote a paper yet; the formula is a improvement
of the formula from http://www.cs.ucsb.edu/~mli/sid.ps . Unfortunately,
I froze the project but the current code is tested and should work well.

Bruce Eckel
(e-mail address removed)

Regards, Oleg

Mark 'Kamikaze' Hughes · Aug 2, 2004

Bruce Eckel said:
Background: for the 4th edition of Thinking in Java, I'm trying to
once again improve the testing scheme for the examples in the book. I
want to verify that the output I show in the book is "reasonably
correct." I say "Reasonably" because a number of examples produce
random numbers or text or the time of day or in general things that do
not repeat themselves from one execution to the next. So, much of the
text will be the same between the "control sample" and the "test
sample," but some of it will be different.

I will be using Python or Jython for the test framework.

What I'd like to do is find an algorithm that produces the results of
a text comparison as a percentage-match. Thus I would be able to
assert that my test samples must match the control sample by at least
(for example) 83% for the test to pass. Clearly, this wouldn't be a
perfect test but it would help flag problems, which is primarily what
I need.

Does anyone know of an algorithm or library that would do this? Thanks
in advance.

Here's an outside-the-box solution: set the random number seed and use
a fixed date in your tests. Now you can test fixed values, even though
the application is "random".

Steve Christensen · Aug 9, 2004

What I'd like to do is find an algorithm that produces the results of
a text comparison as a percentage-match. Thus I would be able to
assert that my test samples must match the control sample by at least
(for example) 83% for the test to pass. Clearly, this wouldn't be a
perfect test but it would help flag problems, which is primarily what
I need.

Does anyone know of an algorithm or library that would do this? Thanks
in advance.

Have you come across the following yet?

Levenshtein C extension module for Python:
http://trific.ath.cx/resources/python/levenshtein/

And/or:
http://hetland.org/python/distance.py

-Steve

SHA512 Prediction percentage	0	Jul 22, 2022
Re[2]: Percentage matching of text	0	Jul 30, 2004
Re[2]: Percentage matching of text	0	Jul 30, 2004
Measuring a string of text	1	Sep 15, 2022
find matching contiguous text	0	Nov 23, 2013
Percentage matching of text	1	Oct 1, 2004
percentage	8	Jul 29, 2008
Text File Only Programming	1	May 10, 2023

Percentage matching of text

Bruce Eckel

Helmut Jarausch

Helmut Jarausch

Eddie Corns

Diez B. Roggisch

Dan Bishop

Oleg Paraschenko

Mark 'Kamikaze' Hughes

Steve Christensen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads