Question about Algorithm::Diff

D

Dilbert

I have a question about the Perl Module Algorithm::Diff.

I want to use Algorithm::Diff to calculate the diff of the two
sequences "ab" and "a1bab", the "LCS" (the "Longest Common Sequence")
is a length of 2.

Theoretically there are 3 solutions with LCS = 2:

Solution 1:
---ab
a1bab

Solution 2:
a-b--
a1bab

Solution 3:
a---b
a1bab

I understand that any of those 3 solutions could be returned by
Algorithm::Diff, but I would argue that solution 1 is "better" than
solution 2 or 3, because solution 1 changes only once between '-' and
[ab], whereas solution 2 and 3 change more than once between '-' and
[ab].

This is my Perl program:

use strict;
use warnings;
use 5.010;
use Algorithm::Diff;
use Data::Dumper;
my @old = split //, 'ab';
my @new = split //, 'a1bab';
my $d = Algorithm::Diff::sdiff(\@old, \@new);
my $line = Dumper($d);
$line =~ s{\s}''xmsg;
say $line;

The output is
$VAR1=[['u','a','a'],['+','','1'],['+','','b'],['+','','a'],
['u','b','b']];

This is in fact Solution 3:
a---b
a1bab

How can I teach Algorithm::Diff to choose Solution 1 (the best of the
3 possibilities) ?
 
K

Kyle T. Jones

Dilbert said:
I have a question about the Perl Module Algorithm::Diff.

I want to use Algorithm::Diff to calculate the diff of the two
sequences "ab" and "a1bab", the "LCS" (the "Longest Common Sequence")
is a length of 2.

Theoretically there are 3 solutions with LCS = 2:

Solution 1:
---ab
a1bab

Solution 2:
a-b--
a1bab

Solution 3:
a---b
a1bab

There is only one LCS solution - 'ab'. You're making too much of it and
you don't get to say there are 3 solutions with LCS = 2 (as if you were
assigning 2, and LCS=3 or LCS=1 were possible with the data you
provide), there is just *the* LCS solution, which, in this case, happens
to be 'ab'.
I understand that any of those 3 solutions could be returned by
Algorithm::Diff, but I would argue that solution 1 is "better" than
solution 2 or 3, because solution 1 changes only once between '-' and
[ab], whereas solution 2 and 3 change more than once between '-' and
[ab].

This is my Perl program:

use strict;
use warnings;
use 5.010;
use Algorithm::Diff;
use Data::Dumper;
my @old = split //, 'ab';
my @new = split //, 'a1bab';
my $d = Algorithm::Diff::sdiff(\@old, \@new);
my $line = Dumper($d);
$line =~ s{\s}''xmsg;
say $line;

The output is
$VAR1=[['u','a','a'],['+','','1'],['+','','b'],['+','','a'],
['u','b','b']];

This is in fact Solution 3:
a---b
a1bab

No, this is the smallest set of instructions for turning the first
sequence into the second sequence (same as diff), with what some find to
be an easier to read "side by side" format:

(unchanged, was 'a', stays 'a')
(add, was nothing(''), adds '1')
(add, was nothing(''), adds 'b')
(add, was nothing(''), adds 'a')
(unchanged, was 'b', stays 'b')


The same set of instructions in diff would just be:

[
['+', 1, '1'],
['+', 2, 'b'],
['+', 3, 'a'],
]
How can I teach Algorithm::Diff to choose Solution 1 (the best of the
3 possibilities) ?

That question doesn't make a lot of sense (to me).

http://search.cpan.org/dist/Algorithm-Diff/lib/Algorithm/Diff.pm

Cheers.
 
K

Kyle T. Jones

Dilbert wrote:

I may have misunderstood what you were asking a second ago... let me try
again (including diff for each of your three solutions):
I have a question about the Perl Module Algorithm::Diff.

I want to use Algorithm::Diff to calculate the diff of the two
sequences "ab" and "a1bab", the "LCS" (the "Longest Common Sequence")
is a length of 2.

Theoretically there are 3 solutions with LCS = 2:

Solution 1:
---ab
a1bab


[
['+', 0, 'a'],
['+', 1, '1'],
['+', 2, 'b'],
]
Solution 2:
a-b--
a1bab


[
['+', 1, '1'],
['+', 3, 'a'],
['+', 4, 'b'],
]
Solution 3:
a---b
a1bab


[
['+', 1, '1'],
['+', 2, 'b'],
['+', 3, 'a'],
]
How can I teach Algorithm::Diff to choose Solution 1 (the best of the
3 possibilities) ?

Why are any of the three better? Any way you crack this nut, the
smallest sequence of operations to turn ab into a1bab involves three
adds... so I'm confused as to why you pick one as "best"?

Cheers.
 
D

Dilbert

Why are any of the three better?  Any way you crack this nut, the
smallest sequence of operations to turn ab into a1bab involves three
adds... so I'm confused as to why you pick one as "best"?

Let me try a more realistic example:

use strict;
use warnings;
use 5.010;
use Algorithm::Diff;
use Data::Dumper;
my @old = split //, 'ab';
my @new = split //, 'Show manual contribution: ab :This word can be
displayed';
my $d = Algorithm::Diff::sdiff(\@old, \@new);
my $line = Dumper($d);
$line =~ s{\s}''xmsg;
say $line;

The output alignment (#al1) is:

-a-----------b---------------------------
manual contribution: ab :can be displayed

But what I want is (#al2):

---------------------ab------------------
manual contribution: ab :can be displayed

Why ?

Because, in case I have more than one optimal solution (and this is
the case here), I want to consider not only the value of a character
('a' eq 'a' and 'b' eq 'b'), but also the context each character is
in.

the 'a' in @old has context: {left => 'start', right => 'b'}
the 'b' in @old has context: {left => 'a', right => 'end'}

the 'a' in @new (#al1) context: {left => 'm', right => 'n'}
the 'b' in @new (#al1) context: {left => 'i', right => 'u'}

the 'a' in @new (#al2) context: {left => ' ', right => 'b'}
the 'b' in @new (#al2) context: {left => 'a', right => ' '}

It can easily be seen that @old and @new(#al1) have no context in
common

Whereas @old and @new(#al2) have two context matches, that is
@old('a')->right eq @new(#al2)'a'->right
@old('b')->left eq @new(#al2)'a'->left

That's why I prefer @new (#al2) over @new (#al1)
 
D

Dilbert

my @new = split //, 'Show manual contribution: ab :This word can be
displayed';

That should be...

my @new = split //, 'manual contribution: ab :can be displayed';
Whereas @old and @new(#al2) have two context matches, that is
  @old('a')->right eq @new(#al2)'a'->right
  @old('b')->left eq @new(#al2)'a'->left

in the last line, replace
"...@new(#al2)'a'->left..." by
"...@new(#al2)'b'->left..."
 
I

Ilya Zakharevich

Solution 1:
---ab
a1bab


[
['+', 0, 'a'],
['+', 1, '1'],
['+', 2, 'b'],
]
Solution 2:
a-b--
a1bab


[
['+', 1, '1'],
['+', 3, 'a'],
['+', 4, 'b'],
]
Solution 3:
a---b
a1bab


[
['+', 1, '1'],
['+', 2, 'b'],
['+', 3, 'a'],
]
Why are any of the three better?

Obviously, the metric the OP wants is: assign the "identity edit"
weight eps, and any other edit weight 1+eps, with the exception that N
consecutive edits of the same type get weight N+eps, not N + N*eps.

Looks reasonable (if one convert it to a cheap algorithm to find the
best match...).

Hope this helps,
Ilya
 
E

Ed

Theoretically there are 3 solutions with LCS = 2:

Solution 1:
---ab
a1bab

Solution 2:
a-b--
a1bab

Solution 3:
a---b
a1bab

I understand that any of those 3 solutions could be returned by
Algorithm::Diff, but I would argue that solution 1 is "better" than
solution 2 or 3, because solution 1 changes only once between '-' and
[ab], whereas solution 2 and 3 change more than once between '-' and
[ab].
my $d = Algorithm::Diff::sdiff(\@old, \@new);
How can I teach Algorithm::Diff to choose Solution 1 (the best of the
3 possibilities) ?

Look at traverse_balanced as a starting point. Basically you'd need
to write your own diff calculator based off the LCS, using whatever
method you feel is appropriate. Once you get to the point where
you're looking for the "best" of the possible solutions, you are in
new territory since you'll have to consider the solution set. I don't
think there's anything in Algorithm::Diff that does that sort of thing
- I believe the code simply finds the first solution that uses the
given LCS.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Algorithm 1
Can someone pls help me with a little algorithm script 1
[ANN] Diff::LCS 1.1.1 5
[ANN] Diff::LCS 1.0.1 0
Question about my projects 3
Register Question 0
[ANN] Diff::LCS 1.0 0
Question help 4

Members online

No members online now.

Forum statistics

Threads
474,266
Messages
2,571,077
Members
48,772
Latest member
Backspace Studios

Latest Threads

Top