(?{..}) and lexical scoping issues.

  • Thread starter Aronaxis, the Sourceror
  • Start date
A

Aronaxis, the Sourceror

# ind_compare($string1,$string2 [,accuracy] )
# calculates strings "similarity"
# algorithm is cutted from some old Pascal code, and rewritten to use
# perl RE-engine backtracking for speed.

use warnings;
use strict;

sub DEBUG () {1}

our ($cnt, $match);
sub ind_compare ($$;$) {
my $max_len = $_[2];

# numification for security reasons.
$max_len="" unless $max_len +=0;

# WHY NOT?!
# my ($cnt,$match)=(0,0);
($cnt, $match) = (0,0);

use re 'eval'; # because of $max_len interpolation
# in regex below. But we cleaned it.

# loop for comparing $_[0] against $_[1], and $_[1] against $_[0] too
for my $i (0,1) {
$_[$i] =~ m{
( .{1,$max_len} )
(?{
$cnt++;
$match++ if index( $_[1-$i], $1 ) != -1;
})
(?!) # that always fail and force backtracking.
}x;
}

print "$match/$cnt\n" if DEBUG;

return 0 unless $cnt;
$match/$cnt;
}


print ind_compare( "abcdefgh","abcdefgh" ), "\n\n";
print ind_compare( "abcdefg!","abcdefgh" ), "\n\n";


__END__
Results:
1) if $match and $cnt in function &ind_compare declared as "our":

72/72
1

64/72
0.888888888888889


2) if $match and $cnt in function &ind_compare declared as "my":

72/72
1

0/0
0

.......................
I found, that if I declare theese vars as "my", then code inside (?{...})
always uses _first_ instances of theese vars, created on _first_ sub
invocation. (strange, they still cleared on sub exit, but on second
invocation $cnt inside regex and $cnt outside are not the same!)..
So it works correctly only once!
dammit, why?!

looks like some scoping issues, but then why there are no problems with
lexical scoped $i, or @_ (which AFAIK is also lexical) ?

oh, yeah, that's an ActiveState port of perl 5.8.0 for Windows;
 
A

Anno Siegel

Aronaxis said:
# ind_compare($string1,$string2 [,accuracy] )
# calculates strings "similarity"
# algorithm is cutted from some old Pascal code, and rewritten to use
# perl RE-engine backtracking for speed.

use warnings;
use strict;

sub DEBUG () {1}

our ($cnt, $match);
sub ind_compare ($$;$) {
my $max_len = $_[2];

# numification for security reasons.
$max_len="" unless $max_len +=0;

# WHY NOT?!
# my ($cnt,$match)=(0,0);
($cnt, $match) = (0,0);

use re 'eval'; # because of $max_len interpolation
# in regex below. But we cleaned it.

# loop for comparing $_[0] against $_[1], and $_[1] against $_[0] too
for my $i (0,1) {
$_[$i] =~ m{
( .{1,$max_len} )
(?{
$cnt++;
$match++ if index( $_[1-$i], $1 ) != -1;
})
(?!) # that always fail and force backtracking.
}x;
}

print "$match/$cnt\n" if DEBUG;

return 0 unless $cnt;
$match/$cnt;
}


print ind_compare( "abcdefgh","abcdefgh" ), "\n\n";
print ind_compare( "abcdefg!","abcdefgh" ), "\n\n";


__END__
Results:
1) if $match and $cnt in function &ind_compare declared as "our":

72/72
1

64/72
0.888888888888889


2) if $match and $cnt in function &ind_compare declared as "my":

72/72
1

0/0
0

......................
I found, that if I declare theese vars as "my", then code inside (?{...})
always uses _first_ instances of theese vars, created on _first_ sub
invocation. (strange, they still cleared on sub exit, but on second
invocation $cnt inside regex and $cnt outside are not the same!)..
So it works correctly only once!
dammit, why?!

I don't have a pat explanation, except to point out that "(?{ ... })"
is still experimental. It has had scoping issues from day one and
still has.
looks like some scoping issues, but then why there are no problems with
lexical scoped $i, or @_ (which AFAIK is also lexical) ?

@_ is most definitely a package variable.

The distinction is not with lexical vs. local. If you change "our" to
"my", but keep the declarations out of the sub, you'll see the same
difference.

I also believe you are mistaken about there being no problems with $i.
In fact, its behavior can also be described as it having a different
value inside the regex and outside. This means that in effect you are
doing "index( $_[1], $1)" both times through the loop, though the regex
is first matched against $_[0] and then against $_[1].

Check this by removing "my" in front of "$i" and adding "$i" to the list
of "our" variables outside the loop. You will find that the match count
changes from 64 to 56 in the second case, which, I think, is correct.
(You are counting how many substrings of each string are also substrings
of the other, right?)

On a more general note, why are you doing this? As far as I can see,
there is no advantage in using the regex backtracking mechanism and
(?{ ... }) against a more conventional method of calling Perl code.
What you are really using it for is a mechanism to walk through
all substrings of a string. That's somewhat neat, but not necessarily
a speed gain. If the original Pascal program is anything like I imagine
it to be, it will be hard to beat with a Perl program.

To do it in Perl, I wouldn't employ the regex engine at all. Instead
I'd base it on a procedure to extract all non-empty substrings from
a string. Ignoring the requirement for a maximal string length for
simplicity, something like this would do:

use constant DEBUG => 1;

print ind_compare( "abcdefgh","abcdefgh" ), "\n\n";
print ind_compare( "abcdefg!","abcdefgh" ), "\n\n";

sub ind_compare {
my ( $s, $t) = @_;
my ( $count, $match);
for ( 1, 2 ) {
my @sub = all_substr( $s);
$count += @sub;
$match += grep 1 + index( $t, $_), @sub;
( $s, $t) = ( $t, $s);
}

print "$match/$count\n" if DEBUG;
$count ? $match/$count : 0;
}

sub all_substr {
my $s = shift;
my @sub;
while ( length $s ) {
push @sub, map substr( $s, $_), 0 .. length( $s) - 1;
chop $s;
}
@sub;
}

Anno
 
A

Aronaxis, the Sourceror

Aronaxis said:
# ind_compare($string1,$string2 [,accuracy] )
# calculates strings "similarity"
# algorithm is cutted from some old Pascal code, and rewritten to use
# perl RE-engine backtracking for speed.

use warnings;
use strict;

sub DEBUG () {1}

our ($cnt, $match);
#inserted:
our $i;
sub ind_compare ($$;$) {
my $max_len = $_[2];

# numification for security reasons.
$max_len="" unless $max_len +=0;

# WHY NOT?!
# my ($cnt,$match)=(0,0);
($cnt, $match) = (0,0);

use re 'eval'; # because of $max_len interpolation
# in regex below. But we cleaned it.

# loop for comparing $_[0] against $_[1], and $_[1] against $_[0] too
for my $i (0,1) {
#changed:
for $i (0,1) {
$_[$i] =~ m{
( .{1,$max_len} )
(?{
$cnt++;
$match++ if index( $_[1-$i], $1 ) != -1;
})
(?!) # that always fail and force backtracking.
}x;
}

print "$match/$cnt\n" if DEBUG;

return 0 unless $cnt;
$match/$cnt;
}


print ind_compare( "abcdefgh","abcdefgh" ), "\n\n";
print ind_compare( "abcdefg!","abcdefgh" ), "\n\n";
{...}

I don't have a pat explanation, except to point out that "(?{ ... })"
is still experimental. It has had scoping issues from day one and
still has.
looks like some scoping issues, but then why there are no problems with
lexical scoped $i, or @_ (which AFAIK is also lexical) ?

@_ is most definitely a package variable.

hm.. I'm sure I've read something claiming that @_ in perl5 behave more as
lexical than as package.. but I think you're right. And it's too hard to
feel the difference in this case... I think, if that function will make a
reference to @_ and store it somewhere, it would break next invocation the
same way as described. Looks like function internal variables tend to use
the same "addresses" on next invocation, if only they can..
("hm.."x2; I tested it now :). \@_ remains the same on several
invocations, but when I pushed \@_ to package array @temp, forcing perl to
change the \@_ on next invocation, regex still works correcly, as before..)
The distinction is not with lexical vs. local. If you change "our" to
"my", but keep the declarations out of the sub, you'll see the same
difference.

I tested this too. for ($cnt,$match) it's correct. for $i - no. It should
be "our". I think that's because perl's "for" behaves slightly different
based on how variable with same name as "for"'s ..er.. iterator declared
in outer scope: as package or as lexical. Package var will be localized in
loop, but if there is lexical $i in scope, then "for $i (...)" works as
"for my $i (...)" does. It's only my assumption.. am I right?
I also believe you are mistaken about there being no problems with $i.
In fact, its behavior can also be described as it having a different
value inside the regex and outside. This means that in effect you are
doing "index( $_[1], $1)" both times through the loop, though the regex
is first matched against $_[0] and then against $_[1].

yeah, thank you. I used badly chosen tests and didn't notice a difference.
Check this by removing "my" in front of "$i" and adding "$i" to the list
of "our" variables outside the loop. You will find that the match count
changes from 64 to 56 in the second case, which, I think, is correct.
(You are counting how many substrings of each string are also substrings
of the other, right?)
right.

On a more general note, why are you doing this? As far as I can see,
there is no advantage in using the regex backtracking mechanism and
(?{ ... }) against a more conventional method of calling Perl code.
What you are really using it for is a mechanism to walk through
all substrings of a string. That's somewhat neat, but not necessarily
a speed gain. If the original Pascal program is anything like I imagine
it to be, it will be hard to beat with a Perl program.

actually, it was my rewrite of another perl program, which made by my
friend from pascal program using "one-to-one" translation. It involved
many length(), substr(), eq, and nested loops, but he didn't ever used
index().
he send me his code and some another variants on other languages with
comparison chart:

Perl - 17.50s
O'Caml raw bytecode - 9.77s
O'Caml funcional bytecode - 9.55s
O'Caml funcional native - 1.68s
O'Caml raw native - 1.63s

his perl program was too lowlevel, and as my experience tells me,
"lowlevel" in perl's case means "slow". and ugly too. And I wonder if I
can use internal regex loop for memory saving (I didn't know how long are
lines he used to compare) and speed.. btw, I never used (?{}) before and
was curious. So you've seen what curiosity made to fox.

BTW, my rewrite took only 5.2s on former test.

To do it in Perl, I wouldn't employ the regex engine at all. Instead
I'd base it on a procedure to extract all non-empty substrings from
a string. Ignoring the requirement for a maximal string length for
simplicity, something like this would do:

use constant DEBUG => 1;

print ind_compare( "abcdefgh","abcdefgh" ), "\n\n";
print ind_compare( "abcdefg!","abcdefgh" ), "\n\n";

sub ind_compare {
my ( $s, $t) = @_;
my ( $count, $match);
for ( 1, 2 ) {
my @sub = all_substr( $s);
$count += @sub;
$match += grep 1 + index( $t, $_), @sub;
( $s, $t) = ( $t, $s);
}

print "$match/$count\n" if DEBUG;
$count ? $match/$count : 0;
}

sub all_substr {
my $s = shift;
my @sub;
while ( length $s ) {
push @sub, map substr( $s, $_), 0 .. length( $s) - 1;
chop $s;
}
@sub;
}

Anno

Thank you, it was real pleasure to read such a code.. it's perfect.
only one drawback I can see - it generates list of all possible substrings
before checking it, which can take too much memory on arbitrarily long
strings.. but I'm in doubt if this function would be useful on long
strings comparison. so your version is great.

Alexey
 
A

Anno Siegel

Aronaxis said:
On 20 Jun 2004 17:10:56 GMT, Anno Siegel
<[email protected]> wrote:

[ counting how many substrings of one string $s are also substrings of
another string $t, lots snipped, including some code by yours truly]
Thank you, it was real pleasure to read such a code.. it's perfect.
only one drawback I can see - it generates list of all possible substrings
before checking it, which can take too much memory on arbitrarily long
strings.. but I'm in doubt if this function would be useful on long
strings comparison. so your version is great.

Thanks for your kind words. Programmers suck up compliments about
their code like old ladies suck up compliments about their complexion.

Yes, it generates all substrings ahead of time. All the counting
could be done without moving a single byte of the original strings,
just delimiting substrings by indices. The original Pascal came close
to doing it that way, I'm sure.

But there is another inefficiency. When we detect that a substring of
$s is a substring of $t, we *know* that all substrings of that string
will also be substrings of $t. There is no need to generate them,
we could just add their number to the count. The number of non-empty
substrings of a string of length $l is $l * ( $l + 1) / 2.

This leads to a different algorithm: For each starting position in
$s, find the maximal substring that is also substring of $t. Add
the number of substrings of that string to the match count. There
is no need to count the number of substrings we can use the (unproven)
formula above.

Another observation is that the number of substrings of $s that are
substrings of $t is the same as the number of substrings of $t that
are also substrings of $s. They are both the number of common substrings
of $s and $t. There is no real need for two rounds of counting.

So here is a revised version of ind_compare. Its less pretty, but it
should work for large strings, and be a lot faster when there are
large common substrings. The code is tested, but not debugged. It
may be off in limiting cases.

use constant DEBUG => 1;

sub ind_compare1 {
my ( $s, $t) = @_;

my $match;
my $from = 0;
while ( $from < length $s ) {
my ( $l_match, $l);
for ( 0 .. length( $s) - $from ) {
$l = $_; # last length considered
last unless 1 + index( $t, substr( $s, $from, $_));
$l_match = $_; # last length with match
}
$match += triangle( $l_match);
$from += $l; # this makes it fast (we hope)
}

$match *= 2; # instead of counting again with $s and $t swapped
my $count = triangle( length $s) + triangle( length $t);

print "$match/$count\n" if DEBUG;
$count ? $match/$count : 0;
}

sub triangle { return $_*( $_ + 1)/2 for shift }

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top