(?{..}) and lexical scoping issues.

Discussion in 'Perl Misc' started by Aronaxis, the Sourceror, Jun 20, 2004.

  1. # ind_compare($string1,$string2 [,accuracy] )
    # calculates strings "similarity"
    # algorithm is cutted from some old Pascal code, and rewritten to use
    # perl RE-engine backtracking for speed.

    use warnings;
    use strict;

    sub DEBUG () {1}

    our ($cnt, $match);
    sub ind_compare ($$;$) {
    my $max_len = $_[2];

    # numification for security reasons.
    $max_len="" unless $max_len +=0;

    # WHY NOT?!
    # my ($cnt,$match)=(0,0);
    ($cnt, $match) = (0,0);

    use re 'eval'; # because of $max_len interpolation
    # in regex below. But we cleaned it.

    # loop for comparing $_[0] against $_[1], and $_[1] against $_[0] too
    for my $i (0,1) {
    $_[$i] =~ m{
    ( .{1,$max_len} )
    (?{
    $cnt++;
    $match++ if index( $_[1-$i], $1 ) != -1;
    })
    (?!) # that always fail and force backtracking.
    }x;
    }

    print "$match/$cnt\n" if DEBUG;

    return 0 unless $cnt;
    $match/$cnt;
    }


    print ind_compare( "abcdefgh","abcdefgh" ), "\n\n";
    print ind_compare( "abcdefg!","abcdefgh" ), "\n\n";


    __END__
    Results:
    1) if $match and $cnt in function &ind_compare declared as "our":

    72/72
    1

    64/72
    0.888888888888889


    2) if $match and $cnt in function &ind_compare declared as "my":

    72/72
    1

    0/0
    0

    .......................
    I found, that if I declare theese vars as "my", then code inside (?{...})
    always uses _first_ instances of theese vars, created on _first_ sub
    invocation. (strange, they still cleared on sub exit, but on second
    invocation $cnt inside regex and $cnt outside are not the same!)..
    So it works correctly only once!
    dammit, why?!

    looks like some scoping issues, but then why there are no problems with
    lexical scoped $i, or @_ (which AFAIK is also lexical) ?

    oh, yeah, that's an ActiveState port of perl 5.8.0 for Windows;
    Aronaxis, the Sourceror, Jun 20, 2004
    #1
    1. Advertising

  2. Aronaxis, the Sourceror

    Anno Siegel Guest

    Aronaxis, the Sourceror <> wrote in comp.lang.perl.misc:
    > # ind_compare($string1,$string2 [,accuracy] )
    > # calculates strings "similarity"
    > # algorithm is cutted from some old Pascal code, and rewritten to use
    > # perl RE-engine backtracking for speed.
    >
    > use warnings;
    > use strict;
    >
    > sub DEBUG () {1}
    >
    > our ($cnt, $match);
    > sub ind_compare ($$;$) {
    > my $max_len = $_[2];
    >
    > # numification for security reasons.
    > $max_len="" unless $max_len +=0;
    >
    > # WHY NOT?!
    > # my ($cnt,$match)=(0,0);
    > ($cnt, $match) = (0,0);
    >
    > use re 'eval'; # because of $max_len interpolation
    > # in regex below. But we cleaned it.
    >
    > # loop for comparing $_[0] against $_[1], and $_[1] against $_[0] too
    > for my $i (0,1) {
    > $_[$i] =~ m{
    > ( .{1,$max_len} )
    > (?{
    > $cnt++;
    > $match++ if index( $_[1-$i], $1 ) != -1;
    > })
    > (?!) # that always fail and force backtracking.
    > }x;
    > }
    >
    > print "$match/$cnt\n" if DEBUG;
    >
    > return 0 unless $cnt;
    > $match/$cnt;
    > }
    >
    >
    > print ind_compare( "abcdefgh","abcdefgh" ), "\n\n";
    > print ind_compare( "abcdefg!","abcdefgh" ), "\n\n";
    >
    >
    > __END__
    > Results:
    > 1) if $match and $cnt in function &ind_compare declared as "our":
    >
    > 72/72
    > 1
    >
    > 64/72
    > 0.888888888888889
    >
    >
    > 2) if $match and $cnt in function &ind_compare declared as "my":
    >
    > 72/72
    > 1
    >
    > 0/0
    > 0
    >
    > ......................
    > I found, that if I declare theese vars as "my", then code inside (?{...})
    > always uses _first_ instances of theese vars, created on _first_ sub
    > invocation. (strange, they still cleared on sub exit, but on second
    > invocation $cnt inside regex and $cnt outside are not the same!)..
    > So it works correctly only once!
    > dammit, why?!


    I don't have a pat explanation, except to point out that "(?{ ... })"
    is still experimental. It has had scoping issues from day one and
    still has.

    > looks like some scoping issues, but then why there are no problems with
    > lexical scoped $i, or @_ (which AFAIK is also lexical) ?


    @_ is most definitely a package variable.

    The distinction is not with lexical vs. local. If you change "our" to
    "my", but keep the declarations out of the sub, you'll see the same
    difference.

    I also believe you are mistaken about there being no problems with $i.
    In fact, its behavior can also be described as it having a different
    value inside the regex and outside. This means that in effect you are
    doing "index( $_[1], $1)" both times through the loop, though the regex
    is first matched against $_[0] and then against $_[1].

    Check this by removing "my" in front of "$i" and adding "$i" to the list
    of "our" variables outside the loop. You will find that the match count
    changes from 64 to 56 in the second case, which, I think, is correct.
    (You are counting how many substrings of each string are also substrings
    of the other, right?)

    On a more general note, why are you doing this? As far as I can see,
    there is no advantage in using the regex backtracking mechanism and
    (?{ ... }) against a more conventional method of calling Perl code.
    What you are really using it for is a mechanism to walk through
    all substrings of a string. That's somewhat neat, but not necessarily
    a speed gain. If the original Pascal program is anything like I imagine
    it to be, it will be hard to beat with a Perl program.

    To do it in Perl, I wouldn't employ the regex engine at all. Instead
    I'd base it on a procedure to extract all non-empty substrings from
    a string. Ignoring the requirement for a maximal string length for
    simplicity, something like this would do:

    use constant DEBUG => 1;

    print ind_compare( "abcdefgh","abcdefgh" ), "\n\n";
    print ind_compare( "abcdefg!","abcdefgh" ), "\n\n";

    sub ind_compare {
    my ( $s, $t) = @_;
    my ( $count, $match);
    for ( 1, 2 ) {
    my @sub = all_substr( $s);
    $count += @sub;
    $match += grep 1 + index( $t, $_), @sub;
    ( $s, $t) = ( $t, $s);
    }

    print "$match/$count\n" if DEBUG;
    $count ? $match/$count : 0;
    }

    sub all_substr {
    my $s = shift;
    my @sub;
    while ( length $s ) {
    push @sub, map substr( $s, $_), 0 .. length( $s) - 1;
    chop $s;
    }
    @sub;
    }

    Anno
    Anno Siegel, Jun 20, 2004
    #2
    1. Advertising

  3. On 20 Jun 2004 17:10:56 GMT, Anno Siegel
    <-berlin.de> wrote:

    > Aronaxis, the Sourceror <> wrote in
    > comp.lang.perl.misc:
    >> # ind_compare($string1,$string2 [,accuracy] )
    >> # calculates strings "similarity"
    >> # algorithm is cutted from some old Pascal code, and rewritten to use
    >> # perl RE-engine backtracking for speed.
    >>
    >> use warnings;
    >> use strict;
    >>
    >> sub DEBUG () {1}
    >>
    >> our ($cnt, $match);

    #inserted:
    our $i;
    >> sub ind_compare ($$;$) {
    >> my $max_len = $_[2];
    >>
    >> # numification for security reasons.
    >> $max_len="" unless $max_len +=0;
    >>
    >> # WHY NOT?!
    >> # my ($cnt,$match)=(0,0);
    >> ($cnt, $match) = (0,0);
    >>
    >> use re 'eval'; # because of $max_len interpolation
    >> # in regex below. But we cleaned it.
    >>
    >> # loop for comparing $_[0] against $_[1], and $_[1] against $_[0] too
    >> for my $i (0,1) {

    #changed:
    for $i (0,1) {
    >> $_[$i] =~ m{
    >> ( .{1,$max_len} )
    >> (?{
    >> $cnt++;
    >> $match++ if index( $_[1-$i], $1 ) != -1;
    >> })
    >> (?!) # that always fail and force backtracking.
    >> }x;
    >> }
    >>
    >> print "$match/$cnt\n" if DEBUG;
    >>
    >> return 0 unless $cnt;
    >> $match/$cnt;
    >> }
    >>
    >>
    >> print ind_compare( "abcdefgh","abcdefgh" ), "\n\n";
    >> print ind_compare( "abcdefg!","abcdefgh" ), "\n\n";
    >>
    >>


    {...}

    > I don't have a pat explanation, except to point out that "(?{ ... })"
    > is still experimental. It has had scoping issues from day one and
    > still has.
    >
    >> looks like some scoping issues, but then why there are no problems with
    >> lexical scoped $i, or @_ (which AFAIK is also lexical) ?

    >
    > @_ is most definitely a package variable.


    hm.. I'm sure I've read something claiming that @_ in perl5 behave more as
    lexical than as package.. but I think you're right. And it's too hard to
    feel the difference in this case... I think, if that function will make a
    reference to @_ and store it somewhere, it would break next invocation the
    same way as described. Looks like function internal variables tend to use
    the same "addresses" on next invocation, if only they can..
    ("hm.."x2; I tested it now :). \@_ remains the same on several
    invocations, but when I pushed \@_ to package array @temp, forcing perl to
    change the \@_ on next invocation, regex still works correcly, as before..)

    > The distinction is not with lexical vs. local. If you change "our" to
    > "my", but keep the declarations out of the sub, you'll see the same
    > difference.


    I tested this too. for ($cnt,$match) it's correct. for $i - no. It should
    be "our". I think that's because perl's "for" behaves slightly different
    based on how variable with same name as "for"'s ..er.. iterator declared
    in outer scope: as package or as lexical. Package var will be localized in
    loop, but if there is lexical $i in scope, then "for $i (...)" works as
    "for my $i (...)" does. It's only my assumption.. am I right?

    > I also believe you are mistaken about there being no problems with $i.
    > In fact, its behavior can also be described as it having a different
    > value inside the regex and outside. This means that in effect you are
    > doing "index( $_[1], $1)" both times through the loop, though the regex
    > is first matched against $_[0] and then against $_[1].


    yeah, thank you. I used badly chosen tests and didn't notice a difference.

    > Check this by removing "my" in front of "$i" and adding "$i" to the list
    > of "our" variables outside the loop. You will find that the match count
    > changes from 64 to 56 in the second case, which, I think, is correct.
    > (You are counting how many substrings of each string are also substrings
    > of the other, right?)


    right.

    > On a more general note, why are you doing this? As far as I can see,
    > there is no advantage in using the regex backtracking mechanism and
    > (?{ ... }) against a more conventional method of calling Perl code.
    > What you are really using it for is a mechanism to walk through
    > all substrings of a string. That's somewhat neat, but not necessarily
    > a speed gain. If the original Pascal program is anything like I imagine
    > it to be, it will be hard to beat with a Perl program.


    actually, it was my rewrite of another perl program, which made by my
    friend from pascal program using "one-to-one" translation. It involved
    many length(), substr(), eq, and nested loops, but he didn't ever used
    index().
    he send me his code and some another variants on other languages with
    comparison chart:

    Perl - 17.50s
    O'Caml raw bytecode - 9.77s
    O'Caml funcional bytecode - 9.55s
    O'Caml funcional native - 1.68s
    O'Caml raw native - 1.63s

    his perl program was too lowlevel, and as my experience tells me,
    "lowlevel" in perl's case means "slow". and ugly too. And I wonder if I
    can use internal regex loop for memory saving (I didn't know how long are
    lines he used to compare) and speed.. btw, I never used (?{}) before and
    was curious. So you've seen what curiosity made to fox.

    BTW, my rewrite took only 5.2s on former test.


    > To do it in Perl, I wouldn't employ the regex engine at all. Instead
    > I'd base it on a procedure to extract all non-empty substrings from
    > a string. Ignoring the requirement for a maximal string length for
    > simplicity, something like this would do:
    >
    > use constant DEBUG => 1;
    >
    > print ind_compare( "abcdefgh","abcdefgh" ), "\n\n";
    > print ind_compare( "abcdefg!","abcdefgh" ), "\n\n";
    >
    > sub ind_compare {
    > my ( $s, $t) = @_;
    > my ( $count, $match);
    > for ( 1, 2 ) {
    > my @sub = all_substr( $s);
    > $count += @sub;
    > $match += grep 1 + index( $t, $_), @sub;
    > ( $s, $t) = ( $t, $s);
    > }
    >
    > print "$match/$count\n" if DEBUG;
    > $count ? $match/$count : 0;
    > }
    >
    > sub all_substr {
    > my $s = shift;
    > my @sub;
    > while ( length $s ) {
    > push @sub, map substr( $s, $_), 0 .. length( $s) - 1;
    > chop $s;
    > }
    > @sub;
    > }
    >
    > Anno


    Thank you, it was real pleasure to read such a code.. it's perfect.
    only one drawback I can see - it generates list of all possible substrings
    before checking it, which can take too much memory on arbitrarily long
    strings.. but I'm in doubt if this function would be useful on long
    strings comparison. so your version is great.

    Alexey
    Aronaxis, the Sourceror, Jun 21, 2004
    #3
  4. Aronaxis, the Sourceror

    Anno Siegel Guest

    Aronaxis, the Sourceror <> wrote in comp.lang.perl.misc:
    > On 20 Jun 2004 17:10:56 GMT, Anno Siegel
    > <-berlin.de> wrote:


    [ counting how many substrings of one string $s are also substrings of
    another string $t, lots snipped, including some code by yours truly]

    > Thank you, it was real pleasure to read such a code.. it's perfect.
    > only one drawback I can see - it generates list of all possible substrings
    > before checking it, which can take too much memory on arbitrarily long
    > strings.. but I'm in doubt if this function would be useful on long
    > strings comparison. so your version is great.


    Thanks for your kind words. Programmers suck up compliments about
    their code like old ladies suck up compliments about their complexion.

    Yes, it generates all substrings ahead of time. All the counting
    could be done without moving a single byte of the original strings,
    just delimiting substrings by indices. The original Pascal came close
    to doing it that way, I'm sure.

    But there is another inefficiency. When we detect that a substring of
    $s is a substring of $t, we *know* that all substrings of that string
    will also be substrings of $t. There is no need to generate them,
    we could just add their number to the count. The number of non-empty
    substrings of a string of length $l is $l * ( $l + 1) / 2.

    This leads to a different algorithm: For each starting position in
    $s, find the maximal substring that is also substring of $t. Add
    the number of substrings of that string to the match count. There
    is no need to count the number of substrings we can use the (unproven)
    formula above.

    Another observation is that the number of substrings of $s that are
    substrings of $t is the same as the number of substrings of $t that
    are also substrings of $s. They are both the number of common substrings
    of $s and $t. There is no real need for two rounds of counting.

    So here is a revised version of ind_compare. Its less pretty, but it
    should work for large strings, and be a lot faster when there are
    large common substrings. The code is tested, but not debugged. It
    may be off in limiting cases.

    use constant DEBUG => 1;

    sub ind_compare1 {
    my ( $s, $t) = @_;

    my $match;
    my $from = 0;
    while ( $from < length $s ) {
    my ( $l_match, $l);
    for ( 0 .. length( $s) - $from ) {
    $l = $_; # last length considered
    last unless 1 + index( $t, substr( $s, $from, $_));
    $l_match = $_; # last length with match
    }
    $match += triangle( $l_match);
    $from += $l; # this makes it fast (we hope)
    }

    $match *= 2; # instead of counting again with $s and $t swapped
    my $count = triangle( length $s) + triangle( length $t);

    print "$match/$count\n" if DEBUG;
    $count ? $match/$count : 0;
    }

    sub triangle { return $_*( $_ + 1)/2 for shift }

    Anno
    Anno Siegel, Jun 21, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Matt Barnicle
    Replies:
    10
    Views:
    626
    Bruno Desthuilliers
    Dec 2, 2007
  2. Khookie

    C closures & lexical scoping

    Khookie, Dec 12, 2007, in forum: C Programming
    Replies:
    28
    Views:
    1,358
    cr88192
    Dec 15, 2007
  3. walterbyrd
    Replies:
    16
    Views:
    462
    Steven D'Aprano
    Dec 18, 2008
  4. PerlFAQ Server
    Replies:
    0
    Views:
    339
    PerlFAQ Server
    Jan 6, 2011
  5. PerlFAQ Server
    Replies:
    0
    Views:
    247
    PerlFAQ Server
    Apr 15, 2011
Loading...

Share This Page