How to find all the strings in a long that are at most of n differentcharacters of a test string?

P

Peng Yu

Hi,

Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Thanks,
Peng
 
S

sln

Hi,

Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Thanks,
Peng

Not easily, and probably fairly slow.
You need some heuristic algorithym.

sln

----------------------------

use strict;
use warnings;


my $str = 'aacdxfo sdfbsabcrxfodfbdfb';
my $pattern = 'abcdef';
my $misses = 2;

my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
$pstr .= "(?:$_|(.))";
}
$pstr = '('.$pstr.')';
my $rxpattern = qr/$pstr/i;

print @tmp,"\n",$rxpattern,"\n\n";

while ($str =~ /$rxpattern/g )
{
my $cnt = 0;
for my $i (2..(@tmp+1)) {
last if (($cnt += defined( $-[$i])) > $misses);
}
if ($cnt > $misses) {
pos($str) = $-[0]+1;
} else {
print "$cnt bad chars, but found a close match: '$1'\n";
}
}

__END__



abcdef
(?i-xsm:((?:a|(.))(?:b|(.))(?:c|(.))(?:d|(.))(?:e|(.))(?:f|(.))))

2 bad chars, but found a close match: 'aacdxf'
2 bad chars, but found a close match: 'abcrxf'
 
P

Peng Yu

Suppose I have a long string $a, and a test string $b.
I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.
For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.
I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?
Thanks,
Peng

Not easily, and probably fairly slow.
You need some heuristic algorithym.

sln

----------------------------

use strict;
use warnings;

my $str     = 'aacdxfo   sdfbsabcrxfodfbdfb';
my $pattern = 'abcdef';
my $misses  = 2;

my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
        $pstr .= "(?:$_|(.))";}

$pstr = '('.$pstr.')';
my $rxpattern = qr/$pstr/i;

print @tmp,"\n",$rxpattern,"\n\n";

while ($str =~ /$rxpattern/g )
{
        my $cnt = 0;
        for my $i (2..(@tmp+1)) {
                last if (($cnt += defined( $-[$i])) > $misses);
        }
        if ($cnt > $misses) {
                pos($str) = $-[0]+1;
        } else {
                print "$cnt bad chars, but found a close match: '$1'\n";
        }

}

__END__

abcdef
(?i-xsm:((?:a|(.))(?:b|(.))(?:c|(.))(?:d|(.))(?:e|(.))(?:f|(.))))

2 bad chars, but found a close match: 'aacdxf'
2 bad chars, but found a close match: 'abcrxf'

Hi,

I can not find what '$-[$i]' means in your code on online perl
documentation. Would you please let me know its meaning? Where is its
documentation?

Thanks,
Peng

Thanks,
Peng
 
J

John W. Krahn

Peng said:
use strict;
use warnings;

my $str = 'aacdxfo sdfbsabcrxfodfbdfb';
my $pattern = 'abcdef';
my $misses = 2;

my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
$pstr .= "(?:$_|(.))";}

$pstr = '('.$pstr.')';
my $rxpattern = qr/$pstr/i;

print @tmp,"\n",$rxpattern,"\n\n";

while ($str =~ /$rxpattern/g )
{
my $cnt = 0;
for my $i (2..(@tmp+1)) {
last if (($cnt += defined( $-[$i])) > $misses);
}
if ($cnt > $misses) {
pos($str) = $-[0]+1;
} else {
print "$cnt bad chars, but found a close match: '$1'\n";
}
}

__END__

I can not find what '$-[$i]' means in your code on online perl
documentation. Would you please let me know its meaning? Where is its
documentation?

You can find out about the @- array in:

perldoc perlvar



John
 
P

Peng Yu

Peng said:
On Nov 19, 7:55 pm, (e-mail address removed) wrote:
use strict;
use warnings;
my $str     = 'aacdxfo   sdfbsabcrxfodfbdfb';
my $pattern = 'abcdef';
my $misses  = 2;
my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
        $pstr .= "(?:$_|(.))";}
$pstr = '('.$pstr.')';
my $rxpattern = qr/$pstr/i;
print @tmp,"\n",$rxpattern,"\n\n";
while ($str =~ /$rxpattern/g )
{
        my $cnt = 0;
        for my $i (2..(@tmp+1)) {
                last if (($cnt += defined( $-[$i])) > $misses);
        }
        if ($cnt > $misses) {
                pos($str) = $-[0]+1;
        } else {
                print "$cnt bad chars, but found a close match: '$1'\n";
        }
}
__END__
I can not find what '$-[$i]' means in your code on online perl
documentation. Would you please let me know its meaning? Where is its
documentation?

You can find out about the @- array in:

perldoc perlvar

It seems that the perldoc is not easy to understand by a newbie. While
it explains @-, the perldoc keeps on referring to others things that I
don't understand. Can somebody help explain @- for me with more
examples?

BTW, I'd think that the perldoc be made more readable for newbies. I'm
not sure whether other newbies would have the same feeling.

Thanks,
Peng
 
P

Pilcrow

Hi,

Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Thanks,
Peng

Not too hard the simple way.
-------------------------------------------------------------------
#!/usr/bin/perl
use strict; use warnings;

my $a =
"abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx";
my $b = "aacdxf";

my @a = split //,$a;
my @b = split //,$b;
my $limit = 2;
my $cnt;
my $lenb = length $b;
my @substrings = ();

print "\nmatching '$a' against '$b'\n";

OUTER:
for (my $i = 0; $i <= $#a-$lenb+1; $i++) {
$cnt = 0;
for (my $j = 0; $j <= $#b; $j++) {
$cnt++ unless $a[$i+$j] eq $b[$j];
next OUTER if $cnt > $limit;
}
my $sub = substr($a,$i,$lenb);
# push @substrings, $sub; # alternate output
print "match '$sub' at offset $i\n";
}

__END__


matching
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx'
against 'aacdxf'
match 'abcdef' at offset 0
match 'bacdef' at offset 6
match 'aacdxf' at offset 21
match 'aacdef' at offset 28
match 'bacdef' at offset 38
match 'aacdxf' at offset 53
match 'aacdef' at offset 60
match 'aacdxf' at offset 66
match 'aacdfx' at offset 72
 
P

Peng Yu

Suppose I have a long string $a, and a test string $b.
I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.
For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.
I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?
Thanks,
Peng

Not too hard the simple way.
-------------------------------------------------------------------
#!/usr/bin/perl
use strict; use warnings;

my $a =
"abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx";
my $b = "aacdxf";

my @a = split //,$a;
my @b = split //,$b;
my $limit = 2;
my $cnt;
my $lenb = length $b;
my @substrings = ();

print "\nmatching '$a' against '$b'\n";

OUTER:
for (my $i = 0; $i <= $#a-$lenb+1; $i++) {
    $cnt = 0;
    for (my $j = 0; $j <= $#b; $j++) {
        $cnt++ unless $a[$i+$j] eq $b[$j];
        next OUTER if $cnt > $limit;
    }
    my $sub = substr($a,$i,$lenb);
#    push @substrings, $sub;            # alternate output
    print "match '$sub' at offset $i\n";

}

__END__

If $a could be very long, say, of the order of MB, will this approach
be much slower than the code given by sln?

Thanks,
Peng
 
C

Charlton Wilbur

PY> BTW, I'd think that the perldoc be made more readable for
PY> newbies. I'm not sure whether other newbies would have the same
PY> feeling.

There's a tradeoff between making the perldocs more readable for newbies
and making them more information-dense for experts. There are other
sources of documentation for newbies, but precious few for experts, and
so I think the perldocs are at a very sweet spot.

Further, you can learn more about Perl by following up on everything in
the documentation that you don't understand. Indeed, that's how many of
us learned as much as we did.

Charlton
 
P

Peng Yu

    PY> BTW, I'd think that the perldoc be made more readable for
    PY> newbies. I'm not sure whether other newbies would have the same
    PY> feeling.

There's a tradeoff between making the perldocs more readable for newbies
and making them more information-dense for experts.  There are other
sources of documentation for newbies, but precious few for experts, and
so I think the perldocs are at a very sweet spot.

Further, you can learn more about Perl by following up on everything in
the documentation that you don't understand.  Indeed, that's how many of
us learned as much as we did.

I total understand what you mean.

At this moment, I only be interested in understanding how @- works. I
tried to read that perldoc page, but I feel it is hard to follow.
Since that section of perldocs is not very readable to newbies, I'm
wondering if there is any more introductory material that explains it
with more examples.

I tried to look the more introductory materials on @- online but I
don't find any.

I think as I understand perl more, I shall be able to read perldoc
more easily.

Thanks,
Peng
 
J

John W. Krahn

Pilcrow said:
Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Not too hard the simple way.
-------------------------------------------------------------------
#!/usr/bin/perl
use strict; use warnings;

my $a =
"abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx";
my $b = "aacdxf";

my @a = split //,$a;
my @b = split //,$b;
my $limit = 2;
my $cnt;
my $lenb = length $b;
my @substrings = ();

print "\nmatching '$a' against '$b'\n";

OUTER:
for (my $i = 0; $i <= $#a-$lenb+1; $i++) {
$cnt = 0;
for (my $j = 0; $j <= $#b; $j++) {
$cnt++ unless $a[$i+$j] eq $b[$j];
next OUTER if $cnt > $limit;
}
my $sub = substr($a,$i,$lenb);
# push @substrings, $sub; # alternate output
print "match '$sub' at offset $i\n";
}

Or more simpler as:

#!/usr/bin/perl
use strict;
use warnings;

my $a =
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx';
my $b = 'aacdxf';

my $limit = 2;
my $lenb = length $b;

print "\nmatching '$a' against '$b'\n";

while ( $a =~ /(?=(.{$lenb}))/sog ) {
next if ( $1 ^ $b ) =~ tr/\0//c > $limit;
print "match '$1' at offset $-[0]\n";
}

__END__




John
 
A

A. Sinan Unur

....
I can not find what '$-[$i]' means in your code on online perl
documentation. Would you please let me know its meaning? Where is
its documentation?

You can find out about the @- array in:

perldoc perlvar

It seems that the perldoc is not easy to understand by a newbie. While
it explains @-, the perldoc keeps on referring to others things that I
don't understand. Can somebody help explain @- for me with more
examples?

I looked at the perldoc perlvar page to remind myself what it said. I am
unable to figure out how one could make it clearer:

<blockquote>
This array holds the offsets of the beginnings of the last
successful submatches in the currently active dynamic scope.
$-[0] is the offset into the string of the beginning of the
entire match. The *n*th element of this array holds the offset
of the *n*th submatch, so $-[1] is the offset where $1 begins,
$-[2] the offset where $2 begins, and so on.
</blockquote>

Consider the example below:

C:\DOCUME~1\asu1\LOCALS~1\Temp> cat t.pl
#!/usr/bin/perl

use strict;
use warnings;

my $s = '123abcABC';

if ( $s =~ /^([0-9]+([0-9]))[a-z]+([A-Z]([A-Z]+))$/ ) {
print "@-\n";
print "$_\n" for $1, $2, $3, $4;
}

__END__

In this case, a successful match would produce

$1 = 123 (offset = 0)
$2 = 3 (offset = 2)
$3 = ABC (offset = 6)
$4 = BC (offset = 7)

The entire match starts at offset 0, so we have @- = (0, 0, 2, 6, 7).

Sinan
--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
D

Dr.Ruud

Peng Yu schreef:
Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

Use substr() and the "^" operator and y/\0//c.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

$ perl -wle'
$r = "abcdef" ^ "aacdxf";
print $r =~ y/\0//c
'
2
 
S

sln

Pilcrow said:
Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Not too hard the simple way.
-------------------------------------------------------------------
#!/usr/bin/perl
use strict; use warnings;

my $a =
"abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx";
my $b = "aacdxf";

my @a = split //,$a;
my @b = split //,$b;
my $limit = 2;
my $cnt;
my $lenb = length $b;
my @substrings = ();

print "\nmatching '$a' against '$b'\n";

OUTER:
for (my $i = 0; $i <= $#a-$lenb+1; $i++) {
$cnt = 0;
for (my $j = 0; $j <= $#b; $j++) {
$cnt++ unless $a[$i+$j] eq $b[$j];
next OUTER if $cnt > $limit;
}
my $sub = substr($a,$i,$lenb);
# push @substrings, $sub; # alternate output
print "match '$sub' at offset $i\n";
}

Or more simpler as:

#!/usr/bin/perl
use strict;
use warnings;

my $a =
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx';
my $b = 'aacdxf';

my $limit = 2;
my $lenb = length $b;

print "\nmatching '$a' against '$b'\n";

while ( $a =~ /(?=(.{$lenb}))/sog ) {
next if ( $1 ^ $b ) =~ tr/\0//c > $limit;
print "match '$1' at offset $-[0]\n";
}

__END__




John

Yes, this is by far the fastest way. It benches 1/3 of the time
of using arrays and substr. Although, traversing with substr's and
using the bitwise ^ strings with tr///c might be faster, if only
by a little, its worth investigating. I personally think using the
regex engine for this is faster.

This is a prime example where bit-wise operators rule over string
comparisons... pure math.

If the problem were to spill over to multiple character comparisons
this won't work but the OP didn't say that.

Fast indeed!

I looked over this pretty good. I got a question about that look-ahead
zero-width assertion.

I noticed the capture group is in the assertion and it only advances the
regexp ordinal position by one.

Example:

/(.{$lenb})/ advances 6
/(?=(.{$lenb}))/ advances 1

I looked in perlre,perlretut and there is only one mention of capture within
a "non-capture zero-width assertion" extended expression elements.
That was just a side from some example on (?>..), and that just mentioned,
the ordinal position is adjusted with a (?=(regex)) analogy.

Do you know why this is?


sln
 
D

Dr.Ruud

Peng Yu schreef:
Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Yes, a regular expression is also possible. First write code to create
the regex string, for your example it can be done like this:

perl -wle'
my ($b, @b) = "aacdxf";
for my $i (0 .. 4) {
substr(my $bi = $b, $i, 1, ".");
for my $j ($i+1 .. 5) {
substr(my $bj = $bi, $j, 1, ".");
push @b, $bj;
}
}
my $re = join("\n | ", @b);
$re = qr/
( $re
)
/x;
print $re;
'
(?x-ism:
( ..cdxf
| .a.dxf
| .ac.xf
| .acd.f
| .acdx.
| a..dxf
| a.c.xf
| a.cd.f
| a.cdx.
| aa..xf
| aa.d.f
| aa.dx.
| aac..f
| aac.x.
| aacd..
)
)
 
S

sln

Hi,

Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Thanks,
Peng

Not easily, and probably fairly slow.
You need some heuristic algorithym.

sln

----------------------------

use strict;
use warnings;


my $str = 'aacdxfo sdfbsabcrxfodfbdfb';
my $pattern = 'abcdef';
my $misses = 2;

my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
$pstr .= "(?:$_|(.))";
}
$pstr = '('.$pstr.')';
my $rxpattern = qr/$pstr/i;

print @tmp,"\n",$rxpattern,"\n\n";

while ($str =~ /$rxpattern/g )
{
my $cnt = 0;
for my $i (2..(@tmp+1)) {
last if (($cnt += defined( $-[$i])) > $misses);
}
if ($cnt > $misses) {
pos($str) = $-[0]+1;
} else {
print "$cnt bad chars, but found a close match: '$1'\n";
}
}

__END__



abcdef
(?i-xsm:((?:a|(.))(?:b|(.))(?:c|(.))(?:d|(.))(?:e|(.))(?:f|(.))))

2 bad chars, but found a close match: 'aacdxf'
2 bad chars, but found a close match: 'abcrxf'

Follow-up:

Below is a modified version that chops the time in half.

The fastest solution conforming to the original problem statement
is from John Krahn. Because you can't fight math when posing
1-dimensional problems like this.

However, I'm posting this modified version incase the problem set
expands to multi-character heuristics.

The modifications include staying away from the @- array altogether.
In my opinion, it is just a eval{} in disguise anyway, don't know
don't care. Position is set every time. The pos() function is free,
its virtually no overhead. Reading @-, the patter/sub-pattern matching
position array, on the otherhand, incurrs 5-10 times the overhead.
Its to be avoided at all costs on performance related regexp issues.

Staying away from the @- array but keeping the algorithym means doing
something thats not supposed to be done, but it works:

$i = 1; # is not a reference
$$i # $1

I didn't see any side affects except you have to manage the pos() each time.
See: FAQ 7.29: How can I use a variable as a variable name?

That being said Expanding notation to multi-character substrings
and to add some intelligence (possibly) to dynamically generated regexp's,
might be a help in the future.

For instance, you could expand the definition of matching items like so:

@terms = (
# string - required - generated rxterm
# ----------------------------------------
'Merry ' , 1, # (?:Merry )
'C', , 0, # (?:C|(.))
'h', , 0, # (?:h|(.))
'r', , 0, # (?:r|(.))
'i', , 0, # (?:i|(.))
's', , 0, # (?:s|(.))
't', , 0, # (?:t|(.))
'mas', , 0, # (?:mas|(.{3}))
);

But then your getting into heuristics.
Good luck!

sln

--------------------------------

use strict;
use warnings;

use Benchmark ':hireswallclock';

my ($t0,$t1,$tdif);

my $pattern = 'aacdxf';
my $misses = 2;

my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
$pstr .= "(?:$_|(.))";
}
$pstr = '('.$pstr.')';

my $rxpattern = qr/$pstr/i;
my $extent = @tmp+1;
my ($i,$cnt,$lastpos) = (0,0,0);

print @tmp,"\n",$rxpattern,"\n\n";

my $fname = 'c:\temp\5MEG_FILE.txt';
open my $fh, $fname or die "can't open $fname...";

$t0 = new Benchmark;

while (<$fh>)
{
chomp;
$lastpos = 0;

while ( /$rxpattern/g )
{
$cnt = 0;
for $i (2..$extent) {
last if (($cnt += defined( $$i)) > $misses);
}
if ($cnt <= $misses) {
print "$cnt bad chars, but found a close match: '$1'\n";
}
pos() = ++$lastpos;
}
}
$t1 = new Benchmark;

close $fh;

$tdif = timediff($t1, $t0);
print "the code took:",timestr($tdif),"\n";

__END__
 
P

Peng Yu

Peng Yu said:
Peng Yu wrote:
On Nov 19, 7:55 pm, (e-mail address removed) wrote:
...
I can not find what '$-[$i]' means in your code on online perl
documentation. Would you please let me know its meaning? Where is
its documentation?
You can find out about the @- array in:
perldoc perlvar
It seems that the perldoc is not easy to understand by a newbie. While
it explains @-, the perldoc keeps on referring to others things that I
don't understand. Can somebody help explain @- for me with more
examples?

I looked at the perldoc perlvar page to remind myself what it said. I am
unable to figure out how one could make it clearer:

<blockquote>
This array holds the offsets of the beginnings of the last
successful submatches in the currently active dynamic scope.
$-[0] is the offset into the string of the beginning of the
entire match. The *n*th element of this array holds the offset
of the *n*th submatch, so $-[1] is the offset where $1 begins,
$-[2] the offset where $2 begins, and so on.
</blockquote>

Consider the example below:

C:\DOCUME~1\asu1\LOCALS~1\Temp> cat t.pl
#!/usr/bin/perl

use strict;
use warnings;

my $s = '123abcABC';

if ( $s =~ /^([0-9]+([0-9]))[a-z]+([A-Z]([A-Z]+))$/ ) {
    print "@-\n";
    print "$_\n" for $1, $2, $3, $4;

}

__END__

In this case, a successful match would produce

$1 = 123 (offset = 0)
$2 = 3   (offset = 2)
$3 = ABC (offset = 6)
$4 = BC  (offset = 7)

The entire match starts at offset 0, so we have @- = (0, 0, 2, 6, 7).

Thank you the example. I see what it means. In general, I think the
document can be made easier to read with more examples.

I agree that adding more examples would make the manual to long to be
read by experts. But I think that this can be solved by having two
sets of document---one is with examples, the other is without
examples.

Thanks,
Peng
 
J

Jürgen Exner

It is a reference, concise and accurate (as much as possible), it is not
and never will be a tutorial.

Then you will have to look for a tutorial or dig down into those other
things. Very often it is not possible or not desirable to explain
something without making references to other concepts. And @- is
certainly not the right place to explain grouping in regular
expressions.
Thank you the example. I see what it means. In general, I think the
document can be made easier to read with more examples.


But there are examples in that section:
<quote>
After a match against some variable $var:

"$`" is the same as "substr($var, 0, $-[0])"
"$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
"$'" is the same as "substr($var, $+[0])"
"$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
"$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"
I agree that adding more examples would make the manual to long to be
read by experts. But I think that this can be solved by having two
sets of document---one is with examples, the other is without
examples.

Go ahead and write the fivehundredandnintysecond tutorial about Perl.
You will find that either you are repeating yourself all over the place
or you start using references all over the place because Perl's concept
are so intertwined and cannot be explained stand-alone.

Not to mention the ensuing maintenance nightmare when you have to fix
something all over the place.

jue
 
S

sln

It is a reference, concise and accurate (as much as possible), it is not
and never will be a tutorial.
What are you defending, a reference?? A reference is and has always been
another word for a POS! It is just a list of prototypes.
A reference has been and always will be nothing of significance at all.
Used by somebody who already knows it all to adjust for typo's and syntax.
I hope those people don't actually write the referece.
Then you will have to look for a tutorial or dig down into those other
things. Very often it is not possible or not desirable to explain
something without making references to other concepts. And @- is
certainly not the right place to explain grouping in regular
expressions.
Why the f*ck not? Can't explain something you have no idea where you
learned it from, then blowhard to someone else needing answer's? Get off man..
Thank you the example. I see what it means. In general, I think the
document can be made easier to read with more examples.


But there are examples in that section:
<quote>
After a match against some variable $var:

"$`" is the same as "substr($var, 0, $-[0])"
"$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
"$'" is the same as "substr($var, $+[0])"
"$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
"$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"
I agree that adding more examples would make the manual to long to be
read by experts. But I think that this can be solved by having two
sets of document---one is with examples, the other is without
examples.

Go ahead and write the fivehundredandnintysecond tutorial about Perl.
You will find that either you are repeating yourself all over the place
or you start using references all over the place because Perl's concept
are so intertwined and cannot be explained stand-alone.
You my friend should not be involved in Perl. You have cactus up your
ass, every move you make when reading posts on this group drive's that
POINT home brother!!!
Not to mention the ensuing maintenance nightmare when you have to fix
something all over the place.

jue

Go fly a friggin kite friend..


sln
 
P

Pilcrow

Pilcrow said:
Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Not too hard the simple way.
-------------------------------------------------------------------
#!/usr/bin/perl
use strict; use warnings;

my $a =
"abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx";
my $b = "aacdxf";

my @a = split //,$a;
my @b = split //,$b;
my $limit = 2;
my $cnt;
my $lenb = length $b;
my @substrings = ();

print "\nmatching '$a' against '$b'\n";

OUTER:
for (my $i = 0; $i <= $#a-$lenb+1; $i++) {
$cnt = 0;
for (my $j = 0; $j <= $#b; $j++) {
$cnt++ unless $a[$i+$j] eq $b[$j];
next OUTER if $cnt > $limit;
}
my $sub = substr($a,$i,$lenb);
# push @substrings, $sub; # alternate output
print "match '$sub' at offset $i\n";
}

Or more simpler as:

#!/usr/bin/perl
use strict;
use warnings;

my $a =
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx';
my $b = 'aacdxf';

my $limit = 2;
my $lenb = length $b;

print "\nmatching '$a' against '$b'\n";

while ( $a =~ /(?=(.{$lenb}))/sog ) {
next if ( $1 ^ $b ) =~ tr/\0//c > $limit;
print "match '$1' at offset $-[0]\n";
}

__END__

Very elegant and fast, but you don't need all the modifiers /sog to the
regex, just /g will do.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,122
Latest member
VinayKumarNevatia_
Top