How to find all the strings in a long that are at most of n differentcharacters of a test string?

Peng Yu · Nov 19, 2008

Hi,

Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Thanks,
Peng

sln · Nov 20, 2008

Hi,

Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Thanks,
Peng

Not easily, and probably fairly slow.
You need some heuristic algorithym.

sln

----------------------------

use strict;
use warnings;

my $str = 'aacdxfo sdfbsabcrxfodfbdfb';
my $pattern = 'abcdef';
my $misses = 2;

my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
$pstr .= "(?:$_|(.))";
}
$pstr = '('.$pstr.')';
my $rxpattern = qr/$pstr/i;

print @tmp,"\n",$rxpattern,"\n\n";

while ($str =~ /$rxpattern/g )
{
my $cnt = 0;
for my $i (2..(@tmp+1)) {
last if (($cnt += defined( $-[$i])) > $misses);
}
if ($cnt > $misses) {
pos($str) = $-[0]+1;
} else {
print "$cnt bad chars, but found a close match: '$1'\n";
}
}

__END__

abcdef
(?i-xsm

(?:a|(.))(?:b|(.))(?:c|(.))(?:d|(.))(?:e|(.))(?:f|(.))))

2 bad chars, but found a close match: 'aacdxf'
2 bad chars, but found a close match: 'abcrxf'

Peng Yu · Nov 20, 2008

Hi,

Click to expand...

Suppose I have a long string $a, and a test string $b.

Click to expand...

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

Click to expand...

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

Click to expand...

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Click to expand...

Thanks,
Peng

Click to expand...

Not easily, and probably fairly slow.
You need some heuristic algorithym.

sln

----------------------------

use strict;
use warnings;

my $str = 'aacdxfo sdfbsabcrxfodfbdfb';
my $pattern = 'abcdef';
my $misses = 2;

my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
$pstr .= "(?:$_|(.))";}

$pstr = '('.$pstr.')';
my $rxpattern = qr/$pstr/i;

print @tmp,"\n",$rxpattern,"\n\n";

while ($str =~ /$rxpattern/g )
{
my $cnt = 0;
for my $i (2..(@tmp+1)) {
last if (($cnt += defined( $-[$i])) > $misses);
}
if ($cnt > $misses) {
pos($str) = $-[0]+1;
} else {
print "$cnt bad chars, but found a close match: '$1'\n";
}

}

__END__

abcdef
(?i-xsm(?:a|(.))(?:b|(.))(?:c|(.))(?:d|(.))(?:e|(.))(?:f|(.))))

2 bad chars, but found a close match: 'aacdxf'
2 bad chars, but found a close match: 'abcrxf'

Hi,

I can not find what '$-[$i]' means in your code on online perl
documentation. Would you please let me know its meaning? Where is its
documentation?

Thanks,
Peng

Thanks,
Peng

Tad J McClellan · Nov 20, 2008

I can not find what '$-[$i]' means in your code

It is accessing the array named @-

on online perl
documentation. Would you please let me know its meaning? Where is its
documentation?

perldoc perlvar

John W. Krahn · Nov 20, 2008

Peng said:
use strict;
use warnings;

my $str = 'aacdxfo sdfbsabcrxfodfbdfb';
my $pattern = 'abcdef';
my $misses = 2;

my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
$pstr .= "(?:$_|(.))";}

$pstr = '('.$pstr.')';
my $rxpattern = qr/$pstr/i;

print @tmp,"\n",$rxpattern,"\n\n";

while ($str =~ /$rxpattern/g )
{
my $cnt = 0;
for my $i (2..(@tmp+1)) {
last if (($cnt += defined( $-[$i])) > $misses);
}
if ($cnt > $misses) {
pos($str) = $-[0]+1;
} else {
print "$cnt bad chars, but found a close match: '$1'\n";
}
}

__END__

Click to expand...

I can not find what '$-[$i]' means in your code on online perl
documentation. Would you please let me know its meaning? Where is its
documentation?

You can find out about the @- array in:

perldoc perlvar

John

Peng Yu · Nov 20, 2008

Peng said:
Peng said:

On Nov 19, 7:55 pm, (e-mail address removed) wrote:

use strict;
use warnings;
my $str = 'aacdxfo sdfbsabcrxfodfbdfb';
my $pattern = 'abcdef';
my $misses = 2;
my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
$pstr .= "(?:$_|(.))";}
$pstr = '('.$pstr.')';
my $rxpattern = qr/$pstr/i;
print @tmp,"\n",$rxpattern,"\n\n";
while ($str =~ /$rxpattern/g )
{
my $cnt = 0;
for my $i (2..(@tmp+1)) {
last if (($cnt += defined( $-[$i])) > $misses);
}
if ($cnt > $misses) {
pos($str) = $-[0]+1;
} else {
print "$cnt bad chars, but found a close match: '$1'\n";
}
}
__END__

Click to expand...

Click to expand...

I can not find what '$-[$i]' means in your code on online perl
documentation. Would you please let me know its meaning? Where is its
documentation?

Click to expand...

You can find out about the @- array in:

perldoc perlvar

It seems that the perldoc is not easy to understand by a newbie. While
it explains @-, the perldoc keeps on referring to others things that I
don't understand. Can somebody help explain @- for me with more
examples?

BTW, I'd think that the perldoc be made more readable for newbies. I'm
not sure whether other newbies would have the same feeling.

Thanks,
Peng

Pilcrow · Nov 20, 2008

Hi,

Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Thanks,
Peng

Not too hard the simple way.
-------------------------------------------------------------------
#!/usr/bin/perl
use strict; use warnings;

my $a =
"abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx";
my $b = "aacdxf";

my @a = split //,$a;
my @b = split //,$b;
my $limit = 2;
my $cnt;
my $lenb = length $b;
my @substrings = ();

print "\nmatching '$a' against '$b'\n";

OUTER:
for (my $i = 0; $i <= $#a-$lenb+1; $i++) {
$cnt = 0;
for (my $j = 0; $j <= $#b; $j++) {
$cnt++ unless $a[$i+$j] eq $b[$j];
next OUTER if $cnt > $limit;
}
my $sub = substr($a,$i,$lenb);
# push @substrings, $sub; # alternate output
print "match '$sub' at offset $i\n";
}

__END__

matching
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx'
against 'aacdxf'
match 'abcdef' at offset 0
match 'bacdef' at offset 6
match 'aacdxf' at offset 21
match 'aacdef' at offset 28
match 'bacdef' at offset 38
match 'aacdxf' at offset 53
match 'aacdef' at offset 60
match 'aacdxf' at offset 66
match 'aacdfx' at offset 72

Peng Yu · Nov 20, 2008

Hi,

Click to expand...

Suppose I have a long string $a, and a test string $b.

Click to expand...

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

Click to expand...

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

Click to expand...

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Click to expand...

Thanks,
Peng

Click to expand...

Not too hard the simple way.
-------------------------------------------------------------------
#!/usr/bin/perl
use strict; use warnings;

my $a =
"abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx";
my $b = "aacdxf";

my @a = split //,$a;
my @b = split //,$b;
my $limit = 2;
my $cnt;
my $lenb = length $b;
my @substrings = ();

print "\nmatching '$a' against '$b'\n";

OUTER:
for (my $i = 0; $i <= $#a-$lenb+1; $i++) {
$cnt = 0;
for (my $j = 0; $j <= $#b; $j++) {
$cnt++ unless $a[$i+$j] eq $b[$j];
next OUTER if $cnt > $limit;
}
my $sub = substr($a,$i,$lenb);
# push @substrings, $sub; # alternate output
print "match '$sub' at offset $i\n";

}

__END__

If $a could be very long, say, of the order of MB, will this approach
be much slower than the code given by sln?

Thanks,
Peng

Charlton Wilbur · Nov 20, 2008

PY> BTW, I'd think that the perldoc be made more readable for
PY> newbies. I'm not sure whether other newbies would have the same
PY> feeling.

There's a tradeoff between making the perldocs more readable for newbies
and making them more information-dense for experts. There are other
sources of documentation for newbies, but precious few for experts, and
so I think the perldocs are at a very sweet spot.

Further, you can learn more about Perl by following up on everything in
the documentation that you don't understand. Indeed, that's how many of
us learned as much as we did.

Charlton

Peng Yu · Nov 20, 2008

PY> BTW, I'd think that the perldoc be made more readable for
PY> newbies. I'm not sure whether other newbies would have the same
PY> feeling.

There's a tradeoff between making the perldocs more readable for newbies
and making them more information-dense for experts. There are other
sources of documentation for newbies, but precious few for experts, and
so I think the perldocs are at a very sweet spot.

Further, you can learn more about Perl by following up on everything in
the documentation that you don't understand. Indeed, that's how many of
us learned as much as we did.

I total understand what you mean.

At this moment, I only be interested in understanding how @- works. I
tried to read that perldoc page, but I feel it is hard to follow.
Since that section of perldocs is not very readable to newbies, I'm
wondering if there is any more introductory material that explains it
with more examples.

I tried to look the more introductory materials on @- online but I
don't find any.

I think as I understand perl more, I shall be able to read perldoc
more easily.

Thanks,
Peng

John W. Krahn · Nov 20, 2008

Pilcrow said:
Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Click to expand...

Not too hard the simple way.
-------------------------------------------------------------------
#!/usr/bin/perl
use strict; use warnings;

my $a =
"abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx";
my $b = "aacdxf";

my @a = split //,$a;
my @b = split //,$b;
my $limit = 2;
my $cnt;
my $lenb = length $b;
my @substrings = ();

print "\nmatching '$a' against '$b'\n";

OUTER:
for (my $i = 0; $i <= $#a-$lenb+1; $i++) {
$cnt = 0;
for (my $j = 0; $j <= $#b; $j++) {
$cnt++ unless $a[$i+$j] eq $b[$j];
next OUTER if $cnt > $limit;
}
my $sub = substr($a,$i,$lenb);
# push @substrings, $sub; # alternate output
print "match '$sub' at offset $i\n";
}

Or more simpler as:

#!/usr/bin/perl
use strict;
use warnings;

my $a =
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx';
my $b = 'aacdxf';

my $limit = 2;
my $lenb = length $b;

print "\nmatching '$a' against '$b'\n";

while ( $a =~ /(?=(.{$lenb}))/sog ) {
next if ( $1 ^ $b ) =~ tr/\0//c > $limit;
print "match '$1' at offset $-[0]\n";
}

__END__

John

A. Sinan Unur · Nov 20, 2008

....

I can not find what '$-[$i]' means in your code on online perl
documentation. Would you please let me know its meaning? Where is
its documentation?

Click to expand...

You can find out about the @- array in:

perldoc perlvar

Click to expand...

It seems that the perldoc is not easy to understand by a newbie. While
it explains @-, the perldoc keeps on referring to others things that I
don't understand. Can somebody help explain @- for me with more
examples?

I looked at the perldoc perlvar page to remind myself what it said. I am
unable to figure out how one could make it clearer:

<blockquote>
This array holds the offsets of the beginnings of the last
successful submatches in the currently active dynamic scope.
$-[0] is the offset into the string of the beginning of the
entire match. The *n*th element of this array holds the offset
of the *n*th submatch, so $-[1] is the offset where $1 begins,
$-[2] the offset where $2 begins, and so on.
</blockquote>

Consider the example below:

C:\DOCUME~1\asu1\LOCALS~1\Temp> cat t.pl
#!/usr/bin/perl

use strict;
use warnings;

my $s = '123abcABC';

if ( $s =~ /^([0-9]+([0-9]))[a-z]+([A-Z]([A-Z]+))$/ ) {
print "@-\n";
print "$_\n" for $1, $2, $3, $4;
}

__END__

In this case, a successful match would produce

$1 = 123 (offset = 0)
$2 = 3 (offset = 2)
$3 = ABC (offset = 6)
$4 = BC (offset = 7)

The entire match starts at offset 0, so we have @- = (0, 0, 2, 6, 7).

Sinan
--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

Dr.Ruud · Nov 20, 2008

Peng Yu schreef:

Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

Use substr() and the "^" operator and y/\0//c.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

$ perl -wle'
$r = "abcdef" ^ "aacdxf";
print $r =~ y/\0//c
'
2

sln · Nov 20, 2008

Pilcrow said:
Pilcrow said:

Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Click to expand...

Not too hard the simple way.
-------------------------------------------------------------------
#!/usr/bin/perl
use strict; use warnings;

my $a =
"abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx";
my $b = "aacdxf";

my @a = split //,$a;
my @b = split //,$b;
my $limit = 2;
my $cnt;
my $lenb = length $b;
my @substrings = ();

print "\nmatching '$a' against '$b'\n";

OUTER:
for (my $i = 0; $i <= $#a-$lenb+1; $i++) {
$cnt = 0;
for (my $j = 0; $j <= $#b; $j++) {
$cnt++ unless $a[$i+$j] eq $b[$j];
next OUTER if $cnt > $limit;
}
my $sub = substr($a,$i,$lenb);
# push @substrings, $sub; # alternate output
print "match '$sub' at offset $i\n";
}

Click to expand...

Or more simpler as:

#!/usr/bin/perl
use strict;
use warnings;

my $a =
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx';
my $b = 'aacdxf';

my $limit = 2;
my $lenb = length $b;

print "\nmatching '$a' against '$b'\n";

while ( $a =~ /(?=(.{$lenb}))/sog ) {
next if ( $1 ^ $b ) =~ tr/\0//c > $limit;
print "match '$1' at offset $-[0]\n";
}

__END__

John

Yes, this is by far the fastest way. It benches 1/3 of the time
of using arrays and substr. Although, traversing with substr's and
using the bitwise ^ strings with tr///c might be faster, if only
by a little, its worth investigating. I personally think using the
regex engine for this is faster.

This is a prime example where bit-wise operators rule over string
comparisons... pure math.

If the problem were to spill over to multiple character comparisons
this won't work but the OP didn't say that.

Fast indeed!

I looked over this pretty good. I got a question about that look-ahead
zero-width assertion.

I noticed the capture group is in the assertion and it only advances the
regexp ordinal position by one.

Example:

/(.{$lenb})/ advances 6
/(?=(.{$lenb}))/ advances 1

I looked in perlre,perlretut and there is only one mention of capture within
a "non-capture zero-width assertion" extended expression elements.
That was just a side from some example on (?>..), and that just mentioned,
the ordinal position is adjusted with a (?=(regex)) analogy.

Do you know why this is?

sln

Dr.Ruud · Nov 21, 2008

Peng Yu schreef:

Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Yes, a regular expression is also possible. First write code to create
the regex string, for your example it can be done like this:

perl -wle'
my ($b, @b) = "aacdxf";
for my $i (0 .. 4) {
substr(my $bi = $b, $i, 1, ".");
for my $j ($i+1 .. 5) {
substr(my $bj = $bi, $j, 1, ".");
push @b, $bj;
}
}
my $re = join("\n | ", @b);
$re = qr/
( $re
)
/x;
print $re;
'
(?x-ism:
( ..cdxf
| .a.dxf
| .ac.xf
| .acd.f
| .acdx.
| a..dxf
| a.c.xf
| a.cd.f
| a.cdx.
| aa..xf
| aa.d.f
| aa.dx.
| aac..f
| aac.x.
| aacd..
)
)

sln · Nov 21, 2008

Hi,

Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Thanks,
Peng

Click to expand...

Not easily, and probably fairly slow.
You need some heuristic algorithym.

sln

----------------------------

use strict;
use warnings;

my $str = 'aacdxfo sdfbsabcrxfodfbdfb';
my $pattern = 'abcdef';
my $misses = 2;

my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
$pstr .= "(?:$_|(.))";
}
$pstr = '('.$pstr.')';
my $rxpattern = qr/$pstr/i;

print @tmp,"\n",$rxpattern,"\n\n";

while ($str =~ /$rxpattern/g )
{
my $cnt = 0;
for my $i (2..(@tmp+1)) {
last if (($cnt += defined( $-[$i])) > $misses);
}
if ($cnt > $misses) {
pos($str) = $-[0]+1;
} else {
print "$cnt bad chars, but found a close match: '$1'\n";
}
}

__END__

abcdef
(?i-xsm(?:a|(.))(?:b|(.))(?:c|(.))(?:d|(.))(?:e|(.))(?:f|(.))))

2 bad chars, but found a close match: 'aacdxf'
2 bad chars, but found a close match: 'abcrxf'

Follow-up:

Below is a modified version that chops the time in half.

The fastest solution conforming to the original problem statement
is from John Krahn. Because you can't fight math when posing
1-dimensional problems like this.

However, I'm posting this modified version incase the problem set
expands to multi-character heuristics.

The modifications include staying away from the @- array altogether.
In my opinion, it is just a eval{} in disguise anyway, don't know
don't care. Position is set every time. The pos() function is free,
its virtually no overhead. Reading @-, the patter/sub-pattern matching
position array, on the otherhand, incurrs 5-10 times the overhead.
Its to be avoided at all costs on performance related regexp issues.

Staying away from the @- array but keeping the algorithym means doing
something thats not supposed to be done, but it works:

$i = 1; # is not a reference
$$i # $1

I didn't see any side affects except you have to manage the pos() each time.
See: FAQ 7.29: How can I use a variable as a variable name?

That being said Expanding notation to multi-character substrings
and to add some intelligence (possibly) to dynamically generated regexp's,
might be a help in the future.

For instance, you could expand the definition of matching items like so:

@terms = (
# string - required - generated rxterm
# ----------------------------------------
'Merry ' , 1, # (?:Merry )
'C', , 0, # (?:C|(.))
'h', , 0, # (?:h|(.))
'r', , 0, # (?:r|(.))
'i', , 0, # (?:i|(.))
's', , 0, # (?:s|(.))
't', , 0, # (?:t|(.))
'mas', , 0, # (?:mas|(.{3}))
);

But then your getting into heuristics.
Good luck!

sln

--------------------------------

use strict;
use warnings;

use Benchmark ':hireswallclock';

my ($t0,$t1,$tdif);

my $pattern = 'aacdxf';
my $misses = 2;

my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
$pstr .= "(?:$_|(.))";
}
$pstr = '('.$pstr.')';

my $rxpattern = qr/$pstr/i;
my $extent = @tmp+1;
my ($i,$cnt,$lastpos) = (0,0,0);

print @tmp,"\n",$rxpattern,"\n\n";

my $fname = 'c:\temp\5MEG_FILE.txt';
open my $fh, $fname or die "can't open $fname...";

$t0 = new Benchmark;

while (<$fh>)
{
chomp;
$lastpos = 0;

while ( /$rxpattern/g )
{
$cnt = 0;
for $i (2..$extent) {
last if (($cnt += defined( $$i)) > $misses);
}
if ($cnt <= $misses) {
print "$cnt bad chars, but found a close match: '$1'\n";
}
pos() = ++$lastpos;
}
}
$t1 = new Benchmark;

close $fh;

$tdif = timediff($t1, $t0);
print "the code took:",timestr($tdif),"\n";

__END__

Peng Yu · Nov 21, 2008

Peng Yu said:
Peng Yu said:

Peng Yu wrote:
On Nov 19, 7:55 pm, (e-mail address removed) wrote:

Click to expand...

...

I can not find what '$-[$i]' means in your code on online perl
documentation. Would you please let me know its meaning? Where is
its documentation?
You can find out about the @- array in:
perldoc perlvar

Click to expand...

Click to expand...

It seems that the perldoc is not easy to understand by a newbie. While
it explains @-, the perldoc keeps on referring to others things that I
don't understand. Can somebody help explain @- for me with more
examples?

Click to expand...

I looked at the perldoc perlvar page to remind myself what it said. I am
unable to figure out how one could make it clearer:

<blockquote>
This array holds the offsets of the beginnings of the last
successful submatches in the currently active dynamic scope.
$-[0] is the offset into the string of the beginning of the
entire match. The *n*th element of this array holds the offset
of the *n*th submatch, so $-[1] is the offset where $1 begins,
$-[2] the offset where $2 begins, and so on.
</blockquote>

Consider the example below:

C:\DOCUME~1\asu1\LOCALS~1\Temp> cat t.pl
#!/usr/bin/perl

use strict;
use warnings;

my $s = '123abcABC';

if ( $s =~ /^([0-9]+([0-9]))[a-z]+([A-Z]([A-Z]+))$/ ) {
print "@-\n";
print "$_\n" for $1, $2, $3, $4;

}

__END__

In this case, a successful match would produce

$1 = 123 (offset = 0)
$2 = 3 (offset = 2)
$3 = ABC (offset = 6)
$4 = BC (offset = 7)

The entire match starts at offset 0, so we have @- = (0, 0, 2, 6, 7).

Thank you the example. I see what it means. In general, I think the
document can be made easier to read with more examples.

I agree that adding more examples would make the manual to long to be
read by experts. But I think that this can be solved by having two
sets of document---one is with examples, the other is without
examples.

Thanks,
Peng

Jürgen Exner · Nov 21, 2008

It is a reference, concise and accurate (as much as possible), it is not
and never will be a tutorial.

Then you will have to look for a tutorial or dig down into those other
things. Very often it is not possible or not desirable to explain
something without making references to other concepts. And @- is
certainly not the right place to explain grouping in regular
expressions.

Thank you the example. I see what it means. In general, I think the
document can be made easier to read with more examples.

But there are examples in that section:
<quote>
After a match against some variable $var:

"$`" is the same as "substr($var, 0, $-[0])"
"$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
"$'" is the same as "substr($var, $+[0])"
"$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
"$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"

I agree that adding more examples would make the manual to long to be
read by experts. But I think that this can be solved by having two
sets of document---one is with examples, the other is without
examples.

Go ahead and write the fivehundredandnintysecond tutorial about Perl.
You will find that either you are repeating yourself all over the place
or you start using references all over the place because Perl's concept
are so intertwined and cannot be explained stand-alone.

Not to mention the ensuing maintenance nightmare when you have to fix
something all over the place.

jue

sln · Nov 21, 2008

It is a reference, concise and accurate (as much as possible), it is not
and never will be a tutorial.

What are you defending, a reference?? A reference is and has always been
another word for a POS! It is just a list of prototypes.
A reference has been and always will be nothing of significance at all.
Used by somebody who already knows it all to adjust for typo's and syntax.
I hope those people don't actually write the referece.

Then you will have to look for a tutorial or dig down into those other
things. Very often it is not possible or not desirable to explain
something without making references to other concepts. And @- is
certainly not the right place to explain grouping in regular
expressions.

Why the f*ck not? Can't explain something you have no idea where you
learned it from, then blowhard to someone else needing answer's? Get off man..

Thank you the example. I see what it means. In general, I think the
document can be made easier to read with more examples.

Click to expand...

But there are examples in that section:
<quote>
After a match against some variable $var:

"$`" is the same as "substr($var, 0, $-[0])"
"$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
"$'" is the same as "substr($var, $+[0])"
"$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
"$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"

I agree that adding more examples would make the manual to long to be
read by experts. But I think that this can be solved by having two
sets of document---one is with examples, the other is without
examples.

Click to expand...

Go ahead and write the fivehundredandnintysecond tutorial about Perl.
You will find that either you are repeating yourself all over the place
or you start using references all over the place because Perl's concept
are so intertwined and cannot be explained stand-alone.

You my friend should not be involved in Perl. You have cactus up your
ass, every move you make when reading posts on this group drive's that
POINT home brother!!!

Not to mention the ensuing maintenance nightmare when you have to fix
something all over the place.

jue

Go fly a friggin kite friend..

sln

Pilcrow · Nov 21, 2008

Pilcrow said:
Pilcrow said:

Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Click to expand...

Not too hard the simple way.
-------------------------------------------------------------------
#!/usr/bin/perl
use strict; use warnings;

my $a =
"abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx";
my $b = "aacdxf";

my @a = split //,$a;
my @b = split //,$b;
my $limit = 2;
my $cnt;
my $lenb = length $b;
my @substrings = ();

print "\nmatching '$a' against '$b'\n";

OUTER:
for (my $i = 0; $i <= $#a-$lenb+1; $i++) {
$cnt = 0;
for (my $j = 0; $j <= $#b; $j++) {
$cnt++ unless $a[$i+$j] eq $b[$j];
next OUTER if $cnt > $limit;
}
my $sub = substr($a,$i,$lenb);
# push @substrings, $sub; # alternate output
print "match '$sub' at offset $i\n";
}

Click to expand...

Or more simpler as:

#!/usr/bin/perl
use strict;
use warnings;

my $a =
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx';
my $b = 'aacdxf';

my $limit = 2;
my $lenb = length $b;

print "\nmatching '$a' against '$b'\n";

while ( $a =~ /(?=(.{$lenb}))/sog ) {
next if ( $1 ^ $b ) =~ tr/\0//c > $limit;
print "match '$1' at offset $-[0]\n";
}

__END__

Very elegant and fast, but you don't need all the modifiers /sog to the
regex, just /g will do.

How to try a range of hex values in C# code ?	0	Nov 19, 2022
How to check if the n-th part of an array of Strings exist?	13	Oct 25, 2011
Best way to search for a string which has N% in a character class?	5	Mar 2, 2012
FAQ 4.27 How can I access or change N characters of a string?	0	Feb 27, 2011
How to break a bash command into an array consisting of the argumentsin the command?	3	May 12, 2013
Decoding no of ways and printing each decode message	2	Jun 1, 2021
'Needless flexibilities' and structured records [very long]	10	Mar 15, 2013
FAQ 4.29 How can I count the number of occurrences of a substring within a string?	0	Jan 4, 2011

How to find all the strings in a long that are at most of n differentcharacters of a test string?

Peng Yu

sln

Peng Yu

Tad J McClellan

John W. Krahn

Peng Yu

Pilcrow

Peng Yu

Charlton Wilbur

Peng Yu

John W. Krahn

A. Sinan Unur

Dr.Ruud

sln

Dr.Ruud

sln

Peng Yu

Jürgen Exner

sln

Pilcrow

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads