Hi,
Suppose I have a long string $a, and a test string $b.
I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.
For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.
I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?
Thanks,
Peng
Not easily, and probably fairly slow.
You need some heuristic algorithym.
sln
----------------------------
use strict;
use warnings;
my $str = 'aacdxfo sdfbsabcrxfodfbdfb';
my $pattern = 'abcdef';
my $misses = 2;
my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
$pstr .= "(?:$_|(.))";
}
$pstr = '('.$pstr.')';
my $rxpattern = qr/$pstr/i;
print @tmp,"\n",$rxpattern,"\n\n";
while ($str =~ /$rxpattern/g )
{
my $cnt = 0;
for my $i (2..(@tmp+1)) {
last if (($cnt += defined( $-[$i])) > $misses);
}
if ($cnt > $misses) {
pos($str) = $-[0]+1;
} else {
print "$cnt bad chars, but found a close match: '$1'\n";
}
}
__END__
abcdef
(?i-xsm
(?:a|(.))(?:b|(.))(?:c|(.))(?:d|(.))(?:e|(.))(?:f|(.))))
2 bad chars, but found a close match: 'aacdxf'
2 bad chars, but found a close match: 'abcrxf'
Follow-up:
Below is a modified version that chops the time in half.
The fastest solution conforming to the original problem statement
is from John Krahn. Because you can't fight math when posing
1-dimensional problems like this.
However, I'm posting this modified version incase the problem set
expands to multi-character heuristics.
The modifications include staying away from the @- array altogether.
In my opinion, it is just a eval{} in disguise anyway, don't know
don't care. Position is set every time. The pos() function is free,
its virtually no overhead. Reading @-, the patter/sub-pattern matching
position array, on the otherhand, incurrs 5-10 times the overhead.
Its to be avoided at all costs on performance related regexp issues.
Staying away from the @- array but keeping the algorithym means doing
something thats not supposed to be done, but it works:
$i = 1; # is not a reference
$$i # $1
I didn't see any side affects except you have to manage the pos() each time.
See: FAQ 7.29: How can I use a variable as a variable name?
That being said Expanding notation to multi-character substrings
and to add some intelligence (possibly) to dynamically generated regexp's,
might be a help in the future.
For instance, you could expand the definition of matching items like so:
@terms = (
# string - required - generated rxterm
# ----------------------------------------
'Merry ' , 1, # (?:Merry )
'C', , 0, # (?:C|(.))
'h', , 0, # (?:h|(.))
'r', , 0, # (?:r|(.))
'i', , 0, # (?:i|(.))
's', , 0, # (?:s|(.))
't', , 0, # (?:t|(.))
'mas', , 0, # (?:mas|(.{3}))
);
But then your getting into heuristics.
Good luck!
sln
--------------------------------
use strict;
use warnings;
use Benchmark ':hireswallclock';
my ($t0,$t1,$tdif);
my $pattern = 'aacdxf';
my $misses = 2;
my @tmp = split '',$pattern;
my $pstr;
for (@tmp) {
$pstr .= "(?:$_|(.))";
}
$pstr = '('.$pstr.')';
my $rxpattern = qr/$pstr/i;
my $extent = @tmp+1;
my ($i,$cnt,$lastpos) = (0,0,0);
print @tmp,"\n",$rxpattern,"\n\n";
my $fname = 'c:\temp\5MEG_FILE.txt';
open my $fh, $fname or die "can't open $fname...";
$t0 = new Benchmark;
while (<$fh>)
{
chomp;
$lastpos = 0;
while ( /$rxpattern/g )
{
$cnt = 0;
for $i (2..$extent) {
last if (($cnt += defined( $$i)) > $misses);
}
if ($cnt <= $misses) {
print "$cnt bad chars, but found a close match: '$1'\n";
}
pos() = ++$lastpos;
}
}
$t1 = new Benchmark;
close $fh;
$tdif = timediff($t1, $t0);
print "the code took:",timestr($tdif),"\n";
__END__