How to find all the strings in a long that are at most of n differentcharacters of a test string?

S

sln

Pilcrow said:
Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Not too hard the simple way.
-------------------------------------------------------------------
#!/usr/bin/perl
use strict; use warnings;

my $a =
"abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx";
my $b = "aacdxf";

my @a = split //,$a;
my @b = split //,$b;
my $limit = 2;
my $cnt;
my $lenb = length $b;
my @substrings = ();

print "\nmatching '$a' against '$b'\n";

OUTER:
for (my $i = 0; $i <= $#a-$lenb+1; $i++) {
$cnt = 0;
for (my $j = 0; $j <= $#b; $j++) {
$cnt++ unless $a[$i+$j] eq $b[$j];
next OUTER if $cnt > $limit;
}
my $sub = substr($a,$i,$lenb);
# push @substrings, $sub; # alternate output
print "match '$sub' at offset $i\n";
}

Or more simpler as:

#!/usr/bin/perl
use strict;
use warnings;

my $a =
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx';
my $b = 'aacdxf';

my $limit = 2;
my $lenb = length $b;

print "\nmatching '$a' against '$b'\n";

while ( $a =~ /(?=(.{$lenb}))/sog ) {
next if ( $1 ^ $b ) =~ tr/\0//c > $limit;
print "match '$1' at offset $-[0]\n";
}

__END__

Very elegant and fast, but you don't need all the modifiers /sog to the
regex, just /g will do.
Actually, he needs 'so' as well. If this is used in a multi-lined
way, the '.' needs the 's' modifier if it is possible a '\n' is encounterred
and could be part of the pattern.
The 'o' suggests to not recompile, which might happen, don't know, anything is
possible. By including both, no harm, ho foul.

sln
 
P

Peng Yu

It is a reference, concise and accurate (as much as possible), it is not
and never will be a tutorial.

Then you will have to look for a tutorial or dig down into those other
things. Very often it is not possible or not desirable to explain
something without making references to other concepts. And @- is
certainly not the right place to explain grouping in regular
expressions.
Thank you the example. I see what it means. In general, I think the
document can be made easier to read with more examples.

But there are examples in that section:
<quote>
            After a match against some variable $var:

            "$`" is the same as "substr($var, 0, $-[0])"
            "$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
            "$'" is the same as "substr($var, $+[0])"
            "$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
            "$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"
            "$3" is the same as "substr $var, $-[3], $+[3] - $-[3])"
</quote>

I didn't understand the above examples because it refers to $-[0],
which was exactly what I wanted to understand.
Go ahead and write the fivehundredandnintysecond tutorial about Perl.
You will find that either you are repeating yourself all over the place
or you start using references all over the place because Perl's concept
are so intertwined and cannot be explained stand-alone.

Not to mention the ensuing maintenance nightmare when you have to fix
something all over the place.

I agree your points. But I think that the perldoc shall have as less
cyclic references as possible. If the example that you mentioned in
the perldoc is not cyclic referred, it would easier to read.

Thanks,
Peng
 
S

Sherm Pendley

Peng Yu said:
But there are examples in that section:
<quote>
            After a match against some variable $var:

            "$`" is the same as "substr($var, 0, $-[0])"
            "$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
            "$'" is the same as "substr($var, $+[0])"
            "$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
            "$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"
            "$3" is the same as "substr $var, $-[3], $+[3] - $-[3])"
</quote>

I didn't understand the above examples because it refers to $-[0],
which was exactly what I wanted to understand.

Arrays are referred to with @, but array *elements* with $. So, given
the array @-, its first element is $-[0].

sherm--
 
I

ilovelinux

#!/usr/bin/perl
use strict;
use warnings;
my $a =
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx';
my $b = 'aacdxf';
my $limit = 2;
my $lenb = length $b;
print "\nmatching '$a' against '$b'\n";
while ( $a =~ /(?=(.{$lenb}))/sog ) {
    next if ( $1 ^ $b ) =~ tr/\0//c > $limit;
    print "match '$1' at offset $-[0]\n";
    }
__END__
Very elegant and fast, but you don't need all the modifiers /sog to the
regex, just /g will do.

Actually, he needs 'so' as well. If this is used in a multi-lined
way, the '.' needs the 's' modifier if it is possible a '\n' is encounterred
and could be part of the pattern.
The 'o' suggests to not recompile, which might happen, don't know, anything is
possible. By including both, no harm, ho foul.

Harm will be done using the 'o'nce option, when the code is
transferred to a subroutine and the regexp will be called with $b's of
different length, because the regexp will be compiled with the first
$lenb encountered and never change.

Better put the regexp into a variable and use this in the while ():

my $re = qr/(?=(.{$lenb}))/s;
while ( $a =~ /$re/g ) {

HTH
 
A

A. Sinan Unur

But there are examples in that section:
<quote>
            After a match against some variable $var:

            "$`" is the same as "substr($var, 0, $-[0])"
            "$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
            "$'" is the same as "substr($var, $+[0])"
            "$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
            "$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"
            "$3" is the same as "substr $var, $-[3], $+[3] - $-[3])"
</quote>

I didn't understand the above examples because it refers to $-[0],
which was exactly what I wanted to understand.

Are you joking? How can one explain what @- is without explaining the
meaning of its elements?

You do know that $x[1] refers to the second element of @x, right?
I agree your points. But I think that the perldoc shall have as less
cyclic references as possible. If the example that you mentioned in
the perldoc is not cyclic referred, it would easier to read.

I don't see such cyclic references in the documentation for @-.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
P

Peter J. Holzer

Did you quote from a newer version than 5.10.0? It's slightly different
in my docs (only checked 5.8.8 and 5.10.0).
Thank you the example. I see what it means. In general, I think the
document can be made easier to read with more examples.

But there are examples in that section:
<quote>
            After a match against some variable $var:

            "$`" is the same as "substr($var, 0, $-[0])"
            "$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
            "$'" is the same as "substr($var, $+[0])"
            "$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
            "$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"
            "$3" is the same as "substr $var, $-[3], $+[3] - $-[3])"
</quote>

I didn't understand the above examples because it refers to $-[0],
which was exactly what I wanted to understand.

If it didn't contain $-[0], it wouldn't be an example for the use of
$-[0], would it?
I agree your points. But I think that the perldoc shall have as less
cyclic references as possible. If the example that you mentioned in
the perldoc is not cyclic referred, it would easier to read.

I don't see any cyclic references in the section about @-. $-[0] is
explained in the very first sentence:

@- $-[0] is the offset of the start of the last successful match.

To understand the examples, it helps to know what $`, $&, etc. are.
These are explained elsewhere in the same document, and none of them
is explained by referring to @-. No cycles there.

hp
 
A

A. Sinan Unur

Did you quote from a newer version than 5.10.0? It's slightly
different in my docs (only checked 5.8.8 and 5.10.0).

C:\Documents and Settings\asu1> perl -v

This is perl, v5.10.0 built for MSWin32-x86-multi-thread
(with 5 registered patches, see perl -V for more detail)

Copyright 1987-2007, Larry Wall

Binary build 1004 [287188] provided by ActiveState
http://www.ActiveState.com
Built Sep 3 2008 13:16:37

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
S

sln

#!/usr/bin/perl
use strict;
use warnings;
my $a =
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx';
my $b = 'aacdxf';
my $limit = 2;
my $lenb = length $b;
print "\nmatching '$a' against '$b'\n";
while ( $a =~ /(?=(.{$lenb}))/sog ) {
    next if ( $1 ^ $b ) =~ tr/\0//c > $limit;
    print "match '$1' at offset $-[0]\n";
    }

Very elegant and fast, but you don't need all the modifiers /sog to the
regex, just /g will do.

Actually, he needs 'so' as well. If this is used in a multi-lined
way, the '.' needs the 's' modifier if it is possible a '\n' is encounterred
and could be part of the pattern.
The 'o' suggests to not recompile, which might happen, don't know, anything is
possible. By including both, no harm, ho foul.

Harm will be done using the 'o'nce option, when the code is
transferred to a subroutine and the regexp will be called with $b's of
different length, because the regexp will be compiled with the first
$lenb encountered and never change.

Better put the regexp into a variable and use this in the while ():

my $re = qr/(?=(.{$lenb}))/s;
while ( $a =~ /$re/g ) {

HTH

I never used the /o modifier. Tested it a long time ago and found it
squirley. I only use qr//.

Yes, you are right, the 'o' option lasts the life of the script.
This defies all logic, where is this regexp stored. Is it related to
variables, even lexical ones? This has to be a runtime association on
the first $pattern 'named' variable, even to nested scope lexicals?
I guess its possible.

Over the "life of the script" is all thats mentioned. I think this has
strictly limited use, basically a one time usage.

Docs:
"If $pattern won't be changing over the lifetime of the script,
we can add the //o modifier, which directs perl to only perform
variable substitutions once"

Thanks!

sln

----------------

my $pattern = 'this';
print getmatch_o('this is a test',$pattern)."\n";
$pattern = 'is';
print getmatch_o('this is a test',$pattern)."\n";

sub getmatch_o
{
my ($target,$pattern) = @_;
print $pattern."\n";
my $ttt = $pattern;
my $ggg = $target;
return $1 if ($ggg =~ /($ttt)/o);

return '';
}
__END__

this
this
is
this
 
S

sln

[snip]
while ( $a =~ /(?=(.{$lenb}))/sog ) {
[snip]

I noticed the capture group is in the assertion and it only advances the
regexp ordinal position by one.

Example:

/(.{$lenb})/ advances 6
/(?=(.{$lenb}))/ advances 1

I looked in perlre,perlretut and there is only one mention of capture within
a "non-capture zero-width assertion" extended expression elements.
That was just a side from some example on (?>..), and that just mentioned,
the ordinal position is adjusted with a (?=(regex)) analogy.

Do you know why this is?


sln

Does anybody know?

sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,152
Latest member
LorettaGur
Top