How to find all the strings in a long that are at most of n differentcharacters of a test string?

sln · Nov 21, 2008

Pilcrow said:
Pilcrow said:

Suppose I have a long string $a, and a test string $b.

I want to fine all the substrings in $a, whose length is the same as
$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Not too hard the simple way.
-------------------------------------------------------------------
#!/usr/bin/perl
use strict; use warnings;

my $a =
"abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx";
my $b = "aacdxf";

my @a = split //,$a;
my @b = split //,$b;
my $limit = 2;
my $cnt;
my $lenb = length $b;
my @substrings = ();

print "\nmatching '$a' against '$b'\n";

OUTER:
for (my $i = 0; $i <= $#a-$lenb+1; $i++) {
$cnt = 0;
for (my $j = 0; $j <= $#b; $j++) {
$cnt++ unless $a[$i+$j] eq $b[$j];
next OUTER if $cnt > $limit;
}
my $sub = substr($a,$i,$lenb);
# push @substrings, $sub; # alternate output
print "match '$sub' at offset $i\n";
}

Click to expand...

Or more simpler as:

#!/usr/bin/perl
use strict;
use warnings;

my $a =
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx';
my $b = 'aacdxf';

my $limit = 2;
my $lenb = length $b;

print "\nmatching '$a' against '$b'\n";

while ( $a =~ /(?=(.{$lenb}))/sog ) {
next if ( $1 ^ $b ) =~ tr/\0//c > $limit;
print "match '$1' at offset $-[0]\n";
}

__END__

Click to expand...

Very elegant and fast, but you don't need all the modifiers /sog to the
regex, just /g will do.

Actually, he needs 'so' as well. If this is used in a multi-lined
way, the '.' needs the 's' modifier if it is possible a '\n' is encounterred
and could be part of the pattern.
The 'o' suggests to not recompile, which might happen, don't know, anything is
possible. By including both, no harm, ho foul.

sln

Peng Yu · Nov 21, 2008

It is a reference, concise and accurate (as much as possible), it is not
and never will be a tutorial.

Then you will have to look for a tutorial or dig down into those other
things. Very often it is not possible or not desirable to explain
something without making references to other concepts. And @- is
certainly not the right place to explain grouping in regular
expressions.

Thank you the example. I see what it means. In general, I think the
document can be made easier to read with more examples.

Click to expand...

But there are examples in that section:
<quote>
After a match against some variable $var:

"$`" is the same as "substr($var, 0, $-[0])"
"$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
"$'" is the same as "substr($var, $+[0])"
"$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
"$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"
"$3" is the same as "substr $var, $-[3], $+[3] - $-[3])"
</quote>

I didn't understand the above examples because it refers to $-[0],
which was exactly what I wanted to understand.

Go ahead and write the fivehundredandnintysecond tutorial about Perl.
You will find that either you are repeating yourself all over the place
or you start using references all over the place because Perl's concept
are so intertwined and cannot be explained stand-alone.

Not to mention the ensuing maintenance nightmare when you have to fix
something all over the place.

I agree your points. But I think that the perldoc shall have as less
cyclic references as possible. If the example that you mentioned in
the perldoc is not cyclic referred, it would easier to read.

Thanks,
Peng

Sherm Pendley · Nov 21, 2008

Peng Yu said:
But there are examples in that section:
<quote>
After a match against some variable $var:

"$`" is the same as "substr($var, 0, $-[0])"
"$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
"$'" is the same as "substr($var, $+[0])"
"$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
"$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"
"$3" is the same as "substr $var, $-[3], $+[3] - $-[3])"
</quote>

Click to expand...

I didn't understand the above examples because it refers to $-[0],
which was exactly what I wanted to understand.

Arrays are referred to with @, but array *elements* with $. So, given
the array @-, its first element is $-[0].

sherm--

ilovelinux · Nov 21, 2008

#!/usr/bin/perl
use strict;
use warnings;
my $a =
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx';
my $b = 'aacdxf';
my $limit = 2;
my $lenb = length $b;
print "\nmatching '$a' against '$b'\n";
while ( $a =~ /(?=(.{$lenb}))/sog ) {
next if ( $1 ^ $b ) =~ tr/\0//c > $limit;
print "match '$1' at offset $-[0]\n";
}
__END__

Click to expand...

Click to expand...

Very elegant and fast, but you don't need all the modifiers /sog to the
regex, just /g will do.

Click to expand...

Actually, he needs 'so' as well. If this is used in a multi-lined
way, the '.' needs the 's' modifier if it is possible a '\n' is encounterred
and could be part of the pattern.
The 'o' suggests to not recompile, which might happen, don't know, anything is
possible. By including both, no harm, ho foul.

Harm will be done using the 'o'nce option, when the code is
transferred to a subroutine and the regexp will be called with $b's of
different length, because the regexp will be compiled with the first
$lenb encountered and never change.

Better put the regexp into a variable and use this in the while ():

my $re = qr/(?=(.{$lenb}))/s;
while ( $a =~ /$re/g ) {

HTH

A. Sinan Unur · Nov 21, 2008

But there are examples in that section:
<quote>
After a match against some variable $var:

"$`" is the same as "substr($var, 0, $-[0])"
"$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
"$'" is the same as "substr($var, $+[0])"
"$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
"$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"
"$3" is the same as "substr $var, $-[3], $+[3] - $-[3])"
</quote>

Click to expand...

I didn't understand the above examples because it refers to $-[0],
which was exactly what I wanted to understand.

Are you joking? How can one explain what @- is without explaining the
meaning of its elements?

You do know that $x[1] refers to the second element of @x, right?

I agree your points. But I think that the perldoc shall have as less
cyclic references as possible. If the example that you mentioned in
the perldoc is not cyclic referred, it would easier to read.

I don't see such cyclic references in the documentation for @-.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

Peter J. Holzer · Nov 23, 2008

Did you quote from a newer version than 5.10.0? It's slightly different
in my docs (only checked 5.8.8 and 5.10.0).

Thank you the example. I see what it means. In general, I think the
document can be made easier to read with more examples.

Click to expand...

But there are examples in that section:
<quote>
After a match against some variable $var:

"$`" is the same as "substr($var, 0, $-[0])"
"$&" is the same as "substr($var, $-[0], $+[0] - $-[0])"
"$'" is the same as "substr($var, $+[0])"
"$1" is the same as "substr($var, $-[1], $+[1] - $-[1])"
"$2" is the same as "substr($var, $-[2], $+[2] - $-[2])"
"$3" is the same as "substr $var, $-[3], $+[3] - $-[3])"
</quote>

Click to expand...

I didn't understand the above examples because it refers to $-[0],
which was exactly what I wanted to understand.

If it didn't contain $-[0], it wouldn't be an example for the use of
$-[0], would it?

I agree your points. But I think that the perldoc shall have as less
cyclic references as possible. If the example that you mentioned in
the perldoc is not cyclic referred, it would easier to read.

I don't see any cyclic references in the section about @-. $-[0] is
explained in the very first sentence:

@- $-[0] is the offset of the start of the last successful match.

To understand the examples, it helps to know what $`, $&, etc. are.
These are explained elsewhere in the same document, and none of them
is explained by referring to @-. No cycles there.

hp

A. Sinan Unur · Nov 23, 2008

Did you quote from a newer version than 5.10.0? It's slightly
different in my docs (only checked 5.8.8 and 5.10.0).

C:\Documents and Settings\asu1> perl -v

This is perl, v5.10.0 built for MSWin32-x86-multi-thread
(with 5 registered patches, see perl -V for more detail)

Copyright 1987-2007, Larry Wall

Binary build 1004 [287188] provided by ActiveState
http://www.ActiveState.com
Built Sep 3 2008 13:16:37

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

sln · Nov 23, 2008

#!/usr/bin/perl
use strict;
use warnings;

Click to expand...

my $a =
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx';
my $b = 'aacdxf';

Click to expand...

my $limit = 2;
my $lenb = length $b;

Click to expand...

print "\nmatching '$a' against '$b'\n";

Click to expand...

while ( $a =~ /(?=(.{$lenb}))/sog ) {
next if ( $1 ^ $b ) =~ tr/\0//c > $limit;
print "match '$1' at offset $-[0]\n";
}

__END__

Click to expand...

Very elegant and fast, but you don't need all the modifiers /sog to the
regex, just /g will do.

Click to expand...

Actually, he needs 'so' as well. If this is used in a multi-lined
way, the '.' needs the 's' modifier if it is possible a '\n' is encounterred
and could be part of the pattern.
The 'o' suggests to not recompile, which might happen, don't know, anything is
possible. By including both, no harm, ho foul.

Click to expand...

Harm will be done using the 'o'nce option, when the code is
transferred to a subroutine and the regexp will be called with $b's of
different length, because the regexp will be compiled with the first
$lenb encountered and never change.

Better put the regexp into a variable and use this in the while ():

my $re = qr/(?=(.{$lenb}))/s;
while ( $a =~ /$re/g ) {

HTH

I never used the /o modifier. Tested it a long time ago and found it
squirley. I only use qr//.

Yes, you are right, the 'o' option lasts the life of the script.
This defies all logic, where is this regexp stored. Is it related to
variables, even lexical ones? This has to be a runtime association on
the first $pattern 'named' variable, even to nested scope lexicals?
I guess its possible.

Over the "life of the script" is all thats mentioned. I think this has
strictly limited use, basically a one time usage.

Docs:
"If $pattern won't be changing over the lifetime of the script,
we can add the //o modifier, which directs perl to only perform
variable substitutions once"

Thanks!

sln

----------------

my $pattern = 'this';
print getmatch_o('this is a test',$pattern)."\n";
$pattern = 'is';
print getmatch_o('this is a test',$pattern)."\n";

sub getmatch_o
{
my ($target,$pattern) = @_;
print $pattern."\n";
my $ttt = $pattern;
my $ggg = $target;
return $1 if ($ggg =~ /($ttt)/o);

return '';
}
__END__

this
this
is
this

sln · Nov 23, 2008

[snip]

while ( $a =~ /(?=(.{$lenb}))/sog ) {

Click to expand...

[snip]

I noticed the capture group is in the assertion and it only advances the
regexp ordinal position by one.

Example:

/(.{$lenb})/ advances 6
/(?=(.{$lenb}))/ advances 1

I looked in perlre,perlretut and there is only one mention of capture within
a "non-capture zero-width assertion" extended expression elements.
That was just a side from some example on (?>..), and that just mentioned,
the ordinal position is adjusted with a (?=(regex)) analogy.

Do you know why this is?

sln

Does anybody know?

sln

How to try a range of hex values in C# code ?	0	Nov 19, 2022
How to check if the n-th part of an array of Strings exist?	13	Oct 25, 2011
Best way to search for a string which has N% in a character class?	5	Mar 2, 2012
FAQ 4.27 How can I access or change N characters of a string?	0	Feb 27, 2011
How to break a bash command into an array consisting of the argumentsin the command?	3	May 12, 2013
Decoding no of ways and printing each decode message	2	Jun 1, 2021
'Needless flexibilities' and structured records [very long]	10	Mar 15, 2013
FAQ 4.29 How can I count the number of occurrences of a substring within a string?	0	Jan 4, 2011

How to find all the strings in a long that are at most of n differentcharacters of a test string?

sln

Peng Yu

Sherm Pendley

ilovelinux

A. Sinan Unur

Peter J. Holzer

A. Sinan Unur

sln

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads