# another help

Discussion in 'Perl Misc' started by giampiero, Sep 25, 2005.

1. ### giampieroGuest

i find three substring of length 2 (also repeated) followed after a
while to a reverse sequences (also repeated)

i use:
\$a=~s/(.{2,})+(.{2,})+(.{2,})+.*\3{1,}\2{1,}\1{1,}/\$1 \$2 \$3/o;

how to be sure in regular expression that length \$1+\$2+\$3 must be more
l?
thanx a lot from deep of my soul

giampiero, Sep 25, 2005

2. ### Dr.RuudGuest

giampiero schreef:

> i find three substring of length 2 (also repeated) followed after a
> while to a reverse sequences (also repeated)

google is no excuse not to do that.

> i use:
> \$a =~ s/(.{2,})+(.{2,})+(.{2,})+.*\3{1,}\2{1,}\1{1,}/\$1 \$2 \$3/o;

The {2,} means two or more, is that what you want?
The {1,} means 1 or more, so is the same as '+'.

If you meant exactly 2:

\$a =~ s/(..)+(..)+(..)+.*(\3)+(\2)+(\1)+/\1 \2 \3/o;

(untested)

> how to be sure in regular expression that length \$1+\$2+\$3 must be
> more l?

That will always be 3 * 2 = 6.

--
Affijn, Ruud

"Gewoon is een tijger."

Dr.Ruud, Sep 25, 2005

3. ### Matt GarrishGuest

"Dr.Ruud" <> wrote in message
news:...
> giampiero schreef:
>
>> i find three substring of length 2 (also repeated) followed after a
>> while to a reverse sequences (also repeated)

>
> google is no excuse not to do that.
>
>
>> i use:
>> \$a =~ s/(.{2,})+(.{2,})+(.{2,})+.*\3{1,}\2{1,}\1{1,}/\$1 \$2 \$3/o;

>
> The {2,} means two or more, is that what you want?
> The {1,} means 1 or more, so is the same as '+'.
>
> If you meant exactly 2:
>
> \$a =~ s/(..)+(..)+(..)+.*(\3)+(\2)+(\1)+/\1 \2 \3/o;
>
> (untested)
>

Capturing like that just isn't going to work. Something like the following
is probably what you wanted:

\$a = 'AAAABBBBCCCCsometexthereCCCCBBBBAAAA';
\$a =~ s/(..)\1*(..)\2*(..)\3*.*?\3+\2+\1+/\$1 \$2 \$3/;
print \$a;

Matt

Matt Garrish, Sep 25, 2005
4. ### Bob WaltonGuest

giampiero wrote:

> i find three substring of length 2 (also repeated) followed after a
> while to a reverse sequences (also repeated)
>
>
> i use:
> \$a=~s/(.{2,})+(.{2,})+(.{2,})+.*\3{1,}\2{1,}\1{1,}/\$1 \$2 \$3/o;

It seems doubtful that the above regex is actually what you want.
That's because the first (.{2,})+ will match any two or more
characters and assign them to \$1, then any next two or more
characters and assign *them* to \$1, etc. So portions of the
string which were matched (other than by the .*) will not be
present in \$1 \$2 or \$3. If you want what I think you said, you
need to place the parenthetical groupings so they pick up the
entire repeated group, like:

\$a=~s/((?:.{2,})+)
((?:.{2,})+)
((?:.{2,})+)
.*
\3{1,}\2{1,}\1{1,}
/\$1 \$2 \$3/xo;

Note that this regex is particularly inefficient, with huge
amounts of backtracking, so give it a while to execute if the
string has any complication at all. This could be improved
immensely by removing the redundant repeats with no change to
what is matched except for the improvement in efficiency. Example:

use warnings;
use strict;
my \$a='qabczycdefxxxxxxxxxefcdabczynn';
my \$b=\$a;
if( #original regexp
\$a=~s/(.{2,})+(.{2,})+(.{2,})+.*\3{1,}\2{1,}\1{1,}/\$1 \$2 \$3/o
){print "\\$a matched.\n";
print "\\$1=\$1\n";
print "\\$2=\$2\n";
print "\\$3=\$3\n";
}
print "\\$a is now \$a\n";

if( #suggested regexp
\$b=~s/(.{2,})
(.{2,})
(.{2,})
.*
\3+\2+\1+
/\$1 \$2 \$3/xo
){print "\\$b matched.\n";
print "\\$1=\$1\n";
print "\\$2=\$2\n";
print "\\$3=\$3\n";
}
print "\\$b is now \$b\n";

When run:

D:\junk>perl junk544.pl
\$a matched.
\$1=ef
\$2=xx
\$3=xx
\$a is now ef xx xxcdabczynn
\$b matched.
\$1=abczy
\$2=cd
\$3=ef
\$b is now qabczy cd efnn

D:\junk>

>
> how to be sure in regular expression that length \$1+\$2+\$3 must be more
> l?

Well, length \$1+\$2+\$3 will always be 1 unless the strings are
numeric . Assuming you actually mean
length(\$1)+length(\$2)+length(\$3), each of \$1 \$2 and \$3 must have
matched at least two characters, so if the match succeeded then
length(\$1)+length(\$2)+length(\$3)>=6. Perhaps you should check to
see if the match succeeded, as per the example above. Don't ever
use \$1 etc unless you know the match succeeded.
--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl

Bob Walton, Sep 26, 2005
5. ### Guest

and if
\$a=~s/((?:.{0,})+)
((?:.{0,})+)
((?:.{0,})+)
.*
\3{1,}\2{1,}\1{1,}
/\$1 \$2 \$3/xo;

and the total of length of \$1+\$2+\$3>=12?

thanx again

, Sep 29, 2005
6. ### Dr.RuudGuest

schreef:
> and if
> \$a=~s/((?:.{0,})+)
> ((?:.{0,})+)
> ((?:.{0,})+)
> .*
> \3{1,}\2{1,}\1{1,}
> /\$1 \$2 \$3/xo;
>
> and the total of length of \$1+\$2+\$3>=12?
>
> thanx again

{0,} is the same as *
{1,} is the same as +

Something like ((.*)+) hurts (the mind too). 1 or more of something that
can be empty, is not what was meant to be.

The usage of (?:, to cleanly use groups, looks OK.

I remember that your data had a basic grouplength of 2, like
'1212123456xxxxxxxx56343412'
Is that still true? If so, try:

\$a=~s/((?:..)+)
((?:..)+)
((?:..)+)
.*
\3+\2+\1+
/\$1 \$2 \$3/xo;

(untested)

--
Affijn, Ruud

"Gewoon is een tijger."

Dr.Ruud, Sep 29, 2005
7. ### Bob WaltonGuest

wrote:
> and if
> \$a=~s/((?:.{0,})+)
> ((?:.{0,})+)
> ((?:.{0,})+)
> .*
> \3{1,}\2{1,}\1{1,}
> /\$1 \$2 \$3/xo;
>

Please note carefully that (?:.{0,})+ is exactly the same as .*,
with the exception that (?:.{0,})+ is grossly inefficient due to
the amount of backtracking it generates, particularly when
multiples of them appear in the same regexp. Also, note that
this regexp could match the null string. So you could
equivalently and much more efficiently write:

\$a=~s/(.*)(.*)(.*).*\3+\2+\1+/\$1 \$2 \$3/;

> and the total of length of \$1+\$2+\$3>=12?

I interpret this to mean that a success match is intended to
occur only if the sum of the lengths of the three strings is
twelve or more characters total. If so:

use warnings;
use strict;
my \$a='qabczycfffdefxxxxxxxxxefcfffdabczynn';
if(
\$a=~s/(.*)
(.*)
(.*)
.*
\3+\2+\1+
#Note: '`' x 100 is intended to refer to a sequence
#of characters which will never occur in the matched
(??{length(\$1)+length(\$2)+length(\$3)>=12?
'':'`' x 100})
/\$1 \$2 \$3/xo
){print "\\$a matched.\n";
print "\\$1=\$1\n";
print "\\$2=\$2\n";
print "\\$3=\$3\n";
}
print "\\$a is now>\$a<\n";

When run, this prints:

d:\junk>perl junk545.pl
\$a matched.
\$1=abczy
\$2=cfffd
\$3=ef
\$a is now>qabczy cfffd efnn<

d:\junk>

If the two sequences of fff in \$a are replaced with ff, the match
will fail because the sum of the string lengths is less than 12.

It can be instructive to add a print "\$1:\$2:\$3\n"; before the
conditional statement in the (??{}). That prints the progress of
the match as it proceeds.

....
--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl

Bob Walton, Sep 30, 2005
8. ### giampieroGuest

>Please note carefully that (?:.{0,})+ is exactly the same as .*,

???????????
(?:.{0,})+ equal (.*)+

giampiero, Oct 7, 2005
9. ### Matt GarrishGuest

"giampiero" <> wrote in message
news:...
> >Please note carefully that (?:.{0,})+ is exactly the same as .*,

>
> ???????????
> (?:.{0,})+ equal (.*)+
>

You seem to be misunderstaning the fundamental concept of a greedy operator.
On it's own, /.*/ will match nothing and everything. Consequently, writing
/(.*)+/ is a useless redundancy as it will always and only ever match once,
so the additional modifier isn't doing anything (.*? and .*+ being
completely other beasts).

Moreover, /.*/ is equivalent to /.{0,}/ as the * modifier means 0 or more
occurences. There is a difference between writing /(?:.{0,})/ and /(.*)/ and
that is that the first will not result in any value being assigned to \$1. If
you look closely at what was written above, it is only stated that the two
are the same without a grouping on .*.

Matt

Matt Garrish, Oct 7, 2005
10. ### Bob WaltonGuest

giampiero wrote:

>>Please note carefully that (?:.{0,})+ is exactly the same as .*,

>
>
> ???????????

Yes, the above is correct. Both will match any string of
characters (with a caveat around a newline depending on whether
the //s switch is active at the time the regexp is encountered --
but that behavior will be the same between the two). As to why
(?:.{0,})+ is the same as .* : {0,} is a longhand way of writing
*, so .{0,} is the same as .* . (?:.{0,}) is then also the same
as .* . Now, (?:.{0,}) will match any character string (see
caveat above), hence (?:.{0,})+ will also, with the + interpreted
as "once". Depending on the character string, it might also
match, say, half of the string followed by the other half, or a
quarter followed by the other three-fourths, etc etc. Note that
there are a whole bunch of ways (?:.{0,})+ can match a character
string -- but also note that the resulting match does in fact
match the entire character string, just as .* would have.

> (?:.{0,})+ equal (.*)+

This is incorrect. (.*)+ contains grouping parentheses which
will cause the last string matched by .* to be returned in \$1 and
other side reactions to occur in the various other
regexp-grouping-related variables. (?:.{0,})+ does not contain
any grouping parentheses pairs. Hence these two, while they will
match the same strings (namely, all of them, subject to my caveat
above), are not the same because they do not cause the same
ultimate actions.

You seem to be totally missing the idea of why one *never* wants
to do something like (?:.*)+ . It is not just that it takes more
time to type and to think about; it is that such an expression
causes an extreme amount of backtracking when something
subsequent to it fails to match in a regexp. That translates
into computer time -- potentially *years* of it -- spent doing
absolutely nothing worthwhile. Here is an example program that
shows the backtracking I'm talking about as the execution of the
regexps proceeds:

use warnings;
use strict;
my \$s='aaaaaaaaaaaaaaaaaaaaaaaaa';
print "Matching re1:\n";
\$s=~/(.*)(??{print "\$1\n";''})\1/;
<>;
print "Matching re2:\n";
\$s=~/((?:.*)+)(??{print "\$1\n";''})\1/;

The result of running this should be most instructive as to why
one should avoid unneeded backtracking in regexps. Note that the
same result is achieved with both "re1" and "re2" above, but at
substantially higher computational cost in the case of "re2".

--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl

Bob Walton, Oct 9, 2005
11. ### giampieroGuest

my intention was to match two substrings at the left and at right of .*
that can be repeated different times . example
...abcabc.....(.*)...abcabcabc.....

this can be done by (.*) and \1 ????
thanx again.

giampiero, Oct 14, 2005
12. ### Bob WaltonGuest

giampiero wrote:

> my intention was to match two substrings at the left and at right of .*
> that can be repeated different times . example
> ..abcabc.....(.*)...abcabcabc.....
>
> this can be done by (.*) and \1 ????

....

\$1 matching . and .* matching all of the string except for the
leading and trailing .'s. Is that what you intend? If one
replaces the .'s with random non-repeating characters, as in:

xyabcabczjtwvu(.*)mqzabcabcabcsukp

then a match will occur with \$1 matching abcabc, and .* matching
zjtwvu(.*)mqzabc . That match still probably isn't what you
intend -- you would apparently like to see \$1 match abc . The
problem is that while that would match, it isn't the first match
encountered by the regexp engine. On the off chance that that is OK:

use warnings;
use strict;
#my \$string='..abcabc.....(.*)...abcabcabc.....';
my \$string= 'xyabcabczjtwv(.*)mqzabcabcabcsukpr';
if(\$string=~s/(.+)\1*.*\1+//){
print "Matched, \\$1=\$1, left: \$string\n";
}

Note that this is probably not what you really want, since
matches you probably aren't interested in will occur. In this
one, \$1 matches abcabc, the .* matches zjtwvu(.*)mqzabc and \1+
matches abcabc. I think you want \$1 to match abc . Note that \$1
matching abcabc meets your stated criterion: a string that can
be repeated following by any characters followed by one or more
repititions of the first string. The abcabc match is the one the
regexp engine will encounter first (unless non-greediness is used).

For an example you most likely don't want: if the string contains
an additional x (or y) anywhere in the "random junk" near the end
of the string, like:

my \$string= 'xyabcabczjtwv(.*)mqzabcabcabcsxkpr';

then \$1 will match the first x (or y), the .* will match
everything up to the second x (or y), \1+ will match the second x
(or y), and the match will succeed. That match meets your stated
criterion (a substring that can be repeated occuring on both
sides of any string), but probably isn't what you want.

It may help a lot if you can make a clearer statement of what you
really want to match.

--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl

Bob Walton, Oct 15, 2005
13. ### giampieroGuest

as you argue abcabc match abcabc(abc)

But what i need for others elaborations in abc as patter repeated two
and three times

giampiero, Oct 16, 2005
14. ### Bob WaltonGuest

giampiero wrote:
> as you argue abcabc match abcabc(abc)
>
> But what i need for others elaborations in abc as patter repeated two
> and three times
>

Unquoted context from previous notes:

[[[[[
giampiero wrote:

> my intention was to match two substrings at the left and at

right of .*
> that can be repeated different times . example
> ..abcabc.....(.*)...abcabcabc.....
>
> this can be done by (.*) and \1 ????

....

\$1 matching . and .* matching all of the string except for the
leading and trailing .'s. Is that what you intend? If one
replaces the .'s with random non-repeating characters, as in:

xyabcabczjtwvu(.*)mqzabcabcabcsukp

then a match will occur with \$1 matching abcabc, and .* matching
zjtwvu(.*)mqzabc . That match still probably isn't what you
intend -- you would apparently like to see \$1 match abc . The
problem is that while that would match, it isn't the first match
encountered by the regexp engine. On the off chance that that is OK:

use warnings;
use strict;
#my \$string='..abcabc.....(.*)...abcabcabc.....';
my \$string= 'xyabcabczjtwv(.*)mqzabcabcabcsukpr';
if(\$string=~s/(.+)\1*.*\1+//){
print "Matched, \\$1=\$1, left: \$string\n";
}

]]]]]

Well, there are a couple of ways of getting that match, all
involving further restrictions of your requirements. If you make
the original string match (the (.+) ) so it only matches strings
three characters long (that is, (.{3,3}) , that works.

Or if you make it so the part of the string before the .* is
required to repeat at least once and the part of the string after
the .* is required to also repeat at least once, that will also
result in \$1 matching abc . Example:

use warnings;
use strict;
my \$string='xyabcabczjtwv(.*)mqzabcabcabcsykpr';
if(\$string=~s/(.+)\1+.*\1{2,}//){
print "Matched, \\$1=\$1, left: \$string\n";
}

But with your original statement of the desired regexp (a first
string, possibly repeated, followed by any string, followed by
the first string possibly repeated), other matches such as abcabc
will be found first.
--
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl

Bob Walton, Oct 16, 2005