difficult substitution patterns

Peter · Feb 1, 2004

I am relative newbie to perl . i am reading programming perl to learn
perl. In the chanper on pattern matching I came across the following
sustitutions that I can't understand completely. It would be great if
someone could explain these.

Thanks in advance

a)
#put commas in the right place in an integer

1 while s/(\d) (\d\d\d) (?!\d)/$1,$2;
# what does this mean (?!\d) and what purpose does it serve

b)
#remove (nested (even deeply nested (like this))) remarks

1 while s/$[^()]*$//g;
# why escape the first ( and second ), what about the ( or ) in
between

Tad McClellan · Feb 1, 2004

Peter said:
It would be great if
someone could explain these.

a)
#put commas in the right place in an integer

1 while s/(\d) (\d\d\d) (?!\d)/$1,$2;

That can't be the right code. It does not compile...

# what does this mean (?!\d)

Did you look it up in the std docs yet?

perldoc perlre

...
A zero-width negative look-ahead assertion.
...

and what purpose does it serve

To ensure that the 3 digit chars that are matched are the
last (rightmost) possible chars.

b)
#remove (nested (even deeply nested (like this))) remarks

1 while s/$[^()]*$//g;
# why escape the first ( and second ),

Because parenthesis are regex metacharacters.

You must backslash them to match literal parenthesis characters.

what about the ( or ) in
between

Parenthesis are not metacharacters in a character class,
so they need no escaping there.

There are only 4 metacharacters in character classes:

] # ends the class, unless it is first

^ # negates the class if it first

- # forms a range, unless it is first or last

\ # for escaping the other metachars

John W. Kennedy · Feb 1, 2004

Peter said:
I am relative newbie to perl . i am reading programming perl to learn
perl. In the chanper on pattern matching I came across the following
sustitutions that I can't understand completely. It would be great if
someone could explain these.

Thanks in advance

a)
#put commas in the right place in an integer

1 while s/(\d) (\d\d\d) (?!\d)/$1,$2;
# what does this mean (?!\d) and what purpose does it serve

The correct form of the line is:
1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/;

The (?!\d) is what is known as a zero-width assertion. It means that
after the (\d) and the (\d\d\d) there is _not_ another \d. That it is a
"zero-width assertion" means that the thing it matches doesn't count as
part of the match; it's just checked.

Let's say that we are processing 12345678.

We try the match. The first thing that works is the '5' (which matches
'(\d)'), the '678' (which matches '(\d\d\d)') and the end, which is not
a \d.

That changes $_ to '12345,678'. Because the s/.../.../ worked, we
repeat the while. This time, the first thing that works is the '2'
(which matches '(\d)'), the '345' (which matches '(\d\d\d)'), and the
',', which is not a \d.

That changes $_ to '12,345,678. The comma after the '5' is not changed
because '(?!\d)' is a zero-width assertion, and therefore doesn't count
as part of the match, and therefore is not part of what is replaced.
Because the s/.../.../ worked, we repeat the match a third time, but
there isn't another match, and so the while terminates.

b)
#remove (nested (even deeply nested (like this))) remarks

1 while s/$[^()]*$//g;
# why escape the first ( and second ), what about the ( or ) in
between

The escapes are there to indicate that they are literal parentheses to
be scanned for, not grouping operators in regular-expression language.

The escapes are not within the [] because parentheses have no meaning
within [], and are therefore automatically taken as literal.

To expand, the regular expression means this:

Match on a (, followed by zero or more characters that are not ( or ),
followed by a ).

The first time, we get "remove (nested (even deeply nested )) remarks".
The second time, we get "remove (nested ) remarks".
The third time, we get "remove remarks".
The fourth time, there is no match, and the while terminates.

--
John W. Kennedy
"But now is a new thing which is very old--
that the rich make themselves richer and not poorer,
which is the true Gospel, for the poor's sake."
-- Charles Williams. "Judgement at Chelmsford"

Joe Smith · Feb 2, 2004

John said:
The correct form of the line is:
1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/;

Any advantage of using that form as opposed to
1 while s/(\d+)(\d\d\d)/$1,$2/;
?
-Joe

Tad McClellan · Feb 2, 2004

Joe Smith said:
Any advantage of using that form as opposed to
1 while s/(\d+)(\d\d\d)/$1,$2/;
?

Not that I can see, other than less backtracking for the first one.

Matt Garrish · Feb 2, 2004

Joe Smith said:
Any advantage of using that form as opposed to
1 while s/(\d+)(\d\d\d)/$1,$2/;
?

The larger the number to format, the faster your regex becomes:

Formatting the number 1250000 returned:

Method 1: 15 wallclock secs (14.50 usr + 0.00 sys = 14.50 CPU) @ 68965.52/s
(1000000)
Method 2: 15 wallclock secs (14.70 usr + 0.00 sys = 14.70 CPU) @ 68013.33/s
(1000000)

Formatting the number 1250000123456789000000 returned:

Method 1: 58 wallclock secs (57.36 usr + 0.00 sys = 57.36 CPU) @ 17433.75/s
(1000000)
Method 2: 41 wallclock secs (41.73 usr + 0.03 sys = 41.76 CPU) @ 23943.49/s
(1000000)

Matt

John W. Kennedy · Feb 2, 2004

Joe said:
Any advantage of using that form as opposed to
1 while s/(\d+)(\d\d\d)/$1,$2/;
?

I didn't write it -- just corrected a syntax error.

--
John W. Kennedy
"But now is a new thing which is very old--
that the rich make themselves richer and not poorer,
which is the true Gospel, for the poor's sake."
-- Charles Williams. "Judgement at Chelmsford"

Peter · Feb 3, 2004

Thanks John and Tad for answering my questions.

John W. Kennedy said:
Peter said:

I am relative newbie to perl . i am reading programming perl to learn
perl. In the chanper on pattern matching I came across the following
sustitutions that I can't understand completely. It would be great if
someone could explain these.

Thanks in advance

a)
#put commas in the right place in an integer

1 while s/(\d) (\d\d\d) (?!\d)/$1,$2;
# what does this mean (?!\d) and what purpose does it serve

Click to expand...

The correct form of the line is:
1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/;

The (?!\d) is what is known as a zero-width assertion. It means that
after the (\d) and the (\d\d\d) there is _not_ another \d. That it is a
"zero-width assertion" means that the thing it matches doesn't count as
part of the match; it's just checked.

Let's say that we are processing 12345678.

We try the match. The first thing that works is the '5' (which matches
'(\d)'), the '678' (which matches '(\d\d\d)') and the end, which is not
a \d.

That changes $_ to '12345,678'. Because the s/.../.../ worked, we
repeat the while. This time, the first thing that works is the '2'
(which matches '(\d)'), the '345' (which matches '(\d\d\d)'), and the
',', which is not a \d.

That changes $_ to '12,345,678. The comma after the '5' is not changed
because '(?!\d)' is a zero-width assertion, and therefore doesn't count
as part of the match, and therefore is not part of what is replaced.
Because the s/.../.../ worked, we repeat the match a third time, but
there isn't another match, and so the while terminates.

b)
#remove (nested (even deeply nested (like this))) remarks

1 while s/$[^()]*$//g;
# why escape the first ( and second ), what about the ( or ) in
between

Click to expand...

The escapes are there to indicate that they are literal parentheses to
be scanned for, not grouping operators in regular-expression language.

The escapes are not within the [] because parentheses have no meaning
within [], and are therefore automatically taken as literal.

To expand, the regular expression means this:

Match on a (, followed by zero or more characters that are not ( or ),
followed by a ).

The first time, we get "remove (nested (even deeply nested )) remarks".
The second time, we get "remove (nested ) remarks".
The third time, we get "remove remarks".
The fourth time, there is no match, and the while terminates.

using $1 in a substitution replacement with variable interpolation	4	May 1, 2013
Python3 string slicing	2	Dec 11, 2023
User-defined substitution	9	Jan 22, 2010
iterating with a substitution	4	Nov 16, 2009
substitution with computed replacements	1	Aug 3, 2009
Command Line Arguments	0	Mar 7, 2023
Need help with a Regex substitution	4	Mar 23, 2007
possible ActivePerl substitution optimisation issue	0	Jun 12, 2007

difficult substitution patterns

Peter

Tad McClellan

John W. Kennedy

Joe Smith

Tad McClellan

Matt Garrish

John W. Kennedy

Peter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads