difficult substitution patterns

P

Peter

I am relative newbie to perl . i am reading programming perl to learn
perl. In the chanper on pattern matching I came across the following
sustitutions that I can't understand completely. It would be great if
someone could explain these.

Thanks in advance

a)
#put commas in the right place in an integer

1 while s/(\d) (\d\d\d) (?!\d)/$1,$2;
# what does this mean (?!\d) and what purpose does it serve

b)
#remove (nested (even deeply nested (like this))) remarks

1 while s/\([^()]*\)//g;
# why escape the first ( and second ), what about the ( or ) in
between
 
T

Tad McClellan

Peter said:
It would be great if
someone could explain these.
a)
#put commas in the right place in an integer

1 while s/(\d) (\d\d\d) (?!\d)/$1,$2;


That can't be the right code. It does not compile...

# what does this mean (?!\d)


Did you look it up in the std docs yet?


perldoc perlre

...
A zero-width negative look-ahead assertion.
...

and what purpose does it serve


To ensure that the 3 digit chars that are matched are the
last (rightmost) possible chars.

b)
#remove (nested (even deeply nested (like this))) remarks

1 while s/\([^()]*\)//g;
# why escape the first ( and second ),


Because parenthesis are regex metacharacters.

You must backslash them to match literal parenthesis characters.

what about the ( or ) in
between


Parenthesis are not metacharacters in a character class,
so they need no escaping there.

There are only 4 metacharacters in character classes:

] # ends the class, unless it is first

^ # negates the class if it first

- # forms a range, unless it is first or last

\ # for escaping the other metachars
 
J

John W. Kennedy

Peter said:
I am relative newbie to perl . i am reading programming perl to learn
perl. In the chanper on pattern matching I came across the following
sustitutions that I can't understand completely. It would be great if
someone could explain these.

Thanks in advance

a)
#put commas in the right place in an integer

1 while s/(\d) (\d\d\d) (?!\d)/$1,$2;
# what does this mean (?!\d) and what purpose does it serve

The correct form of the line is:
1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/;

The (?!\d) is what is known as a zero-width assertion. It means that
after the (\d) and the (\d\d\d) there is _not_ another \d. That it is a
"zero-width assertion" means that the thing it matches doesn't count as
part of the match; it's just checked.

Let's say that we are processing 12345678.

We try the match. The first thing that works is the '5' (which matches
'(\d)'), the '678' (which matches '(\d\d\d)') and the end, which is not
a \d.

That changes $_ to '12345,678'. Because the s/.../.../ worked, we
repeat the while. This time, the first thing that works is the '2'
(which matches '(\d)'), the '345' (which matches '(\d\d\d)'), and the
',', which is not a \d.

That changes $_ to '12,345,678. The comma after the '5' is not changed
because '(?!\d)' is a zero-width assertion, and therefore doesn't count
as part of the match, and therefore is not part of what is replaced.
Because the s/.../.../ worked, we repeat the match a third time, but
there isn't another match, and so the while terminates.
b)
#remove (nested (even deeply nested (like this))) remarks

1 while s/\([^()]*\)//g;
# why escape the first ( and second ), what about the ( or ) in
between

The escapes are there to indicate that they are literal parentheses to
be scanned for, not grouping operators in regular-expression language.

The escapes are not within the [] because parentheses have no meaning
within [], and are therefore automatically taken as literal.

To expand, the regular expression means this:

Match on a (, followed by zero or more characters that are not ( or ),
followed by a ).

The first time, we get "remove (nested (even deeply nested )) remarks".
The second time, we get "remove (nested ) remarks".
The third time, we get "remove remarks".
The fourth time, there is no match, and the while terminates.

--
John W. Kennedy
"But now is a new thing which is very old--
that the rich make themselves richer and not poorer,
which is the true Gospel, for the poor's sake."
-- Charles Williams. "Judgement at Chelmsford"
 
J

Joe Smith

John said:
The correct form of the line is:
1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/;

Any advantage of using that form as opposed to
1 while s/(\d+)(\d\d\d)/$1,$2/;
?
-Joe
 
T

Tad McClellan

Joe Smith said:
Any advantage of using that form as opposed to
1 while s/(\d+)(\d\d\d)/$1,$2/;
?


Not that I can see, other than less backtracking for the first one.
 
M

Matt Garrish

Joe Smith said:
Any advantage of using that form as opposed to
1 while s/(\d+)(\d\d\d)/$1,$2/;
?

The larger the number to format, the faster your regex becomes:

Formatting the number 1250000 returned:

Method 1: 15 wallclock secs (14.50 usr + 0.00 sys = 14.50 CPU) @ 68965.52/s
(1000000)
Method 2: 15 wallclock secs (14.70 usr + 0.00 sys = 14.70 CPU) @ 68013.33/s
(1000000)

Formatting the number 1250000123456789000000 returned:

Method 1: 58 wallclock secs (57.36 usr + 0.00 sys = 57.36 CPU) @ 17433.75/s
(1000000)
Method 2: 41 wallclock secs (41.73 usr + 0.03 sys = 41.76 CPU) @ 23943.49/s
(1000000)


Matt
 
J

John W. Kennedy

Joe said:
Any advantage of using that form as opposed to
1 while s/(\d+)(\d\d\d)/$1,$2/;
?

I didn't write it -- just corrected a syntax error.

--
John W. Kennedy
"But now is a new thing which is very old--
that the rich make themselves richer and not poorer,
which is the true Gospel, for the poor's sake."
-- Charles Williams. "Judgement at Chelmsford"
 
P

Peter

Thanks John and Tad for answering my questions.

John W. Kennedy said:
Peter said:
I am relative newbie to perl . i am reading programming perl to learn
perl. In the chanper on pattern matching I came across the following
sustitutions that I can't understand completely. It would be great if
someone could explain these.

Thanks in advance

a)
#put commas in the right place in an integer

1 while s/(\d) (\d\d\d) (?!\d)/$1,$2;
# what does this mean (?!\d) and what purpose does it serve

The correct form of the line is:
1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/;

The (?!\d) is what is known as a zero-width assertion. It means that
after the (\d) and the (\d\d\d) there is _not_ another \d. That it is a
"zero-width assertion" means that the thing it matches doesn't count as
part of the match; it's just checked.

Let's say that we are processing 12345678.

We try the match. The first thing that works is the '5' (which matches
'(\d)'), the '678' (which matches '(\d\d\d)') and the end, which is not
a \d.

That changes $_ to '12345,678'. Because the s/.../.../ worked, we
repeat the while. This time, the first thing that works is the '2'
(which matches '(\d)'), the '345' (which matches '(\d\d\d)'), and the
',', which is not a \d.

That changes $_ to '12,345,678. The comma after the '5' is not changed
because '(?!\d)' is a zero-width assertion, and therefore doesn't count
as part of the match, and therefore is not part of what is replaced.
Because the s/.../.../ worked, we repeat the match a third time, but
there isn't another match, and so the while terminates.
b)
#remove (nested (even deeply nested (like this))) remarks

1 while s/\([^()]*\)//g;
# why escape the first ( and second ), what about the ( or ) in
between

The escapes are there to indicate that they are literal parentheses to
be scanned for, not grouping operators in regular-expression language.

The escapes are not within the [] because parentheses have no meaning
within [], and are therefore automatically taken as literal.

To expand, the regular expression means this:

Match on a (, followed by zero or more characters that are not ( or ),
followed by a ).

The first time, we get "remove (nested (even deeply nested )) remarks".
The second time, we get "remove (nested ) remarks".
The third time, we get "remove remarks".
The fourth time, there is no match, and the while terminates.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top