difficult substitution patterns

Discussion in 'Perl Misc' started by Peter, Feb 1, 2004.

  1. Peter

    Peter Guest

    I am relative newbie to perl . i am reading programming perl to learn
    perl. In the chanper on pattern matching I came across the following
    sustitutions that I can't understand completely. It would be great if
    someone could explain these.

    Thanks in advance

    a)
    #put commas in the right place in an integer

    1 while s/(\d) (\d\d\d) (?!\d)/$1,$2;
    # what does this mean (?!\d) and what purpose does it serve

    b)
    #remove (nested (even deeply nested (like this))) remarks

    1 while s/\([^()]*\)//g;
    # why escape the first ( and second ), what about the ( or ) in
    between
     
    Peter, Feb 1, 2004
    #1
    1. Advertising

  2. Peter <> wrote:

    > It would be great if
    > someone could explain these.


    > a)
    > #put commas in the right place in an integer
    >
    > 1 while s/(\d) (\d\d\d) (?!\d)/$1,$2;



    That can't be the right code. It does not compile...


    > # what does this mean (?!\d)



    Did you look it up in the std docs yet?


    perldoc perlre

    ...
    A zero-width negative look-ahead assertion.
    ...


    > and what purpose does it serve



    To ensure that the 3 digit chars that are matched are the
    last (rightmost) possible chars.


    > b)
    > #remove (nested (even deeply nested (like this))) remarks
    >
    > 1 while s/\([^()]*\)//g;
    > # why escape the first ( and second ),



    Because parenthesis are regex metacharacters.

    You must backslash them to match literal parenthesis characters.


    > what about the ( or ) in
    > between



    Parenthesis are not metacharacters in a character class,
    so they need no escaping there.

    There are only 4 metacharacters in character classes:

    ] # ends the class, unless it is first

    ^ # negates the class if it first

    - # forms a range, unless it is first or last

    \ # for escaping the other metachars


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Feb 1, 2004
    #2
    1. Advertising

  3. Peter wrote:

    > I am relative newbie to perl . i am reading programming perl to learn
    > perl. In the chanper on pattern matching I came across the following
    > sustitutions that I can't understand completely. It would be great if
    > someone could explain these.
    >
    > Thanks in advance
    >
    > a)
    > #put commas in the right place in an integer
    >
    > 1 while s/(\d) (\d\d\d) (?!\d)/$1,$2;
    > # what does this mean (?!\d) and what purpose does it serve


    The correct form of the line is:
    1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/;

    The (?!\d) is what is known as a zero-width assertion. It means that
    after the (\d) and the (\d\d\d) there is _not_ another \d. That it is a
    "zero-width assertion" means that the thing it matches doesn't count as
    part of the match; it's just checked.

    Let's say that we are processing 12345678.

    We try the match. The first thing that works is the '5' (which matches
    '(\d)'), the '678' (which matches '(\d\d\d)') and the end, which is not
    a \d.

    That changes $_ to '12345,678'. Because the s/.../.../ worked, we
    repeat the while. This time, the first thing that works is the '2'
    (which matches '(\d)'), the '345' (which matches '(\d\d\d)'), and the
    ',', which is not a \d.

    That changes $_ to '12,345,678. The comma after the '5' is not changed
    because '(?!\d)' is a zero-width assertion, and therefore doesn't count
    as part of the match, and therefore is not part of what is replaced.
    Because the s/.../.../ worked, we repeat the match a third time, but
    there isn't another match, and so the while terminates.

    > b)
    > #remove (nested (even deeply nested (like this))) remarks
    >
    > 1 while s/\([^()]*\)//g;
    > # why escape the first ( and second ), what about the ( or ) in
    > between


    The escapes are there to indicate that they are literal parentheses to
    be scanned for, not grouping operators in regular-expression language.

    The escapes are not within the [] because parentheses have no meaning
    within [], and are therefore automatically taken as literal.

    To expand, the regular expression means this:

    Match on a (, followed by zero or more characters that are not ( or ),
    followed by a ).

    The first time, we get "remove (nested (even deeply nested )) remarks".
    The second time, we get "remove (nested ) remarks".
    The third time, we get "remove remarks".
    The fourth time, there is no match, and the while terminates.

    --
    John W. Kennedy
    "But now is a new thing which is very old--
    that the rich make themselves richer and not poorer,
    which is the true Gospel, for the poor's sake."
    -- Charles Williams. "Judgement at Chelmsford"
     
    John W. Kennedy, Feb 1, 2004
    #3
  4. Peter

    Joe Smith Guest

    Re: difficult substitution patterns (commafying)

    John W. Kennedy wrote:

    > The correct form of the line is:
    > 1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/;


    Any advantage of using that form as opposed to
    1 while s/(\d+)(\d\d\d)/$1,$2/;
    ?
    -Joe
     
    Joe Smith, Feb 2, 2004
    #4
  5. Re: difficult substitution patterns (commafying)

    Joe Smith <> wrote:
    > John W. Kennedy wrote:
    >
    >> The correct form of the line is:
    >> 1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/;

    >
    > Any advantage of using that form as opposed to
    > 1 while s/(\d+)(\d\d\d)/$1,$2/;
    > ?



    Not that I can see, other than less backtracking for the first one.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Feb 2, 2004
    #5
  6. Peter

    Matt Garrish Guest

    Re: difficult substitution patterns (commafying)

    "Joe Smith" <> wrote in message
    news:4thTb.201492$I06.2218813@attbi_s01...
    > John W. Kennedy wrote:
    >
    > > The correct form of the line is:
    > > 1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/;

    >
    > Any advantage of using that form as opposed to
    > 1 while s/(\d+)(\d\d\d)/$1,$2/;
    > ?


    The larger the number to format, the faster your regex becomes:

    Formatting the number 1250000 returned:

    Method 1: 15 wallclock secs (14.50 usr + 0.00 sys = 14.50 CPU) @ 68965.52/s
    (1000000)
    Method 2: 15 wallclock secs (14.70 usr + 0.00 sys = 14.70 CPU) @ 68013.33/s
    (1000000)

    Formatting the number 1250000123456789000000 returned:

    Method 1: 58 wallclock secs (57.36 usr + 0.00 sys = 57.36 CPU) @ 17433.75/s
    (1000000)
    Method 2: 41 wallclock secs (41.73 usr + 0.03 sys = 41.76 CPU) @ 23943.49/s
    (1000000)


    Matt
     
    Matt Garrish, Feb 2, 2004
    #6
  7. Re: difficult substitution patterns (commafying)

    Joe Smith wrote:

    > John W. Kennedy wrote:
    >
    >> The correct form of the line is:
    >> 1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/;

    >
    >
    > Any advantage of using that form as opposed to
    > 1 while s/(\d+)(\d\d\d)/$1,$2/;
    > ?


    I didn't write it -- just corrected a syntax error.

    --
    John W. Kennedy
    "But now is a new thing which is very old--
    that the rich make themselves richer and not poorer,
    which is the true Gospel, for the poor's sake."
    -- Charles Williams. "Judgement at Chelmsford"
     
    John W. Kennedy, Feb 2, 2004
    #7
  8. Peter

    Peter Guest

    Thanks John and Tad for answering my questions.

    "John W. Kennedy" <> wrote in message news:<ar_Sb.18963$>...
    > Peter wrote:
    >
    > > I am relative newbie to perl . i am reading programming perl to learn
    > > perl. In the chanper on pattern matching I came across the following
    > > sustitutions that I can't understand completely. It would be great if
    > > someone could explain these.
    > >
    > > Thanks in advance
    > >
    > > a)
    > > #put commas in the right place in an integer
    > >
    > > 1 while s/(\d) (\d\d\d) (?!\d)/$1,$2;
    > > # what does this mean (?!\d) and what purpose does it serve

    >
    > The correct form of the line is:
    > 1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/;
    >
    > The (?!\d) is what is known as a zero-width assertion. It means that
    > after the (\d) and the (\d\d\d) there is _not_ another \d. That it is a
    > "zero-width assertion" means that the thing it matches doesn't count as
    > part of the match; it's just checked.
    >
    > Let's say that we are processing 12345678.
    >
    > We try the match. The first thing that works is the '5' (which matches
    > '(\d)'), the '678' (which matches '(\d\d\d)') and the end, which is not
    > a \d.
    >
    > That changes $_ to '12345,678'. Because the s/.../.../ worked, we
    > repeat the while. This time, the first thing that works is the '2'
    > (which matches '(\d)'), the '345' (which matches '(\d\d\d)'), and the
    > ',', which is not a \d.
    >
    > That changes $_ to '12,345,678. The comma after the '5' is not changed
    > because '(?!\d)' is a zero-width assertion, and therefore doesn't count
    > as part of the match, and therefore is not part of what is replaced.
    > Because the s/.../.../ worked, we repeat the match a third time, but
    > there isn't another match, and so the while terminates.
    >
    > > b)
    > > #remove (nested (even deeply nested (like this))) remarks
    > >
    > > 1 while s/\([^()]*\)//g;
    > > # why escape the first ( and second ), what about the ( or ) in
    > > between

    >
    > The escapes are there to indicate that they are literal parentheses to
    > be scanned for, not grouping operators in regular-expression language.
    >
    > The escapes are not within the [] because parentheses have no meaning
    > within [], and are therefore automatically taken as literal.
    >
    > To expand, the regular expression means this:
    >
    > Match on a (, followed by zero or more characters that are not ( or ),
    > followed by a ).
    >
    > The first time, we get "remove (nested (even deeply nested )) remarks".
    > The second time, we get "remove (nested ) remarks".
    > The third time, we get "remove remarks".
    > The fourth time, there is no match, and the while terminates.
     
    Peter, Feb 3, 2004
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rob

    Difficult problem

    Rob, Aug 1, 2003, in forum: ASP .Net
    Replies:
    0
    Views:
    364
  2. Smith John
    Replies:
    1
    Views:
    697
  3. Daniel Walzenbach
    Replies:
    5
    Views:
    433
    =?Utf-8?B?RGFuaWVsIFdhbHplbmJhY2g=?=
    Feb 3, 2004
  4. Jim Corey

    difficult security problem

    Jim Corey, Jun 25, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    300
  5. crichmon
    Replies:
    4
    Views:
    510
    Mabden
    Jul 7, 2004
Loading...

Share This Page