Basic Regular Expressions question...

Discussion in 'Perl Misc' started by Will, Apr 6, 2005.

  1. Will

    Will Guest

    Hi,
    I have a longer program that finds and recursively replaces text in
    many html files that works beautifully for most cases, but I think I'm
    getting hung up on a s/// and regular expressions glitch. I wrote a
    very short program that gets to the heart of the matter...

    ################################################################################
    use strict;
    use warnings;


    my
    $find="https://sinaicentral.mssm.edu/intranet/intranet/ct_public/view?trial_id=MSM03204&searchNow=no";
    my $replace="http://www.excite.com";


    my $thisPage=
    "https://sinaicentral.mssm.edu/intranet/intranet/ct_public/view?trial_id=MSM03204&searchNow=no";

    $thisPage =~ s#$find#$replace#g;

    print $thisPage;
    ################################################################################

    To my understanding, this program should take the long string in $find
    and then replace it with $replace and the output should be
    "http://www.excite.com". I think the "?" in the $find variable is
    being treated as a Regular Expression but I can't figure out a way to
    nullify that effect. I'm a librarian not a programmer! Sombody please
    help! I'm working for a worthy non-profit that is strapped for cash, so
    I have to figure this out! It will bring you good karma! Thanks a
    bunch!

    Will Jiang
    Will, Apr 6, 2005
    #1
    1. Advertising

  2. "Will" <> wrote in
    news::

    > To my understanding, this program should take the long string in $find
    > and then replace it with $replace and the output should be
    > "http://www.excite.com". I think the "?" in the $find variable is
    > being treated as a Regular Expression but I can't figure out a way to
    > nullify that effect.


    To put it correctly, ? is special in a regular expression.

    perldoc perlreref

    ? Matches the preceding element 0 or 1 times

    also from the same document

    \Q Disable pattern metacharacters until \E

    $thispage =~ s{\Q$find\E}{$replace};

    should work.

    > I'm a librarian not a programmer! Sombody
    > please help! I'm working for a worthy non-profit that is strapped for
    > cash, so I have to figure this out! It will bring you good karma!


    None of this increases your chances of getting help. Describing your
    problem accurately, as you did, is the crucial part.

    For further information on how to help others help you, please see the
    posting guidelines for this group if you haven't already done so.

    Sinan

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
    A. Sinan Unur, Apr 6, 2005
    #2
    1. Advertising

  3. Will

    Will Guest

    THANKS SO MUCH! I really appreciate the help! Have a wonderful day!

    Will Jiang
    A. Sinan Unur wrote:
    > "Will" <> wrote in
    > news::
    >
    > > To my understanding, this program should take the long string in

    $find
    > > and then replace it with $replace and the output should be
    > > "http://www.excite.com". I think the "?" in the $find variable is
    > > being treated as a Regular Expression but I can't figure out a way

    to
    > > nullify that effect.

    >
    > To put it correctly, ? is special in a regular expression.
    >
    > perldoc perlreref
    >
    > ? Matches the preceding element 0 or 1 times
    >
    > also from the same document
    >
    > \Q Disable pattern metacharacters until \E
    >
    > $thispage =~ s{\Q$find\E}{$replace};
    >
    > should work.
    >
    > > I'm a librarian not a programmer! Sombody
    > > please help! I'm working for a worthy non-profit that is strapped

    for
    > > cash, so I have to figure this out! It will bring you good karma!

    >
    > None of this increases your chances of getting help. Describing your
    > problem accurately, as you did, is the crucial part.
    >
    > For further information on how to help others help you, please see

    the
    > posting guidelines for this group if you haven't already done so.
    >
    > Sinan
    >
    > --
    > A. Sinan Unur <>
    > (reverse each component and remove .invalid for email address)
    >
    > comp.lang.perl.misc guidelines on the WWW:
    > http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
    Will, Apr 6, 2005
    #3
  4. A. Sinan Unur wrote:
    >
    > \Q Disable pattern metacharacters until \E
    >
    > $thispage =~ s{\Q$find\E}{$replace};


    Since escaping all the characters in PATTERN makes it a non-regex
    problem, I played with using index() and substr() instead:

    substr $thisPage, index($thisPage, $find), length $find, $replace;

    However, to take the /g modifier into consideration (which the OP
    originally used), you seem to need something like:

    my ($i, $length) = (0,0);
    while ( ( $i = index $thisPage, $find, $i+$length ) >= 0 ) {
    $length = length $find;
    substr $thisPage, $i, $length, $replace;
    }

    That's much typing to 'emulate'

    $thispage =~ s/\Q$find/$replace/g;

    Assuming that using index() and substr() is more efficient than the
    using the s/// operator, is there any easier way to combine them to
    achieve the same result?

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Apr 6, 2005
    #4
  5. Will <> wrote:

    > I have a longer program that finds and recursively replaces text in



    There is no recursion in what you are doing.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Apr 7, 2005
    #5
  6. Will

    Anno Siegel Guest

    Gunnar Hjalmarsson <> wrote in comp.lang.perl.misc:
    > A. Sinan Unur wrote:
    > >
    > > \Q Disable pattern metacharacters until \E
    > >
    > > $thispage =~ s{\Q$find\E}{$replace};

    >
    > Since escaping all the characters in PATTERN makes it a non-regex
    > problem, I played with using index() and substr() instead:
    >
    > substr $thisPage, index($thisPage, $find), length $find, $replace;
    >
    > However, to take the /g modifier into consideration (which the OP
    > originally used), you seem to need something like:
    >
    > my ($i, $length) = (0,0);
    > while ( ( $i = index $thisPage, $find, $i+$length ) >= 0 ) {
    > $length = length $find;
    > substr $thisPage, $i, $length, $replace;
    > }
    >
    > That's much typing to 'emulate'
    >
    > $thispage =~ s/\Q$find/$replace/g;


    A bit tighter:

    my $i = -1;
    substr $thisPage, $i, length $find, $replace while
    ( $i = index $thisPage, $find, $i + 1) >= 0;

    Anno
    Anno Siegel, Apr 7, 2005
    #6
  7. Anno Siegel wrote:
    > Gunnar Hjalmarsson wrote:
    >> However, to take the /g modifier into consideration (which the OP
    >> originally used), you seem to need something like:
    >>
    >> my ($i, $length) = (0,0);
    >> while ( ( $i = index $thisPage, $find, $i+$length ) >= 0 ) {
    >> $length = length $find;
    >> substr $thisPage, $i, $length, $replace;
    >> }
    >>
    >> That's much typing to 'emulate'
    >>
    >> $thispage =~ s/\Q$find/$replace/g;

    >
    > A bit tighter:
    >
    > my $i = -1;
    > substr $thisPage, $i, length $find, $replace while
    > ( $i = index $thisPage, $find, $i + 1) >= 0;


    Yeah, but I just realized that neither of the above index() + substr()
    solutions would work on e.g. this set of input:

    my $thisPage = "It's a ball. The ball is brown.";
    my $find = 'ball';
    my $replace = 'football';

    Isn't it something like this that's needed:

    my $repl_length = length $replace;
    my $i = -$repl_length;
    while ( ( $i = index $thisPage, $find, $i + $repl_length ) >= 0 ) {
    substr $thisPage, $i, length $find, $replace;
    }

    Maybe no wonder that the s/// operator is frequently used also for
    replacing non-regex patterns when efficiency is not a restriction.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Apr 7, 2005
    #7
  8. Will

    Anno Siegel Guest

    Gunnar Hjalmarsson <> wrote in comp.lang.perl.misc:
    > Anno Siegel wrote:
    > > Gunnar Hjalmarsson wrote:
    > >> However, to take the /g modifier into consideration (which the OP
    > >> originally used), you seem to need something like:
    > >>
    > >> my ($i, $length) = (0,0);
    > >> while ( ( $i = index $thisPage, $find, $i+$length ) >= 0 ) {
    > >> $length = length $find;
    > >> substr $thisPage, $i, $length, $replace;
    > >> }
    > >>
    > >> That's much typing to 'emulate'
    > >>
    > >> $thispage =~ s/\Q$find/$replace/g;

    > >
    > > A bit tighter:
    > >
    > > my $i = -1;
    > > substr $thisPage, $i, length $find, $replace while
    > > ( $i = index $thisPage, $find, $i + 1) >= 0;

    >
    > Yeah, but I just realized that neither of the above index() + substr()
    > solutions would work on e.g. this set of input:
    >
    > my $thisPage = "It's a ball. The ball is brown.";
    > my $find = 'ball';
    > my $replace = 'football';
    >
    > Isn't it something like this that's needed:
    >
    > my $repl_length = length $replace;
    > my $i = -$repl_length;
    > while ( ( $i = index $thisPage, $find, $i + $repl_length ) >= 0 ) {
    > substr $thisPage, $i, length $find, $replace;
    > }


    You're right, though I wouldn't bother with storing the length of
    anything. Working backwards runs smoother:

    my $i = length $thisPage;
    substr( $thisPage, $i, length $find) = $replace while
    ( $i = rindex $thisPage, $find, $i) >= 0;

    That way only the unchanged part of the string is ever searched.

    > Maybe no wonder that the s/// operator is frequently used also for
    > replacing non-regex patterns when efficiency is not a restriction.


    I don't think I've used index for anything but simple location or just
    presence/absence. Replacement is too much hassle with the substr() for
    my taste.

    Anno
    Anno Siegel, Apr 7, 2005
    #8
  9. Anno Siegel wrote:
    > Gunnar Hjalmarsson <> wrote in comp.lang.perl.misc:
    >
    >>Anno Siegel wrote:
    >>
    >>>Gunnar Hjalmarsson wrote:
    >>>
    >>>>However, to take the /g modifier into consideration (which the OP
    >>>>originally used), you seem to need something like:
    >>>>
    >>>> my ($i, $length) = (0,0);
    >>>> while ( ( $i = index $thisPage, $find, $i+$length ) >= 0 ) {
    >>>> $length = length $find;
    >>>> substr $thisPage, $i, $length, $replace;
    >>>> }
    >>>>
    >>>>That's much typing to 'emulate'
    >>>>
    >>>> $thispage =~ s/\Q$find/$replace/g;
    >>>
    >>>A bit tighter:
    >>>
    >>> my $i = -1;
    >>> substr $thisPage, $i, length $find, $replace while
    >>> ( $i = index $thisPage, $find, $i + 1) >= 0;

    >>
    >>Yeah, but I just realized that neither of the above index() + substr()
    >>solutions would work on e.g. this set of input:
    >>
    >> my $thisPage = "It's a ball. The ball is brown.";
    >> my $find = 'ball';
    >> my $replace = 'football';
    >>
    >>Isn't it something like this that's needed:
    >>
    >> my $repl_length = length $replace;
    >> my $i = -$repl_length;
    >> while ( ( $i = index $thisPage, $find, $i + $repl_length ) >= 0 ) {
    >> substr $thisPage, $i, length $find, $replace;
    >> }

    >
    >
    > You're right, though I wouldn't bother with storing the length of
    > anything. Working backwards runs smoother:
    >
    > my $i = length $thisPage;
    > substr( $thisPage, $i, length $find) = $replace while
    > ( $i = rindex $thisPage, $find, $i) >= 0;
    >
    > That way only the unchanged part of the string is ever searched.


    Also, using the four argument substr() should be faster.

    my $i = length $thisPage;
    substr $thisPage, $i, length $find, $replace
    while ( $i = rindex $thisPage, $find, $i ) >= 0;



    John
    --
    use Perl;
    program
    fulfillment
    John W. Krahn, Apr 8, 2005
    #9
  10. Will

    Anno Siegel Guest

    John W. Krahn <> wrote in comp.lang.perl.misc:
    > Anno Siegel wrote:
    > > Gunnar Hjalmarsson <> wrote in comp.lang.perl.misc:


    [using index() instead of s///]

    > > my $i = length $thisPage;
    > > substr( $thisPage, $i, length $find) = $replace while
    > > ( $i = rindex $thisPage, $find, $i) >= 0;
    > >
    > > That way only the unchanged part of the string is ever searched.

    >
    > Also, using the four argument substr() should be faster.


    How so? I never heard of that.

    I use "=" with substr() assignments because it reads better. Four argument
    substr is for when I need the old value of the substring too.

    > my $i = length $thisPage;
    > substr $thisPage, $i, length $find, $replace
    > while ( $i = rindex $thisPage, $find, $i ) >= 0;


    Anno
    Anno Siegel, Apr 8, 2005
    #10
  11. Also sprach Anno Siegel:

    > John W. Krahn <> wrote in comp.lang.perl.misc:
    >> Anno Siegel wrote:
    >> > Gunnar Hjalmarsson <> wrote in comp.lang.perl.misc:

    >
    > [using index() instead of s///]
    >
    >> > my $i = length $thisPage;
    >> > substr( $thisPage, $i, length $find) = $replace while
    >> > ( $i = rindex $thisPage, $find, $i) >= 0;
    >> >
    >> > That way only the unchanged part of the string is ever searched.

    >>
    >> Also, using the four argument substr() should be faster.

    >
    > How so? I never heard of that.


    John is right according to a benchmark:

    #!/usr/bin/perl -w

    use strict;
    use Benchmark qw/cmpthese/;

    my $string = "0" x 100;

    cmpthese(-2, {
    arg4 => sub {
    substr $string, rand length $string, 1, "0";
    },
    arg3 => sub {
    substr($string, rand length $string, 1) = "0";
    },
    });
    __END__
    Rate arg3 arg4
    arg3 512250/s -- -43%
    arg4 903253/s 76% --

    The reason for arg4 being faster is the fact that perl needs to attach
    magic to the return value of the 3-argument substr(). In the 4-argument
    case this is not the case.

    > I use "=" with substr() assignments because it reads better. Four argument
    > substr is for when I need the old value of the substring too.


    In most cases though the argument of readability supersedes speed, so
    now you're right. :)

    Tassilo
    --
    use bigint;
    $n=71423350343770280161397026330337371139054411854220053437565440;
    $m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);
    Tassilo v. Parseval, Apr 8, 2005
    #11
  12. Will

    Anno Siegel Guest

    Tassilo v. Parseval <> wrote in comp.lang.perl.misc:
    > Also sprach Anno Siegel:
    > > John W. Krahn <> wrote in comp.lang.perl.misc:


    > >> Also, using the four argument substr() should be faster.

    > >
    > > How so? I never heard of that.

    >
    > John is right according to a benchmark:


    Yup. I ran one too. The difference is significant.

    > The reason for arg4 being faster is the fact that perl needs to attach
    > magic to the return value of the 3-argument substr(). In the 4-argument
    > case this is not the case.


    Ah... the return value from 4-arg substr is indeed not magic:

    my $x = 'aaaaabbbbbcccc';
    my $ref = \ substr( $x, 5, 5, 'ZZZZZ');

    $$ref = 'XXXXX'; # has no effect on $x
    print "$x\n"; # aaaaaZZZZZcccc

    After removing the fourth argument from substr() $x changes to
    "aaaaaXXXXXcccc".

    Live and learn.

    Anno
    Anno Siegel, Apr 8, 2005
    #12
  13. Will

    Anno Siegel Guest

    Tassilo v. Parseval <> wrote in comp.lang.perl.misc:
    > Also sprach Anno Siegel:
    > > John W. Krahn <> wrote in comp.lang.perl.misc:


    > >> Also, using the four argument substr() should be faster.

    > >
    > > How so? I never heard of that.

    >
    > John is right according to a benchmark:


    Yup. I ran one too. The difference is significant.

    > The reason for arg4 being faster is the fact that perl needs to attach
    > magic to the return value of the 3-argument substr(). In the 4-argument
    > case this is not the case.


    Ah... the return value from 4-arg substr is indeed not magic:

    my $x = 'aaaaabbbbbcccc';
    my $ref = \ substr( $x, 5, 5, 'ZZZZZ');

    $$ref = 'XXXXX'; # has no effect on $x
    print "$x\n"; # aaaaaZZZZZcccc

    After removing the fourth argument from substr(), $x changes to
    "aaaaaXXXXXcccc".

    Live and learn.

    Anno
    Anno Siegel, Apr 8, 2005
    #13
  14. Will

    Anno Siegel Guest

    Tassilo v. Parseval <> wrote in comp.lang.perl.misc:
    > Also sprach Anno Siegel:
    > > John W. Krahn <> wrote in comp.lang.perl.misc:


    > >> Also, using the four argument substr() should be faster.

    > >
    > > How so? I never heard of that.

    >
    > John is right according to a benchmark:


    Yup. I ran one too. The difference is significant.

    > The reason for arg4 being faster is the fact that perl needs to attach
    > magic to the return value of the 3-argument substr(). In the 4-argument
    > case this is not the case.


    Ah... the return value from 4-arg substr is indeed not magic:

    my $x = 'aaaaabbbbbcccc';
    my $ref = \ substr( $x, 5, 5, 'ZZZZZ');

    $$ref = 'XXXXX'; # has no effect on $x
    print "$x\n"; # aaaaaZZZZZcccc

    After removing the fourth argument from substr(), $x changes to
    "aaaaaXXXXXcccc".

    Live and learn.

    Anno
    Anno Siegel, Apr 8, 2005
    #14
  15. Will

    Bart Lateur Guest

    Anno Siegel wrote:

    >You're right, though I wouldn't bother with storing the length of
    >anything. Working backwards runs smoother:
    >
    > my $i = length $thisPage;
    > substr( $thisPage, $i, length $find) = $replace while
    > ( $i = rindex $thisPage, $find, $i) >= 0;
    >
    >That way only the unchanged part of the string is ever searched.


    I'm sure you're aware that the behaviour of this version is not the same
    as the original one for overlapping matches. Try replacing "papa" in
    "papapaya", for example.

    --
    Bart.
    Bart Lateur, Apr 11, 2005
    #15
  16. Will

    Anno Siegel Guest

    Bart Lateur <> wrote in comp.lang.perl.misc:
    > Anno Siegel wrote:
    >
    > >You're right, though I wouldn't bother with storing the length of
    > >anything. Working backwards runs smoother:
    > >
    > > my $i = length $thisPage;
    > > substr( $thisPage, $i, length $find) = $replace while
    > > ( $i = rindex $thisPage, $find, $i) >= 0;
    > >
    > >That way only the unchanged part of the string is ever searched.

    >
    > I'm sure you're aware that the behaviour of this version is not the same
    > as the original one for overlapping matches. Try replacing "papa" in
    > "papapaya", for example.


    Frankly, no, I didn't notice. Thanks for pointing it out.

    Anno
    Anno Siegel, Apr 11, 2005
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay Douglas
    Replies:
    0
    Views:
    601
    Jay Douglas
    Aug 15, 2003
  2. ASP.Confused

    Question on regular expressions

    ASP.Confused, Jul 26, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    378
    ASP.Confused
    Jul 26, 2004
  3. jeffM
    Replies:
    3
    Views:
    385
    Alan Moore
    Feb 2, 2004
  4. Tom
    Replies:
    5
    Views:
    93
    Randy Webb
    Nov 16, 2006
  5. Noman Shapiro
    Replies:
    0
    Views:
    232
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page