Hard regexp problem

Discussion in 'Perl Misc' started by Aaron Sherman, Oct 23, 2004.

  1. I was stumped for a while today, when I ran into a regular expression
    problem. Now, I'm a pretty good regexp hacker, but this problem was so
    constrained and seemed at first glance to be so wrong for a regexp
    that I didn't think it could be done. However, in specifying it to a
    friend, I realized that it could.

    I'm curious to see if there's a better solution.

    The problem is this: You have a function call in a library that takes
    a regexp, anchors it on both ends by adding "^" and "$" to the
    beginning end, and then applies that to an input string.

    sub foo {
    my $re = shift;
    my $in = <STDIN>;
    die "'$in' is bad" unless $in =~ /^$re$/;
    }

    It exits with an error if the string does not match the regexp. I
    wanted to accept strings that did NOT contain a particular substring.
    Normally, I would have written:

    !/xyz/

    But a) my regexp is going to have to match the whole line, which the
    above does not, and b) I can't tell the function to negate the regexp.

    In the end, this is what I came up with:

    qr{((?=[^x]|x[^y]|xy[^z]).)*}

    That was the final version. I had done this previously:

    qr{([^x]|x[^y]|xy[^z])*}

    but that fails on:

    xyxyz

    which it allows to slip through because it consumes the leading "xyx"
    on the first pass.
    Aaron Sherman, Oct 23, 2004
    #1
    1. Advertising

  2. Abigail wrote:
    > Aaron Sherman () wrote on MMMMLXX September
    > MCMXCIII in <URL:news:>:
    > () I was stumped for a while today, when I ran into a regular expression
    > () problem. Now, I'm a pretty good regexp hacker, but this problem was so
    > () constrained and seemed at first glance to be so wrong for a regexp
    > () that I didn't think it could be done. However, in specifying it to a
    > () friend, I realized that it could.
    > ()
    > () I'm curious to see if there's a better solution.
    > ()
    > () The problem is this: You have a function call in a library that takes
    > () a regexp, anchors it on both ends by adding "^" and "$" to the
    > () beginning end, and then applies that to an input string.
    > ()
    > () sub foo {
    > () my $re = shift;
    > () my $in = <STDIN>;
    > () die "'$in' is bad" unless $in =~ /^$re$/;
    > () }
    > ()
    > () It exits with an error if the string does not match the regexp. I
    > () wanted to accept strings that did NOT contain a particular substring.
    > () Normally, I would have written:
    > ()
    > () !/xyz/
    >
    > /^[^x]*(?:x(?!yx)[^x]*)*$/
    >
    > Or:
    >
    > /^(?:(?!xyz).)*$/
    >


    Er no,

    /^(?!.*xyz).*$/

    There have been numerous previous threads on this question.
    Brian McCauley, Oct 23, 2004
    #2
    1. Advertising

  3. On 22 Oct 2004, Aaron Sherman wrote:

    >I was stumped for a while today, when I ran into a regular expression
    >problem. Now, I'm a pretty good regexp hacker, but this problem was so
    >constrained and seemed at first glance to be so wrong for a regexp
    >that I didn't think it could be done. However, in specifying it to a
    >friend, I realized that it could.
    >
    >I'm curious to see if there's a better solution.
    >
    >The problem is this: You have a function call in a library that takes
    >a regexp, anchors it on both ends by adding "^" and "$" to the
    >beginning end, and then applies that to an input string.
    >
    > sub foo {
    > my $re = shift;
    > my $in = <STDIN>;
    > die "'$in' is bad" unless $in =~ /^$re$/;
    > }
    >
    >It exits with an error if the string does not match the regexp. I
    >wanted to accept strings that did NOT contain a particular substring.
    >Normally, I would have written:
    >
    > !/xyz/
    >
    >But a) my regexp is going to have to match the whole line, which the
    >above does not, and b) I can't tell the function to negate the regexp.


    So make your regex consist of a negative look-ahead:

    (?!.*xyz).*

    When that gets put into your program, it'll be:

    /^(?!.*xyz).*$/

    which says "at the beginning of the string, if we CAN'T match '.*xyz',
    then match .* and then the end of the string" which, as long as there
    aren't embedded newlines, will work just fine.

    --
    Jeff "japhy" Pinyan % How can we ever be the sold short or
    RPI Acacia Brother #734 % the cheated, we who for every service
    Senior Dean, Fall 2004 % have long ago been overpaid?
    RPI Corporation Secretary %
    http://japhy.perlmonk.org/ % -- Meister Eckhart
    Jeff 'japhy' Pinyan, Oct 23, 2004
    #3
  4. Aaron Sherman

    Lukas Mai Guest

    Aaron Sherman schrob:
    [...]
    > The problem is this: You have a function call in a library that takes
    > a regexp, anchors it on both ends by adding "^" and "$" to the
    > beginning end, and then applies that to an input string.


    > sub foo {
    > my $re = shift;
    > my $in = <STDIN>;
    > die "'$in' is bad" unless $in =~ /^$re$/;
    > }


    Note that this doesn't work with foo('this|that'); you need /^(?:$re)$/.

    HTH, Lukas
    --
    print${;eval{sub{die\"Just another $_[0] hacker, "}->(Perl::)}||$@}
    Lukas Mai, Oct 23, 2004
    #4
  5. Abigail wrote:

    > Brian McCauley () wrote on MMMMLXXI September MCMXCIII in
    > <URL:news:cld243$b09$>:
    > //
    > // Abigail wrote:
    > // > /^(?:(?!xyz).)*$/
    > //
    > // Er no,
    >
    > No? You mean, any of my regexen are wrong?


    For certain values of 'wrong' :)

    > Could you give an example of a
    > string were it gives the wrong result?


    No. But not giving the right result is not the only way to be wrong.
    Your solution is more complex than necessary both from a human and a
    computational point of view.

    use Benchmark;

    my $s ='dshidahsidahsdias' x 1000;

    timethese 1000 => {
    Abigail => sub { $s =~ /^(?:(?!xyz).)*$/ },
    Usual => sub { $s =~ /^(?!.*xyz).*$/ },
    };
    __END__
    Benchmark: timing 1000 iterations of Abigail, Usual...
    Abigail: 4 wallclock secs ( 3.85 usr + 0.00 sys = 3.85 CPU) @
    259.74/s (n=1000)
    Usual: 1 wallclock secs ( 0.60 usr + 0.00 sys = 0.60 CPU) @
    1666.67/s (n=1000)

    > // /^(?!.*xyz).*$/
    >
    > A trailing .*, always very useful in a regexp.


    Given that the problem specification required the regex start with ^ and
    end with $ the trailing .* seemed the best way way to fulfil this
    requirement.

    > The disadvantage of '/^(?!.*xyz).*$/' is that it isn't easy to use it as
    > part of a larger expression - for instance if only part of your string
    > should not contain "xyz". My expression OTOH can easily be taken and
    > put in a larger expression.


    This is completely true. And had the ability to be included in a larger
    expression been part of the requirement then your solution would
    probably have been the best one. Except of course if the ability to
    be included in a larger expression _had_ been part of the requirement
    then the problem specifiaction would have had to said what was to be
    done with xyz that crossed the endpoint of the matched substring.
    Depending on what was desired in that situation your solution may not
    have met the requirement.
    Brian McCauley, Oct 26, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg Hurrell
    Replies:
    4
    Views:
    159
    James Edward Gray II
    Feb 14, 2007
  2. Mikel Lindsaar
    Replies:
    0
    Views:
    482
    Mikel Lindsaar
    Mar 31, 2008
  3. Joao Silva
    Replies:
    16
    Views:
    359
    7stud --
    Aug 21, 2009
  4. Uldis  Bojars
    Replies:
    2
    Views:
    190
    Janwillem Borleffs
    Dec 17, 2006
  5. Matìj Cepl

    new RegExp().test() or just RegExp().test()

    Matìj Cepl, Nov 24, 2009, in forum: Javascript
    Replies:
    3
    Views:
    180
    Matěj Cepl
    Nov 24, 2009
Loading...

Share This Page