Why this Regex not working?

Discussion in 'Perl Misc' started by Looking, Sep 16, 2004.

  1. Looking

    Looking Guest

    $s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
    $s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
    #$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
    print "$s\n";

    The scond regex works. I wonder why the first regex not working?
    I am trying to get whatever is between the first pair of "" or '' after
    content=. It is parsing the header file of HTML pages.

    The first regex gave me this:
    "this is what i want " asd " sdf " adfa

    But I need this:
    this is what i want
     
    Looking, Sep 16, 2004
    #1
    1. Advertising

  2. Looking wrote:
    > $s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
    > $s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
    > #$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
    > print "$s\n";
    >
    > The scond regex works. I wonder why the first regex not working?
    > I am trying to get whatever is between the first pair of "" or '' after
    > content=. It is parsing the header file of HTML pages.
    >
    > The first regex gave me this:
    > "this is what i want " asd " sdf " adfa
    >
    > But I need this:
    > this is what i want

    You may want to check out HTTP::Headers rather than doing this yourself.

    With this regex

    (this won't work for readers using proportional fonts)

    $s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
    ^

    The problem is that in order to do a non-greedy match the question mark
    should be immediately adjacent to the * ie you need to remove the
    brackets or put the ? inside the brackets. Also, you don't need the |
    (pipe symbol) inside [] character classes.

    regards,

    Mark
     
    Mark Clements, Sep 16, 2004
    #2
    1. Advertising

  3. Looking

    Anno Siegel Guest

    Looking <> wrote in comp.lang.perl.misc:
    > $s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
    > $s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;

    ^ ^
    Do you actually want to allow | besides " and ' for quotes? I think
    you have conflated character class notation and alternation.

    > #$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
    > print "$s\n";
    >
    > The scond regex works. I wonder why the first regex not working?
    > I am trying to get whatever is between the first pair of "" or '' after
    > content=. It is parsing the header file of HTML pages.


    Better use a real HTML parser.

    > The first regex gave me this:
    > "this is what i want " asd " sdf " adfa
    >
    > But I need this:
    > this is what i want


    Simple. /.*/ is greedy, it matches the longest string it can while
    still having the rest of the pattern match. So it picks up everything
    until the last " or ' in the line. The question mark in /(.*)?/
    serves no purpose. You probably meant to put it inside the parentheses:
    /(.*?)/. In that position the match will be non-greedy.

    Anno
     
    Anno Siegel, Sep 16, 2004
    #3
  4. Looking wrote:
    > $s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
    > $s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
    > #$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
    > print "$s\n";
    >
    > The scond regex works. I wonder why the first regex not working?


    That is because *, + and ? are greedy and will match as many characters as
    possible so (.*) will match everything to the end until the last ", | or '
    character. (Why are you trying to match the | character?) You probably want
    something like:

    $s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;


    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Sep 16, 2004
    #4
  5. Looking

    Looking Guest

    > Looking wrote:
    > > $s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
    > > $s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
    > > #$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
    > > print "$s\n";
    > >
    > > The scond regex works. I wonder why the first regex not working?

    >
    > That is because *, + and ? are greedy and will match as many characters as
    > possible so (.*) will match everything to the end until the last ", | or '
    > character. (Why are you trying to match the | character?) You probably

    want
    > something like:
    >
    > $s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;


    May I ask what \1 is? I am trying to do a search of \1 on google but this
    string is too short.
    I need to get whatever is between the first 2 pairs of "" or '' after
    content=

    >
    >
    > John
    > --
    > use Perl;
    > program
    > fulfillment
     
    Looking, Sep 16, 2004
    #5
  6. Looking

    Looking Guest

    > Looking wrote:
    > > $s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
    > > $s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
    > > #$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
    > > print "$s\n";
    > >
    > > The scond regex works. I wonder why the first regex not working?
    > > I am trying to get whatever is between the first pair of "" or '' after
    > > content=. It is parsing the header file of HTML pages.
    > >
    > > The first regex gave me this:
    > > "this is what i want " asd " sdf " adfa
    > >
    > > But I need this:
    > > this is what i want

    > You may want to check out HTTP::Headers rather than doing this yourself.
    >


    If you mean HTML::HeadParser
    I tried it and it is not working!.

    That is the sample it gave:
    $h = HTTP::Headers->new;
    $p = HTML::HeadParser->new($h);
    $p->parse(<<EOT);
    <title>Stupid example</title>
    <base href="http://www.linpro.no/lwp/";>
    Normal text starts here.
    EOT
    undef $p;
    print $h->title; # should print "Stupid example"

    I tried to use $h->description, it does not return anything. I am trying to
    get keywords, description etc, but got nothing.
    If you know where the bugs are, let me know.
     
    Looking, Sep 16, 2004
    #6
  7. Looking

    Looking Guest


    > > $s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
    > > $s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
    > > #$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
    > > print "$s\n";
    > >
    > > The scond regex works. I wonder why the first regex not working?

    >
    > That is because *, + and ? are greedy and will match as many characters as
    > possible so (.*) will match everything to the end until the last ", | or '
    > character. (Why are you trying to match the | character?) You probably

    want
    > something like:
    >
    > $s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;
    >


    By the way, I assume \1 is same as $1 but on the left side. Your code is not
    working. It does not match anything. Although, I think your idea is right

    $s=qq( "sadf content= "this is what i' want " asd " sdf " adfa " sdf' );
    #$s =~ s/.*content=.*?["'](.*?)["'].*/$1/si;
    $s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;
    print "$s\n";

    I hope it can return
    this is what i' want
    but yours return
    "sadf content= "this is what i' want " asd " sdf " adfa " sdf'
    so, no match.
     
    Looking, Sep 16, 2004
    #7
  8. Looking wrote:

    >>
    >>$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;

    >
    >
    > May I ask what \1 is? I am trying to do a search of \1 on google but this
    > string is too short.
    > I need to get whatever is between the first 2 pairs of "" or '' after
    > content=

    you need to read up on regexps. check out

    man perlre

    For the record, \1 is a backreference ie it refers to a previously
    matched and captured part of the regexp.

    so

    (["'])([^\1]*)[\1]

    matches " or ', followed by any character other than these zero or more
    times, followed by whichever of " and ' was matched the first time.

    \1, \2 etc are typically used within the regexp itself, and $1, $2 etc
    outside it (or in the second part of a s/// operation).

    Mark
     
    Mark Clements, Sep 16, 2004
    #8
  9. Looking wrote:

    > If you mean HTML::HeadParser
    > I tried it and it is not working!.

    Er - I misread your requirement as parsing HTTP headers rather than the
    <HEAD> section of an HTML document. Sorry for leading you down the wrong
    path. Try this


    use strict;
    use warnings;

    use HTML::HeadParser;

    my $p = HTML::HeadParser->new();
    $p->parse(<<EOT);
    <title>Stupid example</title>
    <base href="http://www.linpro.no/lwp/";>
    Normal text starts here.
    EOT
    print $p->header("title");
     
    Mark Clements, Sep 16, 2004
    #9
  10. On Thu, 16 Sep 2004, Mark Clements wrote:

    >For the record, \1 is a backreference ie it refers to a previously
    >matched and captured part of the regexp.
    >
    >so
    >
    >(["'])([^\1]*)[\1]
    >
    >matches " or ', followed by any character other than these zero or more
    >times, followed by whichever of " and ' was matched the first time.


    No it doesn't. Character classes are created when the regex is compiled,
    but \1 is not known until the regex is EXECUTED. Using \1 inside a
    character class is that same as using \x01 or \001, it's the ASCII
    character whose ordinal value is 1.

    --
    Jeff "japhy" Pinyan % How can we ever be the sold short or
    RPI Acacia Brother #734 % the cheated, we who for every service
    Senior Dean, Fall 2004 % have long ago been overpaid?
    RPI Corporation Secretary %
    http://japhy.perlmonk.org/ % -- Meister Eckhart
     
    Jeff 'japhy' Pinyan, Sep 16, 2004
    #10
  11. Jeff 'japhy' Pinyan wrote:
    > On Thu, 16 Sep 2004, Mark Clements wrote:
    >>For the record, \1 is a backreference ie it refers to a previously
    >>matched and captured part of the regexp.
    >>
    >>so
    >>
    >>(["'])([^\1]*)[\1]
    >>
    >>matches " or ', followed by any character other than these zero or more
    >>times, followed by whichever of " and ' was matched the first time.

    >
    >
    > No it doesn't. Character classes are created when the regex is compiled,
    > but \1 is not known until the regex is EXECUTED. Using \1 inside a
    > character class is that same as using \x01 or \001, it's the ASCII
    > character whose ordinal value is 1.

    Thanks - serves me right for not having the good sense to run the code
    in order to check, and I didn't know that about character classes.

    Mark
     
    Mark Clements, Sep 16, 2004
    #11
  12. Jeff 'japhy' Pinyan wrote:
    > On Thu, 16 Sep 2004, Mark Clements wrote:
    >
    >>For the record, \1 is a backreference ie it refers to a previously
    >>matched and captured part of the regexp.
    >>
    >>so
    >>
    >>(["'])([^\1]*)[\1]
    >>
    >>matches " or ', followed by any character other than these zero or more
    >>times, followed by whichever of " and ' was matched the first time.

    >
    > No it doesn't. Character classes are created when the regex is compiled,
    > but \1 is not known until the regex is EXECUTED. Using \1 inside a
    > character class is that same as using \x01 or \001, it's the ASCII
    > character whose ordinal value is 1.


    Oops, don't you hate it when that happens. ;-)
    So how come you can put a variable in a character class and have it work at
    run-time?


    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Sep 17, 2004
    #12
  13. On Fri, 17 Sep 2004, John W. Krahn wrote:

    >Jeff 'japhy' Pinyan wrote:
    >> On Thu, 16 Sep 2004, Mark Clements wrote:
    >>
    >>>For the record, \1 is a backreference ie it refers to a previously
    >>>matched and captured part of the regexp.
    >>>
    >>>so
    >>>
    >>>(["'])([^\1]*)[\1]
    >>>
    >>>matches " or ', followed by any character other than these zero or more
    >>>times, followed by whichever of " and ' was matched the first time.

    >>
    >> No it doesn't. Character classes are created when the regex is compiled,
    >> but \1 is not known until the regex is EXECUTED. Using \1 inside a
    >> character class is that same as using \x01 or \001, it's the ASCII
    >> character whose ordinal value is 1.

    >
    >Oops, don't you hate it when that happens. ;-)
    >So how come you can put a variable in a character class and have it work at
    >run-time?


    Because when a variable is in a regex, the regex can't be compiled until
    run-time[1]. That "law" just doesn't hold for backreferences.

    [1] thus the existence of the /o switch which quells more than one
    compilation of a regex with variables in it

    --
    Jeff "japhy" Pinyan % How can we ever be the sold short or
    RPI Acacia Brother #734 % the cheated, we who for every service
    Senior Dean, Fall 2004 % have long ago been overpaid?
    RPI Corporation Secretary %
    http://japhy.perlmonk.org/ % -- Meister Eckhart
     
    Jeff 'japhy' Pinyan, Sep 17, 2004
    #13
  14. Looking wrote:
    >>>$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
    >>>$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
    >>>#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
    >>>print "$s\n";
    >>>
    >>>The scond regex works. I wonder why the first regex not working?

    >>
    >>That is because *, + and ? are greedy and will match as many characters as
    >>possible so (.*) will match everything to the end until the last ", | or '
    >>character. (Why are you trying to match the | character?) You probably
    >>want something like:
    >>
    >>$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;

    >
    > By the way, I assume \1 is same as $1 but on the left side. Your code is not
    > working. It does not match anything. Although, I think your idea is right
    >
    > $s=qq( "sadf content= "this is what i' want " asd " sdf " adfa " sdf' );
    > #$s =~ s/.*content=.*?["'](.*?)["'].*/$1/si;
    > $s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;
    > print "$s\n";
    >
    > I hope it can return
    > this is what i' want
    > but yours return
    > "sadf content= "this is what i' want " asd " sdf " adfa " sdf'
    > so, no match.


    Yes, as "Japhy" has pointed out, \1 won't work inside of a character class.
    This should work a lot better. :)

    $s =~ s/.*content=.*?(["'])(.*?)\1.*/$2/si;



    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Sep 17, 2004
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mr. SweatyFinger

    why why why why why

    Mr. SweatyFinger, Nov 28, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    911
    Mark Rae
    Dec 21, 2006
  2. Mr. SweatyFinger
    Replies:
    2
    Views:
    2,000
    Smokey Grindel
    Dec 2, 2006
  3. Skybuck Flying
    Replies:
    16
    Views:
    680
    tragomaskhalos
    Aug 25, 2007
  4. Replies:
    3
    Views:
    773
    Reedick, Andrew
    Jul 1, 2008
  5. Replies:
    3
    Views:
    161
    Paul Lalli
    Oct 27, 2005
Loading...

Share This Page