Regular Expression Question

Discussion in 'Perl Misc' started by Börni, Jan 8, 2009.

  1. Börni

    Börni Guest

    Hi

    This is probably very easy, but I don't get it.

    Example:
    #!perl -w
    use strict;

    my $string = '<meta name="Keywords" content="" lang="fr">';

    my ($keywords) = $string =~ /.*?meta name="Keywords".*?content="(.*?)">/;

    print "[$keywords]\n";
    exit 0;


    In the Example above I'd expect $keywords to be empty. Instead it is ["
    lang="fr].

    What is the correct expression to match everything
    <meta name="Keywords" content="-->IN HERE<--" lang="fr">
    even when it's empty?

    Regards Bernard
    Börni, Jan 8, 2009
    #1
    1. Advertising

  2. Börni

    Tim Greer Guest

    Börni wrote:

    > Hi
    >
    > This is probably very easy, but I don't get it.
    >
    > Example:
    > #!perl -w
    > use strict;
    >
    > my $string = '<meta name="Keywords" content="" lang="fr">';
    >
    > my ($keywords) = $string =~ /.*?meta
    > name="Keywords".*?content="(.*?)">/;
    >
    > print "[$keywords]\n";
    > exit 0;
    >
    >
    > In the Example above I'd expect $keywords to be empty. Instead it is
    > [" lang="fr].
    >
    > What is the correct expression to match everything
    > <meta name="Keywords" content="-->IN HERE<--" lang="fr">
    > even when it's empty?
    >
    > Regards Bernard


    In your above code, it is doing exactly what it should. Using your
    current example, make the following change:

    my ($keywords) = $string =~ /^.*?meta
    name="Keywords".*?content="([^"]*)"/;

    That will take zero or more characters in content="" and anything from
    the opening double quote to the closing double quote, which is not a
    double quote itself, will be what $keywords is. You could probably
    just write that as: my ($keywords) = $string
    =~ /^.*?content="([^"]*)"/; if that's what you want to stick with.
    Notice I've added the start of the string with ^ in my examples. If
    it's not going to be the start of the string in real code, just adjust
    accordingly.
    --
    Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
    Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
    and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
    Industry's most experienced staff! -- Web Hosting With Muscle!
    Tim Greer, Jan 8, 2009
    #2
    1. Advertising

  3. >>>>> "B" == Börni <> writes:

    B> Hi This is probably very easy, but I don't get it.

    That's because you're using regular expressions to parse HTML.

    You will save yourself considerable pain if you use a parser, such as
    HTML::parser, to parse HTML.

    Charlton


    --
    Charlton Wilbur
    Charlton Wilbur, Jan 8, 2009
    #3
  4. Börni

    Börni Guest

    Thank you very much for your help everybody! (Of course my problem was the
    ">" character)

    "Börni" <> schrieb im Newsbeitrag
    news:gk5crn$k5$-plus.net...
    > Hi
    >
    > This is probably very easy, but I don't get it.
    >
    > Example:
    > #!perl -w
    > use strict;
    >
    > my $string = '<meta name="Keywords" content="" lang="fr">';
    >
    > my ($keywords) = $string =~ /.*?meta name="Keywords".*?content="(.*?)">/;
    >
    > print "[$keywords]\n";
    > exit 0;
    >
    >
    > In the Example above I'd expect $keywords to be empty. Instead it is ["
    > lang="fr].
    >
    > What is the correct expression to match everything
    > <meta name="Keywords" content="-->IN HERE<--" lang="fr">
    > even when it's empty?
    >
    > Regards Bernard
    >
    Börni, Jan 9, 2009
    #4
  5. Börni

    Tim Greer Guest

    Börni wrote:

    > Thank you very much for your help everybody! (Of course my problem was
    > the ">" character)


    (top posting fixed)

    Actually, the problem wasn't the ">" character. The problem was that
    the match went all the way to the last character, which happened to be
    the > character. The actual problem was that it was grabbing
    everything from the content's opening double quote content=" (.*?) all
    the way to ending ">, which happened to be " lang="fr.
    --
    Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
    Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
    and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
    Industry's most experienced staff! -- Web Hosting With Muscle!
    Tim Greer, Jan 9, 2009
    #5
  6. Börni

    Tim McDaniel Guest

    In article <bOM9l.2462$>,
    Tim Greer <> wrote:
    >Börni wrote:
    >
    >> Thank you very much for your help everybody! (Of course my problem was
    >> the ">" character)

    >
    >(top posting fixed)
    >
    >Actually, the problem wasn't the ">" character. The problem was that
    >the match went all the way to the last character, which happened to be
    >the > character. The actual problem was that it was grabbing
    >everything from the content's opening double quote content=" (.*?) all
    >the way to ending ">, which happened to be " lang="fr.


    No, he's right: the problem was that '>' was in the regexp.
    .*?
    is non-greedy matching. If the terminal '>' had not been in the
    regexp, it would have stopped at the second ".

    --
    Tim McDaniel,
    Tim McDaniel, Jan 9, 2009
    #6
  7. Börni

    Tim Greer Guest

    Tim McDaniel wrote:

    > In article <bOM9l.2462$>,
    > Tim Greer <> wrote:
    >>Börni wrote:
    >>
    >>> Thank you very much for your help everybody! (Of course my problem
    >>> was the ">" character)

    >>
    >>(top posting fixed)
    >>
    >>Actually, the problem wasn't the ">" character. The problem was that
    >>the match went all the way to the last character, which happened to be
    >>the > character. The actual problem was that it was grabbing
    >>everything from the content's opening double quote content=" (.*?) all
    >>the way to ending ">, which happened to be " lang="fr.

    >
    > No, he's right: the problem was that '>' was in the regexp.
    > .*?
    > is non-greedy matching. If the terminal '>' had not been in the
    > regexp, it would have stopped at the second ".
    >


    I suppose it's just a matter of wording it. I read it as the OP meaning
    it was the character, rather than the formatting of the regex and the
    location of it. I just think the preferable way would be to match with
    ([^"]*), but I suppose it's up to the individual.
    --
    Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
    Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
    and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
    Industry's most experienced staff! -- Web Hosting With Muscle!
    Tim Greer, Jan 9, 2009
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Andrew Munn

    Regular expression question...

    Andrew Munn, Jun 29, 2003, in forum: Perl
    Replies:
    1
    Views:
    2,118
    rakesh sharma
    Jun 30, 2003
  2. Glenn Kidd

    Regular expression question

    Glenn Kidd, Aug 18, 2003, in forum: Perl
    Replies:
    0
    Views:
    930
    Glenn Kidd
    Aug 18, 2003
  3. VSK
    Replies:
    2
    Views:
    2,283
  4. =?iso-8859-1?B?bW9vcJk=?=

    Matching abitrary expression in a regular expression

    =?iso-8859-1?B?bW9vcJk=?=, Dec 1, 2005, in forum: Java
    Replies:
    8
    Views:
    839
    Alan Moore
    Dec 2, 2005
  5. GIMME
    Replies:
    3
    Views:
    11,942
    vforvikash
    Dec 29, 2008
Loading...

Share This Page