Regular Expression Question

Discussion in 'Perl Misc' started by Börni, Jan 8, 2009.

  1. Börni

    Börni Guest


    This is probably very easy, but I don't get it.

    #!perl -w
    use strict;

    my $string = '<meta name="Keywords" content="" lang="fr">';

    my ($keywords) = $string =~ /.*?meta name="Keywords".*?content="(.*?)">/;

    print "[$keywords]\n";
    exit 0;

    In the Example above I'd expect $keywords to be empty. Instead it is ["

    What is the correct expression to match everything
    <meta name="Keywords" content="-->IN HERE<--" lang="fr">
    even when it's empty?

    Regards Bernard
    Börni, Jan 8, 2009
    1. Advertisements

  2. Börni

    Tim Greer Guest

    In your above code, it is doing exactly what it should. Using your
    current example, make the following change:

    my ($keywords) = $string =~ /^.*?meta

    That will take zero or more characters in content="" and anything from
    the opening double quote to the closing double quote, which is not a
    double quote itself, will be what $keywords is. You could probably
    just write that as: my ($keywords) = $string
    =~ /^.*?content="([^"]*)"/; if that's what you want to stick with.
    Notice I've added the start of the string with ^ in my examples. If
    it's not going to be the start of the string in real code, just adjust
    Tim Greer, Jan 8, 2009
    1. Advertisements

  3. B> Hi This is probably very easy, but I don't get it.

    That's because you're using regular expressions to parse HTML.

    You will save yourself considerable pain if you use a parser, such as
    HTML::parser, to parse HTML.

    Charlton Wilbur, Jan 8, 2009
  4. Börni

    Börni Guest

    Thank you very much for your help everybody! (Of course my problem was the
    ">" character)
    Börni, Jan 9, 2009
  5. Börni

    Tim Greer Guest

    (top posting fixed)

    Actually, the problem wasn't the ">" character. The problem was that
    the match went all the way to the last character, which happened to be
    the > character. The actual problem was that it was grabbing
    everything from the content's opening double quote content=" (.*?) all
    the way to ending ">, which happened to be " lang="fr.
    Tim Greer, Jan 9, 2009
  6. Börni

    Tim McDaniel Guest

    No, he's right: the problem was that '>' was in the regexp.
    is non-greedy matching. If the terminal '>' had not been in the
    regexp, it would have stopped at the second ".
    Tim McDaniel, Jan 9, 2009
  7. Börni

    Tim Greer Guest

    I suppose it's just a matter of wording it. I read it as the OP meaning
    it was the character, rather than the formatting of the regex and the
    location of it. I just think the preferable way would be to match with
    ([^"]*), but I suppose it's up to the individual.
    Tim Greer, Jan 9, 2009
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.