Regular Expression Question

Discussion in 'Perl Misc' started by Börni, Jan 8, 2009.

  1. Börni

    Börni Guest

    Hi

    This is probably very easy, but I don't get it.

    Example:
    #!perl -w
    use strict;

    my $string = '<meta name="Keywords" content="" lang="fr">';

    my ($keywords) = $string =~ /.*?meta name="Keywords".*?content="(.*?)">/;

    print "[$keywords]\n";
    exit 0;


    In the Example above I'd expect $keywords to be empty. Instead it is ["
    lang="fr].

    What is the correct expression to match everything
    <meta name="Keywords" content="-->IN HERE<--" lang="fr">
    even when it's empty?

    Regards Bernard
     
    Börni, Jan 8, 2009
    #1
    1. Advertisements

  2. Börni

    Tim Greer Guest

    In your above code, it is doing exactly what it should. Using your
    current example, make the following change:

    my ($keywords) = $string =~ /^.*?meta
    name="Keywords".*?content="([^"]*)"/;

    That will take zero or more characters in content="" and anything from
    the opening double quote to the closing double quote, which is not a
    double quote itself, will be what $keywords is. You could probably
    just write that as: my ($keywords) = $string
    =~ /^.*?content="([^"]*)"/; if that's what you want to stick with.
    Notice I've added the start of the string with ^ in my examples. If
    it's not going to be the start of the string in real code, just adjust
    accordingly.
     
    Tim Greer, Jan 8, 2009
    #2
    1. Advertisements

  3. B> Hi This is probably very easy, but I don't get it.

    That's because you're using regular expressions to parse HTML.

    You will save yourself considerable pain if you use a parser, such as
    HTML::parser, to parse HTML.

    Charlton
     
    Charlton Wilbur, Jan 8, 2009
    #3
  4. Börni

    Börni Guest

    Thank you very much for your help everybody! (Of course my problem was the
    ">" character)
     
    Börni, Jan 9, 2009
    #4
  5. Börni

    Tim Greer Guest

    (top posting fixed)

    Actually, the problem wasn't the ">" character. The problem was that
    the match went all the way to the last character, which happened to be
    the > character. The actual problem was that it was grabbing
    everything from the content's opening double quote content=" (.*?) all
    the way to ending ">, which happened to be " lang="fr.
     
    Tim Greer, Jan 9, 2009
    #5
  6. Börni

    Tim McDaniel Guest

    No, he's right: the problem was that '>' was in the regexp.
    .*?
    is non-greedy matching. If the terminal '>' had not been in the
    regexp, it would have stopped at the second ".
     
    Tim McDaniel, Jan 9, 2009
    #6
  7. Börni

    Tim Greer Guest

    I suppose it's just a matter of wording it. I read it as the OP meaning
    it was the character, rather than the formatting of the regex and the
    location of it. I just think the preferable way would be to match with
    ([^"]*), but I suppose it's up to the individual.
     
    Tim Greer, Jan 9, 2009
    #7
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.