Re: Capture only first match in regular expression

Discussion in 'Perl' started by Jürgen Exner, Apr 12, 2009.

  1. Zapanaz <> wrote:
    >The answer to this is probably staring me in the face ...
    >I am parsing/page scraping some HTML. I know the first anchor tag <a>
    >contains information I want.
    >So I do this:
    > if($content =~ /.*(<a.*<\/a>).*/i){
    > $anchorContent = $1;
    >This basically works the way I want, it matches an anchor tag and
    >captures the content of it.
    >But there are multiple anchor tags in the HTML. What I want is the
    >first one, but what I get is the last one.

    Drop that .* at the beginning of your RE, it doesn't do you any good but
    eats up everything as far as it can provided the following RE still
    matches (in short: it is greedy).

    Having said that unless your HTML is some fixed format you really
    really should be using an HTML parser to parse HTML. HTML is not a
    regular language and therefore cannot be parsed using pure regular

    >I think I should be using one of these
    >* Match 0 or more times
    >+ Match 1 or more times
    >? Match 1 or 0 times
    >{n} Match exactly n times
    >{n,} Match at least n times
    >{n,m} Match at least n but not more than m times

    If at all you could use ? to turn the * into non-greedy as in .*?, but
    that's just stupid because it would match the empty string anywhere.

    Jürgen Exner, Apr 12, 2009
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.