Re: Capture only first match in regular expression

Discussion in 'Perl' started by Mike Spencer, Apr 19, 2009.

  1. Mike Spencer

    Mike Spencer Guest

    Another poster suggested that regular expressions aren't sufficient
    for this. But you may be able to do it anyway if you can confidently
    predict features of the incoming HTML.

    That is, if you know "know the first anchor tag <a> contains
    information" you want, you may also know other things about the HTML
    you're trying to parse.

    Given an anchor of the general form:

    <a href=foo possible-other-arrtibutes=bar> Anchor-text </a>

    If you know in advance that the "Anchor-text" is *not* an <IMG
    src=...> tag and that the "Anchor-text" does not itself contain any
    other tags (such as, say, "<i>Anchor-text</i>) then you could use:

    if($content =~ /(<a\s[^>]+>[^<]*<\/a>)/i)
    $anchorContent = $1;

    Match the <a literally
    Require some matching whitespace after the 'a'
    Match anything that can occur within an opening <A...> tag
    Match the closing '>' of the opening <a tag
    Match any text except the '<' that will signal the closing </a> tag
    Match the closing </a> tag

    Won't work if the incoming HTML is arbitrary because you might have:

    <a href=foo><img src=bar></a> or
    <a href=foo> <i> Yow<b>!</b></i> </a>

    I'm no expert but I suspect that to reliably match what you want from
    any arbitrary HTML, you'll have to write a more general parser.
    Mike Spencer, Apr 19, 2009
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.