Re: Capture only first match in regular expression

Discussion in 'Perl' started by Mike Spencer, Apr 19, 2009.

  1. Mike Spencer

    Mike Spencer Guest

    Zapanaz <http://joecosby.com/code/mail.pl> wrote:

    > I am parsing/page scraping some HTML. I know the first anchor tag <a>
    > contains information I want.
    >
    > So I do this:
    >
    > if($content =~ /.*(<a.*<\/a>).*/i){
    > $anchorContent = $1;


    Another poster suggested that regular expressions aren't sufficient
    for this. But you may be able to do it anyway if you can confidently
    predict features of the incoming HTML.

    That is, if you know "know the first anchor tag <a> contains
    information" you want, you may also know other things about the HTML
    you're trying to parse.

    Given an anchor of the general form:

    <a href=foo possible-other-arrtibutes=bar> Anchor-text </a>

    If you know in advance that the "Anchor-text" is *not* an <IMG
    src=...> tag and that the "Anchor-text" does not itself contain any
    other tags (such as, say, "<i>Anchor-text</i>) then you could use:

    if($content =~ /(<a\s[^>]+>[^<]*<\/a>)/i)
    {
    $anchorContent = $1;
    }

    Match the <a literally
    Require some matching whitespace after the 'a'
    Match anything that can occur within an opening <A...> tag
    Match the closing '>' of the opening <a tag
    Match any text except the '<' that will signal the closing </a> tag
    Match the closing </a> tag

    Won't work if the incoming HTML is arbitrary because you might have:

    <a href=foo><img src=bar></a> or
    <a href=foo> <i> Yow<b>!</b></i> </a>

    I'm no expert but I suspect that to reliably match what you want from
    any arbitrary HTML, you'll have to write a more general parser.

    --
    Mike Spencer Nova Scotia, Canada
     
    Mike Spencer, Apr 19, 2009
    #1
    1. Advertisements

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,810
  2. Replies:
    4
    Views:
    982
  3. Roger L. Cauvin

    Match First Sequence in Regular Expression?

    Roger L. Cauvin, Jan 26, 2006, in forum: Python
    Replies:
    43
    Views:
    1,574
    Armin Steinhoff
    Jan 28, 2006
  4. Jürgen Exner
    Replies:
    0
    Views:
    3,070
    Jürgen Exner
    Apr 12, 2009
  5. Peter Tuente
    Replies:
    0
    Views:
    17,969
    Peter Tuente
    Apr 17, 2009
  6. Dylan Nicholson
    Replies:
    6
    Views:
    577
    A. Sinan Unur
    Oct 19, 2007
  7. aliensite
    Replies:
    4
    Views:
    1,027
    aliensite
    Apr 13, 2005
Loading...