Regexp to match an URL in an HTML <a href=""></a> tag

Discussion in 'Perl' started by Charles Nadeau, Nov 15, 2003.

  1. Hello,

    I am trying to craft a regular expression to filter an URL from a <a
    href=""></a> tag and the one I have doesn't seen right.
    I use the regular expression from this snippet of code:

    foreach my $message (@messages)
    {
    my @match=($message->decoded=~/\bhref="(http.*)">.*/gi);

    foreach my $match(@match)
    {
    print $match,"\n";
    }

    }

    but it doesn't lead to results that are exactly what I need. An excerpt of
    what I get as an output looks like:

    http://203.197.204.155/mout/
    http://www.superrxsalesman.info/aff1/?mulish
    http://www.superrxsalesman.info/aff1/?acme
    http://www.superrxsalesman.info/aff1/?blister
    http://www.superrxsalesman.info/aff1/?samba
    http://www.superrxsalesman.info/aff1/?depot"><font color="#0033CC
    http://www.superrxsalesman.info/aff1/?procter"><font color="#0033CC
    http://www.superrxsalesman.info/aff1/?use"><font color="#0033CC
    http://www.superrxsalesman.info/aff1/?butane"><font color="#0033CC
    http://www.superrxsalesman.info/aff1/?fiche"><font color="#0033CC

    The first 5 lines are exactly what I want but I don't understand why in the
    following lines I get characters after and including ". I want basically to
    keep what is in between the "" of the <href=""> tag.
    Could anybody tell me what is wrong with my regular expression?
    Thanks!

    Charles

    --
    Charles-E. Nadeau Ph.D
    http://radio.weblogs.com/0111823/
    Charles Nadeau, Nov 15, 2003
    #1
    1. Advertising

  2. Charles Nadeau wrote:
    > I am trying to craft a regular expression to filter an URL from a
    > <a href=""></a> tag and the one I have doesn't seen right. I use
    > the regular expression from this snippet of code:
    >
    > foreach my $message (@messages)
    > {
    > my @match=($message->decoded=~/\bhref="(http.*)">.*/gi);
    >
    > foreach my $match(@match)
    > {
    > print $match,"\n";
    > }
    >
    > }
    >
    > but it doesn't lead to results that are exactly what I need.


    http://theoryx5.uwinnipeg.ca/CPAN/perl/pod/perlfaq9/How_do_I_extract_URLs.html

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Nov 15, 2003
    #2
    1. Advertising

  3. Charles Nadeau

    Andy R Guest

    "Charles Nadeau" <> wrote in message
    news:bp483h$1gv0$...
    > Hello,
    >
    > I am trying to craft a regular expression to filter an URL from a <a
    > href=""></a> tag and the one I have doesn't seen right.
    > I use the regular expression from this snippet of code:
    >
    > foreach my $message (@messages)
    > {
    > my @match=($message->decoded=~/\bhref="(http.*)">.*/gi);
    >
    > foreach my $match(@match)
    > {
    > print $match,"\n";
    > }
    >
    > }
    >
    > but it doesn't lead to results that are exactly what I need. An excerpt of
    > what I get as an output looks like:
    >
    > http://203.197.204.155/mout/
    > http://www.superrxsalesman.info/aff1/?mulish
    > http://www.superrxsalesman.info/aff1/?acme
    > http://www.superrxsalesman.info/aff1/?blister
    > http://www.superrxsalesman.info/aff1/?samba
    > http://www.superrxsalesman.info/aff1/?depot"><font color="#0033CC
    > http://www.superrxsalesman.info/aff1/?procter"><font color="#0033CC
    > http://www.superrxsalesman.info/aff1/?use"><font color="#0033CC
    > http://www.superrxsalesman.info/aff1/?butane"><font color="#0033CC
    > http://www.superrxsalesman.info/aff1/?fiche"><font color="#0033CC
    >
    > The first 5 lines are exactly what I want but I don't understand why in

    the
    > following lines I get characters after and including ". I want basically

    to
    > keep what is in between the "" of the <href=""> tag.
    > Could anybody tell me what is wrong with my regular expression?
    > Thanks!
    >
    > Charles
    >
    > --
    > Charles-E. Nadeau Ph.D
    > http://radio.weblogs.com/0111823/


    Use a ? to perform a non-greedy match ie:

    my @match=($message->decoded=~/\bhref="(http.*?)">.*/gi);

    Should work, though I've not tested it.

    Andy R
    Andy R, Nov 15, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. shruds
    Replies:
    1
    Views:
    788
    John C. Bollinger
    Jan 27, 2006
  2. Mikel Lindsaar
    Replies:
    0
    Views:
    482
    Mikel Lindsaar
    Mar 31, 2008
  3. Old Echo
    Replies:
    1
    Views:
    181
    Adam Shelly
    Sep 4, 2008
  4. Joao Silva
    Replies:
    16
    Views:
    359
    7stud --
    Aug 21, 2009
  5. Uldis  Bojars
    Replies:
    2
    Views:
    190
    Janwillem Borleffs
    Dec 17, 2006
Loading...

Share This Page