Link Matching

Discussion in 'Perl Misc' started by Taras_96, May 5, 2007.

  1. Taras_96

    Taras_96 Guest

    Hi everyone,

    I need to write a regex that parses some HTML text to output all links
    whose text (the text that appears on the screen) a given expression.

    eg: findLinks(html,'(.*)o(.*)') called on the html code

    <a>one</a>
    <a>three</a>
    <a>two</a>

    Should return two matches, <a>one</a> and <a>two</a>

    I'm a bit new with regexs. At the moment I have:

    '/<a[^><]*href\s*=\s*[^>]*>'.$regex.'<\/a>/'

    (I'm only interested with tags that have a href attribute)

    which greedily matches the entire input string.

    How do I make the </a> match non greedy? I've read that (.*?)<\/a>
    makes the match non greedy, but this doesn't account for the form of
    the link text.

    Thanks

    Taras
    Taras_96, May 5, 2007
    #1
    1. Advertising

  2. Taras_96 wrote:
    > Hi everyone,
    >
    > I need to write a regex that parses some HTML text


    Bad idea. See "perldoc -q HTML"
    How do I remove HTML from a string?
    and the gazillions of previous articles about this topic about why and what
    to do instead.

    jue
    Jürgen Exner, May 5, 2007
    #2
    1. Advertising

  3. Taras_96

    brian d foy Guest

    In article <>,
    Taras_96 <> wrote:


    > I need to write a regex that parses some HTML text to output all links
    > whose text (the text that appears on the screen) a given expression.
    >
    > eg: findLinks(html,'(.*)o(.*)') called on the html code


    I think you want HTML::LinkExtractor

    http://search.cpan.org/dist/HTML-LinkExtractor/

    --
    Posted via a free Usenet account from http://www.teranews.com
    brian d foy, May 5, 2007
    #3
  4. Taras_96 wrote:
    >
    > '/<a[^><]*href\s*=\s*[^>]*>'.$regex.'<\/a>/'
    >
    > (I'm only interested with tags that have a href attribute)
    >
    > which greedily matches the entire input string.
    >
    > How do I make the </a> match non greedy? I've read that (.*?)<\/a>
    > makes the match non greedy, but this doesn't account for the form of
    > the link text.


    Really? Even if a regex would be sufficient for the task you are trying
    to accomplish, I'm not convinced. Can you demonstrate your claim with
    some runnable example code?

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, May 5, 2007
    #4
  5. Petr Vileta <> wrote:

    > $page =~ s/^.+?<a\s+.*?href=.+?>(.+?)<\/a>(.+)$/$2/si;
    > print "$1\n";



    You should never use the dollar-digit variables unless
    you have first ensured that the pattern match _succeeded_.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, May 5, 2007
    #5
  6. Petr Vileta <> wrote:

    > Assume that variable $page contain html code.



    OK.


    --------------------------
    #!/usr/bin/perl
    use warnings;
    use strict;

    my $page = '<html></html>';

    while ($page =~ m/<a\s+.*?href=.+?>/sig)
    {
    $page =~ s/^.+?<a\s+.*?href=.+?>(.+?)<\/a>(.+)$/$2/si;
    print "$1\n";
    }
    --------------------------


    That link-finding program does exactly what it is supposed to do.

    If you had some other data in mind, then you need to share
    that with us, we are not mind readers.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, May 5, 2007
    #6
  7. Petr Vileta <> wrote:

    > --------------------------
    > #!/usr/bin/perl
    > use warnings;
    > use strict;
    >
    > my $page = "<html><body>\nClick to this <a
    > href=\"http://www.google.com\">link</a>\n</body></html>";
    >
    > while ($page =~ m/<a\s+.*?href=.+?>/sig)
    > {
    > $page =~ s/^.+?<a\s+.*?href=.+?>(.+?)<\/a>(.+)$/$2/si;
    > print "$1\n";
    > }
    > --------------------------
    > then print "link" as Taras_96 need.
    > Where is the problem?



    -------------------------
    #!/usr/bin/perl
    use warnings;
    use strict;

    my $page = '<html><body>\nClick to this
    <a href = "http://www.google.com">link2</a>
    <a href="http://www.google.com">link3</a >
    <a href=\"http://www.google.com\">link</a>
    <!--
    <a href="http://www.google.com">Not A Link!</a>
    -->
    <a href="http://www.google.com" name="<<cool link!>>">link4</a>
    <a name="href=stuff">No href here!</a>
    </body></html>
    ';

    while ($page =~ m/<a\s+.*?href=.+?>/sig)
    {
    $page =~ s/^.+?<a\s+.*?href=.+?>(.+?)<\/a>(.+)$/$2/si;
    print "(($1))\n\n";
    }
    -------------------------

    output:

    ((link3</a >
    <a href=\"http://www.google.com\">link))

    ((Not A Link!))

    ((>">link4))

    ((No href here!))


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, May 6, 2007
    #7
  8. Taras_96

    Xicheng Jia Guest

    On May 4, 11:08 pm, Taras_96 <> wrote:
    > Hi everyone,
    >
    > I need to write a regex that parses some HTML text to output all links
    > whose text (the text that appears on the screen) a given expression.
    >
    > eg: findLinks(html,'(.*)o(.*)') called on the html code
    >
    > <a>one</a>
    > <a>three</a>
    > <a>two</a>
    >
    > Should return two matches, <a>one</a> and <a>two</a>
    >
    > I'm a bit new with regexs. At the moment I have:
    >
    > '/<a[^><]*href\s*=\s*[^>]*>'.$regex.'<\/a>/'
    >
    > (I'm only interested with tags that have a href attribute)
    >
    > which greedily matches the entire input string.
    >
    > How do I make the </a> match non greedy? I've read that (.*?)<\/a>
    > makes the match non greedy, but this doesn't account for the form of
    > the link text.


    Here is one regex way:

    sub findlinks
    {
    my ($html, $ptn) = @_;
    while($html =~ m{( <a (?=[^<>]*href) (.*?) </a> )}gsix) {
    my $ret = $1;
    (my $content = $2) =~ s/<.*?>//g; #remove embedded tags
    print $ret if $content =~ /\Q$ptn/;
    # if $ptn is plain text, switch to index()
    # print $ret if index($content, $ptn) > 0;
    }
    }

    $html = <<END_HTML;
    <a href="bbb">one</a> nnn
    sgfdh <a href="aa">three</a>
    dfgdg <a>two</a> 000
    dfgdg <a href="ttoo">two
    </a> ooo
    END_HTML

    findlinks($html, "o");

    __END__

    Regards,
    Xicheng
    Xicheng Jia, May 7, 2007
    #8
  9. Taras_96

    Xicheng Jia Guest

    On May 7, 11:05 am, Xicheng Jia <> wrote:
    > On May 4, 11:08 pm, Taras_96 <> wrote:
    >
    >
    >
    > > Hi everyone,

    >
    > > I need to write a regex that parses some HTML text to output all links
    > > whose text (the text that appears on the screen) a given expression.

    >
    > > eg: findLinks(html,'(.*)o(.*)') called on the html code

    >
    > > <a>one</a>
    > > <a>three</a>
    > > <a>two</a>

    >
    > > Should return two matches, <a>one</a> and <a>two</a>

    >
    > > I'm a bit new with regexs. At the moment I have:

    >
    > > '/<a[^><]*href\s*=\s*[^>]*>'.$regex.'<\/a>/'

    >
    > > (I'm only interested with tags that have a href attribute)

    >
    > > which greedily matches the entire input string.

    >
    > > How do I make the </a> match non greedy? I've read that (.*?)<\/a>
    > > makes the match non greedy, but this doesn't account for the form of
    > > the link text.

    >
    > Here is one regex way:
    >
    > sub findlinks
    > {
    > my ($html, $ptn) = @_;

    change
    while($html =~ m{( <a (?=[^<>]*href) (.*?) </a> )}gsix) {
    to:
    while($html =~ m{( <a (?=[^<>]*href) .*?> (.*?) </a> )}gsix) {

    or change:
    (my $content = $2) =~ s/<.*?>//g;
    to
    (my $content = $2) =~ s/^[^>]*>|<.*?>//g;

    I forogt to close the opening link tag..
    BTW, this may not work for some ill-formated XHTML documents although
    they do exist widely on the web, and it might also improperly check
    the contents in your commented elements..

    Regards,
    Xicheng

    > my $ret = $1;
    > (my $content = $2) =~ s/<.*?>//g; #remove embedded tags
    > print $ret if $content =~ /\Q$ptn/;
    > # if $ptn is plain text, switch to index()
    > # print $ret if index($content, $ptn) > 0;
    > }
    >
    > }
    >
    > $html = <<END_HTML;
    > <a href="bbb">one</a> nnn
    > sgfdh <a href="aa">three</a>
    > dfgdg <a>two</a> 000
    > dfgdg <a href="ttoo">two
    > </a> ooo
    > END_HTML
    >
    > findlinks($html, "o");
    >
    > __END__
    >
    > Regards,
    > Xicheng
    Xicheng Jia, May 7, 2007
    #9
  10. Xicheng Jia <> wrote:
    > On May 4, 11:08 pm, Taras_96 <> wrote:


    >> I need to write a regex that parses some HTML text



    > Here is one regex way:



    So let's rephrase that in more honest terms.

    Here is a way that appears to work often, but will sometimes match
    things that it shouldn't match, and at other times will not match
    things that it should have matched.

    (If you want one way that always gets it right, then you need
    a Real Parser.
    )

    > sub findlinks
    > {
    > my ($html, $ptn) = @_;
    > while($html =~ m{( <a (?=[^<>]*href) (.*?) </a> )}gsix) {
    > my $ret = $1;
    > (my $content = $2) =~ s/<.*?>//g; #remove embedded tags
    > print $ret if $content =~ /\Q$ptn/;
    > # if $ptn is plain text, switch to index()
    > # print $ret if index($content, $ptn) > 0;
    > }
    > }



    Try it with this data:

    $html = <<END_HTML;
    <p>
    If b<a then href="bbb"
    </p> Don't report me!
    <a href="ttoo">two
    </a>
    <a href="homer>Report me. I am a link!</a
    >

    END_HTML


    -
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, May 8, 2007
    #10
  11. Taras_96

    Xicheng Jia Guest

    On May 7, 8:10 pm, Tad McClellan <> wrote:
    > Xicheng Jia <> wrote:
    > > On May 4, 11:08 pm, Taras_96 <> wrote:
    > >> I need to write a regex that parses some HTML text

    > > Here is one regex way:

    >
    > So let's rephrase that in more honest terms.
    >
    > Here is a way that appears to work often, but will sometimes match
    > things that it shouldn't match, and at other times will not match
    > things that it should have matched.
    >
    > (If you want one way that always gets it right, then you need
    > a Real Parser.
    > )
    >
    > > sub findlinks
    > > {
    > > my ($html, $ptn) = @_;
    > > while($html =~ m{( <a (?=[^<>]*href) (.*?) </a> )}gsix) {
    > > my $ret = $1;
    > > (my $content = $2) =~ s/<.*?>//g; #remove embedded tags
    > > print $ret if $content =~ /\Q$ptn/;
    > > # if $ptn is plain text, switch to index()
    > > # print $ret if index($content, $ptn) > 0;
    > > }
    > > }

    >
    > Try it with this data:
    >
    > $html = <<END_HTML;
    > <p>
    > If b<a then href="bbb"


    That is ill-formated html and won't pass the W3C XHTML validator, I've
    mentioned in my previous post and I never said the code can do all
    things. But in case one knows (what|how) the text presents, CPAN
    modules are not the only tools that can solve the problem.

    BTW. you could actually come up with some better samples to invalidate
    my code, like:

    <a onmouseover="window.location.href = whatever" .....> ...... </a>

    but that's easy to be fixed..

    Regards,
    Xicheng
    Xicheng Jia, May 8, 2007
    #11
  12. Xicheng Jia <> wrote:
    > On May 7, 8:10 pm, Tad McClellan <> wrote:
    >> Xicheng Jia <> wrote:
    >> > On May 4, 11:08 pm, Taras_96 <> wrote:


    >> >> I need to write a regex that parses some HTML text

    ^^^^^^^^^
    ^^^^^^^^^

    >> Here is a way that appears to work often, but will sometimes match
    >> things that it shouldn't match, and at other times will not match
    >> things that it should have matched.



    >> $html = <<END_HTML;
    >> <p>
    >> If b<a then href="bbb"

    >
    > That is ill-formated html



    It is perfectly valid HTML.


    > and won't pass the W3C XHTML validator,



    That's because XHTML is not the same language as HTML.

    The OP was not asking about that language, he was asking about HTML.


    > I've
    > mentioned in my previous post



    I had not seen it yet, and the disclaimers should be in the
    same post where the disclaimed code is.

    Otherwise people might take the code seriously.


    > BTW. you could actually come up with some better samples to invalidate
    > my code,



    Yes I could.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, May 8, 2007
    #12
  13. Taras_96

    Taras_96 Guest

    I thought I posted this earlier, but it seems to have been lost?!

    OK everyone, forget that it's HTML we're parsing.

    How would I make a regex that would return from:

    <open>one</close><open>two</close>

    Two matches, 'one', and 'two', and not the one match 'one</
    close><open>two'?

    Taras
    Taras_96, May 8, 2007
    #13
  14. Taras_96 <> wrote:

    > OK everyone, forget that it's HTML we're parsing.



    You can do that if you want to match one particular string.

    If you want code that will work on different data, we would
    need to understand what could be different...


    > How would I make a regex that would return from:
    >
    ><open>one</close><open>two</close>
    >
    > Two matches, 'one', and 'two',



    --------------------
    #!/usr/bin/perl
    use warnings;
    use strict;

    $_ = '<open>one</close><open>two</close>';

    my @x = /(one|two)/g;
    print join(',', @x), "\n";

    @x = />(...)</g;
    print join(',', @x), "\n";

    @x = /\b(\w{3})\b/g;
    print join(',', @x), "\n";

    @x = />([^<]+)/g;
    print join(',', @x), "\n";

    @x = m#<open>(.*?)</close>#g;
    print join(',', @x), "\n";
    --------------------


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, May 8, 2007
    #14
  15. On 2007-05-08 01:21, Tad McClellan <> wrote:
    > Xicheng Jia <> wrote:
    >> On May 7, 8:10 pm, Tad McClellan <> wrote:
    >>> $html = <<END_HTML;
    >>> <p>
    >>> If b<a then href="bbb"

    </p>
    >>
    >> That is ill-formated html

    >
    >
    > It is perfectly valid HTML.


    I don't think so. "<a " looks like the start of an "a" tag, but the rest
    of it isn't well-formed ("then" is not an attribute of "a", and an
    unquoted "</" is syntactically wrong. I don't have the syntax rules for
    SGML at hand but i doubt that they require backtracking to the "<" and
    reinterpret it as a literal "<" instead of the start of a tag.

    Adding two spaces makes it valid:

    If b < a then href="bbb"

    hp


    --
    _ | Peter J. Holzer | I know I'd be respectful of a pirate
    |_|_) | Sysadmin WSR | with an emu on his shoulder.
    | | | |
    __/ | http://www.hjp.at/ | -- Sam in "Freefall"
    Peter J. Holzer, May 8, 2007
    #15
  16. On 2007-05-08 01:21, Tad McClellan <> wrote:
    > Xicheng Jia <> wrote:
    >> On May 7, 8:10 pm, Tad McClellan <> wrote:
    >>> $html = <<END_HTML;
    >>> <p>
    >>> If b<a then href="bbb"

    </p>
    >>
    >> That is ill-formated html

    >
    >
    > It is perfectly valid HTML.


    I don't think so. "<a " looks like the start of an "a" tag, but the rest
    of it isn't well-formed ("then" is not an attribute of "a", and an
    unquoted "</" is syntactically wrong). I don't have the syntax rules for
    SGML at hand but i doubt that they require backtracking to the "<" and
    reinterpret it as a literal "<" instead of the start of a tag.

    Adding two spaces makes it valid:

    If b < a then href="bbb"

    hp


    --
    _ | Peter J. Holzer | I know I'd be respectful of a pirate
    |_|_) | Sysadmin WSR | with an emu on his shoulder.
    | | | |
    __/ | http://www.hjp.at/ | -- Sam in "Freefall"
    Peter J. Holzer, May 8, 2007
    #16
  17. Taras_96

    Taras_96 Guest

    >
    > You can do that if you want to match one particular string.
    >
    > If you want code that will work on different data, we would
    > need to understand what could be different...
    >


    Should have explained my question a bit more (I thought with the
    previous discussion about HTML it would be clear).

    Part 1)

    How do I construct a regex that matches any text that is in between
    <open> and </close> strings, but the *shortest* (non-greedy matching)
    such string?

    So in the above example, the strings 'one' and 'two' can be
    theoretically anything.

    Part 2)

    Once we have the non-greedy matching, how can I construct a regex that
    would return any text in between <open> and </close>, but the text in
    between the tags must itself match a regex?

    eg: a search for o(.)* would return 'one' using my previous example,
    but not 'two'.
    Taras_96, May 9, 2007
    #17
  18. Taras_96 <> wrote:


    [ Please provide a proper attribution when you quote someone,
    like everybody else does...
    ]


    >> You can do that if you want to match one particular string.
    >>
    >> If you want code that will work on different data, we would
    >> need to understand what could be different...
    >>

    >
    > Should have explained my question a bit more



    I was hoping you'd say that after seeing my post. :)


    > (I thought with the
    > previous discussion about HTML it would be clear).



    Errr, so when you said:

    forget that it's HTML we're parsing

    we weren't really supposed to do that?


    > Part 1)
    >
    > How do I construct a regex that matches any text that is in between
    ><open> and </close> strings, but the *shortest* (non-greedy matching)
    > such string?



    One of the ways I did it in the code that I gave you meets
    that spec. Did you read and understand that code?


    Or, since your Question is Asked Frequently:

    perldoc -q greedy

    What does it mean that regexes are greedy? How can I get around it?


    > So in the above example,



    There is no "above example".

    If you want to discuss a piece of code, then please quote the piece
    of code, like everybody else does...


    > Part 2)
    >
    > Once we have the non-greedy matching, how can I construct a regex that
    > would return any text in between <open> and </close>, but the text in
    > between the tags must itself match a regex?



    That is where managing the greediness will become difficult.

    I'd stick with finding the delimiters first, and then applying
    your regex to the list it returns:

    my $inner_pat = 'o'; # lower case oh
    my @x = grep /$inner_pat/, m#<open>(.*?)</close>#isg;


    > eg: a search for o(.)* would return 'one' using my previous example,
    > but not 'two'.

    ^^^^^^^^^^^^^

    Why not?

    This program makes output when it matches that pattern:

    -------------------
    #!/usr/bin/perl
    use warnings;
    use strict;

    $_ = 'two';
    print "matched\n" if /o(.)*/;
    -------------------


    Your regex will match the same strings as /o/
    and it will fail to match the same strings.

    Did you perhaps mean /o(.)+/ instead?


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, May 10, 2007
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kevin Spencer

    Re: Link Link Link DANGER WILL ROBINSON!!!

    Kevin Spencer, May 17, 2005, in forum: ASP .Net
    Replies:
    0
    Views:
    792
    Kevin Spencer
    May 17, 2005
  2. Graham Thomson
    Replies:
    3
    Views:
    440
    Eki Y. Baskoro
    Dec 18, 2003
  3. Dan M
    Replies:
    5
    Views:
    411
  4. Marc Bissonnette

    Pattern matching : not matching problem

    Marc Bissonnette, Jan 8, 2004, in forum: Perl Misc
    Replies:
    9
    Views:
    220
    Marc Bissonnette
    Jan 13, 2004
  5. Bobby Chamness
    Replies:
    2
    Views:
    212
    Xicheng Jia
    May 3, 2007
Loading...

Share This Page