Link Matching

Taras_96 · May 5, 2007

Hi everyone,

I need to write a regex that parses some HTML text to output all links
whose text (the text that appears on the screen) a given expression.

eg: findLinks(html,'(.*)o(.*)') called on the html code

<a>one</a>
<a>three</a>
<a>two</a>

Should return two matches, <a>one</a> and <a>two</a>

I'm a bit new with regexs. At the moment I have:

'/<a[^><]*href\s*=\s*[^>]*>'.$regex.'<\/a>/'

(I'm only interested with tags that have a href attribute)

which greedily matches the entire input string.

How do I make the </a> match non greedy? I've read that (.*?)<\/a>
makes the match non greedy, but this doesn't account for the form of
the link text.

Thanks

Taras

Jürgen Exner · May 5, 2007

Taras_96 said:
Hi everyone,

I need to write a regex that parses some HTML text

Bad idea. See "perldoc -q HTML"
How do I remove HTML from a string?
and the gazillions of previous articles about this topic about why and what
to do instead.

jue

brian d foy · May 5, 2007

I need to write a regex that parses some HTML text to output all links
whose text (the text that appears on the screen) a given expression.

eg: findLinks(html,'(.*)o(.*)') called on the html code

I think you want HTML::LinkExtractor

http://search.cpan.org/dist/HTML-LinkExtractor/

Gunnar Hjalmarsson · May 5, 2007

Taras_96 said:
'/<a[^><]*href\s*=\s*[^>]*>'.$regex.'<\/a>/'

(I'm only interested with tags that have a href attribute)

which greedily matches the entire input string.

How do I make the </a> match non greedy? I've read that (.*?)<\/a>
makes the match non greedy, but this doesn't account for the form of
the link text.

Really? Even if a regex would be sufficient for the task you are trying
to accomplish, I'm not convinced. Can you demonstrate your claim with
some runnable example code?

Tad McClellan · May 5, 2007

Petr Vileta said:
$page =~ s/^.+?<a\s+.*?href=.+?>(.+?)<\/a>(.+)$/$2/si;
print "$1\n";

You should never use the dollar-digit variables unless
you have first ensured that the pattern match _succeeded_.

Tad McClellan · May 5, 2007

Petr Vileta said:
Assume that variable $page contain html code.

OK.

--------------------------
#!/usr/bin/perl
use warnings;
use strict;

my $page = '<html></html>';

while ($page =~ m/<a\s+.*?href=.+?>/sig)
{
$page =~ s/^.+?<a\s+.*?href=.+?>(.+?)<\/a>(.+)$/$2/si;
print "$1\n";
}
--------------------------

That link-finding program does exactly what it is supposed to do.

If you had some other data in mind, then you need to share
that with us, we are not mind readers.

Tad McClellan · May 6, 2007

Petr Vileta said:
--------------------------
#!/usr/bin/perl
use warnings;
use strict;

my $page = "<html><body>\nClick to this <a
href=\"http://www.google.com\">link</a>\n</body></html>";

while ($page =~ m/<a\s+.*?href=.+?>/sig)
{
$page =~ s/^.+?<a\s+.*?href=.+?>(.+?)<\/a>(.+)$/$2/si;
print "$1\n";
}

-------------------------
#!/usr/bin/perl
use warnings;
use strict;

my $page = '<html><body>\nClick to this
<a href = "http://www.google.com">link2</a>
<a href="http://www.google.com">link3</a >
<a href=\"http://www.google.com\">link</a>

<a href="http://www.google.com" name="<<cool link!>>">link4</a>
<a name="href=stuff">No href here!</a>
</body></html>
';

while ($page =~ m/<a\s+.*?href=.+?>/sig)
{
$page =~ s/^.+?<a\s+.*?href=.+?>(.+?)<\/a>(.+)$/$2/si;
print "(($1))\n\n";
}
-------------------------

output:

((link3</a >
<a href=\"http://www.google.com\">link))

((Not A Link!))

((>">link4))

((No href here!))

Xicheng Jia · May 7, 2007

Hi everyone,

I need to write a regex that parses some HTML text to output all links
whose text (the text that appears on the screen) a given expression.

eg: findLinks(html,'(.*)o(.*)') called on the html code

<a>one</a>
<a>three</a>
<a>two</a>

Should return two matches, <a>one</a> and <a>two</a>

I'm a bit new with regexs. At the moment I have:

'/<a[^><]*href\s*=\s*[^>]*>'.$regex.'<\/a>/'

(I'm only interested with tags that have a href attribute)

which greedily matches the entire input string.

How do I make the </a> match non greedy? I've read that (.*?)<\/a>
makes the match non greedy, but this doesn't account for the form of
the link text.

Here is one regex way:

sub findlinks
{
my ($html, $ptn) = @_;
while($html =~ m{( <a (?=[^<>]*href) (.*?) </a> )}gsix) {
my $ret = $1;
(my $content = $2) =~ s/<.*?>//g; #remove embedded tags
print $ret if $content =~ /\Q$ptn/;
# if $ptn is plain text, switch to index()
# print $ret if index($content, $ptn) > 0;
}
}

$html = <<END_HTML;
<a href="bbb">one</a> nnn
sgfdh <a href="aa">three</a>
dfgdg <a>two</a> 000
dfgdg <a href="ttoo">two
</a> ooo
END_HTML

findlinks($html, "o");

__END__

Regards,
Xicheng

Xicheng Jia · May 7, 2007

Hi everyone,

Click to expand...

I need to write a regex that parses some HTML text to output all links
whose text (the text that appears on the screen) a given expression.

Click to expand...

eg: findLinks(html,'(.*)o(.*)') called on the html code

Should return two matches, <a>one</a> and <a>two</a>

Click to expand...

I'm a bit new with regexs. At the moment I have:

'/<a[^><]*href\s*=\s*[^>]*>'.$regex.'<\/a>/'

Click to expand...

(I'm only interested with tags that have a href attribute)

Click to expand...

which greedily matches the entire input string.

Click to expand...

How do I make the </a> match non greedy? I've read that (.*?)<\/a>
makes the match non greedy, but this doesn't account for the form of
the link text.

Click to expand...

Here is one regex way:

sub findlinks
{
my ($html, $ptn) = @_;

change
while($html =~ m{( <a (?=[^<>]*href) (.*?) </a> )}gsix) {
to:
while($html =~ m{( <a (?=[^<>]*href) .*?> (.*?) </a> )}gsix) {

or change:
(my $content = $2) =~ s/<.*?>//g;
to
(my $content = $2) =~ s/^[^>]*>|<.*?>//g;

I forogt to close the opening link tag..
BTW, this may not work for some ill-formated XHTML documents although
they do exist widely on the web, and it might also improperly check
the contents in your commented elements..

Regards,
Xicheng

Tad McClellan · May 8, 2007

Xicheng Jia said:
Here is one regex way:

So let's rephrase that in more honest terms.

Here is a way that appears to work often, but will sometimes match
things that it shouldn't match, and at other times will not match
things that it should have matched.

(If you want one way that always gets it right, then you need
a Real Parser.
)

sub findlinks
{
my ($html, $ptn) = @_;
while($html =~ m{( <a (?=[^<>]*href) (.*?) </a> )}gsix) {
my $ret = $1;
(my $content = $2) =~ s/<.*?>//g; #remove embedded tags
print $ret if $content =~ /\Q$ptn/;
# if $ptn is plain text, switch to index()
# print $ret if index($content, $ptn) > 0;
}
}

Try it with this data:

$html = <<END_HTML;
<p>
If b<a then href="bbb"
</p> Don't report me!
<a href="ttoo">two
END_HTML

-
Tad McClellan SGML consulting
(e-mail address removed) Perl programming
Fort Worth, Texas

Xicheng Jia · May 8, 2007

Xicheng Jia said:
Xicheng Jia said:

Here is one regex way:

Click to expand...

So let's rephrase that in more honest terms.

Here is a way that appears to work often, but will sometimes match
things that it shouldn't match, and at other times will not match
things that it should have matched.

(If you want one way that always gets it right, then you need
a Real Parser.
)

sub findlinks
{
my ($html, $ptn) = @_;
while($html =~ m{( <a (?=[^<>]*href) (.*?) </a> )}gsix) {
my $ret = $1;
(my $content = $2) =~ s/<.*?>//g; #remove embedded tags
print $ret if $content =~ /\Q$ptn/;
# if $ptn is plain text, switch to index()
# print $ret if index($content, $ptn) > 0;
}
}

Click to expand...

Try it with this data:

$html = <<END_HTML;
<p>
If b<a then href="bbb"

That is ill-formated html and won't pass the W3C XHTML validator, I've
mentioned in my previous post and I never said the code can do all
things. But in case one knows (what|how) the text presents, CPAN
modules are not the only tools that can solve the problem.

BTW. you could actually come up with some better samples to invalidate
my code, like:

<a onmouseover="window.location.href = whatever" .....> ...... </a>

but that's easy to be fixed..

Regards,
Xicheng

Tad McClellan · May 8, 2007

Xicheng Jia said:
^^^^^^^^^
^^^^^^^^^

That is ill-formated html

It is perfectly valid HTML.

and won't pass the W3C XHTML validator,

That's because XHTML is not the same language as HTML.

The OP was not asking about that language, he was asking about HTML.

I've
mentioned in my previous post

I had not seen it yet, and the disclaimers should be in the
same post where the disclaimed code is.

Otherwise people might take the code seriously.

BTW. you could actually come up with some better samples to invalidate
my code,

Yes I could.

Taras_96 · May 8, 2007

I thought I posted this earlier, but it seems to have been lost?!

OK everyone, forget that it's HTML we're parsing.

How would I make a regex that would return from:

<open>one</close><open>two</close>

Two matches, 'one', and 'two', and not the one match 'one</
close><open>two'?

Taras

Tad McClellan · May 8, 2007

Taras_96 said:
OK everyone, forget that it's HTML we're parsing.

You can do that if you want to match one particular string.

If you want code that will work on different data, we would
need to understand what could be different...

How would I make a regex that would return from:

<open>one</close><open>two</close>

Two matches, 'one', and 'two',

--------------------
#!/usr/bin/perl
use warnings;
use strict;

$_ = '<open>one</close><open>two</close>';

my @x = /(one|two)/g;
print join(',', @x), "\n";

@x = />(...)</g;
print join(',', @x), "\n";

@x = /\b(\w{3})\b/g;
print join(',', @x), "\n";

@x = />([^<]+)/g;
print join(',', @x), "\n";

@x = m#<open>(.*?)</close>#g;
print join(',', @x), "\n";

Peter J. Holzer · May 8, 2007

It is perfectly valid HTML.

I don't think so. "<a " looks like the start of an "a" tag, but the rest
of it isn't well-formed ("then" is not an attribute of "a", and an
unquoted "</" is syntactically wrong. I don't have the syntax rules for
SGML at hand but i doubt that they require backtracking to the "<" and
reinterpret it as a literal "<" instead of the start of a tag.

Adding two spaces makes it valid:

If b < a then href="bbb"

hp

Peter J. Holzer · May 8, 2007

It is perfectly valid HTML.

I don't think so. "<a " looks like the start of an "a" tag, but the rest
of it isn't well-formed ("then" is not an attribute of "a", and an
unquoted "</" is syntactically wrong). I don't have the syntax rules for
SGML at hand but i doubt that they require backtracking to the "<" and
reinterpret it as a literal "<" instead of the start of a tag.

Adding two spaces makes it valid:

If b < a then href="bbb"

hp

Taras_96 · May 9, 2007

You can do that if you want to match one particular string.

If you want code that will work on different data, we would
need to understand what could be different...

Should have explained my question a bit more (I thought with the
previous discussion about HTML it would be clear).

Part 1)

How do I construct a regex that matches any text that is in between
<open> and </close> strings, but the *shortest* (non-greedy matching)
such string?

So in the above example, the strings 'one' and 'two' can be
theoretically anything.

Part 2)

Once we have the non-greedy matching, how can I construct a regex that
would return any text in between <open> and </close>, but the text in
between the tags must itself match a regex?

eg: a search for o(.)* would return 'one' using my previous example,
but not 'two'.

Tad McClellan · May 10, 2007

[ Please provide a proper attribution when you quote someone,
like everybody else does...
]

Should have explained my question a bit more

I was hoping you'd say that after seeing my post.

(I thought with the
previous discussion about HTML it would be clear).

Errr, so when you said:

forget that it's HTML we're parsing

we weren't really supposed to do that?

Part 1)

How do I construct a regex that matches any text that is in between
<open> and </close> strings, but the *shortest* (non-greedy matching)
such string?

One of the ways I did it in the code that I gave you meets
that spec. Did you read and understand that code?

Or, since your Question is Asked Frequently:

perldoc -q greedy

What does it mean that regexes are greedy? How can I get around it?

So in the above example,

There is no "above example".

If you want to discuss a piece of code, then please quote the piece
of code, like everybody else does...

Part 2)

Once we have the non-greedy matching, how can I construct a regex that
would return any text in between <open> and </close>, but the text in
between the tags must itself match a regex?

That is where managing the greediness will become difficult.

I'd stick with finding the delimiters first, and then applying
your regex to the list it returns:

my $inner_pat = 'o'; # lower case oh

eg: a search for o(.)* would return 'one' using my previous example,
but not 'two'.

^^^^^^^^^^^^^

Why not?

This program makes output when it matches that pattern:

-------------------
#!/usr/bin/perl
use warnings;
use strict;

$_ = 'two';
print "matched\n" if /o(.)*/;
-------------------

Your regex will match the same strings as /o/
and it will fail to match the same strings.

Did you perhaps mean /o(.)+/ instead?

Clickable link conversion regex?	0	Nov 30, 2012
Iframe link overlapping text	4	Jan 18, 2021
Neopets coding help	4	Sep 23, 2021
Idk need help in editing this source code	0	Nov 5, 2022
Why <link/> is not working?	4	Jan 1, 2020
Only one table shows up with the information	2	Mar 29, 2023
"negative" regex matching?	4	Dec 4, 2009
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022

Link Matching

Taras_96

Jürgen Exner

brian d foy

Gunnar Hjalmarsson

Tad McClellan

Tad McClellan

Tad McClellan

Xicheng Jia

Xicheng Jia

Tad McClellan

Xicheng Jia

Tad McClellan

Taras_96

Tad McClellan

Peter J. Holzer

Peter J. Holzer

Taras_96

Tad McClellan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads