Link Matching

T

Taras_96

Hi everyone,

I need to write a regex that parses some HTML text to output all links
whose text (the text that appears on the screen) a given expression.

eg: findLinks(html,'(.*)o(.*)') called on the html code

<a>one</a>
<a>three</a>
<a>two</a>

Should return two matches, <a>one</a> and <a>two</a>

I'm a bit new with regexs. At the moment I have:

'/<a[^><]*href\s*=\s*[^>]*>'.$regex.'<\/a>/'

(I'm only interested with tags that have a href attribute)

which greedily matches the entire input string.

How do I make the </a> match non greedy? I've read that (.*?)<\/a>
makes the match non greedy, but this doesn't account for the form of
the link text.

Thanks

Taras
 
J

Jürgen Exner

Taras_96 said:
Hi everyone,

I need to write a regex that parses some HTML text

Bad idea. See "perldoc -q HTML"
How do I remove HTML from a string?
and the gazillions of previous articles about this topic about why and what
to do instead.

jue
 
G

Gunnar Hjalmarsson

Taras_96 said:
'/<a[^><]*href\s*=\s*[^>]*>'.$regex.'<\/a>/'

(I'm only interested with tags that have a href attribute)

which greedily matches the entire input string.

How do I make the </a> match non greedy? I've read that (.*?)<\/a>
makes the match non greedy, but this doesn't account for the form of
the link text.

Really? Even if a regex would be sufficient for the task you are trying
to accomplish, I'm not convinced. Can you demonstrate your claim with
some runnable example code?
 
T

Tad McClellan

Petr Vileta said:
$page =~ s/^.+?<a\s+.*?href=.+?>(.+?)<\/a>(.+)$/$2/si;
print "$1\n";


You should never use the dollar-digit variables unless
you have first ensured that the pattern match _succeeded_.
 
T

Tad McClellan

Petr Vileta said:
Assume that variable $page contain html code.


OK.


--------------------------
#!/usr/bin/perl
use warnings;
use strict;

my $page = '<html></html>';

while ($page =~ m/<a\s+.*?href=.+?>/sig)
{
$page =~ s/^.+?<a\s+.*?href=.+?>(.+?)<\/a>(.+)$/$2/si;
print "$1\n";
}
--------------------------


That link-finding program does exactly what it is supposed to do.

If you had some other data in mind, then you need to share
that with us, we are not mind readers.
 
T

Tad McClellan

Petr Vileta said:
--------------------------
#!/usr/bin/perl
use warnings;
use strict;

my $page = "<html><body>\nClick to this <a
href=\"http://www.google.com\">link</a>\n</body></html>";

while ($page =~ m/<a\s+.*?href=.+?>/sig)
{
$page =~ s/^.+?<a\s+.*?href=.+?>(.+?)<\/a>(.+)$/$2/si;
print "$1\n";
}


-------------------------
#!/usr/bin/perl
use warnings;
use strict;

my $page = '<html><body>\nClick to this
<a href = "http://www.google.com">link2</a>
<a href="http://www.google.com">link3</a >
<a href=\"http://www.google.com\">link</a>
<!--
<a href="http://www.google.com">Not A Link!</a>
-->
<a href="http://www.google.com" name="<<cool link!>>">link4</a>
<a name="href=stuff">No href here!</a>
</body></html>
';

while ($page =~ m/<a\s+.*?href=.+?>/sig)
{
$page =~ s/^.+?<a\s+.*?href=.+?>(.+?)<\/a>(.+)$/$2/si;
print "(($1))\n\n";
}
-------------------------

output:

((link3</a >
<a href=\"http://www.google.com\">link))

((Not A Link!))

((>">link4))

((No href here!))
 
X

Xicheng Jia

Hi everyone,

I need to write a regex that parses some HTML text to output all links
whose text (the text that appears on the screen) a given expression.

eg: findLinks(html,'(.*)o(.*)') called on the html code

<a>one</a>
<a>three</a>
<a>two</a>

Should return two matches, <a>one</a> and <a>two</a>

I'm a bit new with regexs. At the moment I have:

'/<a[^><]*href\s*=\s*[^>]*>'.$regex.'<\/a>/'

(I'm only interested with tags that have a href attribute)

which greedily matches the entire input string.

How do I make the </a> match non greedy? I've read that (.*?)<\/a>
makes the match non greedy, but this doesn't account for the form of
the link text.

Here is one regex way:

sub findlinks
{
my ($html, $ptn) = @_;
while($html =~ m{( <a (?=[^<>]*href) (.*?) </a> )}gsix) {
my $ret = $1;
(my $content = $2) =~ s/<.*?>//g; #remove embedded tags
print $ret if $content =~ /\Q$ptn/;
# if $ptn is plain text, switch to index()
# print $ret if index($content, $ptn) > 0;
}
}

$html = <<END_HTML;
<a href="bbb">one</a> nnn
sgfdh <a href="aa">three</a>
dfgdg <a>two</a> 000
dfgdg <a href="ttoo">two
</a> ooo
END_HTML

findlinks($html, "o");

__END__

Regards,
Xicheng
 
X

Xicheng Jia

Hi everyone,
I need to write a regex that parses some HTML text to output all links
whose text (the text that appears on the screen) a given expression.
eg: findLinks(html,'(.*)o(.*)') called on the html code

Should return two matches, <a>one</a> and <a>two</a>
I'm a bit new with regexs. At the moment I have:
'/<a[^><]*href\s*=\s*[^>]*>'.$regex.'<\/a>/'

(I'm only interested with tags that have a href attribute)
which greedily matches the entire input string.
How do I make the </a> match non greedy? I've read that (.*?)<\/a>
makes the match non greedy, but this doesn't account for the form of
the link text.

Here is one regex way:

sub findlinks
{
my ($html, $ptn) = @_;
change
while($html =~ m{( <a (?=[^<>]*href) (.*?) </a> )}gsix) {
to:
while($html =~ m{( <a (?=[^<>]*href) .*?> (.*?) </a> )}gsix) {

or change:
(my $content = $2) =~ s/<.*?>//g;
to
(my $content = $2) =~ s/^[^>]*>|<.*?>//g;

I forogt to close the opening link tag..
BTW, this may not work for some ill-formated XHTML documents although
they do exist widely on the web, and it might also improperly check
the contents in your commented elements..

Regards,
Xicheng
 
T

Tad McClellan

Xicheng Jia said:
Here is one regex way:


So let's rephrase that in more honest terms.

Here is a way that appears to work often, but will sometimes match
things that it shouldn't match, and at other times will not match
things that it should have matched.

(If you want one way that always gets it right, then you need
a Real Parser.
)
sub findlinks
{
my ($html, $ptn) = @_;
while($html =~ m{( <a (?=[^<>]*href) (.*?) </a> )}gsix) {
my $ret = $1;
(my $content = $2) =~ s/<.*?>//g; #remove embedded tags
print $ret if $content =~ /\Q$ptn/;
# if $ptn is plain text, switch to index()
# print $ret if index($content, $ptn) > 0;
}
}


Try it with this data:

$html = <<END_HTML;
<p>
If b<a then href="bbb"
</p> Don't report me!
<a href="ttoo">two
END_HTML


-
Tad McClellan SGML consulting
(e-mail address removed) Perl programming
Fort Worth, Texas
 
X

Xicheng Jia

Xicheng Jia said:
Here is one regex way:

So let's rephrase that in more honest terms.

Here is a way that appears to work often, but will sometimes match
things that it shouldn't match, and at other times will not match
things that it should have matched.

(If you want one way that always gets it right, then you need
a Real Parser.
)
sub findlinks
{
my ($html, $ptn) = @_;
while($html =~ m{( <a (?=[^<>]*href) (.*?) </a> )}gsix) {
my $ret = $1;
(my $content = $2) =~ s/<.*?>//g; #remove embedded tags
print $ret if $content =~ /\Q$ptn/;
# if $ptn is plain text, switch to index()
# print $ret if index($content, $ptn) > 0;
}
}

Try it with this data:

$html = <<END_HTML;
<p>
If b<a then href="bbb"

That is ill-formated html and won't pass the W3C XHTML validator, I've
mentioned in my previous post and I never said the code can do all
things. But in case one knows (what|how) the text presents, CPAN
modules are not the only tools that can solve the problem.

BTW. you could actually come up with some better samples to invalidate
my code, like:

<a onmouseover="window.location.href = whatever" .....> ...... </a>

but that's easy to be fixed..

Regards,
Xicheng
 
T

Tad McClellan

Xicheng Jia said:
^^^^^^^^^
^^^^^^^^^



That is ill-formated html


It is perfectly valid HTML.

and won't pass the W3C XHTML validator,


That's because XHTML is not the same language as HTML.

The OP was not asking about that language, he was asking about HTML.

I've
mentioned in my previous post


I had not seen it yet, and the disclaimers should be in the
same post where the disclaimed code is.

Otherwise people might take the code seriously.

BTW. you could actually come up with some better samples to invalidate
my code,


Yes I could.
 
T

Taras_96

I thought I posted this earlier, but it seems to have been lost?!

OK everyone, forget that it's HTML we're parsing.

How would I make a regex that would return from:

<open>one</close><open>two</close>

Two matches, 'one', and 'two', and not the one match 'one</
close><open>two'?

Taras
 
T

Tad McClellan

Taras_96 said:
OK everyone, forget that it's HTML we're parsing.


You can do that if you want to match one particular string.

If you want code that will work on different data, we would
need to understand what could be different...

How would I make a regex that would return from:

<open>one</close><open>two</close>

Two matches, 'one', and 'two',


--------------------
#!/usr/bin/perl
use warnings;
use strict;

$_ = '<open>one</close><open>two</close>';

my @x = /(one|two)/g;
print join(',', @x), "\n";

@x = />(...)</g;
print join(',', @x), "\n";

@x = /\b(\w{3})\b/g;
print join(',', @x), "\n";

@x = />([^<]+)/g;
print join(',', @x), "\n";

@x = m#<open>(.*?)</close>#g;
print join(',', @x), "\n";
 
P

Peter J. Holzer

It is perfectly valid HTML.

I don't think so. "<a " looks like the start of an "a" tag, but the rest
of it isn't well-formed ("then" is not an attribute of "a", and an
unquoted "</" is syntactically wrong. I don't have the syntax rules for
SGML at hand but i doubt that they require backtracking to the "<" and
reinterpret it as a literal "<" instead of the start of a tag.

Adding two spaces makes it valid:

If b < a then href="bbb"

hp
 
P

Peter J. Holzer

It is perfectly valid HTML.

I don't think so. "<a " looks like the start of an "a" tag, but the rest
of it isn't well-formed ("then" is not an attribute of "a", and an
unquoted "</" is syntactically wrong). I don't have the syntax rules for
SGML at hand but i doubt that they require backtracking to the "<" and
reinterpret it as a literal "<" instead of the start of a tag.

Adding two spaces makes it valid:

If b < a then href="bbb"

hp
 
T

Taras_96

You can do that if you want to match one particular string.

If you want code that will work on different data, we would
need to understand what could be different...

Should have explained my question a bit more (I thought with the
previous discussion about HTML it would be clear).

Part 1)

How do I construct a regex that matches any text that is in between
<open> and </close> strings, but the *shortest* (non-greedy matching)
such string?

So in the above example, the strings 'one' and 'two' can be
theoretically anything.

Part 2)

Once we have the non-greedy matching, how can I construct a regex that
would return any text in between <open> and </close>, but the text in
between the tags must itself match a regex?

eg: a search for o(.)* would return 'one' using my previous example,
but not 'two'.
 
T

Tad McClellan

[ Please provide a proper attribution when you quote someone,
like everybody else does...
]

Should have explained my question a bit more


I was hoping you'd say that after seeing my post. :)

(I thought with the
previous discussion about HTML it would be clear).


Errr, so when you said:

forget that it's HTML we're parsing

we weren't really supposed to do that?

Part 1)

How do I construct a regex that matches any text that is in between
<open> and </close> strings, but the *shortest* (non-greedy matching)
such string?


One of the ways I did it in the code that I gave you meets
that spec. Did you read and understand that code?


Or, since your Question is Asked Frequently:

perldoc -q greedy

What does it mean that regexes are greedy? How can I get around it?

So in the above example,


There is no "above example".

If you want to discuss a piece of code, then please quote the piece
of code, like everybody else does...

Part 2)

Once we have the non-greedy matching, how can I construct a regex that
would return any text in between <open> and </close>, but the text in
between the tags must itself match a regex?


That is where managing the greediness will become difficult.

I'd stick with finding the delimiters first, and then applying
your regex to the list it returns:

my $inner_pat = 'o'; # lower case oh
eg: a search for o(.)* would return 'one' using my previous example,
but not 'two'.
^^^^^^^^^^^^^

Why not?

This program makes output when it matches that pattern:

-------------------
#!/usr/bin/perl
use warnings;
use strict;

$_ = 'two';
print "matched\n" if /o(.)*/;
-------------------


Your regex will match the same strings as /o/
and it will fail to match the same strings.

Did you perhaps mean /o(.)+/ instead?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top