F
Fritz Bayer
Hello,
I'm trying to extract urls from a document.
The following code does not work correctly:
while ($content =~ m!$(<p
class=(["']?)g\2>.*?>.*?<a.*?href=(["'])?(http://([^\3]+)))!ig)
{
print "1 $1\n";
print "2 $2\n";
print "3 $3\n";
print "4 $4\n";
print "5 $5\n";
}
The problem is that
([^\3]+)
is also matching the character " or ' from the third capturing group,
even though it should NOT.
If matches them not because the third capturing is empty (not " or '),
but because somehow \3 can't be used inside a [...] block.
Why is that and whats the workaround for this?
Fritz
I'm trying to extract urls from a document.
The following code does not work correctly:
while ($content =~ m!$(<p
class=(["']?)g\2>.*?>.*?<a.*?href=(["'])?(http://([^\3]+)))!ig)
{
print "1 $1\n";
print "2 $2\n";
print "3 $3\n";
print "4 $4\n";
print "5 $5\n";
}
The problem is that
([^\3]+)
is also matching the character " or ' from the third capturing group,
even though it should NOT.
If matches them not because the third capturing is empty (not " or '),
but because somehow \3 can't be used inside a [...] block.
Why is that and whats the workaround for this?
Fritz