S
seven.reeds
Hi,
I'm running perl v5.8.7
I have a series of files with html tags in them. I man NOT trying to
strip the tags I am however trying to list the "link phrases"
associated with all of the "<a href=...>link phrase</a>" sequences in
each file. I have a script that does what I want. I just need it to
be improved a bit and that's why I am here.
the code so far is:
use strict;
select(STDIN);
$|++;
my $sep = $/;
undef $/;
my $text = <>;
$/ = $sep;
my $tmp = "";
while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)
{
$text = $';
if ($text =~ /<\s*\/A\s*>/is)
{
$tmp = $`;
#$tmp =~ s/^\s+/ /sg;
#$tmp =~ s/\s+$/ /sg;
#$tmp =~ s/\s+/ /sg;
print STDOUT ">>>$tmp<<<\n";
$text = $';
}
}
So the "while" looks to see if there is a starting "<A" tag. If there
is then I reset the text line to the portion of the text following the
initial match "$text = $';". Next, I look to find a closing "</a>" tag
and stih the pre-match portion in "$tmp".
ignore the commented out lines for a second... then I print out $tmp
and "increment the file-string past the closing A tag.
Again, this works. It is spitting out the text i expect. but now we
come to the commented out lines.
I am trying to pretty-up the text I find by stripping off
leading/trailing whitespece and compressing internal whitespace.
Except that bit isn[t working.
any ideas?
I'm running perl v5.8.7
I have a series of files with html tags in them. I man NOT trying to
strip the tags I am however trying to list the "link phrases"
associated with all of the "<a href=...>link phrase</a>" sequences in
each file. I have a script that does what I want. I just need it to
be improved a bit and that's why I am here.
the code so far is:
use strict;
select(STDIN);
$|++;
my $sep = $/;
undef $/;
my $text = <>;
$/ = $sep;
my $tmp = "";
while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)
{
$text = $';
if ($text =~ /<\s*\/A\s*>/is)
{
$tmp = $`;
#$tmp =~ s/^\s+/ /sg;
#$tmp =~ s/\s+$/ /sg;
#$tmp =~ s/\s+/ /sg;
print STDOUT ">>>$tmp<<<\n";
$text = $';
}
}
So the "while" looks to see if there is a starting "<A" tag. If there
is then I reset the text line to the portion of the text following the
initial match "$text = $';". Next, I look to find a closing "</a>" tag
and stih the pre-match portion in "$tmp".
ignore the commented out lines for a second... then I print out $tmp
and "increment the file-string past the closing A tag.
Again, this works. It is spitting out the text i expect. but now we
come to the commented out lines.
I am trying to pretty-up the text I find by stripping off
leading/trailing whitespece and compressing internal whitespace.
Except that bit isn[t working.
any ideas?