regex multi-line match/replace issue

S

seven.reeds

Hi,

I'm running perl v5.8.7

I have a series of files with html tags in them. I man NOT trying to
strip the tags I am however trying to list the "link phrases"
associated with all of the "<a href=...>link phrase</a>" sequences in
each file. I have a script that does what I want. I just need it to
be improved a bit and that's why I am here.

the code so far is:

use strict;
select(STDIN);
$|++;

my $sep = $/;
undef $/;
my $text = <>;
$/ = $sep;
my $tmp = "";

while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)
{
$text = $';
if ($text =~ /<\s*\/A\s*>/is)
{
$tmp = $`;
#$tmp =~ s/^\s+/ /sg;
#$tmp =~ s/\s+$/ /sg;
#$tmp =~ s/\s+/ /sg;
print STDOUT ">>>$tmp<<<\n";
$text = $';
}
}

So the "while" looks to see if there is a starting "<A" tag. If there
is then I reset the text line to the portion of the text following the
initial match "$text = $';". Next, I look to find a closing "</a>" tag
and stih the pre-match portion in "$tmp".

ignore the commented out lines for a second... then I print out $tmp
and "increment the file-string past the closing A tag.

Again, this works. It is spitting out the text i expect. but now we
come to the commented out lines.

I am trying to pretty-up the text I find by stripping off
leading/trailing whitespece and compressing internal whitespace.
Except that bit isn[t working.

any ideas?
 
A

A. Sinan Unur

I have a series of files with html tags in them. I man NOT trying to
strip the tags I am however trying to list the "link phrases"
associated with all of the "<a href=...>link phrase</a>" sequences in
each file. I have a script that does what I want.

You should use an HTML parser to parse HTML.
use strict;

use warnings;
select(STDIN);
$|++;

$| = 1;
my $sep = $/;
undef $/;
my $text = <>;
$/ = $sep;

Aaargh!

my $text = do { local $/; <> };

Actually, I would just use File::Slurp;
my $tmp = "";

while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)
{
$text = $';
if ($text =~ /<\s*\/A\s*>/is)
{
$tmp = $`;
#$tmp =~ s/^\s+/ /sg;
#$tmp =~ s/\s+$/ /sg;
#$tmp =~ s/\s+/ /sg;
print STDOUT ">>>$tmp<<<\n";
$text = $';
}
}
....

I am trying to pretty-up the text I find by stripping off
leading/trailing whitespece and compressing internal whitespace.
Except that bit isn[t working.

As I said, use an HTML parser to parse HTML.

Anyway, no need to reinvent to wheel. You can adapt:

http://search.cpan.org/src/GAAS/HTML-Parser-3.51/eg/hanchors

Sinan
--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
 
T

Tad McClellan

seven.reeds said:
I have a series of files with html tags in them. I man NOT trying to
strip the tags


Nonetheless, the primary point in the "How do I remove HTML from a string?"
FAQ answer is: don't use regular expressions for this.

I am however trying to list the "link phrases"
associated with all of the "<a href=...>link phrase</a>" sequences in
each file.


I would recommend using a module that already does that for you, such as:

http://search.cpan.org/~bdfoy/HTML-SimpleLinkExtor-1.12/SimpleLinkExtor.pm

I have a script that does what I want.


I think it only "appears" to do what you want.

You just haven't tried it with a test case that trips it up yet.

I just need it to
be improved a bit


It is a dirty hack.

If proper operation is of importance, then it needs to be thrown
away and replaced with something more robust.

and that's why I am here.


OK. So let's patch it up anyway, just as a "learning exercise".

my $sep = $/;
undef $/;
my $text = <>;
$/ = $sep;


Let Perl do the save-and-restore for you. This does the same thing:

my $text;
{ local $/; # a naked block creates a scope
$text = <>;
}
# $/ has been restored to its previous value here


Or, probably even better:

my $tmp = "";

while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)
^^^
^^^

Spaces are not allowed there, so you should not allow spaces there.

The m//s modifier changes the meaning of dot, it is useless when
your pattern contains no dot.

{
$text = $';
if ($text =~ /<\s*\/A\s*>/is)


No unallowed spaces, no "s" modifier, as above.

If you choose an alternate delimiter for your m//, then you
won't have to backslash slashes:

{
$tmp = $`;
#$tmp =~ s/^\s+/ /sg;
#$tmp =~ s/\s+$/ /sg;
#$tmp =~ s/\s+/ /sg;
print STDOUT ">>>$tmp<<<\n";
$text = $';
}
}


Try your code with these:

<a name="perl" href="http://www.perl.org">Perl Mongers</a>

<a href="http://www.perl.org" name=">>>perl<<<">Perl Mongers</a>

<!--
<a href="not_a_link.com">Don't report me as a link!</a>
-->

any ideas?


Start over (with a module).
 
L

Lukas Mai

seven.reeds said:
the code so far is:

use strict;
select(STDIN);

The other posters seem to have missed this.
select() changes the current _output_ filehandle. I have no idea what
you're trying to achieve by selecting STDIN.

$| changes the behavior of print. This line has no effect as you don't
print to STDIN.
my $sep = $/;
undef $/;
my $text = <>;
$/ = $sep;

Eww, use File::Slurp or local $/ here.
my $tmp = "";

while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)
^
This /s has no effect. Why did you put it there?
{
$text = $';
if ($text =~ /<\s*\/A\s*>/is)
^
This /s has no effect. Why did you put it there?
[snip]

MJD's Good Advice #11924 comes to mind.

Lukas
 
A

Anno Siegel

Tad McClellan said:
seven.reeds <[email protected]> wrote:

[good advice snipped]

Apart from everything else, uncommenting the commented substitutions will
change what $' contains at the end of the block. "$text = $'" should
come before any additional matches. Also, the commented s/// do not
strip leading and trailing white space but reduce them to a single blank.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top