regex multi-line match/replace issue

seven.reeds · Apr 24, 2006

Hi,

I'm running perl v5.8.7

I have a series of files with html tags in them. I man NOT trying to
strip the tags I am however trying to list the "link phrases"
associated with all of the "<a href=...>link phrase</a>" sequences in
each file. I have a script that does what I want. I just need it to
be improved a bit and that's why I am here.

the code so far is:

use strict;
select(STDIN);
$|++;

my $sep = $/;
undef $/;
my $text = <>;
$/ = $sep;
my $tmp = "";

while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)
{
$text = $';
if ($text =~ /<\s*\/A\s*>/is)
{
$tmp = $`;
#$tmp =~ s/^\s+/ /sg;
#$tmp =~ s/\s+$/ /sg;
#$tmp =~ s/\s+/ /sg;
print STDOUT ">>>$tmp<<<\n";
$text = $';
}
}

So the "while" looks to see if there is a starting "<A" tag. If there
is then I reset the text line to the portion of the text following the
initial match "$text = $';". Next, I look to find a closing "</a>" tag
and stih the pre-match portion in "$tmp".

ignore the commented out lines for a second... then I print out $tmp
and "increment the file-string past the closing A tag.

Again, this works. It is spitting out the text i expect. but now we
come to the commented out lines.

I am trying to pretty-up the text I find by stripping off
leading/trailing whitespece and compressing internal whitespace.
Except that bit isn[t working.

any ideas?

A. Sinan Unur · Apr 24, 2006

I have a series of files with html tags in them. I man NOT trying to
strip the tags I am however trying to list the "link phrases"
associated with all of the "<a href=...>link phrase</a>" sequences in
each file. I have a script that does what I want.

You should use an HTML parser to parse HTML.

use strict;

use warnings;

select(STDIN);
$|++;

$| = 1;

my $sep = $/;
undef $/;
my $text = <>;
$/ = $sep;

Aaargh!

my $text = do { local $/; <> };

Actually, I would just use File::Slurp;

my $tmp = "";

while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)
{
$text = $';
if ($text =~ /<\s*\/A\s*>/is)
{
$tmp = $`;
#$tmp =~ s/^\s+/ /sg;
#$tmp =~ s/\s+$/ /sg;
#$tmp =~ s/\s+/ /sg;
print STDOUT ">>>$tmp<<<\n";
$text = $';
}
}
....

I am trying to pretty-up the text I find by stripping off
leading/trailing whitespece and compressing internal whitespace.
Except that bit isn[t working.

As I said, use an HTML parser to parse HTML.

Anyway, no need to reinvent to wheel. You can adapt:

http://search.cpan.org/src/GAAS/HTML-Parser-3.51/eg/hanchors

Sinan
--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

seven.reeds · Apr 24, 2006

Thanks

The anchors script is largely what i am looking for.

all the best

Tad McClellan · Apr 24, 2006

seven.reeds said:
I have a series of files with html tags in them. I man NOT trying to
strip the tags

Nonetheless, the primary point in the "How do I remove HTML from a string?"
FAQ answer is: don't use regular expressions for this.

I am however trying to list the "link phrases"
associated with all of the "<a href=...>link phrase</a>" sequences in
each file.

I would recommend using a module that already does that for you, such as:

http://search.cpan.org/~bdfoy/HTML-SimpleLinkExtor-1.12/SimpleLinkExtor.pm

I have a script that does what I want.

I think it only "appears" to do what you want.

You just haven't tried it with a test case that trips it up yet.

I just need it to
be improved a bit

It is a dirty hack.

If proper operation is of importance, then it needs to be thrown
away and replaced with something more robust.

and that's why I am here.

OK. So let's patch it up anyway, just as a "learning exercise".

my $sep = $/;
undef $/;
my $text = <>;
$/ = $sep;

Let Perl do the save-and-restore for you. This does the same thing:

my $text;
{ local $/; # a naked block creates a scope
$text = <>;
}
# $/ has been restored to its previous value here

Or, probably even better:

my $tmp = "";

while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)

^^^
^^^

Spaces are not allowed there, so you should not allow spaces there.

The m//s modifier changes the meaning of dot, it is useless when
your pattern contains no dot.

{
$text = $';
if ($text =~ /<\s*\/A\s*>/is)

No unallowed spaces, no "s" modifier, as above.

If you choose an alternate delimiter for your m//, then you
won't have to backslash slashes:

{
$tmp = $`;
#$tmp =~ s/^\s+/ /sg;
#$tmp =~ s/\s+$/ /sg;
#$tmp =~ s/\s+/ /sg;
print STDOUT ">>>$tmp<<<\n";
$text = $';
}
}

Try your code with these:

<a name="perl" href="http://www.perl.org">Perl Mongers</a>

<a href="http://www.perl.org" name=">>>perl<<<">Perl Mongers</a>

any ideas?

Start over (with a module).

DJ Stunks · Apr 25, 2006

Tad said:
I would recommend using a module that already does that for you, such as:

http://search.cpan.org/~bdfoy/HTML-SimpleLinkExtor-1.12/SimpleLinkExtor.pm

I don't believe HTML::LinkExtor (upon which HTML::SimpleLinkExtor is
built) extracts the link text, only the link itself.

-jp

Lukas Mai · Apr 25, 2006

seven.reeds said:
the code so far is:

use strict;
select(STDIN);

The other posters seem to have missed this.
select() changes the current _output_ filehandle. I have no idea what
you're trying to achieve by selecting STDIN.

$|++;

$| changes the behavior of print. This line has no effect as you don't
print to STDIN.

my $sep = $/;
undef $/;
my $text = <>;
$/ = $sep;

Eww, use File::Slurp or local $/ here.

my $tmp = "";

while ($text =~ /<\s*A\s+HREF\s*=[^>]+>/is)

^
This /s has no effect. Why did you put it there?

{
$text = $';
if ($text =~ /<\s*\/A\s*>/is)

^
This /s has no effect. Why did you put it there?

{

[snip]

MJD's Good Advice #11924 comes to mind.

Lukas

Anno Siegel · Apr 25, 2006

Tad McClellan said:
seven.reeds <[email protected]> wrote:

[good advice snipped]

Apart from everything else, uncommenting the commented substitutions will
change what $' contains at the end of the block. "$text = $'" should
come before any additional matches. Also, the commented s/// do not
strip leading and trailing white space but reduce them to a single blank.

Anno

Issue with textbox script?	0	Sep 5, 2022
Multi select options in a menu	1	Oct 30, 2022
Regex: match double OR single quote	4	Jul 12, 2012
Regex question; match <br> after opening tag	23	Feb 16, 2011
Clickable link conversion regex?	0	Nov 30, 2012
Neopets coding help	4	Sep 23, 2021
Multi Line Match and Regex	2	Nov 28, 2006
Replace an occurrence of a regexp with a function call on a substringof the match, multiple times on	4	Sep 16, 2013

regex multi-line match/replace issue

seven.reeds

A. Sinan Unur

seven.reeds

Tad McClellan

DJ Stunks

Lukas Mai

Anno Siegel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads