I want to able replace text inside html tags. I am using a regex to
extract the text , but after I modify the text how can I re-assemble
the html tag? Here is an example:
<font size=1> HI </font>
I need to replace HI with BYE and re-assemble html tag like below:
<font size=1> BYE </font>
Others have suggested using a parser module - and they're right. That should
always be your first instinct when working with HTML. However, there are
some scenarios where a regex is good enough, and faster to write than a
parser-based solution. For example, if the task at hand is a very simple
search-and-replace across a number of pages where you know a given pattern
will match. Or you're fixing pages that are broken beyond a parser's
ability to cope with them.
With that in mind, have a look at "perldoc perlretut", paying special
attention to the section titled "Extracting matches". You can use
"backreferences" in your regex to use parts of the matched string in the
replacement, like this:
#!/usr/bin/perl
use strict;
use warnings;
my $html = '<font size=1> HI </font><font size=1> HELLO </font>';
$html =~ s%(<font size=1>)(.*?)(</font>)%$1 BYE $3%g;
print $html, "\n";
Aside from subexpressions and backreferences, another point of note is the
"non-greedy" quantifier "*?". Without it - i.e. written as "*" - the second
expression would be "greedy", meaning it would return the longest possible
string that matches the expression it modifies. In the example above, that
would mean replacing everything between the first '<font size=1>' and the
*second* '</font>'. (Try it!)
That's not what you want - you want the *shortest* string that matches the
expression, not the longest. That's what the "non-greedy" quantifier gives
you.
Just to restate it - regexes are generally *not* the best way to parse HTML,
particularly arbitrary HTML that's fetched from a web site that's beyond
your control. But using them *can* useful if the task at hand is extremely
limited, or if the HTML is broken beyond a parser's ability to handle it.
References:
perldoc perlretut
perldoc perlre
sherm--