Replace text inside html tags?

S

squash

I want to able replace text inside html tags. I am using a regex to
extract the text , but after I modify the text how can I re-assemble
the html tag? Here is an example:

<font size=1> HI </font>

I need to replace HI with BYE and re-assemble html tag like below:

<font size=1> BYE </font>
I checked perldoc -q html but could not find the answer there.

Thx!
 
A

A. Sinan Unur

(e-mail address removed) wrote in @z14g2000cwz.googlegroups.com:
I want to able replace text inside html tags. I am using a regex to
extract the text , but after I modify the text how can I re-assemble
the html tag? Here is an example:

<font size=1> HI </font>

I need to replace HI with BYE and re-assemble html tag like below:

<font size=1> BYE </font>
I checked perldoc -q html but could not find the answer there.

The answer to your question can be found in the answer to the FAQ.

The most correct way (albeit not the fastest) is to use HTML::parser
from CPAN.

....

Many folks attempt a simple-minded regular expression approach, like
"s/<.*?>//g", but that fails in many cases because the tags may
continue over line breaks, they may contain quoted angle-brackets,
or HTML comment may be present. Plus, folks forget to convert
entities--like "&lt;" for example.

That is, you need to use an HTML parser to parse HTML.

See CPAN for HTML parser modules.

I had never used HTML::TokeParser::Simple, so I gave that a shot:

#! /usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $html = <<HTML;
<font><!--
<font> HI
</font>
-->
HI
</font>
HTML

my $p = HTML::TokeParser::Simple->new(string => $html);

my $in_font_tag;

while(my $token = $p->get_token) {
if($token->is_start_tag('font')) {
print $token->as_is;
$in_font_tag = 1;
next;
}
if($token->is_end_tag('font')) {
print $token->as_is;
$in_font_tag = 0;
next;
}
if($in_font_tag and $token->is_text) {
my $text = $token->as_is;
$text =~ s/HI/BYE/g;
print $text;
next;
}
print $token->as_is;
}

__END__

C:\Dload> h
<font><!--
<font> HI
</font>
-->
BYE
</font>

Seems to work.

Sinan.
 
G

Gunnar Hjalmarsson

I want to able replace text inside html tags. I am using a regex to
extract the text , but after I modify the text how can I re-assemble
the html tag? Here is an example:

<font size=1> HI </font>

I need to replace HI with BYE and re-assemble html tag like below:

<font size=1> BYE </font>

Depending on the complexity of the document, the s/// operator may be
sufficient.
I checked perldoc -q html but could not find the answer there.

Then you should have seen for instance

perldoc -q "remove HTML"

and other entries in perlfaq9 which warn for trying to parse HTML
documents with regular expressions, and recommend the use of a suitable
module for HTML parsing.
 
S

Sherm Pendley

I want to able replace text inside html tags. I am using a regex to
extract the text , but after I modify the text how can I re-assemble
the html tag? Here is an example:

<font size=1> HI </font>

I need to replace HI with BYE and re-assemble html tag like below:

<font size=1> BYE </font>

Others have suggested using a parser module - and they're right. That should
always be your first instinct when working with HTML. However, there are
some scenarios where a regex is good enough, and faster to write than a
parser-based solution. For example, if the task at hand is a very simple
search-and-replace across a number of pages where you know a given pattern
will match. Or you're fixing pages that are broken beyond a parser's
ability to cope with them.

With that in mind, have a look at "perldoc perlretut", paying special
attention to the section titled "Extracting matches". You can use
"backreferences" in your regex to use parts of the matched string in the
replacement, like this:

#!/usr/bin/perl
use strict;
use warnings;

my $html = '<font size=1> HI </font><font size=1> HELLO </font>';

$html =~ s%(<font size=1>)(.*?)(</font>)%$1 BYE $3%g;

print $html, "\n";

Aside from subexpressions and backreferences, another point of note is the
"non-greedy" quantifier "*?". Without it - i.e. written as "*" - the second
expression would be "greedy", meaning it would return the longest possible
string that matches the expression it modifies. In the example above, that
would mean replacing everything between the first '<font size=1>' and the
*second* '</font>'. (Try it!)

That's not what you want - you want the *shortest* string that matches the
expression, not the longest. That's what the "non-greedy" quantifier gives
you.

Just to restate it - regexes are generally *not* the best way to parse HTML,
particularly arbitrary HTML that's fetched from a web site that's beyond
your control. But using them *can* useful if the task at hand is extremely
limited, or if the HTML is broken beyond a parser's ability to handle it.

References:

perldoc perlretut
perldoc perlre

sherm--
 
B

Bart Lateur

A. Sinan Unur said:
I had never used HTML::TokeParser::Simple, so I gave that a shot:
my $p = HTML::TokeParser::Simple->new(string => $html);

my $in_font_tag;

while(my $token = $p->get_token) {
if($token->is_start_tag('font')) {
print $token->as_is;
$in_font_tag = 1;
next;
}
if($token->is_end_tag('font')) {
print $token->as_is;
$in_font_tag = 0;
next;
}
if($in_font_tag and $token->is_text) {
my $text = $token->as_is;
$text =~ s/HI/BYE/g;
print $text;
next;
}
print $token->as_is;
}

I like to use ".." in code with this kind of functionality. This shows
IMO an aspect where a tokeparser approach is vastly superior to raw
usage of HTML::parser.

while(my $token = $p->get_token) {
if($token->is_start_tag('font') .. $token->is_end_tag('font')) {
if($token->is_text) {
my $text = $token->as_is;
$text =~ s/HI/BYE/g;
print $text;
next;
}
}
print $token->as_is;
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,054
Latest member
LucyCarper

Latest Threads

Top