Replace text inside html tags?

squash · Jan 30, 2005

I want to able replace text inside html tags. I am using a regex to
extract the text , but after I modify the text how can I re-assemble
the html tag? Here is an example:

 HI 

I need to replace HI with BYE and re-assemble html tag like below:

 BYE 
I checked perldoc -q html but could not find the answer there.

Thx!

A. Sinan Unur · Jan 30, 2005

(e-mail address removed) wrote in @z14g2000cwz.googlegroups.com:

I want to able replace text inside html tags. I am using a regex to
extract the text , but after I modify the text how can I re-assemble
the html tag? Here is an example:

 HI 

I need to replace HI with BYE and re-assemble html tag like below:

 BYE 
I checked perldoc -q html but could not find the answer there.

The answer to your question can be found in the answer to the FAQ.

The most correct way (albeit not the fastest) is to use HTML:

arser
from CPAN.

....

Many folks attempt a simple-minded regular expression approach, like
"s/<.*?>//g", but that fails in many cases because the tags may
continue over line breaks, they may contain quoted angle-brackets,
or HTML comment may be present. Plus, folks forget to convert
entities--like "<" for example.

That is, you need to use an HTML parser to parse HTML.

See CPAN for HTML parser modules.

I had never used HTML::TokeParser::Simple, so I gave that a shot:

#! /usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $html = <<HTML;

HI

HTML

my $p = HTML::TokeParser::Simple->new(string => $html);

my $in_font_tag;

while(my $token = $p->get_token) {
if($token->is_start_tag('font')) {
print $token->as_is;
$in_font_tag = 1;
next;
}
if($token->is_end_tag('font')) {
print $token->as_is;
$in_font_tag = 0;
next;
}
if($in_font_tag and $token->is_text) {
my $text = $token->as_is;
$text =~ s/HI/BYE/g;
print $text;
next;
}
print $token->as_is;
}

__END__

C:\Dload> h

BYE


Seems to work.

Sinan.

Gunnar Hjalmarsson · Jan 30, 2005

I want to able replace text inside html tags. I am using a regex to
extract the text , but after I modify the text how can I re-assemble
the html tag? Here is an example:

 HI 

I need to replace HI with BYE and re-assemble html tag like below:

 BYE

Depending on the complexity of the document, the s/// operator may be
sufficient.

I checked perldoc -q html but could not find the answer there.

Then you should have seen for instance

perldoc -q "remove HTML"

and other entries in perlfaq9 which warn for trying to parse HTML
documents with regular expressions, and recommend the use of a suitable
module for HTML parsing.

Sherm Pendley · Jan 30, 2005

I want to able replace text inside html tags. I am using a regex to
extract the text , but after I modify the text how can I re-assemble
the html tag? Here is an example:

 HI 

I need to replace HI with BYE and re-assemble html tag like below:

 BYE

Others have suggested using a parser module - and they're right. That should
always be your first instinct when working with HTML. However, there are
some scenarios where a regex is good enough, and faster to write than a
parser-based solution. For example, if the task at hand is a very simple
search-and-replace across a number of pages where you know a given pattern
will match. Or you're fixing pages that are broken beyond a parser's
ability to cope with them.

With that in mind, have a look at "perldoc perlretut", paying special
attention to the section titled "Extracting matches". You can use
"backreferences" in your regex to use parts of the matched string in the
replacement, like this:

#!/usr/bin/perl
use strict;
use warnings;

my $html = ' HI HELLO ';

$html =~ s%()(.*?)()%$1 BYE $3%g;

print $html, "\n";

Aside from subexpressions and backreferences, another point of note is the
"non-greedy" quantifier "*?". Without it - i.e. written as "*" - the second
expression would be "greedy", meaning it would return the longest possible
string that matches the expression it modifies. In the example above, that
would mean replacing everything between the first '' and the
*second* ''. (Try it!)

That's not what you want - you want the *shortest* string that matches the
expression, not the longest. That's what the "non-greedy" quantifier gives
you.

Just to restate it - regexes are generally *not* the best way to parse HTML,
particularly arbitrary HTML that's fetched from a web site that's beyond
your control. But using them *can* useful if the task at hand is extremely
limited, or if the HTML is broken beyond a parser's ability to handle it.

References:

perldoc perlretut
perldoc perlre

sherm--

Bart Lateur · Jan 31, 2005

A. Sinan Unur said:
I had never used HTML::TokeParser::Simple, so I gave that a shot:

my $p = HTML::TokeParser::Simple->new(string => $html);

my $in_font_tag;

while(my $token = $p->get_token) {
if($token->is_start_tag('font')) {
print $token->as_is;
$in_font_tag = 1;
next;
}
if($token->is_end_tag('font')) {
print $token->as_is;
$in_font_tag = 0;
next;
}
if($in_font_tag and $token->is_text) {
my $text = $token->as_is;
$text =~ s/HI/BYE/g;
print $text;
next;
}
print $token->as_is;
}

I like to use ".." in code with this kind of functionality. This shows
IMO an aspect where a tokeparser approach is vastly superior to raw
usage of HTML:

arser.

while(my $token = $p->get_token) {
if($token->is_start_tag('font') .. $token->is_end_tag('font')) {
if($token->is_text) {
my $text = $token->as_is;
$text =~ s/HI/BYE/g;
print $text;
next;
}
}
print $token->as_is;
}

I'm about to get in trouble with the HTML <body></body> tags	10	Aug 12, 2023
"input-group-text" help	7	Aug 10, 2023
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
Positioning CSS components	1	Nov 16, 2023
Chrome not displaying uploaded HTML and CSS code	5	Nov 17, 2022
<Button ...> display is fine, except for two things	1	Oct 23, 2023
Having difficulty with the layout of these images / video for this web page	2	Jul 5, 2022
Closing an overlay outside the overlay as well	1	Dec 11, 2022

Replace text inside html tags?

squash

A. Sinan Unur

Gunnar Hjalmarsson

Sherm Pendley

Bart Lateur

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads