matching HTML expression across multiple lines

H

H.S.

Hello,

I am trying to make changes to a website made by somebody else,
apparently in Dreamweaver. I do not have that application to work with
(never used it) and I am working in Linux only (Debian, Etch).

I have searched google on how to do this and have written a small perl
script in the process. I am quite knew to perl (not new to Linux and
have some basic knowledge of expression matching in sed and grep).

I want to replace the following code (which may or may not be on a
single line):
##############################################
<td height="30" valign="top"><a href="db/index.php"
onmouseout="MM_swapImgRestore()"
onmouseover="MM_swapImage('db','','img_after.gif',1)"><img
name="db" border="0" src="img_before.gif" width="100" height="30"></a></td>
###############################################

with a slightly modified expression. For simplicity, let us say I want
to replace it with "CHANGED" string. Note that the above code is just a
part of the HTML file which contains similar code before and after this
para (consequently, it will have <td>.*</td> before and after this para).

Here is my attempt at doing this:
############################################
#!/usr/bin/perl

local $/; # slurp mode ( to read the file into a var )

#define the name of the data file to modify
my $data_file="intro.php";
#open the data file
open DAT, $data_file or die "Could not open file: $!";

#variable to hold the data file (it will hold *all* of it!)
my $contents;

#read in the contents of the data file
$contents=<DAT>;

#print the file contents
#print $contents;

if ($contents =~ s/<td.*db_before.*td>/CHANGED/s){
print $contents;
print "Changed\n";
}
############################################

The problem is, I think, that it matches the very first <td and
continues on till a td/> after db_before. From the results, this is not
working at all. Suggestions? Or even a different way to perform this
editing? The editing is to be performed on this block of code on many
files. And as I said earlier, the block of code is either in the shape I
mentioned above or is on one single. The latter case is not a problem.

thanks,
->HS
 
L

Lars Eighner

the said:
The problem is, I think, that it matches the very first <td and
continues on till a td/> after db_before. From the results, this is not
working at all. Suggestions? Or even a different way to perform this
editing? The editing is to be performed on this block of code on many
files. And as I said earlier, the block of code is either in the shape I
mentioned above or is on one single. The latter case is not a problem.

This is a frequently asked sort of question. The right answer is that you
should not attempt to parse HTML with regular expressions. There are many
ways to go wrong, and you have found one of them. You should install
the HTML::parser module. However, how to use it may not be entirely
obvious, so you might cruise google for some help/tutorials.

Your particular problem is that regular expressions are greedy and by
default will match the longest string they possibly can. There are a number
of ways to degreedify them, which you can discover in the perlre man page.
And the truth to tell, you might get away with degreedifying your regular
expressions in this particular and limited case. But as I have said there
are many ways to go wrong with HTML, and HTML::parser which can deal with
"real world" HTML (i.e. can parse like browsers do, making allowances for
not-strictly-valid markup) is the right way.
 
A

Ala Qumsieh

Lars said:
This is a frequently asked sort of question. The right answer is that you
should not attempt to parse HTML with regular expressions. There are many
ways to go wrong, and you have found one of them. You should install
the HTML::parser module.

For production code, I would whole-heartedly agree. But the OP seems to need
a quick fix for a very specific case that shouldn't take more than a few
lines of code. In this case, I wouldn't suggest using a module. It's
usually much faster and simpler to just whip up a quick regexp to suit your
needs. There's nothing wrong with that, as long as you throw away your
script once you're done.

--Ala
 
T

Tad McClellan

H.S. said:
I want to replace the following code (which may or may not be on a
single line):
##############################################
<td height="30" valign="top"><a href="db/index.php"
onmouseout="MM_swapImgRestore()"
onmouseover="MM_swapImage('db','','img_after.gif',1)"><img
name="db" border="0" src="img_before.gif" width="100" height="30"></a></td>
###############################################
[snip]

if ($contents =~ s/<td.*db_before.*td>/CHANGED/s){


The string "db_before" does not appear in your data, so
the match is *supposed* to fail.
 
H

H.S.

Tad said:
I want to replace the following code (which may or may not be on a
single line):
##############################################
<td height="30" valign="top"><a href="db/index.php"
onmouseout="MM_swapImgRestore()"
onmouseover="MM_swapImage('db','','img_after.gif',1)"><img
name="db" border="0" src="img_before.gif" width="100" height="30"></a></td>
###############################################

[snip]


if ($contents =~ s/<td.*db_before.*td>/CHANGED/s){



The string "db_before" does not appear in your data, so
the match is *supposed* to fail.

Sorry, db_before is from the actual data. I missed to change that from
example data. So for my example above, consider the expression that is
supposed to occur between two td tags is img_after.

->HS
 
A

A. Sinan Unur

Tad said:
I want to replace the following code (which may or may not be on a
single line):
##############################################
<td height="30" valign="top"><a href="db/index.php"
onmouseout="MM_swapImgRestore()"
onmouseover="MM_swapImage('db','','img_after.gif',1)"><img
name="db" border="0" src="img_before.gif" width="100"
height="30"></a></td>
###############################################

[snip]


if ($contents =~ s/<td.*db_before.*td>/CHANGED/s){



The string "db_before" does not appear in your data, so
the match is *supposed* to fail.

Sorry, db_before is from the actual data. I missed to change that from
example data. So for my example above, consider the expression that is
supposed to occur between two td tags is img_after.

Instead, please consider posting a short but complete script with the
correct data in the __DATA__ section as described in the posting
guidelines.

Make it easy for others to help you.

Sinan
--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
 
T

Tad McClellan

H.S. said:
So for my example above, consider the expression that is
supposed to occur between two td tags is img_after.


As a Real Program, then use a module that understand HTML for
processing HTML data.

As a dirty hack, find all of the <td> elements and change
just the one(s) you want:

$contents =~ s{ (<td # start tag
.*? # rest of start tag, content
</td> # end tag
)
}
{ my $s = $1;
$s =~ /img_after/ ? 'CHANGED' : $s;
}gsex;
 
S

Steve Kostecke

Lars said:
This is a frequently asked sort of question. The right answer is
that you should not attempt to parse HTML with regular expressions.
There are many ways to go wrong, and you have found one of them. You
should install
the HTML::parser module. However, how to use it may not be entirely
obvious, so you might cruise google for some help/tutorials.

Your particular problem is that regular expressions are greedy and by
default will match the longest string they possibly can. There are a
number of ways to degreedify them, which you can discover in the
perlre man page. And the truth to tell, you might get away with
degreedifying your regular expressions in this particular and limited
case.

Is it really that dificult to say, ``add a question mark after .*
(quantified expressions) to degreedify them; eg: .*? ''
 
D

David Squire

Steve said:
Lars Eighner wrote:
[snip]
Your particular problem is that regular expressions are greedy and by
default will match the longest string they possibly can. There are a
number of ways to degreedify them, which you can discover in the
perlre man page. And the truth to tell, you might get away with
degreedifying your regular expressions in this particular and limited
case.

Is it really that dificult to say, ``add a question mark after .*
(quantified expressions) to degreedify them; eg: .*? ''

It's the old "give a man a fish" vs "teach a man to fish" issue.

DS
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top