Removing empty tags

J

jwcarlton

I've just started changing my processing over to HTML::HTML5::parser,
so please bear with me on this.

I've been using a regex to remove empty tags, but I see one that's not
working so I assume there's either a typo, or an error in the logic.

I'm trying to convert this:

<span class="Apple-style-span" style="font-family: Arial, Verdana,
Helvetica, sans-serif; "><br></span>

To:

<br>

It should also catch <span...></span> (with nothing inside), or
<span...> </span> (with a whitespace inside).

"class" and "style" can be anything (or non-existent), so I'm just
trying to remove <span, followed by anything (or nothing) to the first
, then the following </span>

Here's what I'm using:

$text =~ s/<span[^>]*>\s*<\/span>/ /gi;
$text =~ s/<span[^>]*>(<br>)*<\/span>/$1/gi;

This doesn't appear to work, though. The string I posted above
actually came through verbatim, so it must have matched false.

Of course, I know that this would fail on nested <span></span> tags,
which is why I'm switching over to HTML::HTML5::parser. But in the
meanwhile, why did this one not match?
 
J

jwcarlton

It works for me.
------------------------
#!/usr/bin/perl
use warnings;
use strict;

$_ = '<span class="Apple-style-span" style="font-family: Arial, Verdana,
Helvetica, sans-serif; "><br></span>';

s/<span[^>]*>(<br>)*<\/span>/$1/gi;

print "$_\n";
------------------------

If you can post a short and complete program that we can run that
duplicates the problem you are having, then we can surely help
you fix it...


That's really pretty much all there is! I'll paste the whole function
below; the only thing I'm leaving out is the part at the top where it
declares a few variables, logs the user in (which doesn't affect the
$text variable), and then prints the data to MySQL.

The data comes from a contenteditable, and when people paste things it
needs to be manipulated a bit, which is mostly what this function
does. I don't have a sample of raw content (I don't save it before it
runs through the function), but here's a sample of a complete string
that was printed (I left the content because I thought you guys might
get a kick out of it):

<span class="Apple-style-span" style="font-family: Arial, Verdana,
Helvetica, sans-serif; "><b>"We ALL got problems....If you're gonna be
dumb, ya gotta be tough."</b></span><br><br><span class="Apple-style-
span" style="font-family: Arial, Verdana, Helvetica, sans-serif;
"><br></span>


And the function:

sub fixtext {
$text = $_[0];

$text =~ s/&nbsp;/ /gi;

# Convert <em> to <i> and <strong> to <b>, saves a few steps later
$text =~ s/<em>(.*?)<\/em>/<i>$1<\/i>/gsi;
$text =~ s/<strong>(.*?)<\/strong>/<b>$1<\/b>/gsi;

# Strip Javascript
$text =~ s/<script.*?>.*?<\/script>//gsi;
$text =~ s/onmouseover=".*?"//gsi;
$text =~ s/onclick=".*?"//gsi;

### Only Allow Specified Tags
my $lt=chr(1);
my $gt=chr(2);
$text =~ s/<br>/$lt br $gt/gi;

$text =~ s/<(\/{0,1})(div.*?)>/$lt$1$2$gt/gsi;
$text =~ s/<(\/{0,1})(span.*?)>/$lt$1$2$gt/gsi;

$text =~ s/<(\/{0,1})(table.*?)>/$lt$1$2$gt/gsi;
$text =~ s/<(\/{0,1})(tr.*?)>/$lt$1$2$gt/gsi;
$text =~ s/<(\/{0,1})(td.*?)>/$lt$1$2$gt/gsi;

$text =~ s/<(\/{0,1})(b|p)>/$lt$1$2$gt/gsi;
$text =~ s/<(\/{0,1})(u|i)>/$lt$1$2$gt/gsi;

$text =~ s/<(\/{0,1})(font.*?)>/$lt$1$2$gt/gsi;

$text =~ s/<(\/{0,1})(img.*?)>/$lt$1$2$gt/gsi;

# delete all other tags
$text =~ s/<.+?>//gs;

$text =~ s/$lt/</g;
$text =~ s/$gt/>/g;
$text =~ s/< br >/<br>/gi;
###

# Strip Word junk
$text =~ s/Normal 0 false.*?}//gsi;
$text =~ s/Normal 0 MicrosoftInternetExplorer4.*?}//gsi;
$text =~ s/\/\* Style Definitions \*\/.*?}//gsi;
$text =~ s/Normal\.dotm .*? false false//gsi;

$text =~ s/white-space: nowrap;*//gsi;
$text =~ s/style="(\s*)"//gsi;

# Strip empty tags
$text =~ s/<font[^>]*>\s*<\/font>/ /gi;
$text =~ s/<font[^>]*>(<br>)*<\/font>/<br><br>/gi;

$text =~ s/<span[^>]*>\s*<\/span>/ /gi;
$text =~ s/<span[^>]*>(<br>)*<\/span>/$1/gsi;

$text =~ s/<i>(\s*)<\/i>/$1/gi;
$text =~ s/<b>(\s*)<\/b>/$1/gi;
$text =~ s/<u>(\s*)<\/u>/$1/gi;

$text =~ s/<div>\s*<\/div>/<br>/gi;
$text =~ s/<div>(.*?)<\/div>/<br><br>$1/gsi;

# Limit repeating characters
$text =~ s/(.)\1{4,}/$1$1$1$1/g;

# Strip opening, trailing, or repeating whitespace, <br>
$text =~ s/\s+/ /gs;
$text =~ s/^\s+|\s+$//g;

$text =~ s/(<br><br>)+/<br><br>/gi;
$text =~ s/^(<br>)+|(<br>)+$//gi;

return $text;
}
 
W

Wolf Behrenhoff

That's really pretty much all there is! I'll paste the whole function
below; the only thing I'm leaving out is the part at the top where it
declares a few variables, logs the user in (which doesn't affect the
$text variable), and then prints the data to MySQL.

We are not interested in whole long functions but only on the relevant
parts.
The data comes from a contenteditable, and when people paste things it
needs to be manipulated a bit, which is mostly what this function
does. I don't have a sample of raw content (I don't save it before it
runs through the function), but here's a sample of a complete string
that was printed (I left the content because I thought you guys might
get a kick out of it):

First: try the string you have posted. Your function will remove the
second span part!

And then: why don't you output the string before putting it in your
function? You need to look at the input!

Solution is probably simple: you are doing a lot of replacements. Assume
the input is "<span><br><b></b></span>". Then you don't remove the spam.
But later you remove the b. If you reverse the order, you would also
remove the span.

So you can try running the fixtext function more than once or try to
change the order of your 10000 replacements.

- Wolf

Next time please try to post a short program that one can run without
changing/adding anything! Often writing such a short program will point
you to the problem so that you can solve it on your own.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top