Removing empty tags

jwcarlton · Feb 24, 2011

I've just started changing my processing over to HTML::HTML5:

arser,
so please bear with me on this.

I've been using a regex to remove empty tags, but I see one that's not
working so I assume there's either a typo, or an error in the logic.

I'm trying to convert this:

 

To:

 

It should also catch <span...> (with nothing inside), or
<span...> (with a whitespace inside).

"class" and "style" can be anything (or non-existent), so I'm just
trying to remove <span, followed by anything (or nothing) to the first

, then the following

Here's what I'm using:

$text =~ s/<span[^>]*>\s*<\/span>/ /gi;
$text =~ s/<span[^>]*>( )*<\/span>/$1/gi;

This doesn't appear to work, though. The string I posted above
actually came through verbatim, so it must have matched false.

Of course, I know that this would fail on nested tags,
which is why I'm switching over to HTML::HTML5:

arser. But in the
meanwhile, why did this one not match?

jwcarlton · Feb 24, 2011

It works for me.

------------------------
#!/usr/bin/perl
use warnings;
use strict;

$_ = ' ';

s/<span[^>]*>( )*<\/span>/$1/gi;

print "$_\n";
------------------------

If you can post a short and complete program that we can run that
duplicates the problem you are having, then we can surely help
you fix it...

That's really pretty much all there is! I'll paste the whole function
below; the only thing I'm leaving out is the part at the top where it
declares a few variables, logs the user in (which doesn't affect the
$text variable), and then prints the data to MySQL.

The data comes from a contenteditable, and when people paste things it
needs to be manipulated a bit, which is mostly what this function
does. I don't have a sample of raw content (I don't save it before it
runs through the function), but here's a sample of a complete string
that was printed (I left the content because I thought you guys might
get a kick out of it):

"We ALL got problems....If you're gonna be
dumb, ya gotta be tough." 

And the function:

sub fixtext {
$text = $_[0];

$text =~ s/ / /gi;

# Convert to and to , saves a few steps later
$text =~ s/(.*?)<\/em>/$1<\/i>/gsi;
$text =~ s/(.*?)<\/strong>/$1<\/b>/gsi;

# Strip Javascript
$text =~ s/<script.*?>.*?<\/script>//gsi;
$text =~ s/onmouseover=".*?"//gsi;
$text =~ s/onclick=".*?"//gsi;

### Only Allow Specified Tags
my $lt=chr(1);
my $gt=chr(2);
$text =~ s/ /$lt br $gt/gi;

$text =~ s/<(\/{0,1})(div.*?)>/$lt$1$2$gt/gsi;
$text =~ s/<(\/{0,1})(span.*?)>/$lt$1$2$gt/gsi;

$text =~ s/<(\/{0,1})(table.*?)>/$lt$1$2$gt/gsi;
$text =~ s/<(\/{0,1})(tr.*?)>/$lt$1$2$gt/gsi;
$text =~ s/<(\/{0,1})(td.*?)>/$lt$1$2$gt/gsi;

$text =~ s/<(\/{0,1})(b|p)>/$lt$1$2$gt/gsi;
$text =~ s/<(\/{0,1})(u|i)>/$lt$1$2$gt/gsi;

$text =~ s/<(\/{0,1})(font.*?)>/$lt$1$2$gt/gsi;

$text =~ s/<(\/{0,1})(img.*?)>/$lt$1$2$gt/gsi;

# delete all other tags
$text =~ s/<.+?>//gs;

$text =~ s/$lt/</g;
$text =~ s/$gt/>/g;
$text =~ s/ / /gi;
###

# Strip Word junk
$text =~ s/Normal 0 false.*?}//gsi;
$text =~ s/Normal 0 MicrosoftInternetExplorer4.*?}//gsi;
$text =~ s/\/\* Style Definitions \*\/.*?}//gsi;
$text =~ s/Normal\.dotm .*? false false//gsi;

$text =~ s/white-space: nowrap;*//gsi;
$text =~ s/style="(\s*)"//gsi;

# Strip empty tags
$text =~ s/<font[^>]*>\s*<\/font>/ /gi;
$text =~ s/<font[^>]*>( )*<\/font>/ /gi;

$text =~ s/<span[^>]*>\s*<\/span>/ /gi;
$text =~ s/<span[^>]*>( )*<\/span>/$1/gsi;

$text =~ s/(\s*)<\/i>/$1/gi;
$text =~ s/(\s*)<\/b>/$1/gi;
$text =~ s/(\s*)<\/u>/$1/gi;

$text =~ s/<div>\s*<\/div>/ /gi;
$text =~ s/<div>(.*?)<\/div>/ $1/gsi;

# Limit repeating characters
$text =~ s/(.)\1{4,}/$1$1$1$1/g;

# Strip opening, trailing, or repeating whitespace, 
$text =~ s/\s+/ /gs;
$text =~ s/^\s+|\s+$//g;

$text =~ s/( )+/ /gi;
$text =~ s/^( )+|( )+$//gi;

return $text;
}

Wolf Behrenhoff · Feb 24, 2011

That's really pretty much all there is! I'll paste the whole function
below; the only thing I'm leaving out is the part at the top where it
declares a few variables, logs the user in (which doesn't affect the
$text variable), and then prints the data to MySQL.

We are not interested in whole long functions but only on the relevant
parts.

The data comes from a contenteditable, and when people paste things it
needs to be manipulated a bit, which is mostly what this function
does. I don't have a sample of raw content (I don't save it before it
runs through the function), but here's a sample of a complete string
that was printed (I left the content because I thought you guys might
get a kick out of it):

First: try the string you have posted. Your function will remove the
second span part!

And then: why don't you output the string before putting it in your
function? You need to look at the input!

Solution is probably simple: you are doing a lot of replacements. Assume
the input is " ". Then you don't remove the spam.
But later you remove the b. If you reverse the order, you would also
remove the span.

So you can try running the fixtext function more than once or try to
change the order of your 10000 replacements.

- Wolf

Next time please try to post a short program that one can run without
changing/adding anything! Often writing such a short program will point
you to the problem so that you can solve it on your own.

Different font sizes inside same div	2	Dec 3, 2023
Help with code	0	Jun 12, 2022
Regex question; match <br> after opening tag	23	Feb 16, 2011
"input-group-text" help	7	Aug 10, 2023
Positioning CSS components	1	Nov 16, 2023
Slideshow not working properly	2	Jan 7, 2023
SendGrid email issue in responsive Gmail	1	Nov 4, 2021
Why is this WordPress comments form not submitting?	1	Jan 12, 2020

Removing empty tags

jwcarlton

jwcarlton

Wolf Behrenhoff

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads