Removing empty tags

Discussion in 'Perl Misc' started by jwcarlton, Feb 24, 2011.

  1. jwcarlton

    jwcarlton Guest

    I've just started changing my processing over to HTML::HTML5::parser,
    so please bear with me on this.

    I've been using a regex to remove empty tags, but I see one that's not
    working so I assume there's either a typo, or an error in the logic.

    I'm trying to convert this:

    <span class="Apple-style-span" style="font-family: Arial, Verdana,
    Helvetica, sans-serif; "><br></span>

    To:

    <br>

    It should also catch <span...></span> (with nothing inside), or
    <span...> </span> (with a whitespace inside).

    "class" and "style" can be anything (or non-existent), so I'm just
    trying to remove <span, followed by anything (or nothing) to the first
    >, then the following </span>


    Here's what I'm using:

    $text =~ s/<span[^>]*>\s*<\/span>/ /gi;
    $text =~ s/<span[^>]*>(<br>)*<\/span>/$1/gi;

    This doesn't appear to work, though. The string I posted above
    actually came through verbatim, so it must have matched false.

    Of course, I know that this would fail on nested <span></span> tags,
    which is why I'm switching over to HTML::HTML5::parser. But in the
    meanwhile, why did this one not match?
     
    jwcarlton, Feb 24, 2011
    #1
    1. Advertising

  2. jwcarlton

    jwcarlton Guest

    > It works for me.
    >
    > ------------------------
    > #!/usr/bin/perl
    > use warnings;
    > use strict;
    >
    > $_ = '<span class="Apple-style-span" style="font-family: Arial, Verdana,
    > Helvetica, sans-serif; "><br></span>';
    >
    > s/<span[^>]*>(<br>)*<\/span>/$1/gi;
    >
    > print "$_\n";
    > ------------------------
    >
    > If you can post a short and complete program that we can run that
    > duplicates the problem you are having, then we can surely help
    > you fix it...



    That's really pretty much all there is! I'll paste the whole function
    below; the only thing I'm leaving out is the part at the top where it
    declares a few variables, logs the user in (which doesn't affect the
    $text variable), and then prints the data to MySQL.

    The data comes from a contenteditable, and when people paste things it
    needs to be manipulated a bit, which is mostly what this function
    does. I don't have a sample of raw content (I don't save it before it
    runs through the function), but here's a sample of a complete string
    that was printed (I left the content because I thought you guys might
    get a kick out of it):

    <span class="Apple-style-span" style="font-family: Arial, Verdana,
    Helvetica, sans-serif; "><b>"We ALL got problems....If you're gonna be
    dumb, ya gotta be tough."</b></span><br><br><span class="Apple-style-
    span" style="font-family: Arial, Verdana, Helvetica, sans-serif;
    "><br></span>


    And the function:

    sub fixtext {
    $text = $_[0];

    $text =~ s/&nbsp;/ /gi;

    # Convert <em> to <i> and <strong> to <b>, saves a few steps later
    $text =~ s/<em>(.*?)<\/em>/<i>$1<\/i>/gsi;
    $text =~ s/<strong>(.*?)<\/strong>/<b>$1<\/b>/gsi;

    # Strip Javascript
    $text =~ s/<script.*?>.*?<\/script>//gsi;
    $text =~ s/onmouseover=".*?"//gsi;
    $text =~ s/onclick=".*?"//gsi;

    ### Only Allow Specified Tags
    my $lt=chr(1);
    my $gt=chr(2);
    $text =~ s/<br>/$lt br $gt/gi;

    $text =~ s/<(\/{0,1})(div.*?)>/$lt$1$2$gt/gsi;
    $text =~ s/<(\/{0,1})(span.*?)>/$lt$1$2$gt/gsi;

    $text =~ s/<(\/{0,1})(table.*?)>/$lt$1$2$gt/gsi;
    $text =~ s/<(\/{0,1})(tr.*?)>/$lt$1$2$gt/gsi;
    $text =~ s/<(\/{0,1})(td.*?)>/$lt$1$2$gt/gsi;

    $text =~ s/<(\/{0,1})(b|p)>/$lt$1$2$gt/gsi;
    $text =~ s/<(\/{0,1})(u|i)>/$lt$1$2$gt/gsi;

    $text =~ s/<(\/{0,1})(font.*?)>/$lt$1$2$gt/gsi;

    $text =~ s/<(\/{0,1})(img.*?)>/$lt$1$2$gt/gsi;

    # delete all other tags
    $text =~ s/<.+?>//gs;

    $text =~ s/$lt/</g;
    $text =~ s/$gt/>/g;
    $text =~ s/< br >/<br>/gi;
    ###

    # Strip Word junk
    $text =~ s/Normal 0 false.*?}//gsi;
    $text =~ s/Normal 0 MicrosoftInternetExplorer4.*?}//gsi;
    $text =~ s/\/\* Style Definitions \*\/.*?}//gsi;
    $text =~ s/Normal\.dotm .*? false false//gsi;

    $text =~ s/white-space: nowrap;*//gsi;
    $text =~ s/style="(\s*)"//gsi;

    # Strip empty tags
    $text =~ s/<font[^>]*>\s*<\/font>/ /gi;
    $text =~ s/<font[^>]*>(<br>)*<\/font>/<br><br>/gi;

    $text =~ s/<span[^>]*>\s*<\/span>/ /gi;
    $text =~ s/<span[^>]*>(<br>)*<\/span>/$1/gsi;

    $text =~ s/<i>(\s*)<\/i>/$1/gi;
    $text =~ s/<b>(\s*)<\/b>/$1/gi;
    $text =~ s/<u>(\s*)<\/u>/$1/gi;

    $text =~ s/<div>\s*<\/div>/<br>/gi;
    $text =~ s/<div>(.*?)<\/div>/<br><br>$1/gsi;

    # Limit repeating characters
    $text =~ s/(.)\1{4,}/$1$1$1$1/g;

    # Strip opening, trailing, or repeating whitespace, <br>
    $text =~ s/\s+/ /gs;
    $text =~ s/^\s+|\s+$//g;

    $text =~ s/(<br><br>)+/<br><br>/gi;
    $text =~ s/^(<br>)+|(<br>)+$//gi;

    return $text;
    }
     
    jwcarlton, Feb 24, 2011
    #2
    1. Advertising

  3. On 24.02.2011 06:11, jwcarlton wrote:
    >> If you can post a short and complete program that we can run that
    >> duplicates the problem you are having, then we can surely help
    >> you fix it...

    >
    >
    > That's really pretty much all there is! I'll paste the whole function
    > below; the only thing I'm leaving out is the part at the top where it
    > declares a few variables, logs the user in (which doesn't affect the
    > $text variable), and then prints the data to MySQL.


    We are not interested in whole long functions but only on the relevant
    parts.

    > The data comes from a contenteditable, and when people paste things it
    > needs to be manipulated a bit, which is mostly what this function
    > does. I don't have a sample of raw content (I don't save it before it
    > runs through the function), but here's a sample of a complete string
    > that was printed (I left the content because I thought you guys might
    > get a kick out of it):


    First: try the string you have posted. Your function will remove the
    second span part!

    And then: why don't you output the string before putting it in your
    function? You need to look at the input!

    Solution is probably simple: you are doing a lot of replacements. Assume
    the input is "<span><br><b></b></span>". Then you don't remove the spam.
    But later you remove the b. If you reverse the order, you would also
    remove the span.

    So you can try running the fixtext function more than once or try to
    change the order of your 10000 replacements.

    - Wolf

    Next time please try to post a short program that one can run without
    changing/adding anything! Often writing such a short program will point
    you to the problem so that you can solve it on your own.
     
    Wolf Behrenhoff, Feb 24, 2011
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Darren Clark

    VS.NET Removing tags from my code...

    Darren Clark, Jun 8, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    335
    Darren Clark
    Jun 8, 2004
  2. Nathan Sokalski
    Replies:
    2
    Views:
    463
    Scott M.
    Sep 17, 2005
  3. [blu|shark]

    removing useless tags?

    [blu|shark], Jan 12, 2004, in forum: HTML
    Replies:
    7
    Views:
    636
    Barry Pearson
    Jan 13, 2004
  4. Chris  Chiasson
    Replies:
    6
    Views:
    624
    Richard Tobin
    Nov 14, 2006
  5. Replies:
    3
    Views:
    943
    Stefan Behnel
    Jul 28, 2007
Loading...

Share This Page