Regex question; match <br> after opening tag

Discussion in 'Perl Misc' started by jwcarlton, Feb 16, 2011.

  1. jwcarlton

    jwcarlton Guest

    I'm working on an area where the visitor submits content via
    contenteditable, so the submission comes through in Word-style HTML
    (meaning, it's somewhat of a mess, and completely dependent on the
    users browser).

    I'm trying to remove opening and closing <br> tags. The problem I'm
    having is when those tags come after a <font, <div, or <span, or
    before a closing </font>, </div>, or </span>; eg:

    <div class=whatever><span class=whatever><font
    class=whatever><br><br><br>Hello, World!<br><br></font></span></div>

    It's worth noting that <div>...</div> may or may not be there,
    <span>...</span> may or may not be there, <font>...</font> may or may
    not be there, they could be transposed (ie, <font> before <span>), and
    the <br> tags can be from 0 to 3.

    Here's where I am so far:

    $text =~ s/^(<div(.*?)>)(<br>)+/$1/gi;
    $text =~ s/^(<span(.*?)>)(<br>)+/$1/gi;
    $text =~ s/^(<font(.*?)>)(<br>)+/$1/gi;

    $text =~ s/(<br>)+(<\/div>)$/$2/gi;
    $text =~ s/(<br>)+(<\/span>)$/$2/gi;
    $text =~ s/(<br>)+(<\/font>)$/$2/gi;


    I have 3 questions on this:

    1. First off, does the code above look technically correct to you?
    Meaning, would it work if we assume that the tags are always div,
    followed by span, followed by font?

    2. Is there a way to get these on 1 line?

    3. How can I code it to work regardless of which tag comes first?

    TIA,

    Jason
     
    jwcarlton, Feb 16, 2011
    #1
    1. Advertising

  2. jwcarlton <> wrote:
    >I'm working on an area where the visitor submits content via
    >contenteditable, so the submission comes through in Word-style HTML
    >(meaning, it's somewhat of a mess, and completely dependent on the
    >users browser).


    Then why are you trying to use REs to parse this mess?

    [typical ill-fated attempt of using the wrong tool for the job deleted]

    >I have 3 questions on this:
    >
    >1. First off, does the code above look technically correct to you?
    >Meaning, would it work if we assume that the tags are always div,
    >followed by span, followed by font?


    Who cares? Nobody in his right mind would use _REGULAR_ expressions to
    parse a context-free language.

    >2. Is there a way to get these on 1 line?


    Sure. Just remove the linebreaks.

    >3. How can I code it to work regardless of which tag comes first?


    By writing a proper HTML parser. Or much easier by using one of the
    readily available HTML parsers from CPAN.

    jue
     
    Jürgen Exner, Feb 16, 2011
    #2
    1. Advertising

  3. jwcarlton

    jwcarlton Guest

    > >I'm working on an area where the visitor submits content via
    > >contenteditable, so the submission comes through in Word-style HTML
    > >(meaning, it's somewhat of a mess, and completely dependent on the
    > >users browser).

    >
    > Then why are you trying to use REs to parse this mess?
    >
    > [typical ill-fated attempt of using the wrong tool for the job deleted]


    I'm guessing that you've never worked with a contenteditable form?
    It's not as easy as all that.


    > >I have 3 questions on this:

    >
    > >1. First off, does the code above look technically correct to you?
    > >Meaning, would it work if we assume that the tags are always div,
    > >followed by span, followed by font?

    >
    > Who cares? Nobody in his right mind would use _REGULAR_ expressions to
    > parse a context-free language.


    I care, or I wouldn't have asked. I assume that you care, too, or you
    wouldn't have wasted your time on replying :)


    > >2. Is there a way to get these on 1 line?

    >
    > Sure. Just remove the linebreaks.


    Sigh.
     
    jwcarlton, Feb 16, 2011
    #3
  4. jwcarlton

    jwcarlton Guest

    On Feb 16, 12:11 am, Tad McClellan <> wrote:
    > jwcarlton <> wrote:
    > > I'm trying to remove opening and closing <br> tags.

    >
    > There is no such thing as a "closing" <br> tag...
    >
    >    http://www.w3.org/TR/REC-html32#br
    >
    >     ... This is an empty element so the end tag is forbidden
    >
    > ><div class=whatever><span class=whatever><font
    > > class=whatever><br><br><br>Hello, World!<br><br></font></span></div>

    >
    > ---------------------------
    > #!/usr/bin/perl
    > use warnings;
    > use strict;
    >
    > my $text = '<div class=whatever><span class=whatever><font
    > class=whatever><br><br><br>Hello, World!<br><br></font></span></div>';
    >
    > $text =~ s/<br>//g;
    >
    > print "$text\n";
    > ---------------------------
    >
    > --
    > Tad McClellan
    > email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
    > The above message is a Usenet post.
    > I don't recall having given anyone permission to use it on a Web site.


    Seriously, why even both replying?
     
    jwcarlton, Feb 16, 2011
    #4
  5. jwcarlton

    Dr.Ruud Guest

    On 2011-02-16 06:18, jwcarlton wrote:
    > On Feb 16, 12:11 am, Tad McClellan<> wrote:
    >> jwcarlton<> wrote:


    >>> I'm trying to remove opening and closing<br> tags.

    >>
    >> There is no such thing as a "closing"<br> tag...
    >> [...]

    >
    > Seriously, why even both replying?


    I guess because all answers to your questions are in the FAQ.
    That you shouldn't quote sigs is in another one.

    --
    Ruud
     
    Dr.Ruud, Feb 16, 2011
    #5
  6. my $text = '<div class=whatever><span
    class=whatever><font class=whatever><br>help<o><br><br>Hello,
    World!<br><br></font></span>
    </div>';

    while ( $text =~/<br>(.+?)<br>/gm )
    {
    (my $a = $^N)=~s/<.+?>//g;
    print "*$a*\n";
    }
     
    George Mpouras, Feb 16, 2011
    #6
  7. jwcarlton

    Justin C Guest

    On 2011-02-16, jwcarlton <> wrote:
    >
    > Seriously, why even both replying?


    Then show us a sample of the content that you are receiving so we can
    better understand the problem. Antagonising those who offer suggestions
    is never a good move.

    Justin.

    --
    Justin C, by the sea.
     
    Justin C, Feb 16, 2011
    #7
  8. jwcarlton

    jwcarlton Guest

    > > Seriously, why even both replying?
    >
    > Then show us a sample of the content that you are receiving so we can
    > better understand the problem. Antagonising those who offer suggestions
    > is never a good move.


    Justin, please understand that Tad was giving a PITA answer, not a
    suggestion. I definitely wasn't antagonizing; if you look closely at
    his response, you'll see what I mean.

    He and I have a history, and in the years that I've been watching, I
    don't think he's ever given a REAL answer to anyone.

    Anyway, let's not let Tad ruin yet another thread.

    I gave a sample of what I get in the OP:

    <div class=whatever><span class=whatever><font
    class=whatever><br><br><br>Hello, World!<br><br></font></span></div>

    I'm trying to write a regex that will remove <br> from both the
    beginning and the end of the string, but that's also nested within
    other tags.

    I already use this, which obviously removes the <br> when it's not
    nested inside of other tags:

    $text =~ s/^(<br>)+|(<br>)+$//gi;

    I gave code samples in my OP, too, of what I think will work; the only
    problem is that it requires the tags to be in that order; DIV, then
    SPAN, then FONT. If the FONT comes before the SPAN, then it doesn't
    work, so I'm trying to create a more streamline method.

    Thanks, Justin.
     
    jwcarlton, Feb 16, 2011
    #8
  9. jwcarlton

    jwcarlton Guest

    On Feb 16, 4:03 am, "George Mpouras"
    <> wrote:
    > my $text = '<div class=whatever><span
    > class=whatever><font class=whatever><br>help<o><br><br>Hello,
    > World!<br><br></font></span>
    > </div>';
    >
    > while ( $text =~/<br>(.+?)<br>/gm )
    > {
    > (my $a = $^N)=~s/<.+?>//g;
    > print "*$a*\n";
    > }
    >
    >


    Awesome, George! I really appreciate that.
     
    jwcarlton, Feb 16, 2011
    #9
  10. jwcarlton <> wrote:
    >I gave a sample of what I get in the OP:
    >
    ><div class=whatever><span class=whatever><font
    >class=whatever><br><br><br>Hello, World!<br><br></font></span></div>
    >
    >I'm trying to write a regex that will remove <br> from both the
    >beginning and the end of the string, but that's also nested within
    >other tags.
    >
    >I already use this, which obviously removes the <br> when it's not
    >nested inside of other tags:
    >
    >$text =~ s/^(<br>)+|(<br>)+$//gi;
    >
    >I gave code samples in my OP, too, of what I think will work; the only
    >problem is that it requires the tags to be in that order; DIV, then
    >SPAN, then FONT. If the FONT comes before the SPAN, then it doesn't
    >work, so I'm trying to create a more streamline method.


    And these conditions are exactly why using a simple-minded regular
    expression is an unsuitable approach, in particular if you have no
    control over the format of the incoming data.
    Use a parser that actually parses HTML fragments and creates a syntax
    tree, and then delete or keep exactly those elements that you want.

    Doing it on the textual level is not going to work reliably.

    jue
     
    Jürgen Exner, Feb 16, 2011
    #10
  11. >
    > Don't do that with a regex. A regular expression can only express a
    > regular grammar - hence the name. HTML is a context-free grammar, which
    > needs a more complex parser than a regex can provide.
    >


    sometimes a "good enough" workaround is just fine
     
    George Mpouras, Feb 16, 2011
    #11
  12. jwcarlton

    Keith Keller Guest

    On 2011-02-16, George Mpouras <> wrote:
    >>
    >> Don't do that with a regex. A regular expression can only express a
    >> regular grammar - hence the name. HTML is a context-free grammar, which
    >> needs a more complex parser than a regex can provide.

    >
    > sometimes a "good enough" workaround is just fine


    Perhaps. This isn't one of those times, especially since the HTML
    modules available with Perl are excellent and easy to use.

    --keith


    --
    -francisco.ca.us
    (try just my userid to email me)
    AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt
    see X- headers for PGP signature information
     
    Keith Keller, Feb 16, 2011
    #12
  13. jwcarlton

    jwcarlton Guest

    > > He and I have a history
    >
    > Then maybe you should simply ignore his posts.


    I try; really, I do. I was mostly concerned that others would glance
    over the thread and think that he had legitimately solved the problem.


    > Don't do that with a regex. A regular expression can only express a
    > regular grammar - hence the name. HTML is a context-free grammar, which
    > needs a more complex parser than a regex can provide.
    >
    > Have a look at HTML::parser:
    >
    >     <http://search.cpan.org/perldoc?HTML::parser>


    For now, I have a filter that I wrote a few years ago, and it's
    working well enough so I'm just trying to correct what's really just
    one minor issue. I do intend to change it to work with a parser in the
    near future, though; which would probably have been smarter in the
    beginning, but when I asked for help, I only got responses like the
    first few in this thread, so I just gave up and did it a way that I
    knew.

    In fact, I just looked, and the responses I got then were that I
    should write my own. Funny when you consider that, now that I've
    written my own, all of the responses are that I should have used a
    module! LOL

    I've considered HTML::parser, but if I understand correctly, don't you
    have to specifically define which tags you want to parse? That's all
    well and good, except that people often paste data from other sites,
    so it's difficult to think of every possibility.

    I'm looking at HTML::HTML5::parser, but I'm messing up in a way that I
    don't get. Here's the code I'm entering, which is almost exactly
    what's on CPAN:

    #!/usr/bin/perl
    use CGI::Carp qw(fatalsToBrowser);
    use HTML::HTML5::parser;

    $comment = "<!doctype html>\n<title>Foo</title>\n<p><b><i>Foo</b> bar</
    i>.\n<p>Baz</br>Quux.";

    my $parser = HTML::HTML5::parser->new;
    $comment = $parser->parse_string($comment);

    print "Content-type: text/html\n\n";
    print "$comment";
    exit;

    All this prints, though, is:

    XML::LibXML::Document=SCALAR(0x924f988)

    I double checked, and do have XML::LibXML installed. The
    HTML::HTML5::parser is a fresh install from yesterday.

    Any suggestions on how to print the parsed string, if I'm doing it
    incorrectly?
     
    jwcarlton, Feb 17, 2011
    #13
  14. jwcarlton

    Keith Keller Guest

    On 2011-02-17, jwcarlton <> wrote:
    >
    > I've considered HTML::parser, but if I understand correctly, don't you
    > have to specifically define which tags you want to parse? That's all
    > well and good, except that people often paste data from other sites,
    > so it's difficult to think of every possibility.


    HTML::parser can do basically any HTML parsing. But this also means
    you have a fair amount of coding to tell it what to do. You can also
    look at HTML::TreeBuilder, which uses HTML::parser to build a nice hash
    structure and provide powerful search functions on the structure.

    --keith


    --
    -francisco.ca.us
    (try just my userid to email me)
    AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt
    see X- headers for PGP signature information
     
    Keith Keller, Feb 17, 2011
    #14
  15. jwcarlton

    jwcarlton Guest

    > > I try; really, I do.
    >
    > "Do, or do not - there is no try." - Yoda


    I thought that was Mr. Miyagi? LOL


    > That just goes to show, you should consider the source. One of our more
    > persistent trolls here used to give that same, very misguided, advice
    > whenever the topic came up. I'm sorry to hear you were misled by bad
    > advice.


    It happens. Honestly, for awhile I was getting 0 help on here, just
    all trolls, so it left a rather bad taste in my mouth.

    I've been coding in Perl for almost 15 years, and I keep thinking of
    how helpful everyone here used to be when I was just starting. I don't
    know if it's that the type of people that post has changed, or if I'm
    just more sensitive, or if I'm just becoming an old man thinking about
    "the good ol' days". Probably a mix of the 3.

    I used to ALWAYS know better than to feed the trolls, too. Maybe it is
    just an old man thing? :-(


    > The HTML::HTML5::parser docs say that parse_string() should give you an
    > instance of XML::LibXML::Document, and the message above indicates that
    > it did. That's good news, as it shows that nothing actually went wrong;
    > the problem is that you're trying to print the object as if it were just
    > a string. What you should do instead is check the docs for that module,
    > and find a method for that object that will give you a string. At first
    > glance, it looks to me like toString() would be appropriate:
    >
    >   print $comment->toString();


    Awesome! That worked perfectly, Sherm.

    I looked all through the docs, both last night and today, and didn't
    see anything like that. For the sake of my own learnin', where exactly
    did you find that?


    > Note that X::L::Document has some other interesting methods, that relate
    > to querying the document to get a collection of all the elements of a
    > given type, or an element with a particular id. These DOM methods are
    > the same (language differences aside) as those provided by JavaScript
    > on the document object.


    Cool, thanks again!
     
    jwcarlton, Feb 17, 2011
    #15
  16. George Mpouras <> writes:

    >>
    >> Don't do that with a regex. A regular expression can only express a
    >> regular grammar - hence the name. HTML is a context-free grammar, which
    >> needs a more complex parser than a regex can provide.
    >>

    >
    > sometimes a "good enough" workaround is just fine


    Yes, but that requires understanding that it *is* a "good enough"
    workaround.

    And the burden of proving that understanding is on the one asking a
    FAQ. And he's not doing too good a job of it right now.

    Mart

    --
    "We will need a longer wall when the revolution comes."
    --- AJS, quoting an uncertain source.
     
    Mart van de Wege, Feb 17, 2011
    #16
  17. jwcarlton <> writes:

    >> > He and I have a history

    >>
    >> Then maybe you should simply ignore his posts.

    >
    > I try; really, I do. I was mostly concerned that others would glance
    > over the thread and think that he had legitimately solved the problem.
    >
    >

    I did. His solution of using a parser was the correct one.

    >
    > I've considered HTML::parser, but if I understand correctly, don't you
    > have to specifically define which tags you want to parse? That's all
    > well and good, except that people often paste data from other sites,
    > so it's difficult to think of every possibility.


    You are Not Getting It.

    If your users can give you data that you cannot handle, the best method
    is to reject or discard it (in case of HTML, don't process tags you have
    no handlers for).

    The way HTML::parser goes about it according to your description *is*
    the right way to do it.

    When treating outside data, always do so on a white-list basis: only
    accept what you have explicitly defined. Trying to think of everything
    leads to security holes. If you write software this way, the question is
    *when* it will be exploited, not if.

    Mart

    --
    "We will need a longer wall when the revolution comes."
    --- AJS, quoting an uncertain source.
     
    Mart van de Wege, Feb 17, 2011
    #17
  18. jwcarlton

    ccc31807 Guest

    On Feb 15, 11:43 pm, jwcarlton <> wrote:
    > I'm trying to remove opening and closing <br> tags.


    What you want to do, assuming that you have the entire ASCII text in a
    variable, is this:

    $var =~ s/<br[^>]*>//ig;

    This looks for three literal characters, the '<', 'b', and 'r', then
    looks for any number of characters (including zero characters) which
    are not a literal '>', then a literal '>', and replaces them with
    nothing, looking globally in a case insensitive manner.

    CC.
     
    ccc31807, Feb 17, 2011
    #18
  19. On 2011-02-17 19:04, ccc31807 <> wrote:
    > On Feb 15, 11:43 pm, jwcarlton <> wrote:
    >> I'm trying to remove opening and closing <br> tags.

    >
    > What you want to do, assuming that you have the entire ASCII text in a
    > variable, is this:
    >
    > $var =~ s/<br[^>]*>//ig;
    >
    > This looks for three literal characters, the '<', 'b', and 'r', then
    > looks for any number of characters (including zero characters) which
    > are not a literal '>', then a literal '>', and replaces them with
    > nothing, looking globally in a case insensitive manner.


    <br title="a <br> element">

    SCNR,
    hp
     
    Peter J. Holzer, Feb 18, 2011
    #19
  20. jwcarlton

    ccc31807 Guest

    On Feb 18, 1:42 pm, "Peter J. Holzer" <> wrote:
    > <br title="a <br> element">


    not valid HTML.

    <br title="a &lt;br&gt; element" />

    CC.
     
    ccc31807, Feb 18, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. hiwa
    Replies:
    0
    Views:
    648
  2. shruds
    Replies:
    1
    Views:
    871
    John C. Bollinger
    Jan 27, 2006
  3. fniles
    Replies:
    0
    Views:
    282
    fniles
    Apr 26, 2009
  4. Smarta55 Chris

    RegEx Help, Please? (match after n)

    Smarta55 Chris, Jun 27, 2005, in forum: Perl Misc
    Replies:
    13
    Views:
    226
    Dave A.
    Jun 27, 2005
  5. Sascha Bendix
    Replies:
    3
    Views:
    190
Loading...

Share This Page