Regex question; match <br> after opening tag

J

jwcarlton

I'm working on an area where the visitor submits content via
contenteditable, so the submission comes through in Word-style HTML
(meaning, it's somewhat of a mess, and completely dependent on the
users browser).

I'm trying to remove opening and closing <br> tags. The problem I'm
having is when those tags come after a <font, <div, or <span, or
before a closing </font>, </div>, or </span>; eg:

<div class=whatever><span class=whatever><font
class=whatever><br><br><br>Hello, World!<br><br></font></span></div>

It's worth noting that <div>...</div> may or may not be there,
<span>...</span> may or may not be there, <font>...</font> may or may
not be there, they could be transposed (ie, <font> before <span>), and
the <br> tags can be from 0 to 3.

Here's where I am so far:

$text =~ s/^(<div(.*?)>)(<br>)+/$1/gi;
$text =~ s/^(<span(.*?)>)(<br>)+/$1/gi;
$text =~ s/^(<font(.*?)>)(<br>)+/$1/gi;

$text =~ s/(<br>)+(<\/div>)$/$2/gi;
$text =~ s/(<br>)+(<\/span>)$/$2/gi;
$text =~ s/(<br>)+(<\/font>)$/$2/gi;


I have 3 questions on this:

1. First off, does the code above look technically correct to you?
Meaning, would it work if we assume that the tags are always div,
followed by span, followed by font?

2. Is there a way to get these on 1 line?

3. How can I code it to work regardless of which tag comes first?

TIA,

Jason
 
J

Jürgen Exner

jwcarlton said:
I'm working on an area where the visitor submits content via
contenteditable, so the submission comes through in Word-style HTML
(meaning, it's somewhat of a mess, and completely dependent on the
users browser).

Then why are you trying to use REs to parse this mess?

[typical ill-fated attempt of using the wrong tool for the job deleted]
I have 3 questions on this:

1. First off, does the code above look technically correct to you?
Meaning, would it work if we assume that the tags are always div,
followed by span, followed by font?

Who cares? Nobody in his right mind would use _REGULAR_ expressions to
parse a context-free language.
2. Is there a way to get these on 1 line?

Sure. Just remove the linebreaks.
3. How can I code it to work regardless of which tag comes first?

By writing a proper HTML parser. Or much easier by using one of the
readily available HTML parsers from CPAN.

jue
 
J

jwcarlton

I'm working on an area where the visitor submits content via
contenteditable, so the submission comes through in Word-style HTML
(meaning, it's somewhat of a mess, and completely dependent on the
users browser).

Then why are you trying to use REs to parse this mess?

[typical ill-fated attempt of using the wrong tool for the job deleted]

I'm guessing that you've never worked with a contenteditable form?
It's not as easy as all that.

Who cares? Nobody in his right mind would use _REGULAR_ expressions to
parse a context-free language.

I care, or I wouldn't have asked. I assume that you care, too, or you
wouldn't have wasted your time on replying :)

Sure. Just remove the linebreaks.

Sigh.
 
J

jwcarlton

There is no such thing as a "closing" <br> tag...

   http://www.w3.org/TR/REC-html32#br

    ... This is an empty element so the end tag is forbidden


---------------------------
#!/usr/bin/perl
use warnings;
use strict;

my $text = '<div class=whatever><span class=whatever><font
class=whatever><br><br><br>Hello, World!<br><br></font></span></div>';

$text =~ s/<br>//g;

print "$text\n";
---------------------------

--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.

Seriously, why even both replying?
 
G

George Mpouras

my $text = '<div class=whatever><span
class=whatever><font class=whatever><br>help<o><br><br>Hello,
World!<br><br></font></span>
</div>';

while ( $text =~/<br>(.+?)<br>/gm )
{
(my $a = $^N)=~s/<.+?>//g;
print "*$a*\n";
}
 
J

Justin C

Seriously, why even both replying?

Then show us a sample of the content that you are receiving so we can
better understand the problem. Antagonising those who offer suggestions
is never a good move.

Justin.
 
J

jwcarlton

Seriously, why even both replying?
Then show us a sample of the content that you are receiving so we can
better understand the problem. Antagonising those who offer suggestions
is never a good move.

Justin, please understand that Tad was giving a PITA answer, not a
suggestion. I definitely wasn't antagonizing; if you look closely at
his response, you'll see what I mean.

He and I have a history, and in the years that I've been watching, I
don't think he's ever given a REAL answer to anyone.

Anyway, let's not let Tad ruin yet another thread.

I gave a sample of what I get in the OP:

<div class=whatever><span class=whatever><font
class=whatever><br><br><br>Hello, World!<br><br></font></span></div>

I'm trying to write a regex that will remove <br> from both the
beginning and the end of the string, but that's also nested within
other tags.

I already use this, which obviously removes the <br> when it's not
nested inside of other tags:

$text =~ s/^(<br>)+|(<br>)+$//gi;

I gave code samples in my OP, too, of what I think will work; the only
problem is that it requires the tags to be in that order; DIV, then
SPAN, then FONT. If the FONT comes before the SPAN, then it doesn't
work, so I'm trying to create a more streamline method.

Thanks, Justin.
 
J

jwcarlton

my $text = '<div class=whatever><span
class=whatever><font class=whatever><br>help<o><br><br>Hello,
World!<br><br></font></span>
</div>';

while ( $text =~/<br>(.+?)<br>/gm )
{
(my $a = $^N)=~s/<.+?>//g;
print "*$a*\n";
}

Awesome, George! I really appreciate that.
 
J

Jürgen Exner

jwcarlton said:
I gave a sample of what I get in the OP:

<div class=whatever><span class=whatever><font
class=whatever><br><br><br>Hello, World!<br><br></font></span></div>

I'm trying to write a regex that will remove <br> from both the
beginning and the end of the string, but that's also nested within
other tags.

I already use this, which obviously removes the <br> when it's not
nested inside of other tags:

$text =~ s/^(<br>)+|(<br>)+$//gi;

I gave code samples in my OP, too, of what I think will work; the only
problem is that it requires the tags to be in that order; DIV, then
SPAN, then FONT. If the FONT comes before the SPAN, then it doesn't
work, so I'm trying to create a more streamline method.

And these conditions are exactly why using a simple-minded regular
expression is an unsuitable approach, in particular if you have no
control over the format of the incoming data.
Use a parser that actually parses HTML fragments and creates a syntax
tree, and then delete or keep exactly those elements that you want.

Doing it on the textual level is not going to work reliably.

jue
 
G

George Mpouras

Don't do that with a regex. A regular expression can only express a
regular grammar - hence the name. HTML is a context-free grammar, which
needs a more complex parser than a regex can provide.

sometimes a "good enough" workaround is just fine
 
K

Keith Keller

sometimes a "good enough" workaround is just fine

Perhaps. This isn't one of those times, especially since the HTML
modules available with Perl are excellent and easy to use.

--keith
 
J

jwcarlton

He and I have a history
Then maybe you should simply ignore his posts.

I try; really, I do. I was mostly concerned that others would glance
over the thread and think that he had legitimately solved the problem.

Don't do that with a regex. A regular expression can only express a
regular grammar - hence the name. HTML is a context-free grammar, which
needs a more complex parser than a regex can provide.

Have a look at HTML::parser:

    <http://search.cpan.org/perldoc?HTML::Parser>

For now, I have a filter that I wrote a few years ago, and it's
working well enough so I'm just trying to correct what's really just
one minor issue. I do intend to change it to work with a parser in the
near future, though; which would probably have been smarter in the
beginning, but when I asked for help, I only got responses like the
first few in this thread, so I just gave up and did it a way that I
knew.

In fact, I just looked, and the responses I got then were that I
should write my own. Funny when you consider that, now that I've
written my own, all of the responses are that I should have used a
module! LOL

I've considered HTML::parser, but if I understand correctly, don't you
have to specifically define which tags you want to parse? That's all
well and good, except that people often paste data from other sites,
so it's difficult to think of every possibility.

I'm looking at HTML::HTML5::parser, but I'm messing up in a way that I
don't get. Here's the code I'm entering, which is almost exactly
what's on CPAN:

#!/usr/bin/perl
use CGI::Carp qw(fatalsToBrowser);
use HTML::HTML5::parser;

$comment = "<!doctype html>\n<title>Foo</title>\n<p><b><i>Foo</b> bar</
i>.\n<p>Baz</br>Quux.";

my $parser = HTML::HTML5::parser->new;
$comment = $parser->parse_string($comment);

print "Content-type: text/html\n\n";
print "$comment";
exit;

All this prints, though, is:

XML::LibXML::Document=SCALAR(0x924f988)

I double checked, and do have XML::LibXML installed. The
HTML::HTML5::parser is a fresh install from yesterday.

Any suggestions on how to print the parsed string, if I'm doing it
incorrectly?
 
K

Keith Keller

I've considered HTML::parser, but if I understand correctly, don't you
have to specifically define which tags you want to parse? That's all
well and good, except that people often paste data from other sites,
so it's difficult to think of every possibility.

HTML::parser can do basically any HTML parsing. But this also means
you have a fair amount of coding to tell it what to do. You can also
look at HTML::TreeBuilder, which uses HTML::parser to build a nice hash
structure and provide powerful search functions on the structure.

--keith
 
J

jwcarlton

I try; really, I do.
"Do, or do not - there is no try." - Yoda

I thought that was Mr. Miyagi? LOL

That just goes to show, you should consider the source. One of our more
persistent trolls here used to give that same, very misguided, advice
whenever the topic came up. I'm sorry to hear you were misled by bad
advice.

It happens. Honestly, for awhile I was getting 0 help on here, just
all trolls, so it left a rather bad taste in my mouth.

I've been coding in Perl for almost 15 years, and I keep thinking of
how helpful everyone here used to be when I was just starting. I don't
know if it's that the type of people that post has changed, or if I'm
just more sensitive, or if I'm just becoming an old man thinking about
"the good ol' days". Probably a mix of the 3.

I used to ALWAYS know better than to feed the trolls, too. Maybe it is
just an old man thing? :-(

The HTML::HTML5::parser docs say that parse_string() should give you an
instance of XML::LibXML::Document, and the message above indicates that
it did. That's good news, as it shows that nothing actually went wrong;
the problem is that you're trying to print the object as if it were just
a string. What you should do instead is check the docs for that module,
and find a method for that object that will give you a string. At first
glance, it looks to me like toString() would be appropriate:

  print $comment->toString();

Awesome! That worked perfectly, Sherm.

I looked all through the docs, both last night and today, and didn't
see anything like that. For the sake of my own learnin', where exactly
did you find that?

Note that X::L::Document has some other interesting methods, that relate
to querying the document to get a collection of all the elements of a
given type, or an element with a particular id. These DOM methods are
the same (language differences aside) as those provided by JavaScript
on the document object.

Cool, thanks again!
 
M

Mart van de Wege

George Mpouras said:
sometimes a "good enough" workaround is just fine

Yes, but that requires understanding that it *is* a "good enough"
workaround.

And the burden of proving that understanding is on the one asking a
FAQ. And he's not doing too good a job of it right now.

Mart
 
M

Mart van de Wege

jwcarlton said:
I try; really, I do. I was mostly concerned that others would glance
over the thread and think that he had legitimately solved the problem.
I did. His solution of using a parser was the correct one.
I've considered HTML::parser, but if I understand correctly, don't you
have to specifically define which tags you want to parse? That's all
well and good, except that people often paste data from other sites,
so it's difficult to think of every possibility.

You are Not Getting It.

If your users can give you data that you cannot handle, the best method
is to reject or discard it (in case of HTML, don't process tags you have
no handlers for).

The way HTML::parser goes about it according to your description *is*
the right way to do it.

When treating outside data, always do so on a white-list basis: only
accept what you have explicitly defined. Trying to think of everything
leads to security holes. If you write software this way, the question is
*when* it will be exploited, not if.

Mart
 
C

ccc31807

I'm trying to remove opening and closing <br> tags.

What you want to do, assuming that you have the entire ASCII text in a
variable, is this:

$var =~ s/<br[^>]*>//ig;

This looks for three literal characters, the '<', 'b', and 'r', then
looks for any number of characters (including zero characters) which
are not a literal '>', then a literal '>', and replaces them with
nothing, looking globally in a case insensitive manner.

CC.
 
P

Peter J. Holzer

I'm trying to remove opening and closing <br> tags.

What you want to do, assuming that you have the entire ASCII text in a
variable, is this:

$var =~ s/<br[^>]*>//ig;

This looks for three literal characters, the '<', 'b', and 'r', then
looks for any number of characters (including zero characters) which
are not a literal '>', then a literal '>', and replaces them with
nothing, looking globally in a case insensitive manner.

<br title="a <br> element">

SCNR,
hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top