Regex question; match after opening tag

jwcarlton · Feb 16, 2011

I'm working on an area where the visitor submits content via
contenteditable, so the submission comes through in Word-style HTML
(meaning, it's somewhat of a mess, and completely dependent on the
users browser).

I'm trying to remove opening and closing tags. The problem I'm
having is when those tags come after a <font, <div, or <span, or
before a closing , </div>, or ; eg:

<div class=whatever> Hello, World! </div>

It's worth noting that <div>...</div> may or may not be there,
... may or may not be there, ... may or may
not be there, they could be transposed (ie, before ), and
the tags can be from 0 to 3.

Here's where I am so far:

$text =~ s/^(<div(.*?)>)( )+/$1/gi;
$text =~ s/^(<span(.*?)>)( )+/$1/gi;
$text =~ s/^(<font(.*?)>)( )+/$1/gi;

$text =~ s/( )+(<\/div>)$/$2/gi;
$text =~ s/( )+(<\/span>)$/$2/gi;
$text =~ s/( )+(<\/font>)$/$2/gi;

I have 3 questions on this:

1. First off, does the code above look technically correct to you?
Meaning, would it work if we assume that the tags are always div,
followed by span, followed by font?

2. Is there a way to get these on 1 line?

3. How can I code it to work regardless of which tag comes first?

TIA,

Jason

Jürgen Exner · Feb 16, 2011

jwcarlton said:
I'm working on an area where the visitor submits content via
contenteditable, so the submission comes through in Word-style HTML
(meaning, it's somewhat of a mess, and completely dependent on the
users browser).

Then why are you trying to use REs to parse this mess?

[typical ill-fated attempt of using the wrong tool for the job deleted]

I have 3 questions on this:

1. First off, does the code above look technically correct to you?
Meaning, would it work if we assume that the tags are always div,
followed by span, followed by font?

Who cares? Nobody in his right mind would use _REGULAR_ expressions to
parse a context-free language.

2. Is there a way to get these on 1 line?

Sure. Just remove the linebreaks.

3. How can I code it to work regardless of which tag comes first?

By writing a proper HTML parser. Or much easier by using one of the
readily available HTML parsers from CPAN.

jue

jwcarlton · Feb 16, 2011

I'm working on an area where the visitor submits content via

contenteditable, so the submission comes through in Word-style HTML
(meaning, it's somewhat of a mess, and completely dependent on the
users browser).

Click to expand...

Then why are you trying to use REs to parse this mess?

[typical ill-fated attempt of using the wrong tool for the job deleted]

I'm guessing that you've never worked with a contenteditable form?
It's not as easy as all that.

Who cares? Nobody in his right mind would use _REGULAR_ expressions to
parse a context-free language.

I care, or I wouldn't have asked. I assume that you care, too, or you
wouldn't have wasted your time on replying

Sure. Just remove the linebreaks.

Sigh.

jwcarlton · Feb 16, 2011

There is no such thing as a "closing" tag...

http://www.w3.org/TR/REC-html32#br

... This is an empty element so the end tag is forbidden

---------------------------
#!/usr/bin/perl
use warnings;
use strict;

my $text = '<div class=whatever> Hello, World! </div>';

$text =~ s/ //g;

print "$text\n";
---------------------------

--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.

Seriously, why even both replying?

Dr.Ruud · Feb 16, 2011

I'm trying to remove opening and closing tags.

Click to expand...

There is no such thing as a "closing" tag...
[...]

Click to expand...

Seriously, why even both replying?

I guess because all answers to your questions are in the FAQ.
That you shouldn't quote sigs is in another one.

George Mpouras · Feb 16, 2011

my $text = '<div class=whatever> help<o> Hello,
World! 
</div>';

while ( $text =~/ (.+?) /gm )
{
(my $a = $^N)=~s/<.+?>//g;
print "*$a*\n";
}

Justin C · Feb 16, 2011

Seriously, why even both replying?

Then show us a sample of the content that you are receiving so we can
better understand the problem. Antagonising those who offer suggestions
is never a good move.

Justin.

jwcarlton · Feb 16, 2011

Seriously, why even both replying?

Then show us a sample of the content that you are receiving so we can
better understand the problem. Antagonising those who offer suggestions
is never a good move.

Justin, please understand that Tad was giving a PITA answer, not a
suggestion. I definitely wasn't antagonizing; if you look closely at
his response, you'll see what I mean.

He and I have a history, and in the years that I've been watching, I
don't think he's ever given a REAL answer to anyone.

Anyway, let's not let Tad ruin yet another thread.

I gave a sample of what I get in the OP:

<div class=whatever> Hello, World! </div>

I'm trying to write a regex that will remove from both the
beginning and the end of the string, but that's also nested within
other tags.

I already use this, which obviously removes the when it's not
nested inside of other tags:

$text =~ s/^( )+|( )+$//gi;

I gave code samples in my OP, too, of what I think will work; the only
problem is that it requires the tags to be in that order; DIV, then
SPAN, then FONT. If the FONT comes before the SPAN, then it doesn't
work, so I'm trying to create a more streamline method.

Thanks, Justin.

jwcarlton · Feb 16, 2011

my $text = '<div class=whatever> help<o> Hello,
World! 
</div>';

while ( $text =~/ (.+?) /gm )
{
(my $a = $^N)=~s/<.+?>//g;
print "*$a*\n";
}

Awesome, George! I really appreciate that.

Jürgen Exner · Feb 16, 2011

jwcarlton said:
I gave a sample of what I get in the OP:

<div class=whatever> Hello, World! </div>

I'm trying to write a regex that will remove from both the
beginning and the end of the string, but that's also nested within
other tags.

I already use this, which obviously removes the when it's not
nested inside of other tags:

$text =~ s/^( )+|( )+$//gi;

I gave code samples in my OP, too, of what I think will work; the only
problem is that it requires the tags to be in that order; DIV, then
SPAN, then FONT. If the FONT comes before the SPAN, then it doesn't
work, so I'm trying to create a more streamline method.

And these conditions are exactly why using a simple-minded regular
expression is an unsuitable approach, in particular if you have no
control over the format of the incoming data.
Use a parser that actually parses HTML fragments and creates a syntax
tree, and then delete or keep exactly those elements that you want.

Doing it on the textual level is not going to work reliably.

jue

George Mpouras · Feb 16, 2011

Don't do that with a regex. A regular expression can only express a
regular grammar - hence the name. HTML is a context-free grammar, which
needs a more complex parser than a regex can provide.

sometimes a "good enough" workaround is just fine

Keith Keller · Feb 16, 2011

sometimes a "good enough" workaround is just fine

Perhaps. This isn't one of those times, especially since the HTML
modules available with Perl are excellent and easy to use.

--keith

jwcarlton · Feb 17, 2011

He and I have a history

Then maybe you should simply ignore his posts.

I try; really, I do. I was mostly concerned that others would glance
over the thread and think that he had legitimately solved the problem.

Don't do that with a regex. A regular expression can only express a
regular grammar - hence the name. HTML is a context-free grammar, which
needs a more complex parser than a regex can provide.

Have a look at HTML:arser:

<http://search.cpan.org/perldoc?HTML::Parser>

For now, I have a filter that I wrote a few years ago, and it's
working well enough so I'm just trying to correct what's really just
one minor issue. I do intend to change it to work with a parser in the
near future, though; which would probably have been smarter in the
beginning, but when I asked for help, I only got responses like the
first few in this thread, so I just gave up and did it a way that I
knew.

In fact, I just looked, and the responses I got then were that I
should write my own. Funny when you consider that, now that I've
written my own, all of the responses are that I should have used a
module! LOL

I've considered HTML:

arser, but if I understand correctly, don't you
have to specifically define which tags you want to parse? That's all
well and good, except that people often paste data from other sites,
so it's difficult to think of every possibility.

I'm looking at HTML::HTML5:

arser, but I'm messing up in a way that I
don't get. Here's the code I'm entering, which is almost exactly
what's on CPAN:

#!/usr/bin/perl
use CGI::Carp qw(fatalsToBrowser);
use HTML::HTML5:

arser;

$comment = "<!doctype html>\n<title>Foo</title>\nFoo bar.\nBazQuux.";

my $parser = HTML::HTML5:

arser->new;
$comment = $parser->parse_string($comment);

print "Content-type: text/html\n\n";
print "$comment";
exit;

All this prints, though, is:

XML::LibXML:

ocument=SCALAR(0x924f988)

I double checked, and do have XML::LibXML installed. The
HTML::HTML5:

arser is a fresh install from yesterday.

Any suggestions on how to print the parsed string, if I'm doing it
incorrectly?

Keith Keller · Feb 17, 2011

I've considered HTML:arser, but if I understand correctly, don't you
have to specifically define which tags you want to parse? That's all
well and good, except that people often paste data from other sites,
so it's difficult to think of every possibility.

HTML:

arser can do basically any HTML parsing. But this also means
you have a fair amount of coding to tell it what to do. You can also
look at HTML::TreeBuilder, which uses HTML:

arser to build a nice hash
structure and provide powerful search functions on the structure.

--keith

jwcarlton · Feb 17, 2011

I try; really, I do.

"Do, or do not - there is no try." - Yoda

I thought that was Mr. Miyagi? LOL

That just goes to show, you should consider the source. One of our more
persistent trolls here used to give that same, very misguided, advice
whenever the topic came up. I'm sorry to hear you were misled by bad
advice.

It happens. Honestly, for awhile I was getting 0 help on here, just
all trolls, so it left a rather bad taste in my mouth.

I've been coding in Perl for almost 15 years, and I keep thinking of
how helpful everyone here used to be when I was just starting. I don't
know if it's that the type of people that post has changed, or if I'm
just more sensitive, or if I'm just becoming an old man thinking about
"the good ol' days". Probably a mix of the 3.

I used to ALWAYS know better than to feed the trolls, too. Maybe it is
just an old man thing? :-(

The HTML::HTML5:arser docs say that parse_string() should give you an
instance of XML::LibXML:ocument, and the message above indicates that
it did. That's good news, as it shows that nothing actually went wrong;
the problem is that you're trying to print the object as if it were just
a string. What you should do instead is check the docs for that module,
and find a method for that object that will give you a string. At first
glance, it looks to me like toString() would be appropriate:

print $comment->toString();

Awesome! That worked perfectly, Sherm.

I looked all through the docs, both last night and today, and didn't
see anything like that. For the sake of my own learnin', where exactly
did you find that?

Note that X::L:ocument has some other interesting methods, that relate
to querying the document to get a collection of all the elements of a
given type, or an element with a particular id. These DOM methods are
the same (language differences aside) as those provided by JavaScript
on the document object.

Cool, thanks again!

Mart van de Wege · Feb 17, 2011

George Mpouras said:
sometimes a "good enough" workaround is just fine

Yes, but that requires understanding that it *is* a "good enough"
workaround.

And the burden of proving that understanding is on the one asking a
FAQ. And he's not doing too good a job of it right now.

Mart

Mart van de Wege · Feb 17, 2011

jwcarlton said:
I try; really, I do. I was mostly concerned that others would glance
over the thread and think that he had legitimately solved the problem.

I did. His solution of using a parser was the correct one.

I've considered HTML:arser, but if I understand correctly, don't you
have to specifically define which tags you want to parse? That's all
well and good, except that people often paste data from other sites,
so it's difficult to think of every possibility.

You are Not Getting It.

If your users can give you data that you cannot handle, the best method
is to reject or discard it (in case of HTML, don't process tags you have
no handlers for).

The way HTML:

arser goes about it according to your description *is*
the right way to do it.

When treating outside data, always do so on a white-list basis: only
accept what you have explicitly defined. Trying to think of everything
leads to security holes. If you write software this way, the question is
*when* it will be exploited, not if.

Mart

ccc31807 · Feb 17, 2011

I'm trying to remove opening and closing tags.

What you want to do, assuming that you have the entire ASCII text in a
variable, is this:

$var =~ s/<br[^>]*>//ig;

This looks for three literal characters, the '<', 'b', and 'r', then
looks for any number of characters (including zero characters) which
are not a literal '>', then a literal '>', and replaces them with
nothing, looking globally in a case insensitive manner.

CC.

Peter J. Holzer · Feb 18, 2011

I'm trying to remove opening and closing tags.

Click to expand...

What you want to do, assuming that you have the entire ASCII text in a
variable, is this:

$var =~ s/<br[^>]*>//ig;

This looks for three literal characters, the '<', 'b', and 'r', then
looks for any number of characters (including zero characters) which
are not a literal '>', then a literal '>', and replaces them with
nothing, looking globally in a case insensitive manner.

element">

SCNR,
hp

ccc31807 · Feb 18, 2011

element">

not valid HTML.

 element" />

CC.

Help with code	0	Jun 12, 2022
Different font sizes inside same div	2	Dec 3, 2023
Help with my responsive home page	2	Dec 14, 2022
I dont get this. Please help me!!	2	Jan 24, 2023
Positioning CSS components	1	Nov 16, 2023
I am trying to detect Which image id="" was clicked ?	22	Jan 3, 2023
Slideshow not working properly	2	Jan 7, 2023
Troubles with Fullpage / please help	0	Dec 14, 2023

Regex question; match <br> after opening tag

jwcarlton

Jürgen Exner

jwcarlton

jwcarlton

Dr.Ruud

George Mpouras

Justin C

jwcarlton

jwcarlton

Jürgen Exner

George Mpouras

Keith Keller

jwcarlton

Keith Keller

jwcarlton

Mart van de Wege

Mart van de Wege

ccc31807

Peter J. Holzer

ccc31807

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads