Regular Expression for HTML Tags and Special Characters

M

Marc Bogaard

Hello together!

How can I allowed some HTML-Tags like <BR>, <B>, <P> but
filter out <, >, when they stand alone?

Must be something like: "^[A-Za-Z0-9\>\<]+$"
for the < and >, but where do i have to put in my tags?


thank you in advance
marc van den Bogaard
 
J

Josef Moellers

Marc said:
Hello together!

How can I allowed some HTML-Tags like <BR>, <B>, <P> but
filter out <, >, when they stand alone?

Must be something like: "^[A-Za-Z0-9\>\<]+$"
for the < and >, but where do i have to put in my tags?

how about "<[a-zA-Z0-9]{1,2}>"?
 
J

Jon Ericson

Hello together!

Hallo!

(In English, the idiom is "Hello all!" or "Hello everyone!" or "Hello
folks!")
How can I allowed some HTML-Tags like <BR>, <B>, <P> but
filter out <, >, when they stand alone?

Must be something like: "^[A-Za-Z0-9\>\<]+$"
for the < and >, but where do i have to put in my tags?

Do you mean that all of the tags will be on their own line? Do you
want to remove the tags, everything within the tags or just < and >?
Maybe you could show us some sample input and the expected output?

I modified one of the examples from HTML::parser to do what I *think*
you want:

#!perl -w
use strict;

use HTML::parser;

my %allowed = map {$_ => 1} qw{br b p};

HTML::parser->new(default_h => [sub { print shift }, 'text'],
start_h => [sub { my $tag = shift;
print "<$tag>" if $allowed{$tag} },
'tagname'],
end_h => [sub { my $tag = shift;
print "</$tag>" if $allowed{$tag} },
'tagname']
)->parse_file(shift || die) || die $!;


Given:

<BR>eakfast every morning <B>efore going to work or <no> lunch for you.
</P>aragraphs like this are </kooky>.

It produces:

<br>eakfast every morning <b>efore going to work or lunch for you.
</p>aragraphs like this are .

Jon
 
B

Bart Lateur

Marc said:
How can I allowed some HTML-Tags like <BR>, <B>, <P> but
filter out <, >, when they stand alone?

Typically, people don't like you using regexes for this kind of taask,
because the pattern would be *really* complex before working
satisfactorily. Instead, use something involving a HTML parser module.

I like HTML::TokeParser::Simple for that kind of task.

<http://search.cpan.org/search?module=HTML::TokeParser::Simple>

You loop through the input, processing one token (tag, comment, piece
of text) at a time, act differently depending on the type of token and
its actual contents, and can use $token->as_is to just pass it through
unchanged (the ordinary case). You can filter out disallowed tags,
disallowed attributes. You could probably even use it to balance the
left over, allowed tokens.

Here's a demo script (do at least remove the whitespace in front of the
line containing just "*END*"):

use HTML::TokeParser::Simple;

my $html = <<"*END*";
<P>Get up in the morning, slaving for bread, sir,
<BR>so that every mouth can be fed.
<P><B>Poor me</B>, the Israelite. <I>Aah.</I>
<!-- this is a comment. It'll be gone. -->
<P>There's a lone "<" in here, matched by a lone ">".
<script language="Javascript">alert("Hello, World!")</script>
<P>I don't like <a href="http://example.com">links</a> either,
but will allow for <a name="foo"></a>anchors.
*END*

my $p = HTML::TokeParser::Simple->new(\$html);
my %allow = map { $_ => 1 } qw(b i u br p);
my %wipe_content = map { $_ => 1 } qw(style script);
my %escape = ( '<' => '&lt;', '>' => '&gt;');

while(my $t = $p->get_token) {
if($t->is_tag) {
my $tag = $t->get_tag;
if($tag eq 'a') {
print $t->as_is, "</a>" if defined
$t->get_attr('name');
} elsif($allow{$tag}) {
print $t->as_is;
} elsif($wipe_content{$tag}) {
while(my $t = $p->get_token) {
# wipe
last if $t->is_end_tag($tag);
}
}
} elsif($t->is_comment) {
# wipe
} elsif($t->is_text) {
my $text = $t->as_is;
$text =~ s/([<>])/$escape{$1}/g;
print $text;
}
}


Result:
<P>Get up in the morning, slaving for bread, sir,
<BR>so that every mouth can be fed.
<P><B>Poor me</B>, the Israelite. <I>Aah.</I>

<P>There's a lone "&lt;" in here, matched by a lone "&gt;".

<P>I don't like links either,
but will allow for <a name="foo"></a>anchors.
 
V

Vijai Kalyan

filter out <, >, when they stand alone?

Must be something like: "^[A-Za-Z0-9\>\<]+$"
for the < and >, but where do i have to put in my tags?

As others said below, you should be using a parser instead of regexp
for this, but I am just a beginner with perl and am trying to answer
questions to get practice.

If you really want to use a regexp, lookup an example that's in the
first chapter of the Camel book.

It goes something like this: (I will let u do the homework :)

m/<(.*?)>.*?(\/\1)/

which means,

a. minimally match something within a < and a >

b. minimally match anything (. matches everything but newline, so u
might want to modify that - again, homework :)

c. make a back reference to what was found between the first < and >.

NOTE:

a. This probably won't work if you have attributes so a modification
might be:

m/<\s*(\w+)\s+.*?>.*?(\/\1)/

which (I think) means:

i. Match a < followed any number of ws chars, followed by one or more
word chars followed again by ws chars.

ii. Finally any number of chars is minimally matched till again a > is
met.

iii. Again the back reference is used to force the same pattern (here,
this will be the tag) to match at the end.

As someone said, it gets complicated.

hth,
 
T

Tad McClellan

Vijai Kalyan said:
How can I allowed some HTML-Tags like <BR>, <B>, <P> but
filter out <, >, when they stand alone?

Must be something like: "^[A-Za-Z0-9\>\<]+$"
for the < and >, but where do i have to put in my tags?

As others said below, you should be using a parser instead of regexp
for this, but I am just a beginner with perl and am trying to answer
questions to get practice.

If you really want to use a regexp, lookup an example that's in the
first chapter of the Camel book.

It goes something like this: (I will let u do the homework :)

m/<(.*?)>.*?(\/\1)/ ^^^^


m/<\s*(\w+)\s+.*?>.*?(\/\1)/
^^^


It is invalid HTML if it has whitespace there.

Will that work on the below (after taking out the \s*)?

As someone said, it gets complicated.


here are some more complications to try and match correctly:

<!-- there are no <tags></tags> on this line at all! -->

<img src="cool.jpg" alt=">>cool pic!<<">
 
V

Vijayaraghavan Kalyanapasupathy

^^^^

Where are the <angle brackets> for the endtag?

Oops, apologies are in order. I missed them!
^^^

It is invalid HTML if it has whitespace there.

I didn't know that. In this case we would modify it to

m/ said:
Will that work on the below (after taking out the \s*)?

<foo>bar</foo>

(I am trying to answer without running the code. So if there's a
mistake, you know whom you have to blame me.)

You are right it wouldn't. I would have to do this instead:

m/<(\w+)*?\s+.*?>.*?<\/\1>/

This will catch the "foo" minimally.
here are some more complications to try and match correctly:

<!-- there are no <tags></tags> on this line at all! -->

But if we had a regular expression that matched a <!-- . --> wouldn't
that gobble up the <tags></tags> inbetween?

Of course, I am thinking more along the lines of a Lex input
specification where you would typically do a:

<YYINITIAL> "<!--" { yybegin(COMMENT); }
<COMMENT> "-->" { yybegin(YYINITIAL); }
<COMMENT> \n { }
<COMMENT> \r { yybegin(DOSEOL); }
<COMMENT> . { }
<DOSEOL> \n { }
<DOSEOL> . { yybegin(COMMENT); }

This is actually JLex input. But you can understand what it does. (I
just sligthly modified from a JLex input for ASN.1 that I wrote. That's
why you see the redundant state transitions on \n and \r. In ASN.1 a
comment can be multiline but each multiline comment has to have the
comment starter; in this case --)
<img src="cool.jpg" alt=">>cool pic!<<">

Yes, I think the modified regexp above would get this.

Why? The minimal matcher should stop at the first ">" in the substring
">> . Also because of the intervening \s+ and .*? the back reference
would yield only "img" . But, the input itself in this case is incorrect
right?

If I am wrong, do correct.

thanx,

hth,
 
V

Vijayaraghavan Kalyanapasupathy

The waters do get murkier. No, you are correct, I made a mistake in the

<img src=".." alt=">>CoolPic<<">

example you gave.

the reg exp would actually do the wrong thing because it's too simple.
As I said, it would definitely be easier with a lexer where you can
remember state!

It does get exceedingly complicated. But, then I am not sure if a
regular expression can really match all types of input. Isn't that what
the Chomsky hierarchy is about?

Correct me if I am wrong.
 
T

Tad McClellan

Vijayaraghavan Kalyanapasupathy said:
(e-mail address removed) says...


I didn't know that.


Yet another reason to use an HTML module.

The module authors know these things. :)

In this case we would modify it to

m/<(\w+)\s+.*?>.*?(\/\1)/


Let's put the endtag angle brackets in there too:

(I am trying to answer without running the code.


That's what I was hoping you would do...

So if there's a
mistake, you know whom you have to blame me.)

You are right it wouldn't. I would have to do this instead:

m/<(\w+)*?\s+.*?>.*?<\/\1>/

This will catch the "foo" minimally.


.... but then you should check yourself by trying it in actual code. :)

It will fail to match at all.

Your pattern requires at least one whitespace and <foo> does
not contain a whitespace.

But if we had a regular expression that matched a <!-- . --> wouldn't
that gobble up the <tags></tags> inbetween?


Yes, but matching "comment declarations" is not that easy, yet another
reason to use an HTML module. Here is a valid comment declaration:

Of course, I am thinking more along the lines of a Lex input
specification where you would typically do a:


[snip lex patterns]

The grammar for SGML comment declarations is a good bit more complex
than that.
 
T

Tad McClellan

Vijayaraghavan Kalyanapasupathy said:
It does get exceedingly complicated. But, then I am not sure if a
regular expression can really match all types of input. Isn't that what
the Chomsky hierarchy is about?

Correct me if I am wrong.


No corrections required.

Regular Expressions do not have the power required to parse a
Context Free grammar, such as HTML.



Note:

Perl's regular expressions are no longer Regular, they are called
that for historical reasons rather than for mathematically-correct
reasons.
 
B

Bart Lateur

Tad said:
Regular Expressions do not have the power required to parse a
Context Free grammar, such as HTML.

I doubt if HTML actually is a context free grammar. There are no
recursive rules, AFAIK.
 
T

Tad McClellan

Bart Lateur said:
I doubt if HTML actually is a context free grammar. There are no
recursive rules, AFAIK.


tables can nest arbitrarily deep:

<table>
<tbody>
<tr>
<td>
<table>
start all over again...
 
B

Bart Lateur

Tad said:
tables can nest arbitrarily deep:

Ah yes, that way. But no-one who tries to use regexes to parse HTML,
tries to match particular tags. Instead, they try to recognize tags,
text, comments... that sort of stuff. The rules for those aren't
recursive.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top