Regular Expression for HTML Tags and Special Characters

Discussion in 'Perl Misc' started by Marc Bogaard, Oct 19, 2004.

  1. Marc Bogaard

    Marc Bogaard Guest

    Hello together!

    How can I allowed some HTML-Tags like <BR>, <B>, <P> but
    filter out <, >, when they stand alone?

    Must be something like: "^[A-Za-Z0-9\>\<]+$"
    for the < and >, but where do i have to put in my tags?

    thank you in advance
    marc van den Bogaard
    Marc Bogaard, Oct 19, 2004
    1. Advertisements

  2. how about "<[a-zA-Z0-9]{1,2}>"?
    Josef Moellers, Oct 19, 2004
    1. Advertisements

  3. Marc Bogaard

    Jon Ericson Guest


    (In English, the idiom is "Hello all!" or "Hello everyone!" or "Hello
    Do you mean that all of the tags will be on their own line? Do you
    want to remove the tags, everything within the tags or just < and >?
    Maybe you could show us some sample input and the expected output?

    I modified one of the examples from HTML::parser to do what I *think*
    you want:

    #!perl -w
    use strict;

    use HTML::parser;

    my %allowed = map {$_ => 1} qw{br b p};

    HTML::parser->new(default_h => [sub { print shift }, 'text'],
    start_h => [sub { my $tag = shift;
    print "<$tag>" if $allowed{$tag} },
    end_h => [sub { my $tag = shift;
    print "</$tag>" if $allowed{$tag} },
    )->parse_file(shift || die) || die $!;


    <BR>eakfast every morning <B>efore going to work or <no> lunch for you.
    </P>aragraphs like this are </kooky>.

    It produces:

    <br>eakfast every morning <b>efore going to work or lunch for you.
    </p>aragraphs like this are .

    Jon Ericson, Oct 19, 2004
  4. Marc Bogaard

    Bart Lateur Guest

    Typically, people don't like you using regexes for this kind of taask,
    because the pattern would be *really* complex before working
    satisfactorily. Instead, use something involving a HTML parser module.

    I like HTML::TokeParser::Simple for that kind of task.


    You loop through the input, processing one token (tag, comment, piece
    of text) at a time, act differently depending on the type of token and
    its actual contents, and can use $token->as_is to just pass it through
    unchanged (the ordinary case). You can filter out disallowed tags,
    disallowed attributes. You could probably even use it to balance the
    left over, allowed tokens.

    Here's a demo script (do at least remove the whitespace in front of the
    line containing just "*END*"):

    use HTML::TokeParser::Simple;

    my $html = <<"*END*";
    <P>Get up in the morning, slaving for bread, sir,
    <BR>so that every mouth can be fed.
    <P><B>Poor me</B>, the Israelite. <I>Aah.</I>
    <!-- this is a comment. It'll be gone. -->
    <P>There's a lone "<" in here, matched by a lone ">".
    <script language="Javascript">alert("Hello, World!")</script>
    <P>I don't like <a href="">links</a> either,
    but will allow for <a name="foo"></a>anchors.

    my $p = HTML::TokeParser::Simple->new(\$html);
    my %allow = map { $_ => 1 } qw(b i u br p);
    my %wipe_content = map { $_ => 1 } qw(style script);
    my %escape = ( '<' => '&lt;', '>' => '&gt;');

    while(my $t = $p->get_token) {
    if($t->is_tag) {
    my $tag = $t->get_tag;
    if($tag eq 'a') {
    print $t->as_is, "</a>" if defined
    } elsif($allow{$tag}) {
    print $t->as_is;
    } elsif($wipe_content{$tag}) {
    while(my $t = $p->get_token) {
    # wipe
    last if $t->is_end_tag($tag);
    } elsif($t->is_comment) {
    # wipe
    } elsif($t->is_text) {
    my $text = $t->as_is;
    $text =~ s/([<>])/$escape{$1}/g;
    print $text;

    <P>Get up in the morning, slaving for bread, sir,
    <BR>so that every mouth can be fed.
    <P><B>Poor me</B>, the Israelite. <I>Aah.</I>

    <P>There's a lone "&lt;" in here, matched by a lone "&gt;".

    <P>I don't like links either,
    but will allow for <a name="foo"></a>anchors.
    Bart Lateur, Oct 20, 2004
  5. Marc Bogaard

    Vijai Kalyan Guest

    As others said below, you should be using a parser instead of regexp
    for this, but I am just a beginner with perl and am trying to answer
    questions to get practice.

    If you really want to use a regexp, lookup an example that's in the
    first chapter of the Camel book.

    It goes something like this: (I will let u do the homework :)


    which means,

    a. minimally match something within a < and a >

    b. minimally match anything (. matches everything but newline, so u
    might want to modify that - again, homework :)

    c. make a back reference to what was found between the first < and >.


    a. This probably won't work if you have attributes so a modification
    might be:


    which (I think) means:

    i. Match a < followed any number of ws chars, followed by one or more
    word chars followed again by ws chars.

    ii. Finally any number of chars is minimally matched till again a > is

    iii. Again the back reference is used to force the same pattern (here,
    this will be the tag) to match at the end.

    As someone said, it gets complicated.

    Vijai Kalyan, Oct 20, 2004
  6. ^^^

    It is invalid HTML if it has whitespace there.

    Will that work on the below (after taking out the \s*)?

    here are some more complications to try and match correctly:

    <!-- there are no <tags></tags> on this line at all! -->

    <img src="cool.jpg" alt=">>cool pic!<<">
    Tad McClellan, Oct 20, 2004
  7. Oops, apologies are in order. I missed them!
    I didn't know that. In this case we would modify it to

    (I am trying to answer without running the code. So if there's a
    mistake, you know whom you have to blame me.)

    You are right it wouldn't. I would have to do this instead:


    This will catch the "foo" minimally.
    But if we had a regular expression that matched a <!-- . --> wouldn't
    that gobble up the <tags></tags> inbetween?

    Of course, I am thinking more along the lines of a Lex input
    specification where you would typically do a:

    <YYINITIAL> "<!--" { yybegin(COMMENT); }
    <COMMENT> "-->" { yybegin(YYINITIAL); }
    <COMMENT> \n { }
    <COMMENT> \r { yybegin(DOSEOL); }
    <COMMENT> . { }
    <DOSEOL> \n { }
    <DOSEOL> . { yybegin(COMMENT); }

    This is actually JLex input. But you can understand what it does. (I
    just sligthly modified from a JLex input for ASN.1 that I wrote. That's
    why you see the redundant state transitions on \n and \r. In ASN.1 a
    comment can be multiline but each multiline comment has to have the
    comment starter; in this case --)
    Yes, I think the modified regexp above would get this.

    Why? The minimal matcher should stop at the first ">" in the substring
    ">> . Also because of the intervening \s+ and .*? the back reference
    would yield only "img" . But, the input itself in this case is incorrect

    If I am wrong, do correct.


    Vijayaraghavan Kalyanapasupathy, Oct 21, 2004
  8. The waters do get murkier. No, you are correct, I made a mistake in the

    <img src=".." alt=">>CoolPic<<">

    example you gave.

    the reg exp would actually do the wrong thing because it's too simple.
    As I said, it would definitely be easier with a lexer where you can
    remember state!

    It does get exceedingly complicated. But, then I am not sure if a
    regular expression can really match all types of input. Isn't that what
    the Chomsky hierarchy is about?

    Correct me if I am wrong.
    Vijayaraghavan Kalyanapasupathy, Oct 21, 2004

  9. Yet another reason to use an HTML module.

    The module authors know these things. :)

    Let's put the endtag angle brackets in there too:

    That's what I was hoping you would do...

    .... but then you should check yourself by trying it in actual code. :)

    It will fail to match at all.

    Your pattern requires at least one whitespace and <foo> does
    not contain a whitespace.

    Yes, but matching "comment declarations" is not that easy, yet another
    reason to use an HTML module. Here is a valid comment declaration:

    [snip lex patterns]

    The grammar for SGML comment declarations is a good bit more complex
    than that.
    Tad McClellan, Oct 21, 2004

  10. No corrections required.

    Regular Expressions do not have the power required to parse a
    Context Free grammar, such as HTML.


    Perl's regular expressions are no longer Regular, they are called
    that for historical reasons rather than for mathematically-correct
    Tad McClellan, Oct 21, 2004
  11. Marc Bogaard

    Bart Lateur Guest

    I doubt if HTML actually is a context free grammar. There are no
    recursive rules, AFAIK.
    Bart Lateur, Oct 21, 2004

  12. tables can nest arbitrarily deep:

    start all over again...
    Tad McClellan, Oct 21, 2004
  13. Marc Bogaard

    Bart Lateur Guest

    Ah yes, that way. But no-one who tries to use regexes to parse HTML,
    tries to match particular tags. Instead, they try to recognize tags,
    text, comments... that sort of stuff. The rules for those aren't
    Bart Lateur, Oct 21, 2004
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.