Regular Expression for HTML Tags and Special Characters

Discussion in 'Perl Misc' started by Marc Bogaard, Oct 19, 2004.

  1. Marc Bogaard

    Marc Bogaard Guest

    Hello together!

    How can I allowed some HTML-Tags like <BR>, <B>, <P> but
    filter out <, >, when they stand alone?

    Must be something like: "^[A-Za-Z0-9\>\<]+$"
    for the < and >, but where do i have to put in my tags?


    thank you in advance
    marc van den Bogaard
     
    Marc Bogaard, Oct 19, 2004
    #1
    1. Advertising

  2. Marc Bogaard wrote:
    > Hello together!
    >
    > How can I allowed some HTML-Tags like <BR>, <B>, <P> but
    > filter out <, >, when they stand alone?
    >
    > Must be something like: "^[A-Za-Z0-9\>\<]+$"
    > for the < and >, but where do i have to put in my tags?


    how about "<[a-zA-Z0-9]{1,2}>"?

    --
    Josef Möllers (Pinguinpfleger bei FSC)
    If failure had no penalty success would not be a prize
    -- T. Pratchett
     
    Josef Moellers, Oct 19, 2004
    #2
    1. Advertising

  3. Marc Bogaard

    Jon Ericson Guest

    (Marc Bogaard) writes:

    > Hello together!


    Hallo!

    (In English, the idiom is "Hello all!" or "Hello everyone!" or "Hello
    folks!")

    > How can I allowed some HTML-Tags like <BR>, <B>, <P> but
    > filter out <, >, when they stand alone?
    >
    > Must be something like: "^[A-Za-Z0-9\>\<]+$"
    > for the < and >, but where do i have to put in my tags?


    Do you mean that all of the tags will be on their own line? Do you
    want to remove the tags, everything within the tags or just < and >?
    Maybe you could show us some sample input and the expected output?

    I modified one of the examples from HTML::parser to do what I *think*
    you want:

    #!perl -w
    use strict;

    use HTML::parser;

    my %allowed = map {$_ => 1} qw{br b p};

    HTML::parser->new(default_h => [sub { print shift }, 'text'],
    start_h => [sub { my $tag = shift;
    print "<$tag>" if $allowed{$tag} },
    'tagname'],
    end_h => [sub { my $tag = shift;
    print "</$tag>" if $allowed{$tag} },
    'tagname']
    )->parse_file(shift || die) || die $!;


    Given:

    <BR>eakfast every morning <B>efore going to work or <no> lunch for you.
    </P>aragraphs like this are </kooky>.

    It produces:

    <br>eakfast every morning <b>efore going to work or lunch for you.
    </p>aragraphs like this are .

    Jon
     
    Jon Ericson, Oct 19, 2004
    #3
  4. Marc Bogaard

    Bart Lateur Guest

    Marc Bogaard wrote:

    >How can I allowed some HTML-Tags like <BR>, <B>, <P> but
    >filter out <, >, when they stand alone?


    Typically, people don't like you using regexes for this kind of taask,
    because the pattern would be *really* complex before working
    satisfactorily. Instead, use something involving a HTML parser module.

    I like HTML::TokeParser::Simple for that kind of task.

    <http://search.cpan.org/search?module=HTML::TokeParser::Simple>

    You loop through the input, processing one token (tag, comment, piece
    of text) at a time, act differently depending on the type of token and
    its actual contents, and can use $token->as_is to just pass it through
    unchanged (the ordinary case). You can filter out disallowed tags,
    disallowed attributes. You could probably even use it to balance the
    left over, allowed tokens.

    Here's a demo script (do at least remove the whitespace in front of the
    line containing just "*END*"):

    use HTML::TokeParser::Simple;

    my $html = <<"*END*";
    <P>Get up in the morning, slaving for bread, sir,
    <BR>so that every mouth can be fed.
    <P><B>Poor me</B>, the Israelite. <I>Aah.</I>
    <!-- this is a comment. It'll be gone. -->
    <P>There's a lone "<" in here, matched by a lone ">".
    <script language="Javascript">alert("Hello, World!")</script>
    <P>I don't like <a href="http://example.com">links</a> either,
    but will allow for <a name="foo"></a>anchors.
    *END*

    my $p = HTML::TokeParser::Simple->new(\$html);
    my %allow = map { $_ => 1 } qw(b i u br p);
    my %wipe_content = map { $_ => 1 } qw(style script);
    my %escape = ( '<' => '&lt;', '>' => '&gt;');

    while(my $t = $p->get_token) {
    if($t->is_tag) {
    my $tag = $t->get_tag;
    if($tag eq 'a') {
    print $t->as_is, "</a>" if defined
    $t->get_attr('name');
    } elsif($allow{$tag}) {
    print $t->as_is;
    } elsif($wipe_content{$tag}) {
    while(my $t = $p->get_token) {
    # wipe
    last if $t->is_end_tag($tag);
    }
    }
    } elsif($t->is_comment) {
    # wipe
    } elsif($t->is_text) {
    my $text = $t->as_is;
    $text =~ s/([<>])/$escape{$1}/g;
    print $text;
    }
    }


    Result:
    <P>Get up in the morning, slaving for bread, sir,
    <BR>so that every mouth can be fed.
    <P><B>Poor me</B>, the Israelite. <I>Aah.</I>

    <P>There's a lone "&lt;" in here, matched by a lone "&gt;".

    <P>I don't like links either,
    but will allow for <a name="foo"></a>anchors.

    --
    Bart.
     
    Bart Lateur, Oct 20, 2004
    #4
  5. Marc Bogaard

    Vijai Kalyan Guest

    > How can I allowed some HTML-Tags like <BR>, <B>, <P> but
    > filter out <, >, when they stand alone?
    >
    > Must be something like: "^[A-Za-Z0-9\>\<]+$"
    > for the < and >, but where do i have to put in my tags?


    As others said below, you should be using a parser instead of regexp
    for this, but I am just a beginner with perl and am trying to answer
    questions to get practice.

    If you really want to use a regexp, lookup an example that's in the
    first chapter of the Camel book.

    It goes something like this: (I will let u do the homework :)

    m/<(.*?)>.*?(\/\1)/

    which means,

    a. minimally match something within a < and a >

    b. minimally match anything (. matches everything but newline, so u
    might want to modify that - again, homework :)

    c. make a back reference to what was found between the first < and >.

    NOTE:

    a. This probably won't work if you have attributes so a modification
    might be:

    m/<\s*(\w+)\s+.*?>.*?(\/\1)/

    which (I think) means:

    i. Match a < followed any number of ws chars, followed by one or more
    word chars followed again by ws chars.

    ii. Finally any number of chars is minimally matched till again a > is
    met.

    iii. Again the back reference is used to force the same pattern (here,
    this will be the tag) to match at the end.

    As someone said, it gets complicated.

    hth,
    ----
    vijai.
     
    Vijai Kalyan, Oct 20, 2004
    #5
  6. Vijai Kalyan <> wrote:
    >> How can I allowed some HTML-Tags like <BR>, <B>, <P> but
    >> filter out <, >, when they stand alone?
    >>
    >> Must be something like: "^[A-Za-Z0-9\>\<]+$"
    >> for the < and >, but where do i have to put in my tags?

    >
    > As others said below, you should be using a parser instead of regexp
    > for this, but I am just a beginner with perl and am trying to answer
    > questions to get practice.
    >
    > If you really want to use a regexp, lookup an example that's in the
    > first chapter of the Camel book.
    >
    > It goes something like this: (I will let u do the homework :)
    >
    > m/<(.*?)>.*?(\/\1)/

    ^^^^

    Where are the <angle brackets> for the endtag?


    > m/<\s*(\w+)\s+.*?>.*?(\/\1)/

    ^^^


    It is invalid HTML if it has whitespace there.

    Will that work on the below (after taking out the \s*)?

    <foo>bar</foo>


    > As someone said, it gets complicated.



    here are some more complications to try and match correctly:

    <!-- there are no <tags></tags> on this line at all! -->

    <img src="cool.jpg" alt=">>cool pic!<<">


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Oct 20, 2004
    #6
  7. In article <>,
    says...

    > > m/<(.*?)>.*?(\/\1)/

    > ^^^^
    >
    > Where are the <angle brackets> for the endtag?


    Oops, apologies are in order. I missed them!

    >
    > > m/<\s*(\w+)\s+.*?>.*?(\/\1)/

    > ^^^
    >
    > It is invalid HTML if it has whitespace there.


    I didn't know that. In this case we would modify it to

    m/<(\w+)\s+.*?>.*?(\/\1)/

    >
    > Will that work on the below (after taking out the \s*)?
    >
    > <foo>bar</foo>


    (I am trying to answer without running the code. So if there's a
    mistake, you know whom you have to blame me.)

    You are right it wouldn't. I would have to do this instead:

    m/<(\w+)*?\s+.*?>.*?<\/\1>/

    This will catch the "foo" minimally.

    > here are some more complications to try and match correctly:
    >
    > <!-- there are no <tags></tags> on this line at all! -->


    But if we had a regular expression that matched a <!-- . --> wouldn't
    that gobble up the <tags></tags> inbetween?

    Of course, I am thinking more along the lines of a Lex input
    specification where you would typically do a:

    <YYINITIAL> "<!--" { yybegin(COMMENT); }
    <COMMENT> "-->" { yybegin(YYINITIAL); }
    <COMMENT> \n { }
    <COMMENT> \r { yybegin(DOSEOL); }
    <COMMENT> . { }
    <DOSEOL> \n { }
    <DOSEOL> . { yybegin(COMMENT); }

    This is actually JLex input. But you can understand what it does. (I
    just sligthly modified from a JLex input for ASN.1 that I wrote. That's
    why you see the redundant state transitions on \n and \r. In ASN.1 a
    comment can be multiline but each multiline comment has to have the
    comment starter; in this case --)

    >
    > <img src="cool.jpg" alt=">>cool pic!<<">


    Yes, I think the modified regexp above would get this.

    Why? The minimal matcher should stop at the first ">" in the substring
    ">> . Also because of the intervening \s+ and .*? the back reference
    would yield only "img" . But, the input itself in this case is incorrect
    right?

    If I am wrong, do correct.

    thanx,

    hth,
    -----
    -vijai.
     
    Vijayaraghavan Kalyanapasupathy, Oct 21, 2004
    #7
  8. In article <>,
    says...
    > In article <>,
    > says...
    >


    The waters do get murkier. No, you are correct, I made a mistake in the

    <img src=".." alt=">>CoolPic<<">

    example you gave.

    the reg exp would actually do the wrong thing because it's too simple.
    As I said, it would definitely be easier with a lexer where you can
    remember state!

    It does get exceedingly complicated. But, then I am not sure if a
    regular expression can really match all types of input. Isn't that what
    the Chomsky hierarchy is about?

    Correct me if I am wrong.

    ------
    -vijai.
     
    Vijayaraghavan Kalyanapasupathy, Oct 21, 2004
    #8
  9. Vijayaraghavan Kalyanapasupathy <> wrote:
    > In article <>,
    > says...



    >> > m/<\s*(\w+)\s+.*?>.*?(\/\1)/

    >> ^^^
    >>
    >> It is invalid HTML if it has whitespace there.

    >
    > I didn't know that.



    Yet another reason to use an HTML module.

    The module authors know these things. :)


    > In this case we would modify it to
    >
    > m/<(\w+)\s+.*?>.*?(\/\1)/



    Let's put the endtag angle brackets in there too:

    m/<(\w+)\s+.*?>.*?<(\/\1)>/


    >> Will that work on the below (after taking out the \s*)?
    >>
    >> <foo>bar</foo>

    >
    > (I am trying to answer without running the code.



    That's what I was hoping you would do...


    > So if there's a
    > mistake, you know whom you have to blame me.)
    >
    > You are right it wouldn't. I would have to do this instead:
    >
    > m/<(\w+)*?\s+.*?>.*?<\/\1>/
    >
    > This will catch the "foo" minimally.



    .... but then you should check yourself by trying it in actual code. :)

    It will fail to match at all.

    Your pattern requires at least one whitespace and <foo> does
    not contain a whitespace.


    >> here are some more complications to try and match correctly:
    >>
    >> <!-- there are no <tags></tags> on this line at all! -->

    >
    > But if we had a regular expression that matched a <!-- . --> wouldn't
    > that gobble up the <tags></tags> inbetween?



    Yes, but matching "comment declarations" is not that easy, yet another
    reason to use an HTML module. Here is a valid comment declaration:

    <!-- comment -- -- some more comment -- >


    > Of course, I am thinking more along the lines of a Lex input
    > specification where you would typically do a:



    [snip lex patterns]

    The grammar for SGML comment declarations is a good bit more complex
    than that.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Oct 21, 2004
    #9
  10. Vijayaraghavan Kalyanapasupathy <> wrote:

    > It does get exceedingly complicated. But, then I am not sure if a
    > regular expression can really match all types of input. Isn't that what
    > the Chomsky hierarchy is about?
    >
    > Correct me if I am wrong.



    No corrections required.

    Regular Expressions do not have the power required to parse a
    Context Free grammar, such as HTML.



    Note:

    Perl's regular expressions are no longer Regular, they are called
    that for historical reasons rather than for mathematically-correct
    reasons.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Oct 21, 2004
    #10
  11. Marc Bogaard

    Bart Lateur Guest

    Tad McClellan wrote:

    >Regular Expressions do not have the power required to parse a
    >Context Free grammar, such as HTML.


    I doubt if HTML actually is a context free grammar. There are no
    recursive rules, AFAIK.

    --
    Bart.
     
    Bart Lateur, Oct 21, 2004
    #11
  12. Bart Lateur <> wrote:
    > Tad McClellan wrote:
    >
    >>Regular Expressions do not have the power required to parse a
    >>Context Free grammar, such as HTML.

    >
    > I doubt if HTML actually is a context free grammar. There are no
    > recursive rules, AFAIK.



    tables can nest arbitrarily deep:

    <table>
    <tbody>
    <tr>
    <td>
    <table>
    start all over again...


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Oct 21, 2004
    #12
  13. Marc Bogaard

    Bart Lateur Guest

    Tad McClellan wrote:

    >> I doubt if HTML actually is a context free grammar. There are no
    >> recursive rules, AFAIK.

    >
    >
    >tables can nest arbitrarily deep:


    Ah yes, that way. But no-one who tries to use regexes to parse HTML,
    tries to match particular tags. Instead, they try to recognize tags,
    text, comments... that sort of stuff. The rules for those aren't
    recursive.

    --
    Bart.
     
    Bart Lateur, Oct 21, 2004
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Shannon Jacobs
    Replies:
    8
    Views:
    698
    John W. Kennedy
    Jan 24, 2004
  2. VSK
    Replies:
    2
    Views:
    2,382
  3. RJN
    Replies:
    2
    Views:
    20,079
    Frank
    Feb 25, 2005
  4. Stefan Mueller
    Replies:
    3
    Views:
    33,307
    Stefan Mueller
    Jul 23, 2006
  5. AAaron123
    Replies:
    0
    Views:
    648
    AAaron123
    Oct 3, 2008
Loading...

Share This Page