HTML regex challenge

Discussion in 'Perl Misc' started by Max Metral, Jul 24, 2004.

  1. Max Metral

    Max Metral Guest

    I'm matching some ASP.net code with some perl regex's to do localization.
    I'm having some trouble with asp's embedded use of <% %> and differentiating
    it from the html tag... So, the thing I'm matching is like:

    <tag a=b c="d">stuff</tag>

    My reg ex is:

    <tag([^>]*?)>(.*?)</tag>

    Which works fine for the first example. But it doesn't for this:

    <tag a=b c="<%foo%>">stuff</tag>

    As expected, it stops after %>. Question is, how can I modify the
    expression to still get the whole "attribute section" in that single
    match... I've tried various back reference constructs, but they don't seem
    to do it. The expression fragment I want is "match everything except right
    bracket, unless there was a % before the right bracket"...

    Hrmph,
    --Max
    Max Metral, Jul 24, 2004
    #1
    1. Advertising

  2. Max Metral

    Bob Walton Guest

    Max Metral wrote:

    > I'm matching some ASP.net code with some perl regex's to do localization.
    > I'm having some trouble with asp's embedded use of <% %> and differentiating
    > it from the html tag... So, the thing I'm matching is like:
    >
    > <tag a=b c="d">stuff</tag>
    >
    > My reg ex is:
    >
    > <tag([^>]*?)>(.*?)</tag>
    >
    > Which works fine for the first example. But it doesn't for this:
    >
    > <tag a=b c="<%foo%>">stuff</tag>
    >
    > As expected, it stops after %>. Question is, how can I modify the
    > expression to still get the whole "attribute section" in that single
    > match... I've tried various back reference constructs, but they don't seem
    > to do it. The expression fragment I want is "match everything except right
    > bracket, unless there was a % before the right bracket"...

    ....


    > --Max


    Well, there's really only one way to do it right: Parse the HTML.
    There are *bunches* of other cases that can bite you besides the one you
    found, and, in general, it is most difficult to handle them all,
    particularly in a single regexp. Actually, it is probably difficult to
    even know about them all. See:

    perldoc HTML::parser
    perldoc -q HTML

    The latter document has a few of the possible trip-ups listed.
    --
    Bob Walton
    Email: http://bwalton.com/cgi-bin/emailbob.pl
    Bob Walton, Jul 24, 2004
    #2
    1. Advertising

  3. Max Metral <> wrote:

    > Subject: HTML regex challenge



    Parsing arbitrary HTML with a regex is nearly impossible.

    You need a Real Parser that knows the HTML grammar.


    > The expression fragment I want is "match everything except right
    > bracket, unless there was a % before the right bracket"...



    Your problem description will not do the Right Thing for this HTML:

    <img src="cool.jpg" alt=">>Cool pic!<<">

    after you fix the regex for that case, post it here and we
    will show some other HTML that breaks it.

    Then after you fix the regex for _that_ case, post the regex
    and we'll do it again.

    Lather, rinse, repeat.

    We can keep that up longer than you can. :)


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Jul 24, 2004
    #3
  4. Max Metral

    Max Metral Guest

    Understood. To argue my case only slightly more, I'm not parsing arbitrary
    html, I'm looking for a single tag called "localize" which I the replace the
    contents of with the contents of an XML entry from a resource file. So
    there's never a case where > appears in an attribute of that tag, UNLESS
    it's inside an ASP block (<% %>). The attributes of the localize tag are
    very restricted, true/false type things, except for the fact that somebody
    may need to "bind" one of these true/falses to a functon call.

    So my latest is:
    <localize((?:[^>]*%>[^>])*[^>]*)>(.*?)</localize>

    which fixes my original problem, but it's true that that won't handle

    <localize visible="<%# x > 5%>">foo</localize>

    but that seems fixable and "final", in that that's the only case that could
    occur given the allowable values of the tag...

    The problem with most HTML parsers is that (shocker) they don't handle
    ASP.Net (which isn't HTML)... So rather than modding something big I was
    hoping to keep it simple, even if that means constraining the user of the
    tag somewhat.

    "Tad McClellan" <> wrote in message
    news:...
    > Max Metral <> wrote:
    >
    > > Subject: HTML regex challenge

    >
    >
    > Parsing arbitrary HTML with a regex is nearly impossible.
    >
    > You need a Real Parser that knows the HTML grammar.
    >
    >
    > > The expression fragment I want is "match everything except right
    > > bracket, unless there was a % before the right bracket"...

    >
    >
    > Your problem description will not do the Right Thing for this HTML:
    >
    > <img src="cool.jpg" alt=">>Cool pic!<<">
    >
    > after you fix the regex for that case, post it here and we
    > will show some other HTML that breaks it.
    >
    > Then after you fix the regex for _that_ case, post the regex
    > and we'll do it again.
    >
    > Lather, rinse, repeat.
    >
    > We can keep that up longer than you can. :)
    >
    >
    > --
    > Tad McClellan SGML consulting
    > Perl programming
    > Fort Worth, Texas
    Max Metral, Jul 25, 2004
    #4
  5. Max Metral

    ko Guest

    Max Metral wrote:
    > Understood. To argue my case only slightly more, I'm not parsing arbitrary
    > html, I'm looking for a single tag called "localize" which I the replace the
    > contents of with the contents of an XML entry from a resource file. So
    > there's never a case where > appears in an attribute of that tag, UNLESS
    > it's inside an ASP block (<% %>). The attributes of the localize tag are
    > very restricted, true/false type things, except for the fact that somebody
    > may need to "bind" one of these true/falses to a functon call.
    >
    > So my latest is:
    > <localize((?:[^>]*%>[^>])*[^>]*)>(.*?)</localize>
    >
    > which fixes my original problem, but it's true that that won't handle
    >
    > <localize visible="<%# x > 5%>">foo</localize>
    >
    > but that seems fixable and "final", in that that's the only case that could
    > occur given the allowable values of the tag...


    Generally, you can't argue against the advice to use a HTML module to
    parse HTML.

    If you *really* want to use a regex, (now that you have elaborated
    you're looking for a single, *specific* instance) there are modules on
    CPAN that make this job easier, one being Regexp::Common:

    use strict;
    use warnings;
    use Regexp::Common qw /balanced/;

    my $text = q[<localize visible="<%# x > 5%>">foo</localize>];
    (my $changed = $text) =~
    s/$RE{balanced}{-begin => '<%'}{-end => '%>'}{-keep}
    /changed text/x;
    print $changed . "\n";

    Also, when replying to someone please keep the content you quote at the
    top and your reply on the bottom. Your reply to Tad is an example of
    top-posting, which is covered in the group's posting guidelines
    available here:

    http://mail.augustmail.com/~tadmc/clpmisc.shtml

    HTH - keith
    ko, Jul 25, 2004
    #5
  6. Max Metral

    Tore Aursand Guest

    On Sat, 24 Jul 2004 13:21:11 -0400, Max Metral wrote:
    > My reg ex is:
    >
    > <tag([^>]*?)>(.*?)</tag>
    >
    > Which works fine for the first example. But it doesn't for this:
    >
    > <tag a=b c="<%foo%>">stuff</tag>


    Hint: Think right-left, not left-right.


    --
    Tore Aursand <>
    "Life is pleasant. Death is peaceful. It's the transition that's
    troublesome." (Isaac Asimov)
    Tore Aursand, Jul 25, 2004
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Paul Battersby

    Html download challenge

    Paul Battersby, Jun 30, 2005, in forum: Java
    Replies:
    25
    Views:
    1,216
    Andrea Desole
    Jul 1, 2005
  2. Roedy Green

    Regex challenge

    Roedy Green, Jun 4, 2008, in forum: Java
    Replies:
    15
    Views:
    584
    BTDTGTTS
    Jun 5, 2008
  3. Replies:
    3
    Views:
    744
    Reedick, Andrew
    Jul 1, 2008
  4. basi
    Replies:
    3
    Views:
    84
  5. John R

    RegEx challenge - jrs

    John R, Nov 16, 2004, in forum: Perl Misc
    Replies:
    7
    Views:
    118
    A. Sinan Unur
    Nov 17, 2004
Loading...

Share This Page