any idea how to optimize this regex?

Discussion in 'Perl Misc' started by drejcicaREMOVE@volja.net, Dec 4, 2003.

  1. Guest

    Hello. I've discovered that this regex is a bottleneck:

    /(?:<!\-.*?>.*?){5}/sig

    It tries to locate as many html comments in chunks of five which can
    make for quite some possibilities in longer files. Is there a way to
    optimize this or do you consider it to be simply poor practice?

    Thanks,

    andrej

    --
    echo ${girl_name} > /etc/dumpdates
     
    , Dec 4, 2003
    #1
    1. Advertising

  2. <> wrote:
    > Hello. I've discovered that this regex is a bottleneck:
    >
    > /(?:<!\-.*?>.*?){5}/sig

    ^
    ^

    That "i" doesn't do anything. So why is it there?


    > It tries to locate as many html comments



    What will it do when it comes across a comment like this:

    <!-- if A > B then -->

    ??


    > in chunks of five which can
    > make for quite some possibilities in longer files. Is there a way to
    > optimize this or do you consider it to be simply poor practice?



    Attempting to use regexes to parse HTML is the poor practice.

    Use a module that understands HTML data for processing HTML data.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Dec 4, 2003
    #2
    1. Advertising

  3. wrote:
    : Hello. I've discovered that this regex is a bottleneck:

    : /(?:<!\-.*?>.*?){5}/sig

    : It tries to locate as many html comments in chunks of five which can
    : make for quite some possibilities in longer files. Is there a way to
    : optimize this or do you consider it to be simply poor practice?


    First, there are html parses that may help do what ever you want to do,
    but ignoring that for the moment...


    First off, a comment does not end with >, it ends with --> (and starts
    with <!-- so why not test for that correctly also)?

    <!--.*?-->

    If you know the comments can't have > in them, then a character class
    would be quicker than .*?

    <!--[^>]*>

    Next, I wonder why would you need to find comments in blocks of 5?

    Even if you really wish to look for blocks of 5 comments at a time, the /g
    says to do this globally, so it looks thru the entire file for all
    possible combinations of 5 blocks (I didn't say that correctly) and I
    suspect that is the biggest bottle neck.

    I suspect you don't really want /g at all.

    Also, the .*? is a potential bug, because it does not _prevent_ the re
    from matching two (or more) comments at the place you intend to match a
    single comment, it simply says "match no more than is necessary to get a
    match", so the regex engine could be trying combinations of multiple
    comments in an attempt to get a {5} /g match to work.

    I'm not sure if the above _is_ a bug, but I can't say it isn't. The
    character class I mentioned is not prone to this issue as it simply can't
    match past the > , but that assumes (as I mentioned) that the comments
    never use > .

    Finally, /i is to ignore case, but nothing you look for uses case, so why
    specify it (though I doubt that makes a difference here).
     
    Malcolm Dew-Jones, Dec 5, 2003
    #3
  4. Ben Morrow Guest

    (Malcolm Dew-Jones) wrote:
    > First off, a comment does not end with >, it ends with --> (and starts
    > with <!-- so why not test for that correctly also)?
    >
    > <!--.*?-->
    >
    > If you know the comments can't have > in them, then a character class
    > would be quicker than .*?
    >
    > <!--[^>]*>
    >

    <snip>
    >
    > Also, the .*? is a potential bug, because it does not _prevent_ the re
    > from matching two (or more) comments at the place you intend to match a
    > single comment, it simply says "match no more than is necessary to get a
    > match", so the regex engine could be trying combinations of multiple
    > comments in an attempt to get a {5} /g match to work.


    I'm somewhat thinking aloud here, but would

    / (?: <!-- (?: [^-] (?!->) )* --> ){5} /x

    perform the correct match here? The generalisation of [^>]: ie. 'match
    anything up to this multi-character string' is something one quite
    often wants.

    Ben

    --
    Every twenty-four hours about 34k children die from the effects of poverty.
    Meanwhile, the latest estimate is that 2800 people died on 9/11, so it's like
    that image, that ghastly, grey-billowing, double-barrelled fall, repeated
    twelve times every day. Full of children. [Iain Banks]
     
    Ben Morrow, Dec 5, 2003
    #4
  5. Matt Garrish Guest

    "Malcolm Dew-Jones" <> wrote in message
    news:...
    >
    > First off, a comment does not end with >, it ends with --> (and starts
    > with <!-- so why not test for that correctly also)?
    >
    > <!--.*?-->
    >


    Html comments allow whitespace between the -- and > when you close a
    comment, so you'd have to write that as:

    <!--.*?--\s*>

    Matt
     
    Matt Garrish, Dec 5, 2003
    #5
  6. Ben Morrow Guest

    "Matt Garrish" <> wrote:
    >
    > "Malcolm Dew-Jones" <> wrote in message
    > news:...
    > >
    > > First off, a comment does not end with >, it ends with --> (and starts
    > > with <!-- so why not test for that correctly also)?
    > >
    > > <!--.*?-->
    > >

    >
    > Html comments allow whitespace between the -- and > when you close a
    > comment, so you'd have to write that as:
    >
    > <!--.*?--\s*>


    HTML (SGML) comments also allow whitespace after the '!', and anything
    matching /--\s*--/ to appear within the body of the comment. What
    browsers will accept is another matter... ;)

    Ben

    --
    "If a book is worth reading when you are six, *
    it is worth reading when you are sixty." - C.S.Lewis
     
    Ben Morrow, Dec 5, 2003
    #6
  7. On Thu, 04 Dec 2003 22:53:02 GMT
    wrote:

    > Hello. I've discovered that this regex is a bottleneck:
    >
    > /(?:<!\-.*?>.*?){5}/sig
    >
    > It tries to locate as many html comments in chunks of five which can
    > make for quite some possibilities in longer files. Is there a way to
    > optimize this or do you consider it to be simply poor practice?


    Poor practice :)

    Use one of the *many* HTML parsing modules that are available.
    http://search.cpan.org/

    HTH

    --
    Jim

    Copyright notice: all code written by the author in this post is
    released under the GPL. http://www.gnu.org/licenses/gpl.txt
    for more information.

    a fortune quote ...
    Never hit a man with glasses. Hit him with a baseball bat.
     
    James Willmore, Dec 5, 2003
    #7
  8. Matt Garrish () wrote:

    : "Malcolm Dew-Jones" <> wrote in message
    : news:...
    : >
    : > First off, a comment does not end with >, it ends with --> (and starts
    : > with <!-- so why not test for that correctly also)?
    : >
    : > <!--.*?-->
    : >

    : Html comments allow whitespace between the -- and > when you close a
    : comment, so you'd have to write that as:

    : <!--.*?--\s*>

    Ah yes, and exactly why one should use the html parsing modules if
    at all possible,

    (I was looking at my xml book. Xml comments have a more rigid comment
    format, if I understand it correctly.)
     
    Malcolm Dew-Jones, Dec 5, 2003
    #8
  9. Matt Garrish Guest

    "Ben Morrow" <> wrote in message
    news:bqoq0c$gms$...
    >
    > "Matt Garrish" <> wrote:
    > >
    > > "Malcolm Dew-Jones" <> wrote in message
    > > news:...
    > > >
    > > > First off, a comment does not end with >, it ends with --> (and starts
    > > > with <!-- so why not test for that correctly also)?
    > > >
    > > > <!--.*?-->
    > > >

    > >
    > > Html comments allow whitespace between the -- and > when you close a
    > > comment, so you'd have to write that as:
    > >
    > > <!--.*?--\s*>

    >
    > HTML (SGML) comments also allow whitespace after the '!', and anything
    > matching /--\s*--/ to appear within the body of the comment. What
    > browsers will accept is another matter... ;)
    >


    I thought no whitespace at the start of a comment was one of the few things
    that html did enforce? It obviously would never fly in sgml if that was the
    only way to comment out text (or would make for interesting dtds). Then
    again, what standards do any browsers adhere to? : )

    Matt
     
    Matt Garrish, Dec 5, 2003
    #9
  10. SGML/HTML syntax trivia (was Re: any idea how to optimize this regex?)

    Matt Garrish <> wrote:
    > "Ben Morrow" <> wrote in message
    > news:bqoq0c$gms$...
    >> "Matt Garrish" <> wrote:



    >> > Html comments allow whitespace between the -- and > when you close a
    >> > comment, so you'd have to write that as:
    >> >
    >> > <!--.*?--\s*>

    >>
    >> HTML (SGML) comments also allow whitespace after the '!', and anything



    I believe that you are mistaken with that part.


    >> matching /--\s*--/ to appear within the body of the comment. What
    >> browsers will accept is another matter... ;)



    But that part is true enough.


    > I thought no whitespace at the start of a comment was one of the few things
    > that html did enforce?



    You thought correctly. The grammar[1], reformatted, is:


    comment declaration =
    MDO,
    ( comment,
    ( s |
    comment
    )*
    )?
    MDC

    comment =
    COM
    SGML character*
    COM

    Where:

    MDO (<!) Markup Declaration Open
    MDC (>) Markup Declaration Close
    COM (--) Comment Delimiter
    s Separator ( roughly /\s/ )


    So, if you have any "comment"s in the "comment declaration",
    then there must be no spaces before that first one.

    Note also that <!> is a "comment declaration" as well.


    This is but one of the "strange corners" of SGML syntax. There
    are several dozen of these. Your choices are:

    1. Research the bazillion syntax oddities and code for
    *all of them* in your program.

    or

    2. Use a module.




    [1] "The SGML Handbook" Charles Goldfarb, p391

    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Dec 5, 2003
    #10
  11. Ben Morrow Guest

    Re: SGML/HTML syntax trivia (was Re: any idea how to optimize this regex?)

    wrote:
    > > I thought no whitespace at the start of a comment was one of the few things
    > > that html did enforce?

    >
    > You thought correctly. The grammar[1], reformatted, is:
    >

    <snip>
    >
    > So, if you have any "comment"s in the "comment declaration",
    > then there must be no spaces before that first one.
    >
    > Note also that <!> is a "comment declaration" as well.


    Bleech. SGML syntax is too obscure for words :).

    > 2. Use a module.


    I couldn't agree more...

    Ben

    --
    Joy and Woe are woven fine,
    A Clothing for the Soul divine William Blake
    Under every grief and pine 'Auguries of Innocence'
    Runs a joy with silken twine.
     
    Ben Morrow, Dec 5, 2003
    #11
  12. Re: SGML/HTML syntax trivia (was Re: any idea how to optimize thisregex?)

    On Fri, 5 Dec 2003, Tad McClellan wrote:

    > Note also that <!> is a "comment declaration" as well.


    And, currently, a sure-fire indicator of spam in HTML-formatted emails
    - they evidently intend it to disrupt content scanners. It won't
    last, of course - as soon as they realise that we're rating it for
    rejection, rather than letting ourselves be fooled by the obfuscation.

    > 2. Use a module.


    And hope the module author has read the book too ;-)
     
    Alan J. Flavell, Dec 5, 2003
    #12
  13. On Fri, 5 Dec 2003, Matt Garrish wrote:

    > I thought no whitespace at the start of a comment was one of the few things
    > that html did enforce?


    This is wildly off-topic, but one of the things I had to learn about
    HTML is that even the W3C HTML specs contain self-contradictions.

    While claiming in one place to be an application of SGML, there are
    other places where particular SGML constructs are prohibited which the
    SGML specification does not allow to be prohibited.

    > Then again, what standards do any browsers adhere to? : )


    That's another matter entirely. And just don't start me on Appendix C
    to XHTML/1.0
     
    Alan J. Flavell, Dec 5, 2003
    #13
  14. Re: SGML/HTML syntax trivia (was Re: any idea how to optimize this regex?)

    Alan J. Flavell <> wrote:

    > On Fri, 5 Dec 2003, Tad McClellan wrote:
    >
    >> Note also that <!> is a "comment declaration" as well.

    >
    > And, currently, a sure-fire indicator of spam in HTML-formatted
    > emails - they evidently intend it to disrupt content scanners. It
    > won't last, of course - as soon as they realise that we're rating
    > it for rejection, rather than letting ourselves be fooled by the
    > obfuscation.


    With that in mind, here's an infinite loop:

    while ($spammer < $pond_scum) {
    $spammer++;
    }
     
    David K. Wall, Dec 5, 2003
    #14
  15. Matt Garrish Guest

    "Alan J. Flavell" <> wrote in message
    news:p...
    > On Fri, 5 Dec 2003, Matt Garrish wrote:
    >
    > > Then again, what standards do any browsers adhere to? : )

    >
    > That's another matter entirely. And just don't start me on Appendix C
    > to XHTML/1.0
    >


    Oh come on, at least they recognize that xhtml is never going to fly! The
    whole popularity of the web lies in the ability of Joe Blow web designer
    wannabe to toss whatever tags and styles he wants on a page and see a nice
    pretty page pop up in his favourite M$ browser (okay a bit of an
    exaggeration). Much as I'd prefer to see sgml-like tagging requirements
    enforced in html, xhtml is a pipe-dream. XML will obviously survive, but
    getting the riff-raff to adhere to coding standards I just can't see
    happening (unless the wysiwyg editors get onboard). Which is sad in the end,
    because it means there will always be new people asking how to parse html
    with a regular expression...

    Matt
     
    Matt Garrish, Dec 6, 2003
    #15
  16. Ben Morrow Guest

    wrote:
    > Ben Morrow () wrote on MMMDCCXLVIII September MCMXCIII
    > in <URL:news:bqol7f$bmo$>:
    > **
    > ** I'm somewhat thinking aloud here, but would
    > **
    > ** / (?: <!-- (?: [^-] (?!->) )* --> ){5} /x
    > **
    > ** perform the correct match here? The generalisation of [^>]: ie. 'match
    > ** anything up to this multi-character string' is something one quite
    > ** often wants.
    >
    > That fails on:
    >
    > <!--> Comment one <!----> Comment two <!-->


    Ouch! I had to think *quite* hard to convince myself that '<!-->
    Comment one <!---->' isn't a valid comment... it's obvious once you
    see the symmetry of it, of course.

    OK, making a small modification and applying the grammar fragment Tad
    gave:

    my $com = qr/-- (?: (?!--) . )* --/x;
    /<! (?: $com (?: \s* | $com )* )? >/x;

    or in Perl6, just to show how much nicer it is when you get rid of all
    those bleedin' question-marks:

    my $com = rx/-- [ <!before --> . ]* --/;
    m{<'<!'> [ <$com> [ \s* | <$com> ]* ]? <'>'>};

    :).

    This problem has been solved before, of course, so this is purely an
    exercise in (not-so) Tricky Regexes on my part. Apologies if I've
    bored anyone.

    Ben

    --
    Heracles: Vulture! Here's a titbit for you / A few dried molecules of the gall
    From the liver of a friend of yours. / Excuse the arrow but I have no spoon.
    (Ted Hughes, [ Heracles shoots Vulture with arrow. Vulture bursts into ]
    /Alcestis/) [ flame, and falls out of sight. ]
     
    Ben Morrow, Dec 6, 2003
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Michael B.

    Want to optimize this procedure, any advice?

    Michael B., Nov 13, 2003, in forum: C Programming
    Replies:
    11
    Views:
    534
    CBFalconer
    Nov 14, 2003
  2. Replies:
    10
    Views:
    1,243
    Big K
    Feb 2, 2005
  3. Dr Mephesto

    App idea, Any idea on implementation?

    Dr Mephesto, Feb 4, 2008, in forum: Python
    Replies:
    3
    Views:
    720
    Dennis Lee Bieber
    Feb 5, 2008
  4. Replies:
    0
    Views:
    635
  5. david.karr
    Replies:
    19
    Views:
    1,553
    Eric Sosman
    Sep 5, 2009
Loading...

Share This Page