any idea how to optimize this regex?

drejcicaREMOVE · Dec 4, 2003

Hello. I've discovered that this regex is a bottleneck:

/(?:<!\-.*?>.*?){5}/sig

It tries to locate as many html comments in chunks of five which can
make for quite some possibilities in longer files. Is there a way to
optimize this or do you consider it to be simply poor practice?

Thanks,

andrej

Tad McClellan · Dec 4, 2003

Hello. I've discovered that this regex is a bottleneck:

/(?:<!\-.*?>.*?){5}/sig

^
^

That "i" doesn't do anything. So why is it there?

It tries to locate as many html comments

What will it do when it comes across a comment like this:



??

in chunks of five which can
make for quite some possibilities in longer files. Is there a way to
optimize this or do you consider it to be simply poor practice?

Attempting to use regexes to parse HTML is the poor practice.

Use a module that understands HTML data for processing HTML data.

Malcolm Dew-Jones · Dec 5, 2003

(e-mail address removed) wrote:
: Hello. I've discovered that this regex is a bottleneck:

: /(?:<!\-.*?>.*?){5}/sig

: It tries to locate as many html comments in chunks of five which can
: make for quite some possibilities in longer files. Is there a way to
: optimize this or do you consider it to be simply poor practice?

First, there are html parses that may help do what ever you want to do,
but ignoring that for the moment...

First off, a comment does not end with >, it ends with --> (and starts
with 

If you know the comments can't have > in them, then a character class
would be quicker than .*?

<!--[^>]*>

Next, I wonder why would you need to find comments in blocks of 5?

Even if you really wish to look for blocks of 5 comments at a time, the /g
says to do this globally, so it looks thru the entire file for all
possible combinations of 5 blocks (I didn't say that correctly) and I
suspect that is the biggest bottle neck.

I suspect you don't really want /g at all.

Also, the .*? is a potential bug, because it does not _prevent_ the re
from matching two (or more) comments at the place you intend to match a
single comment, it simply says "match no more than is necessary to get a
match", so the regex engine could be trying combinations of multiple
comments in an attempt to get a {5} /g match to work.

I'm not sure if the above _is_ a bug, but I can't say it isn't. The
character class I mentioned is not prone to this issue as it simply can't
match past the > , but that assumes (as I mentioned) that the comments
never use > .

Finally, /i is to ignore case, but nothing you look for uses case, so why
specify it (though I doubt that makes a difference here).

Ben Morrow · Dec 5, 2003

First off, a comment does not end with >, it ends with --> (and starts
with 

If you know the comments can't have > in them, then a character class
would be quicker than .*?

<!--[^>]*>

Also, the .*? is a potential bug, because it does not _prevent_ the re
from matching two (or more) comments at the place you intend to match a
single comment, it simply says "match no more than is necessary to get a
match", so the regex engine could be trying combinations of multiple
comments in an attempt to get a {5} /g match to work.

I'm somewhat thinking aloud here, but would

/ (?:  ){5} /x

perform the correct match here? The generalisation of [^>]: ie. 'match
anything up to this multi-character string' is something one quite
often wants.

Ben

Matt Garrish · Dec 5, 2003

Malcolm Dew-Jones said:
First off, a comment does not end with >, it ends with --> (and starts
with

Html comments allow whitespace between the -- and > when you close a
comment, so you'd have to write that as:

<!--.*?--\s*>

Matt

Ben Morrow · Dec 5, 2003

Matt Garrish said:
Html comments allow whitespace between the -- and > when you close a
comment, so you'd have to write that as:

<!--.*?--\s*>

HTML (SGML) comments also allow whitespace after the '!', and anything
matching /--\s*--/ to appear within the body of the comment. What
browsers will accept is another matter...

Ben

James Willmore · Dec 5, 2003

On Thu, 04 Dec 2003 22:53:02 GMT

Hello. I've discovered that this regex is a bottleneck:

/(?:<!\-.*?>.*?){5}/sig

It tries to locate as many html comments in chunks of five which can
make for quite some possibilities in longer files. Is there a way to
optimize this or do you consider it to be simply poor practice?

Poor practice

Use one of the *many* HTML parsing modules that are available.
http://search.cpan.org/

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
Never hit a man with glasses. Hit him with a baseball bat.

Malcolm Dew-Jones · Dec 5, 2003

Matt Garrish ([email protected]) wrote:

: : >
: > First off, a comment does not end with >, it ends with --> (and starts
: > with 
: >

: Html comments allow whitespace between the -- and > when you close a
: comment, so you'd have to write that as:

: <!--.*?--\s*>

Ah yes, and exactly why one should use the html parsing modules if
at all possible,

(I was looking at my xml book. Xml comments have a more rigid comment
format, if I understand it correctly.)

Matt Garrish · Dec 5, 2003

Ben Morrow said:
HTML (SGML) comments also allow whitespace after the '!', and anything
matching /--\s*--/ to appear within the body of the comment. What
browsers will accept is another matter...

I thought no whitespace at the start of a comment was one of the few things
that html did enforce? It obviously would never fly in sgml if that was the
only way to comment out text (or would make for interesting dtds). Then
again, what standards do any browsers adhere to? : )

Matt

Tad McClellan · Dec 5, 2003

I believe that you are mistaken with that part.

But that part is true enough.

I thought no whitespace at the start of a comment was one of the few things
that html did enforce?

You thought correctly. The grammar[1], reformatted, is:

comment declaration =
MDO,
( comment,
( s |
comment
)*
)?
MDC

comment =
COM
SGML character*
COM

Where:

MDO (<!) Markup Declaration Open
MDC (>) Markup Declaration Close
COM (--) Comment Delimiter
s Separator ( roughly /\s/ )

So, if you have any "comment"s in the "comment declaration",
then there must be no spaces before that first one.

Note also that <!> is a "comment declaration" as well.

This is but one of the "strange corners" of SGML syntax. There
are several dozen of these. Your choices are:

1. Research the bazillion syntax oddities and code for
*all of them* in your program.

or

2. Use a module.

[1] "The SGML Handbook" Charles Goldfarb, p391

Ben Morrow · Dec 5, 2003

I thought no whitespace at the start of a comment was one of the few things
that html did enforce?

Click to expand...

You thought correctly. The grammar[1], reformatted, is:

So, if you have any "comment"s in the "comment declaration",
then there must be no spaces before that first one.

Note also that <!> is a "comment declaration" as well.

Bleech. SGML syntax is too obscure for words

.

2. Use a module.

I couldn't agree more...

Ben

Alan J. Flavell · Dec 5, 2003

Note also that <!> is a "comment declaration" as well.

And, currently, a sure-fire indicator of spam in HTML-formatted emails
- they evidently intend it to disrupt content scanners. It won't
last, of course - as soon as they realise that we're rating it for
rejection, rather than letting ourselves be fooled by the obfuscation.

2. Use a module.

And hope the module author has read the book too ;-)

Alan J. Flavell · Dec 5, 2003

I thought no whitespace at the start of a comment was one of the few things
that html did enforce?

This is wildly off-topic, but one of the things I had to learn about
HTML is that even the W3C HTML specs contain self-contradictions.

While claiming in one place to be an application of SGML, there are
other places where particular SGML constructs are prohibited which the
SGML specification does not allow to be prohibited.

Then again, what standards do any browsers adhere to? : )

That's another matter entirely. And just don't start me on Appendix C
to XHTML/1.0

David K. Wall · Dec 5, 2003

Alan J. Flavell said:
And, currently, a sure-fire indicator of spam in HTML-formatted
emails - they evidently intend it to disrupt content scanners. It
won't last, of course - as soon as they realise that we're rating
it for rejection, rather than letting ourselves be fooled by the
obfuscation.

With that in mind, here's an infinite loop:

while ($spammer < $pond_scum) {
$spammer++;
}

Matt Garrish · Dec 6, 2003

Alan J. Flavell said:
That's another matter entirely. And just don't start me on Appendix C
to XHTML/1.0

Oh come on, at least they recognize that xhtml is never going to fly! The
whole popularity of the web lies in the ability of Joe Blow web designer
wannabe to toss whatever tags and styles he wants on a page and see a nice
pretty page pop up in his favourite M$ browser (okay a bit of an
exaggeration). Much as I'd prefer to see sgml-like tagging requirements
enforced in html, xhtml is a pipe-dream. XML will obviously survive, but
getting the riff-raff to adhere to coding standards I just can't see
happening (unless the wysiwyg editors get onboard). Which is sad in the end,
because it means there will always be new people asking how to parse html
with a regular expression...

Matt

Ben Morrow · Dec 6, 2003

Ben Morrow ([email protected]) wrote on MMMDCCXLVIII September MCMXCIII
in <URL:**
** I'm somewhat thinking aloud here, but would
**
** / (?:  ){5} /x
**
** perform the correct match here? The generalisation of [^>]: ie. 'match
** anything up to this multi-character string' is something one quite
** often wants.

That fails on:

 Comment two <!-->

Ouch! I had to think *quite* hard to convince myself that '' isn't a valid comment... it's obvious once you
see the symmetry of it, of course.

OK, making a small modification and applying the grammar fragment Tad
gave:

my $com = qr/-- (?: (?!--) . )* --/x;
/<! (?: $com (?: \s* | $com )* )? >/x;

or in Perl6, just to show how much nicer it is when you get rid of all
those bleedin' question-marks:

my $com = rx/-- [ <!before --> . ]* --/;
m{<'<!'> [ <$com> [ \s* | <$com> ]* ]? <'>'>};

.

This problem has been solved before, of course, so this is purely an
exercise in (not-so) Tricky Regexes on my part. Apologies if I've
bored anyone.

Ben

How to optimize this JavaScript?	9	Sep 23, 2007
Any idea how to do this?	4	Jun 25, 2007
Big problem I need to solve with some unix utils	1	Jun 19, 2022
Warning and how to optimize code	2	Jan 19, 2006
How to avoid searching this folder?	11	Mar 25, 2011
Any idea how to do this? (zoom in/out)	1	May 25, 2006
[regex] How to check for non-space character?	3	Mar 21, 2009
How to grab a number from inside a .html file using regex	13	Aug 7, 2010

any idea how to optimize this regex?

drejcicaREMOVE

Tad McClellan

Malcolm Dew-Jones

Ben Morrow

Matt Garrish

Ben Morrow

James Willmore

Malcolm Dew-Jones

Matt Garrish

Tad McClellan

Ben Morrow

Alan J. Flavell

Alan J. Flavell

David K. Wall

Matt Garrish

Ben Morrow

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads