any idea how to optimize this regex?

D

drejcicaREMOVE

Hello. I've discovered that this regex is a bottleneck:

/(?:<!\-.*?>.*?){5}/sig

It tries to locate as many html comments in chunks of five which can
make for quite some possibilities in longer files. Is there a way to
optimize this or do you consider it to be simply poor practice?

Thanks,

andrej
 
T

Tad McClellan

Hello. I've discovered that this regex is a bottleneck:

/(?:<!\-.*?>.*?){5}/sig
^
^

That "i" doesn't do anything. So why is it there?

It tries to locate as many html comments


What will it do when it comes across a comment like this:

<!-- if A > B then -->

??

in chunks of five which can
make for quite some possibilities in longer files. Is there a way to
optimize this or do you consider it to be simply poor practice?


Attempting to use regexes to parse HTML is the poor practice.

Use a module that understands HTML data for processing HTML data.
 
M

Malcolm Dew-Jones

(e-mail address removed) wrote:
: Hello. I've discovered that this regex is a bottleneck:

: /(?:<!\-.*?>.*?){5}/sig

: It tries to locate as many html comments in chunks of five which can
: make for quite some possibilities in longer files. Is there a way to
: optimize this or do you consider it to be simply poor practice?


First, there are html parses that may help do what ever you want to do,
but ignoring that for the moment...


First off, a comment does not end with >, it ends with --> (and starts
with <!-- so why not test for that correctly also)?

<!--.*?-->

If you know the comments can't have > in them, then a character class
would be quicker than .*?

<!--[^>]*>

Next, I wonder why would you need to find comments in blocks of 5?

Even if you really wish to look for blocks of 5 comments at a time, the /g
says to do this globally, so it looks thru the entire file for all
possible combinations of 5 blocks (I didn't say that correctly) and I
suspect that is the biggest bottle neck.

I suspect you don't really want /g at all.

Also, the .*? is a potential bug, because it does not _prevent_ the re
from matching two (or more) comments at the place you intend to match a
single comment, it simply says "match no more than is necessary to get a
match", so the regex engine could be trying combinations of multiple
comments in an attempt to get a {5} /g match to work.

I'm not sure if the above _is_ a bug, but I can't say it isn't. The
character class I mentioned is not prone to this issue as it simply can't
match past the > , but that assumes (as I mentioned) that the comments
never use > .

Finally, /i is to ignore case, but nothing you look for uses case, so why
specify it (though I doubt that makes a difference here).
 
B

Ben Morrow

First off, a comment does not end with >, it ends with --> (and starts
with <!-- so why not test for that correctly also)?

<!--.*?-->

If you know the comments can't have > in them, then a character class
would be quicker than .*?

<!--[^>]*>
Also, the .*? is a potential bug, because it does not _prevent_ the re
from matching two (or more) comments at the place you intend to match a
single comment, it simply says "match no more than is necessary to get a
match", so the regex engine could be trying combinations of multiple
comments in an attempt to get a {5} /g match to work.

I'm somewhat thinking aloud here, but would

/ (?: <!-- (?: [^-] (?!->) )* --> ){5} /x

perform the correct match here? The generalisation of [^>]: ie. 'match
anything up to this multi-character string' is something one quite
often wants.

Ben
 
M

Matt Garrish

Malcolm Dew-Jones said:
First off, a comment does not end with >, it ends with --> (and starts
with <!-- so why not test for that correctly also)?

<!--.*?-->

Html comments allow whitespace between the -- and > when you close a
comment, so you'd have to write that as:

<!--.*?--\s*>

Matt
 
B

Ben Morrow

Matt Garrish said:
Html comments allow whitespace between the -- and > when you close a
comment, so you'd have to write that as:

<!--.*?--\s*>

HTML (SGML) comments also allow whitespace after the '!', and anything
matching /--\s*--/ to appear within the body of the comment. What
browsers will accept is another matter... ;)

Ben
 
J

James Willmore

On Thu, 04 Dec 2003 22:53:02 GMT
Hello. I've discovered that this regex is a bottleneck:

/(?:<!\-.*?>.*?){5}/sig

It tries to locate as many html comments in chunks of five which can
make for quite some possibilities in longer files. Is there a way to
optimize this or do you consider it to be simply poor practice?

Poor practice :)

Use one of the *many* HTML parsing modules that are available.
http://search.cpan.org/

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
Never hit a man with glasses. Hit him with a baseball bat.
 
M

Malcolm Dew-Jones

Matt Garrish ([email protected]) wrote:

: : >
: > First off, a comment does not end with >, it ends with --> (and starts
: > with <!-- so why not test for that correctly also)?
: >
: > <!--.*?-->
: >

: Html comments allow whitespace between the -- and > when you close a
: comment, so you'd have to write that as:

: <!--.*?--\s*>

Ah yes, and exactly why one should use the html parsing modules if
at all possible,

(I was looking at my xml book. Xml comments have a more rigid comment
format, if I understand it correctly.)
 
M

Matt Garrish

Ben Morrow said:
HTML (SGML) comments also allow whitespace after the '!', and anything
matching /--\s*--/ to appear within the body of the comment. What
browsers will accept is another matter... ;)

I thought no whitespace at the start of a comment was one of the few things
that html did enforce? It obviously would never fly in sgml if that was the
only way to comment out text (or would make for interesting dtds). Then
again, what standards do any browsers adhere to? : )

Matt
 
T

Tad McClellan

I believe that you are mistaken with that part.



But that part is true enough.

I thought no whitespace at the start of a comment was one of the few things
that html did enforce?


You thought correctly. The grammar[1], reformatted, is:


comment declaration =
MDO,
( comment,
( s |
comment
)*
)?
MDC

comment =
COM
SGML character*
COM

Where:

MDO (<!) Markup Declaration Open
MDC (>) Markup Declaration Close
COM (--) Comment Delimiter
s Separator ( roughly /\s/ )


So, if you have any "comment"s in the "comment declaration",
then there must be no spaces before that first one.

Note also that <!> is a "comment declaration" as well.


This is but one of the "strange corners" of SGML syntax. There
are several dozen of these. Your choices are:

1. Research the bazillion syntax oddities and code for
*all of them* in your program.

or

2. Use a module.




[1] "The SGML Handbook" Charles Goldfarb, p391
 
B

Ben Morrow

I thought no whitespace at the start of a comment was one of the few things
that html did enforce?

You thought correctly. The grammar[1], reformatted, is:
So, if you have any "comment"s in the "comment declaration",
then there must be no spaces before that first one.

Note also that <!> is a "comment declaration" as well.

Bleech. SGML syntax is too obscure for words :).
2. Use a module.

I couldn't agree more...

Ben
 
A

Alan J. Flavell

Note also that <!> is a "comment declaration" as well.

And, currently, a sure-fire indicator of spam in HTML-formatted emails
- they evidently intend it to disrupt content scanners. It won't
last, of course - as soon as they realise that we're rating it for
rejection, rather than letting ourselves be fooled by the obfuscation.
2. Use a module.

And hope the module author has read the book too ;-)
 
A

Alan J. Flavell

I thought no whitespace at the start of a comment was one of the few things
that html did enforce?

This is wildly off-topic, but one of the things I had to learn about
HTML is that even the W3C HTML specs contain self-contradictions.

While claiming in one place to be an application of SGML, there are
other places where particular SGML constructs are prohibited which the
SGML specification does not allow to be prohibited.
Then again, what standards do any browsers adhere to? : )

That's another matter entirely. And just don't start me on Appendix C
to XHTML/1.0
 
D

David K. Wall

Alan J. Flavell said:
And, currently, a sure-fire indicator of spam in HTML-formatted
emails - they evidently intend it to disrupt content scanners. It
won't last, of course - as soon as they realise that we're rating
it for rejection, rather than letting ourselves be fooled by the
obfuscation.

With that in mind, here's an infinite loop:

while ($spammer < $pond_scum) {
$spammer++;
}
 
M

Matt Garrish

Alan J. Flavell said:
That's another matter entirely. And just don't start me on Appendix C
to XHTML/1.0

Oh come on, at least they recognize that xhtml is never going to fly! The
whole popularity of the web lies in the ability of Joe Blow web designer
wannabe to toss whatever tags and styles he wants on a page and see a nice
pretty page pop up in his favourite M$ browser (okay a bit of an
exaggeration). Much as I'd prefer to see sgml-like tagging requirements
enforced in html, xhtml is a pipe-dream. XML will obviously survive, but
getting the riff-raff to adhere to coding standards I just can't see
happening (unless the wysiwyg editors get onboard). Which is sad in the end,
because it means there will always be new people asking how to parse html
with a regular expression...

Matt
 
B

Ben Morrow

Ben Morrow ([email protected]) wrote on MMMDCCXLVIII September MCMXCIII
in <URL:**
** I'm somewhat thinking aloud here, but would
**
** / (?: <!-- (?: [^-] (?!->) )* --> ){5} /x
**
** perform the correct match here? The generalisation of [^>]: ie. 'match
** anything up to this multi-character string' is something one quite
** often wants.

That fails on:

<!--> Comment one <!----> Comment two <!-->

Ouch! I had to think *quite* hard to convince myself that '<!-->
Comment one <!---->' isn't a valid comment... it's obvious once you
see the symmetry of it, of course.

OK, making a small modification and applying the grammar fragment Tad
gave:

my $com = qr/-- (?: (?!--) . )* --/x;
/<! (?: $com (?: \s* | $com )* )? >/x;

or in Perl6, just to show how much nicer it is when you get rid of all
those bleedin' question-marks:

my $com = rx/-- [ <!before --> . ]* --/;
m{<'<!'> [ <$com> [ \s* | <$com> ]* ]? <'>'>};

:).

This problem has been solved before, of course, so this is purely an
exercise in (not-so) Tricky Regexes on my part. Apologies if I've
bored anyone.

Ben
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top