X
Xicheng Jia
Hi folks:
I am recently reading Jeffery Friedl's book "Mastering Regular
Expressions"(O'Reilly, 2nd edition), and found that something in
perldoc might be out of date and not fully updated with Perl's
development.
perldoc -q comment
this gives me a C comments stripper(created by Jeffrey Friedl and later
modified by Fred Curtis.):
s#/\*[^*]*\*+([^/*][^*]*\*+)*/¦("(\\.¦[^"\\])*"¦'(\\.¦[^'\\])*'¦.[^/"'\\]*)#defined
$2 ? $2 : ""#gse;
I think there are several parts which are not optimized or can be
simplified from Perl regex's flavor:
1) /\*[^*]*\*+([^/*][^*]*\*+)*/
this pattern is to remove a normal C comment in form of /* ..... */,
which is developed when there is no lazy quantifiers. As Jeffery
metioned in his book, a much simpler pattern can be:
/\*.*?\*/ and this one is obviously much easier to be understood..
2) "(\\.¦[^"\\])*"
this pattern is to capture all contents in a C string(double-quoted
stuff), and the unrolling version of this pattern
"[^"\\]*(?:\\.[^"\\]*)*" developed by Jeffery can be much more
efficient(as he mentioned in his book). A similar approach can be done
with the single-quoted stuff..
3) several non-capturing parentheses could be modified to(?: ) form
which can somehow optimize the performace of the regex.
According to the above, some modification can be made, and the s///
expression can be written to, i.e.:
s#/\*.*?\*/|//.*?\n|("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'|[^'"/]+)#
$1 or "" #gse
or in another form:
s{
/\*.*?\*/ ## strip normal C comments
| ## or
//[^\n]* ## strip C++ comments
| ## or
( ## capture $1
"[^"\\]*(?:\\.[^"\\]*)*" ## double-quoted stuff
| ## or
'[^'\\]*(?:\\.[^'\\]*)*' ## single-quoted stuff
| ## or
[^"'/]+ ## strings that guarantee a non-comment
) ## end of capturing $1
}{ $1 or "" }gsxe
which I think might be better than the one in 'perldoc -q comment'.. I
didnt do very much experiment on this s/// expressions though. Just
some of my $0.02.. Thanks for any comments,
Xicheng
=====
USENET is a classroom, for me.
I am recently reading Jeffery Friedl's book "Mastering Regular
Expressions"(O'Reilly, 2nd edition), and found that something in
perldoc might be out of date and not fully updated with Perl's
development.
perldoc -q comment
this gives me a C comments stripper(created by Jeffrey Friedl and later
modified by Fred Curtis.):
s#/\*[^*]*\*+([^/*][^*]*\*+)*/¦("(\\.¦[^"\\])*"¦'(\\.¦[^'\\])*'¦.[^/"'\\]*)#defined
$2 ? $2 : ""#gse;
I think there are several parts which are not optimized or can be
simplified from Perl regex's flavor:
1) /\*[^*]*\*+([^/*][^*]*\*+)*/
this pattern is to remove a normal C comment in form of /* ..... */,
which is developed when there is no lazy quantifiers. As Jeffery
metioned in his book, a much simpler pattern can be:
/\*.*?\*/ and this one is obviously much easier to be understood..
2) "(\\.¦[^"\\])*"
this pattern is to capture all contents in a C string(double-quoted
stuff), and the unrolling version of this pattern
"[^"\\]*(?:\\.[^"\\]*)*" developed by Jeffery can be much more
efficient(as he mentioned in his book). A similar approach can be done
with the single-quoted stuff..
3) several non-capturing parentheses could be modified to(?: ) form
which can somehow optimize the performace of the regex.
According to the above, some modification can be made, and the s///
expression can be written to, i.e.:
s#/\*.*?\*/|//.*?\n|("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'|[^'"/]+)#
$1 or "" #gse
or in another form:
s{
/\*.*?\*/ ## strip normal C comments
| ## or
//[^\n]* ## strip C++ comments
| ## or
( ## capture $1
"[^"\\]*(?:\\.[^"\\]*)*" ## double-quoted stuff
| ## or
'[^'\\]*(?:\\.[^'\\]*)*' ## single-quoted stuff
| ## or
[^"'/]+ ## strings that guarantee a non-comment
) ## end of capturing $1
}{ $1 or "" }gsxe
which I think might be better than the one in 'perldoc -q comment'.. I
didnt do very much experiment on this s/// expressions though. Just
some of my $0.02.. Thanks for any comments,
Xicheng
=====
USENET is a classroom, for me.