Something about stripping C/C++ comments in perldoc

Discussion in 'Perl Misc' started by Xicheng Jia, Apr 18, 2006.

  1. Xicheng Jia

    Xicheng Jia Guest

    Hi folks:

    I am recently reading Jeffery Friedl's book "Mastering Regular
    Expressions"(O'Reilly, 2nd edition), and found that something in
    perldoc might be out of date and not fully updated with Perl's
    development.

    perldoc -q comment

    this gives me a C comments stripper(created by Jeffrey Friedl and later
    modified by Fred Curtis.):

    s#/\*[^*]*\*+([^/*][^*]*\*+)*/¦("(\\.¦[^"\\])*"¦'(\\.¦[^'\\])*'¦.[^/"'\\]*)#defined
    $2 ? $2 : ""#gse;

    I think there are several parts which are not optimized or can be
    simplified from Perl regex's flavor:

    1) /\*[^*]*\*+([^/*][^*]*\*+)*/
    this pattern is to remove a normal C comment in form of /* ..... */,
    which is developed when there is no lazy quantifiers. As Jeffery
    metioned in his book, a much simpler pattern can be:
    /\*.*?\*/ and this one is obviously much easier to be understood..

    2) "(\\.¦[^"\\])*"
    this pattern is to capture all contents in a C string(double-quoted
    stuff), and the unrolling version of this pattern
    "[^"\\]*(?:\\.[^"\\]*)*" developed by Jeffery can be much more
    efficient(as he mentioned in his book). A similar approach can be done
    with the single-quoted stuff..

    3) several non-capturing parentheses could be modified to(?: ) form
    which can somehow optimize the performace of the regex.

    According to the above, some modification can be made, and the s///
    expression can be written to, i.e.:

    s#/\*.*?\*/|//.*?\n|("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'|[^'"/]+)#
    $1 or "" #gse

    or in another form:

    s{
    /\*.*?\*/ ## strip normal C comments
    | ## or
    //[^\n]* ## strip C++ comments
    | ## or
    ( ## capture $1
    "[^"\\]*(?:\\.[^"\\]*)*" ## double-quoted stuff
    | ## or
    '[^'\\]*(?:\\.[^'\\]*)*' ## single-quoted stuff
    | ## or
    [^"'/]+ ## strings that guarantee a non-comment
    ) ## end of capturing $1
    }{ $1 or "" }gsxe

    which I think might be better than the one in 'perldoc -q comment'.. I
    didnt do very much experiment on this s/// expressions though. Just
    some of my $0.02.. Thanks for any comments,

    Xicheng
    =====
    USENET is a classroom, for me.:)
     
    Xicheng Jia, Apr 18, 2006
    #1
    1. Advertising

  2. Xicheng Jia

    Xicheng Jia Guest

    Xicheng Jia wrote:
    > Hi folks:
    >
    > I am recently reading Jeffery Friedl's book "Mastering Regular
    > Expressions"(O'Reilly, 2nd edition), and found that something in
    > perldoc might be out of date and not fully updated with Perl's
    > development.
    >
    > perldoc -q comment
    >
    > this gives me a C comments stripper(created by Jeffrey Friedl and later
    > modified by Fred Curtis.):
    >
    > s#/\*[^*]*\*+([^/*][^*]*\*+)*/¦("(\\.¦[^"\\])*"¦'(\\.¦[^'\\])*'¦.[^/"'\\]*)#defined
    > $2 ? $2 : ""#gse;
    >
    > I think there are several parts which are not optimized or can be
    > simplified from Perl regex's flavor:
    >
    > 1) /\*[^*]*\*+([^/*][^*]*\*+)*/
    > this pattern is to remove a normal C comment in form of /* ..... */,
    > which is developed when there is no lazy quantifiers. As Jeffery
    > metioned in his book, a much simpler pattern can be:
    > /\*.*?\*/ and this one is obviously much easier to be understood..
    >
    > 2) "(\\.¦[^"\\])*"
    > this pattern is to capture all contents in a C string(double-quoted
    > stuff), and the unrolling version of this pattern
    > "[^"\\]*(?:\\.[^"\\]*)*" developed by Jeffery can be much more
    > efficient(as he mentioned in his book). A similar approach can be done
    > with the single-quoted stuff..
    >
    > 3) several non-capturing parentheses could be modified to(?: ) form
    > which can somehow optimize the performace of the regex.
    >
    > According to the above, some modification can be made, and the s///
    > expression can be written to, i.e.:


    =>
    s#/\*.*?\*/|//.*?\n|("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'|[^'"/]+)#
    => $1 or "" #gse

    //.*?\n should be //[^\n]*

    and an one-liner testing line under Linux can roughly be written as:
    (note: removed all single-quote testing part):

    perl -0777pe '
    s#/\*.*?\*/|//[^\n]*|("[^"\\]*(?:\\.[^"\\]*)*"| [^"/]+)# $1 or ""
    #gse
    ' myfile.cpp

    Xicheng

    > or in another form:
    >
    > s{
    > /\*.*?\*/ ## strip normal C comments
    > | ## or
    > //[^\n]* ## strip C++ comments
    > | ## or
    > ( ## capture $1
    > "[^"\\]*(?:\\.[^"\\]*)*" ## double-quoted stuff
    > | ## or
    > '[^'\\]*(?:\\.[^'\\]*)*' ## single-quoted stuff
    > | ## or
    > [^"'/]+ ## strings that guarantee a non-comment
    > ) ## end of capturing $1
    > }{ $1 or "" }gsxe
    >
    > which I think might be better than the one in 'perldoc -q comment'.. I
    > didnt do very much experiment on this s/// expressions though. Just
    > some of my $0.02.. Thanks for any comments,
    >
    > Xicheng
    > =====
    > USENET is a classroom, for me.:)
     
    Xicheng Jia, Apr 18, 2006
    #2
    1. Advertising

  3. Xicheng Jia

    Lukas Mai Guest

    Xicheng Jia <> schrob:
    [stripping C comments]
    >
    > s#/\*[^*]*\*+([^/*][^*]*\*+)*/¦("(\\.¦[^"\\])*"¦'(\\.¦[^'\\])*'¦.[^/"'\\]*)#defined
    > $2 ? $2 : ""#gse;
    >
    > I think there are several parts which are not optimized or can be
    > simplified from Perl regex's flavor:
    >
    > 1) /\*[^*]*\*+([^/*][^*]*\*+)*/
    > this pattern is to remove a normal C comment in form of /* ..... */,
    > which is developed when there is no lazy quantifiers. As Jeffery
    > metioned in his book, a much simpler pattern can be:
    > /\*.*?\*/ and this one is obviously much easier to be understood..


    There's rumors on the internets that non-greedy quantifiers are slower
    than their normal counterparts. I don't know if that's true, but .*?
    still feels "unclean" to me.

    [other improvements]
    >
    > s#/\*.*?\*/|//.*?\n|("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'|[^'"/]+)#
    > $1 or "" #gse
    >
    > or in another form:
    >
    > s{
    > /\*.*?\*/ ## strip normal C comments
    > | ## or
    > //[^\n]* ## strip C++ comments
    > | ## or
    > ( ## capture $1
    > "[^"\\]*(?:\\.[^"\\]*)*" ## double-quoted stuff
    > | ## or
    > '[^'\\]*(?:\\.[^'\\]*)*' ## single-quoted stuff
    > | ## or
    > [^"'/]+ ## strings that guarantee a non-comment
    > ) ## end of capturing $1
    > }{ $1 or "" }gsxe
    >
    > which I think might be better than the one in 'perldoc -q comment'.. I
    > didnt do very much experiment on this s/// expressions though. Just
    > some of my $0.02.. Thanks for any comments,


    Of course, this regex is still incomplete because it completely ignores
    trigraphs and continuation lines:

    /??/
    * this is a comment */

    ??/ is a trigraph for \, \<newline> is removed, then /* ... */ is parsed
    as a comment.

    Another problem is that comments are semantically equivalent to
    whitespace, so something like "int/**/main" should turn into "int main",
    not "intmain".

    Here's my own version:

    #!/usr/local/bin/perl -p0777

    s!
    /
    (?: (?: \\ | \?\?/ ) \n )*
    (?:
    /
    (?:
    (?: \\ | \?\?/ ) \n
    |
    [^\n]
    )*
    |
    \*
    [^*]* \*+
    (?: (?: \\ | \?\?/ ) \n )*
    (?:
    [^/*]
    [^*]* \*+
    (?: (?: \\ | \?\?/ ) \n )*
    )*
    (/)
    )
    |
    (
    "
    (?:
    (?: \\ | \?\?/ ) .
    |
    [^"]
    )*
    "
    |
    '
    (?:
    (?: \\ | \?\?/ ) .
    |
    [^']
    )*
    '
    |
    . [^'"/]*
    )
    !(defined $1 ? ' ' : '') . $2!gsex
    __END__
     
    Lukas Mai, Apr 18, 2006
    #3
  4. Xicheng Jia

    robic0 Guest

    On 18 Apr 2006 11:22:10 -0700, "Xicheng Jia" <> wrote:

    >Hi folks:
    >
    >I am recently reading Jeffery Friedl's book "Mastering Regular
    >Expressions"(O'Reilly, 2nd edition), and found that something in
    >perldoc might be out of date and not fully updated with Perl's
    >development.
    >
    > perldoc -q comment
    >
    >this gives me a C comments stripper(created by Jeffrey Friedl and later
    >modified by Fred Curtis.):
    >
    >s#/\*[^*]*\*+([^/*][^*]*\*+)*/¦("(\\.¦[^"\\])*"¦'(\\.¦[^'\\])*'¦.[^/"'\\]*)#defined
    >$2 ? $2 : ""#gse;
    >
    >I think there are several parts which are not optimized or can be
    >simplified from Perl regex's flavor:
    >
    >1) /\*[^*]*\*+([^/*][^*]*\*+)*/
    >this pattern is to remove a normal C comment in form of /* ..... */,
    >which is developed when there is no lazy quantifiers. As Jeffery
    >metioned in his book, a much simpler pattern can be:
    > /\*.*?\*/ and this one is obviously much easier to be understood..
    >
    >2) "(\\.¦[^"\\])*"
    >this pattern is to capture all contents in a C string(double-quoted
    >stuff), and the unrolling version of this pattern
    >"[^"\\]*(?:\\.[^"\\]*)*" developed by Jeffery can be much more
    >efficient(as he mentioned in his book). A similar approach can be done
    >with the single-quoted stuff..
    >
    >3) several non-capturing parentheses could be modified to(?: ) form
    >which can somehow optimize the performace of the regex.
    >
    >According to the above, some modification can be made, and the s///
    >expression can be written to, i.e.:
    >
    >s#/\*.*?\*/|//.*?\n|("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'|[^'"/]+)#
    >$1 or "" #gse
    >
    >or in another form:
    >
    >s{
    > /\*.*?\*/ ## strip normal C comments
    > | ## or
    > //[^\n]* ## strip C++ comments


    what if you have something like this:
    // comments: ... /*<newline>
    some code example <newline>
    more code // embedded comment, code example <newline>
    /* more code and comments <newline>
    */ <newline>
    // comments <newline>
    */ <newline>
    It does matter if it won't compile but then you have to invoke
    the compiler and parse its output.

    The opposite construction as well:
    /* .... /* .... */ this is left in */

    > | ## or
    > ( ## capture $1
    > "[^"\\]*(?:\\.[^"\\]*)*" ## double-quoted stuff
    > | ## or
    > '[^'\\]*(?:\\.[^'\\]*)*' ## single-quoted stuff
    > | ## or
    > [^"'/]+ ## strings that guarantee a non-comment
    > ) ## end of capturing $1
    >}{ $1 or "" }gsxe
    >
    >which I think might be better than the one in 'perldoc -q comment'.. I
    >didnt do very much experiment on this s/// expressions though. Just
    >some of my $0.02.. Thanks for any comments,
    >
    >Xicheng
    >=====
    >USENET is a classroom, for me.:)


    I don't know if the above won't compile on todays compilers, it didn't (if the
    code was bad) on Vc6 and below.
    For '//' the end delimeter might be the eol or if continuations are allowed,
    the eol on the next line. A 'rolling' regexp parse (global) will have problems
    with nesting.

    Doesen't seem to be a defined standard on comments. There may be, dunno.
    XML runs into the same problem with COMMENT/CDATA statement.
    The difference might be that the XML standard has clearly defined path
    of precedence. Its chiseled in stone. There's no ambiguity.
    Your code may work for perfectly constructed comments (the ones that compile within
    C/C++ code) as the idea of such exists in your mind, but don't fool yourself as to
    the flaws in this regexp.

    Its not really flawed for what it does, in your mind,
    its that the idea is a conceptual *error*.

    The glaring flaw is that s///g is not compatable with this, *if* nesting will
    be taken into account and allowed. I don't think there's a standards commitee for
    C/C++ comments, compilers give you what you get.

    If you want to persue an *all cases* approach check the just posted RXParse xml parser on how it
    effectively deals with COMMENT/CDATA.
     
    robic0, Apr 18, 2006
    #4
  5. Xicheng Jia

    robic0 Guest

    On Tue, 18 Apr 2006 21:37:00 +0200, "Lukas Mai" <> wrote:

    >Xicheng Jia <> schrob:
    >[stripping C comments]
    >>
    >> s#/\*[^*]*\*+([^/*][^*]*\*+)*/¦("(\\.¦[^"\\])*"¦'(\\.¦[^'\\])*'¦.[^/"'\\]*)#defined
    >> $2 ? $2 : ""#gse;
    >>
    >> I think there are several parts which are not optimized or can be
    >> simplified from Perl regex's flavor:
    >>
    >> 1) /\*[^*]*\*+([^/*][^*]*\*+)*/
    >> this pattern is to remove a normal C comment in form of /* ..... */,
    >> which is developed when there is no lazy quantifiers. As Jeffery
    >> metioned in his book, a much simpler pattern can be:
    >> /\*.*?\*/ and this one is obviously much easier to be understood..

    >
    >There's rumors on the internets that non-greedy quantifiers are slower
    >than their normal counterparts. I don't know if that's true, but .*?
    >still feels "unclean" to me.


    Hogwash!!!

    >
    >[other improvements]
    >>
    >> s#/\*.*?\*/|//.*?\n|("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'|[^'"/]+)#
    >> $1 or "" #gse
    >>
    >> or in another form:
    >>
    >> s{
    >> /\*.*?\*/ ## strip normal C comments
    >> | ## or
    >> //[^\n]* ## strip C++ comments
    >> | ## or
    >> ( ## capture $1
    >> "[^"\\]*(?:\\.[^"\\]*)*" ## double-quoted stuff
    >> | ## or
    >> '[^'\\]*(?:\\.[^'\\]*)*' ## single-quoted stuff
    >> | ## or
    >> [^"'/]+ ## strings that guarantee a non-comment
    >> ) ## end of capturing $1
    >> }{ $1 or "" }gsxe
    >>
    >> which I think might be better than the one in 'perldoc -q comment'.. I
    >> didnt do very much experiment on this s/// expressions though. Just
    >> some of my $0.02.. Thanks for any comments,

    >
    >Of course, this regex is still incomplete because it completely ignores
    >trigraphs and continuation lines:
    >

    huh, trigraphs?

    >/??/
    >* this is a comment */
    >
    >??/ is a trigraph for \, \<newline> is removed, then /* ... */ is parsed
    >as a comment.


    huh?
    >
    >Another problem is that comments are semantically equivalent to
    >whitespace, so something like "int/**/main" should turn into "int main",
    >not "intmain".


    int/**/main doesen't compile on my machine

    >
    >Here's my own version:
    >
    >#!/usr/local/bin/perl -p0777
    >
    >s!
    > /
    > (?: (?: \\ | \?\?/ ) \n )*
    > (?:
    > /
    > (?:
    > (?: \\ | \?\?/ ) \n
    > |
    > [^\n]
    > )*
    > |
    > \*
    > [^*]* \*+
    > (?: (?: \\ | \?\?/ ) \n )*
    > (?:
    > [^/*]
    > [^*]* \*+
    > (?: (?: \\ | \?\?/ ) \n )*
    > )*
    > (/)
    > )
    >|
    > (
    > "
    > (?:
    > (?: \\ | \?\?/ ) .
    > |
    > [^"]
    > )*
    > "
    > |
    > '
    > (?:
    > (?: \\ | \?\?/ ) .
    > |
    > [^']
    > )*
    > '
    > |
    > . [^'"/]*
    > )
    >!(defined $1 ? ' ' : '') . $2!gsex
    >__END__


    /* .... /* .... */ whats this? */
     
    robic0, Apr 18, 2006
    #5
  6. Xicheng Jia

    Xicheng Jia Guest

    Lukas Mai wrote:
    > Xicheng Jia <> schrob:
    > [stripping C comments]
    > >
    > > s#/\*[^*]*\*+([^/*][^*]*\*+)*/¦("(\\.¦[^"\\])*"¦'(\\.¦[^'\\])*'¦.[^/"'\\]*)#defined
    > > $2 ? $2 : ""#gse;
    > >
    > > I think there are several parts which are not optimized or can be
    > > simplified from Perl regex's flavor:
    > >
    > > 1) /\*[^*]*\*+([^/*][^*]*\*+)*/
    > > this pattern is to remove a normal C comment in form of /* ..... */,
    > > which is developed when there is no lazy quantifiers. As Jeffery
    > > metioned in his book, a much simpler pattern can be:
    > > /\*.*?\*/ and this one is obviously much easier to be understood..


    => There's rumors on the internets that non-greedy quantifiers are
    slower
    => than their normal counterparts. I don't know if that's true, but .*?
    => still feels "unclean" to me.

    >From Jeffery's book "Mastering Regular Expressions" (2nd edition

    O'Reilly)

    "Lazy versus Greedy": Page 256
    "It's not always obvious which is best......... If the data is random,
    and you have no idea which will be more likely, use a greedy
    quantifier, as they are generally optimized a bit better than
    non-greedy quantifier, especially when what follows in the regex
    disallows the character following lazy quantifier
    optimization(page-249)."


    "Specific versus Lazy" page 257
    "Generally, using a negated class is much more efficient than a lazy
    quantifier. One exception is Perl, because it has that character
    following lazy quantifier optimization"

    >From the above, because Perl supports "character following lazy

    quantifier optimization", I feel that non-greedy quantifiers are not as
    bad as the rumors you heard. :)

    moreover, In Jeffery's book page 272 the 4th paragraph:
    ".......So, with modern versions of Perl, I'd just use /\*.*?\*/ to
    match C comments and be done with it."

    => [other improvements]
    > >
    > > s#/\*.*?\*/|//.*?\n|("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'|[^'"/]+)#
    > > $1 or "" #gse
    > >
    > > or in another form:
    > >
    > > s{
    > > /\*.*?\*/ ## strip normal C comments
    > > | ## or
    > > //[^\n]* ## strip C++ comments
    > > | ## or
    > > ( ## capture $1
    > > "[^"\\]*(?:\\.[^"\\]*)*" ## double-quoted stuff
    > > | ## or
    > > '[^'\\]*(?:\\.[^'\\]*)*' ## single-quoted stuff
    > > | ## or
    > > [^"'/]+ ## strings that guarantee a non-comment
    > > ) ## end of capturing $1
    > > }{ $1 or "" }gsxe
    > >
    > > which I think might be better than the one in 'perldoc -q comment'.. I
    > > didnt do very much experiment on this s/// expressions though. Just
    > > some of my $0.02.. Thanks for any comments,


    => Of course, this regex is still incomplete because it completely
    ignores
    => trigraphs and continuation lines:
    => /??/
    => * this is a comment */
    =>
    => ??/ is a trigraph for \, \<newline> is removed, then /* ... */ is
    parsed
    => as a comment.
    =>
    => Another problem is that comments are semantically equivalent to
    => whitespace, so something like "int/**/main" should turn into "int
    main",
    => not "intmain".

    I guess the regex we discussed so far was written for traditional C
    instead of ANSI C. as far as I know, in traditional C(old K&R),
    int/**/main is parsed into "intmain" instead of "int main", and it also
    does not support trigraphs.. :) ..

    => Here's my own version:
    > #!/usr/local/bin/perl -p0777
    >

    => s!
    => /
    => (?: (?: \\ | \?\?/ ) \n )*
    => (?:
    => /
    => (?:
    => (?: \\ | \?\?/ ) \n
    => |
    => [^\n]
    => )*
    => |
    => \*
    => [^*]* \*+
    => (?: (?: \\ | \?\?/ ) \n )*
    => (?:
    => [^/*]
    => [^*]* \*+
    => (?: (?: \\ | \?\?/ ) \n )*
    => )*
    => (/)
    => )
    => |
    => (
    => "
    => (?:
    => (?: \\ | \?\?/ ) .
    => |
    => [^"]
    => )*
    => "
    => |
    => '
    => (?:
    => (?: \\ | \?\?/ ) .
    => |
    => [^']
    => )*
    => '
    => |
    => . [^'"/]*
    => )
    => !(defined $1 ? ' ' : '') . $2!gsex

    Could you please group your patterns and add some comments on them.
    that way, we can know what you did and how? Many thanks..

    Xicheng
    =====
    USENET is a classroom, for me:)
     
    Xicheng Jia, Apr 18, 2006
    #6
  7. Xicheng Jia

    Xicheng Jia Guest

    robic0 wrote:
    > On 18 Apr 2006 11:22:10 -0700, "Xicheng Jia" <> wrote:
    >
    > >Hi folks:
    > >
    > >I am recently reading Jeffery Friedl's book "Mastering Regular
    > >Expressions"(O'Reilly, 2nd edition), and found that something in
    > >perldoc might be out of date and not fully updated with Perl's
    > >development.
    > >
    > > perldoc -q comment
    > >
    > >this gives me a C comments stripper(created by Jeffrey Friedl and later
    > >modified by Fred Curtis.):
    > >
    > >s#/\*[^*]*\*+([^/*][^*]*\*+)*/¦("(\\.¦[^"\\])*"¦'(\\.¦[^'\\])*'¦.[^/"'\\]*)#defined
    > >$2 ? $2 : ""#gse;
    > >
    > >I think there are several parts which are not optimized or can be
    > >simplified from Perl regex's flavor:
    > >
    > >1) /\*[^*]*\*+([^/*][^*]*\*+)*/
    > >this pattern is to remove a normal C comment in form of /* ..... */,
    > >which is developed when there is no lazy quantifiers. As Jeffery
    > >metioned in his book, a much simpler pattern can be:
    > > /\*.*?\*/ and this one is obviously much easier to be understood..
    > >
    > >2) "(\\.¦[^"\\])*"
    > >this pattern is to capture all contents in a C string(double-quoted
    > >stuff), and the unrolling version of this pattern
    > >"[^"\\]*(?:\\.[^"\\]*)*" developed by Jeffery can be much more
    > >efficient(as he mentioned in his book). A similar approach can be done
    > >with the single-quoted stuff..
    > >
    > >3) several non-capturing parentheses could be modified to(?: ) form
    > >which can somehow optimize the performace of the regex.
    > >
    > >According to the above, some modification can be made, and the s///
    > >expression can be written to, i.e.:
    > >
    > >s#/\*.*?\*/|//.*?\n|("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'|[^'"/]+)#
    > >$1 or "" #gse
    > >
    > >or in another form:
    > >
    > >s{
    > > /\*.*?\*/ ## strip normal C comments
    > > | ## or
    > > //[^\n]* ## strip C++ comments

    >
    > what if you have something like this:
    > // comments: ... /*<newline>
    > some code example <newline>
    > more code // embedded comment, code example <newline>
    > /* more code and comments <newline>
    > */ <newline>
    > // comments <newline>
    > */ <newline>
    > It does matter if it won't compile but then you have to invoke
    > the compiler and parse its output.
    >
    > The opposite construction as well:
    > /* .... /* .... */ this is left in */
    >
    > > | ## or
    > > ( ## capture $1
    > > "[^"\\]*(?:\\.[^"\\]*)*" ## double-quoted stuff
    > > | ## or
    > > '[^'\\]*(?:\\.[^'\\]*)*' ## single-quoted stuff
    > > | ## or
    > > [^"'/]+ ## strings that guarantee a non-comment
    > > ) ## end of capturing $1
    > >}{ $1 or "" }gsxe
    > >
    > >which I think might be better than the one in 'perldoc -q comment'.. I
    > >didnt do very much experiment on this s/// expressions though. Just
    > >some of my $0.02.. Thanks for any comments,
    > >
    > >Xicheng
    > >=====
    > >USENET is a classroom, for me.:)

    >
    > I don't know if the above won't compile on todays compilers, it didn't (if the
    > code was bad) on Vc6 and below.


    => For '//' the end delimeter might be the eol or if continuations
    => are allowed, the eol on the next line.

    what do you mean "eol", isn't that "\n"? or you mean there is a
    backslash at the end of line and thus the next line should be
    continuous line?? huh, I guess that would be a problem:)

    => A 'rolling' regexp parse (global) will have problems with nesting.

    I am trying Jeffery's unrolling version of the regex..

    > Doesen't seem to be a defined standard on comments. There may be, dunno.
    > XML runs into the same problem with COMMENT/CDATA statement.
    > The difference might be that the XML standard has clearly defined path
    > of precedence. Its chiseled in stone. There's no ambiguity.
    > Your code may work for perfectly constructed comments (the ones that compile within
    > C/C++ code) as the idea of such exists in your mind, but don't fool yourself as to
    > the flaws in this regexp.


    I knew the regex can NOT handle everything, and I just want to learn
    something from trying it..:) thanks anyway for your suggestions..:)

    > Its not really flawed for what it does, in your mind,
    > its that the idea is a conceptual *error*.
    >
    > The glaring flaw is that s///g is not compatable with this, *if* nesting will
    > be taken into account and allowed. I don't think there's a standards commitee for
    > C/C++ comments, compilers give you what you get.
    >
    > If you want to persue an *all cases* approach check the just posted RXParse xml parser on how it
    > effectively deals with COMMENT/CDATA.
     
    Xicheng Jia, Apr 18, 2006
    #7
  8. Xicheng Jia

    Xicheng Jia Guest

    Lukas Mai wrote:
    > Xicheng Jia <> schrob:
    > [stripping C comments]
    > >

    > [other improvements]
    > >
    > > s#/\*.*?\*/|//.*?\n|("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'|[^'"/]+)#
    > > $1 or "" #gse
    > >
    > > or in another form:
    > >
    > > s{
    > > /\*.*?\*/ ## strip normal C comments
    > > | ## or
    > > //[^\n]* ## strip C++ comments
    > > | ## or
    > > ( ## capture $1
    > > "[^"\\]*(?:\\.[^"\\]*)*" ## double-quoted stuff
    > > | ## or
    > > '[^'\\]*(?:\\.[^'\\]*)*' ## single-quoted stuff
    > > | ## or
    > > [^"'/]+ ## strings that guarantee a non-comment
    > > ) ## end of capturing $1
    > > }{ $1 or "" }gsxe
    > >
    > > which I think might be better than the one in 'perldoc -q comment'.. I
    > > didnt do very much experiment on this s/// expressions though. Just
    > > some of my $0.02.. Thanks for any comments,

    >
    > Of course, this regex is still incomplete because it completely ignores
    > trigraphs and continuation lines:
    >

    => /??/
    => * this is a comment */
    =>
    => ??/ is a trigraph for \, \<newline> is removed, then /* ... */ is
    parsed
    => as a comment.

    I just checked the trigraph ??/ which is exactly a backslash so it
    comes to a similar situation as robic0 has proposed, when there is a
    trailing backslask at the same line with a "//" comment..... However,
    writting C code the following ways are quite unusual, isn't it?? :)

    /\
    * this is a comment *\
    /

    // this is \
    a comment
    __________________________
    > Another problem is that comments are semantically equivalent to
    > whitespace, so something like "int/**/main" should turn into "int main",
    > not "intmain".


    For ANSI C, this kind of comments is converted into a SPACE at
    pre-processing stage. so fixing it might be as easy as changing the
    replacement part of the regex from:

    { $1 or "" }gsxe to { $1 or " " }gsxe

    Xicheng. :)
     
    Xicheng Jia, Apr 18, 2006
    #8
  9. Xicheng Jia

    Lukas Mai Guest

    robic0 schrob:
    >
    > what if you have something like this:
    > // comments: ... /*<newline>


    That's a single comment.

    > some code example <newline>
    > more code // embedded comment, code example <newline>


    Code followed by a comment.

    > /* more code and comments <newline>
    > */ <newline>


    A comment.

    > // comments <newline>


    A comment.

    > */ <newline>


    Syntax error, expecting value before /.

    > It does matter if it won't compile but then you have to invoke
    > the compiler and parse its output.
    >
    > The opposite construction as well:
    > /* .... /* .... */ this is left in */

    |--------------------------|

    Comment, code, syntax error.

    > I don't know if the above won't compile on todays compilers, it didn't
    > (if the code was bad) on Vc6 and below.
    > For '//' the end delimeter might be the eol or if continuations are
    > allowed, the eol on the next line. A 'rolling' regexp parse (global)
    > will have problems with nesting.


    How about actually reading the relevant docs instead of babbling?

    > Doesen't seem to be a defined standard on comments. There may be,
    > dunno. XML runs into the same problem with COMMENT/CDATA statement.
    > The difference might be that the XML standard has clearly defined path
    > of precedence. Its chiseled in stone. There's no ambiguity.
    > Your code may work for perfectly constructed comments (the ones that
    > compile within C/C++ code) as the idea of such exists in your mind,
    > but don't fool yourself as to the flaws in this regexp.


    More babbling.

    > Its not really flawed for what it does, in your mind,
    > its that the idea is a conceptual *error*.
    >
    > The glaring flaw is that s///g is not compatable with this, *if*
    > nesting will be taken into account and allowed. I don't think there's
    > a standards commitee for C/C++ comments, compilers give you what you
    > get.


    Uh. I recommend you take a look at ISO 9899:1999. There is no separate
    standard for C comments because they're part of C.

    > If you want to persue an *all cases* approach check the just posted
    > RXParse xml parser on how it effectively deals with COMMENT/CDATA.


    How does an XML parser help with tokenizing C?
     
    Lukas Mai, Apr 19, 2006
    #9
  10. Xicheng Jia

    robic0 Guest

    On Wed, 19 Apr 2006 02:44:53 +0200, "Lukas Mai" <> wrote:

    >robic0 schrob:
    >>
    >> what if you have something like this:
    >> // comments: ... /*<newline>

    >
    >That's a single comment.
    >
    >> some code example <newline>
    >> more code // embedded comment, code example <newline>

    >
    >Code followed by a comment.
    >
    >> /* more code and comments <newline>
    >> */ <newline>

    >
    >A comment.
    >
    >> // comments <newline>

    >
    >A comment.
    >
    >> */ <newline>

    >
    >Syntax error, expecting value before /.
    >
    >> It does matter if it won't compile but then you have to invoke
    >> the compiler and parse its output.
    >>
    >> The opposite construction as well:
    >> /* .... /* .... */ this is left in */

    > |--------------------------|
    >
    >Comment, code, syntax error.
    >
    >> I don't know if the above won't compile on todays compilers, it didn't
    >> (if the code was bad) on Vc6 and below.
    >> For '//' the end delimeter might be the eol or if continuations are
    >> allowed, the eol on the next line. A 'rolling' regexp parse (global)
    >> will have problems with nesting.

    >
    >How about actually reading the relevant docs instead of babbling?
    >
    >> Doesen't seem to be a defined standard on comments. There may be,
    >> dunno. XML runs into the same problem with COMMENT/CDATA statement.
    >> The difference might be that the XML standard has clearly defined path
    >> of precedence. Its chiseled in stone. There's no ambiguity.
    >> Your code may work for perfectly constructed comments (the ones that
    >> compile within C/C++ code) as the idea of such exists in your mind,
    >> but don't fool yourself as to the flaws in this regexp.

    >
    >More babbling.
    >
    >> Its not really flawed for what it does, in your mind,
    >> its that the idea is a conceptual *error*.
    >>
    >> The glaring flaw is that s///g is not compatable with this, *if*
    >> nesting will be taken into account and allowed. I don't think there's
    >> a standards commitee for C/C++ comments, compilers give you what you
    >> get.

    >
    >Uh. I recommend you take a look at ISO 9899:1999. There is no separate
    >standard for C comments because they're part of C.
    >
    >> If you want to persue an *all cases* approach check the just posted
    >> RXParse xml parser on how it effectively deals with COMMENT/CDATA.

    >
    >How does an XML parser help with tokenizing C?


    You must be from another planet, you sound like you have written all these
    exceptions and caveats..... I mean other than the written English you write
    here
     
    robic0, Apr 19, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    1,185
  2. Stephane CHAZELAS

    Re: Stripping multiline C comments without using Lex

    Stephane CHAZELAS, Feb 4, 2004, in forum: C Programming
    Replies:
    3
    Views:
    962
    Jens Schweikhardt
    Feb 5, 2004
  3. Replies:
    4
    Views:
    591
  4. Jay
    Replies:
    3
    Views:
    431
  5. bizt
    Replies:
    1
    Views:
    111
    Evertjan.
    Nov 16, 2009
Loading...

Share This Page