Stripping C-style comments using a Python regexp

L

lorinh

Hi Folks,

I'm trying to strip C/C++ style comments (/* ... */ or // ) from
source code using Python regexps.

If I don't have to worry about comments embedded in strings, it seems
pretty straightforward (this is what I'm using now):

cpp_pat = re.compile(r"""
/\* .*? \*/ | # C comments
// [^\n\r]* # C++ comments
""",re.S|re.X)
s = file('myprog.cpp').read()
cpp_pat.sub(' ',s)

However, the sticking point is dealing with tokens like /* embedded
within a string:

const char *mystr = "This is /*trouble*/";

I've inherited a working Perl script, which I'd like to reimplement in
Python so that I don't have to spawn a new Perl process in my Python
program each time I want to strip comments from a file. The Perl script
looks like this:

#!/usr/bin/perl -w

$/ = undef; # no line delimiter
$_ = <>; # read entire file

s! ((['"]) (?: \\. | .)*? \2) | # skip quoted strings
/\* .*? \*/ | # delete C comments
// [^\n\r]* # delete C++ comments
! $1 || ' ' # change comments to a single space
!xseg; # ignore white space, treat as single line
# evaluate result, repeat globally
print;

The Perl regexp above uses some sort of conditional to deal with this,
by replacing a quoted string with itself if the initial match is a
quoted string. Is there some equivalent feature in Python regexps?

Lorin
 
L

Lonnie Princehouse

Is there some equivalent feature in Python regexps?

cpp_pat = re.compile('(/\*.*?\*/)|(".*?")', re.S)

def subfunc(match):
if match.group(2):
return match.group(2)
else:
return ''

stripped_c_code = cpp_pat.sub(subfunc, c_code)


....I suppose this is what the Perl code might do, but I'm not sure,
since trying to read it hurts my brain...
 
J

Jeff Epler

#------------------------------------------------------------------------
import re, sys

def q(c):
"""Returns a regular expression that matches a region delimited by c,
inside which c may be escaped with a backslash"""

return r"%s(\\.|[^%s])*%s" % (c, c, c)

single_quoted_string = q('"')
double_quoted_string = q("'")
c_comment = r"/\*.*?\*/"
cxx_comment = r"//[^\n]*[\n]"

rx = re.compile("|".join([single_quoted_string, double_quoted_string,
c_comment, cxx_comment]), re.DOTALL)

def replace(x):
x = x.group(0)
if x.startswith("/"): return ' '
return x

result = rx.sub(replace, sys.stdin.read())
sys.stdout.write(result)
#------------------------------------------------------------------------

The regular expression matches ""-strings, ''-character-constants,
c-comments, and c++-comments. The replace function returns ' ' (space)
when the matched thing was a comment, or the original thing otherwise.
Depending on your use for this code, replace() should return as many
'\n's as are in the matched thing, or ' ' otherwise, so that line
numbers remain unchanged.

Basically, the regular expression is a tokenizer, and replace() chooses
what to do with each recognized token. Things not recognized as tokens
by the regular expression are left unchanged.

Jeff
PS this is the test file I used:
/* ... */ xyzzy;
456 // 123
const char *mystr = "This is /*trouble*/";
/* * */
/* /* */
// /* /* */
/* // /* */
/*
* */

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFC57hHJd01MZaTXX0RAsE4AKCAmR8fPkU6BNofAZQhn1X9qdWNMQCgn+8c
ex2GXeRAF+P2d3HJuRDs6zo=
=J5YT
-----END PGP SIGNATURE-----
 
L

Lonnie Princehouse

Is there some equivalent feature in Python regexps?

cpp_pat = re.compile('(/\*.*?\*/)|(".*?")', re.S)

def subfunc(match):
if match.group(2):
return match.group(2)
else:
return ''

stripped_c_code = cpp_pat.sub(subfunc, c_code)


....I suppose this is what the Perl code might do, but I'm not sure,
since trying to read it hurts my brain...
 
L

lorinh

Neat! I didn't realize that re.sub could take a function as an
argument. Thanks.

Lorin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,534
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top