Removing comments... tokenize error

qwweeeit · Apr 12, 2005

In analysing a very big application (pysol) made of almost
100 sources, I had the need to remove comments.

Removing the comments which take all the line is straightforward...

Instead for the embedded comments I used the tokenize module.

To my surprise the analysed output is different from the input
(the last tuple element should exactly replicate the input line)
The error comes out in correspondance of a triple string.
I don't know if this has already been corrected (I use Python 2.3)
or perhaps is a mistake on my part...

Next you find the script I use to replicate the strange behaviour:

import tokenize

Input = "pippo1"
Output = "pippo2"

f = open(Input)
fOut=open(Output,"w")

nLastLine=0
for i in tokenize.generate_tokens(f.readline):
.. if nLastLine != (i[2])[0]: # the 3rd element of the tuple is
.. . nLastLine = (i[2])[0] # (startingRow, startingCol)
.. . fOut.write(i[4])

f.close()
fOut.close()

The file to be used (pippo1) contains an extract:

class SelectDialogTreeData:
.. img = None
.. def __init__(self):
.. . self.tree_xview = (0.0, 1.0)
.. . self.tree_yview = (0.0, 1.0)
.. . if self.img is None:
.. . . SelectDialogTreeData.img = (makeImage(dither=0, data="""
R0lGODlhEAAOAPIFAAAAAICAgMDAwP//AP///4AAAAAAAAAAACH5BAEAAAUALAAAAAAQAA4AAAOL
WLrcGxA6FoYYYoRZwhCDMAhDFCkBoa6sGgBFQAzCIAzCIAzCEACFAEEwEAwEA8FAMBAEAIUAYSAY
CAaCgWAgGAQAhQBBMBAMBAPBQDAQBACFAGEgGAgGgoFgIBgEAAUBBAIDAgMCAwIDAgMCAQAFAQQD
AgMCAwIDAgMCAwEABSaiogAKAKeoqakFCQA7"""), makeImage(dither=0, data="""
R0lGODlhEAAOAPIFAAAAAICAgMDAwP//AP///4AAAAAAAAAAACH5BAEAAAUALAAAAAAQAA4AAAN3
WLrcHBA6Foi1YZZAxBCDQESREhCDMAiDcFkBUASEMAiDMAiDMAgBAGlIGgQAgZeSEAAIAoAAQTAQ
DAQDwUAwAEAAhQBBMBAMBAPBQBAABACFAGEgGAgGgoFgIAAEAAoBBAMCAwIDAgMCAwEAAApERI4L
jpWWlgkAOw=="""), makeImage(dither=0, data="""
R0lGODdhEAAOAPIAAAAAAAAAgICAgMDAwP///wAAAAAAAAAAACwAAAAAEAAOAAADTii63DowyiiA
GCHrnQUQAxcQAAEQgAAIg+MCwkDMdD0LgDDUQG8LAMGg1gPYBADBgFbs1QQAwYDWBNQEAMHABrAR
BADBwOsVAFzoqlqdAAA7"""), makeImage(dither=0, data="""
R0lGODdhEAAOAPIAAAAAAAAAgICAgMDAwP8AAP///wAAAAAAACwAAAAAEAAOAAADVCi63DowyiiA
GCHrnQUQAxcUQAEUgAAIg+MCwlDMdD0LgDDQBE3UAoBgUCMUCDYBQDCwEWwFAUAwqBEKBJsAIBjQ
CDRCTQAQDKBQAcDFBrjf8Lg7AQA7"""))

The output of tokenize (pippo2) gives instead:

class SelectDialogTreeData:
.. img = None
.. def __init__(self):
.. . self.tree_xview = (0.0, 1.0)
.. . self.tree_yview = (0.0, 1.0)
.. . if self.img is None:
.. . . SelectDialogTreeData.img = (makeImage(dither=0, data="""
AgMCAwIDAgMCAwEABSaiogAKAKeoqakFCQA7"""), makeImage(dither=0, data="""
jpWWlgkAOw=="""), makeImage(dither=0, data="""
BADBwOsVAFzoqlqdAAA7"""), makeImage(dither=0, data="""
CDRCTQAQDKBQAcDFBrjf8Lg7AQA7"""))

.... with a big difference! Why?

Fredrik Lundh · Apr 12, 2005

qwweeeit said:
I don't know if this has already been corrected (I use Python 2.3)
or perhaps is a mistake on my part...

it's a mistake on your part. adding a print statement to the for-
loop might help you figure it out:

nLastLine=0
for i in tokenize.generate_tokens(f.readline): print i
. if nLastLine != (i[2])[0]: # the 3rd element of the tuple is
. . nLastLine = (i[2])[0] # (startingRow, startingCol)
. . fOut.write(i[4])

(hints: what happens if a token spans multiple lines? and how does
the tokenize module deal with comments?)

</F>

qwweeeit · Apr 13, 2005

Thanks! If you answer to my posts one more time I could consider you as
my tutor...

It was strange to have found a bug...! In any case I will not go deeper
into the matter, because for me it's enough your explanatiom.
I corrected the problem by hand removing the tokens spanning multiple lines
(there were only 8 cases...).

Instead I haven't understood your hint about comments...
I succeded in realizing a python script which removes comments.

Here it is (in all its cumbersome and criptic appearence!...):

# removeCommentsTok.py
import tokenize
Input = "pippo1"
Output = "pippo2"
f = open(Input)
fOut=open(Output,"w")

nLastLine=0
for i in tokenize.generate_tokens(f.readline):
.. if i[0]==52 and nLastLine != (i[2])[0]:
.. . fOut.write((i[4].replace(i[1],'')).rstrip()+'\n')
.. . nLastLine=(i[2])[0]
.. elif i[0]==4 and nLastLine != (i[2])[0]:
.. . fOut.write((i[4]))
.. . nLastLine=(i[2])[0]
f.close()
fOut.close()

Some explanations for the guys like me...:
- 52 and 4 are the arbitrary codes for comments and NEWLINE respectively
- the comment removing is obtained by clearing the comment (i[1]) in the
input line (i[4])
- I also right trimmed the line to get rid off the remaining blanks.

M.E.Farmer · Apr 13, 2005

qwweeeit said:
Thanks! If you answer to my posts one more time I could consider you as
my tutor...

It was strange to have found a bug...! In any case I will not go deeper
into the matter, because for me it's enough your explanatiom.
I corrected the problem by hand removing the tokens spanning multiple lines
(there were only 8 cases...).

Instead I haven't understood your hint about comments...
I succeded in realizing a python script which removes comments.

Here it is (in all its cumbersome and criptic appearence!...):

# removeCommentsTok.py
import tokenize
Input = "pippo1"
Output = "pippo2"
f = open(Input)
fOut=open(Output,"w")

nLastLine=0
for i in tokenize.generate_tokens(f.readline):
. if i[0]==52 and nLastLine != (i[2])[0]:
. . fOut.write((i[4].replace(i[1],'')).rstrip()+'\n')
. . nLastLine=(i[2])[0]
. elif i[0]==4 and nLastLine != (i[2])[0]:
. . fOut.write((i[4]))
. . nLastLine=(i[2])[0]
f.close()
fOut.close()

Some explanations for the guys like me...:
- 52 and 4 are the arbitrary codes for comments and NEWLINE respectively
- the comment removing is obtained by clearing the comment (i[1]) in the
input line (i[4])
- I also right trimmed the line to get rid off the remaining blanks.

Tokenizer sends multiline strings and comments as a single token.

######################################################################
# python comment and whitespace stripper

######################################################################

import keyword, os, sys, traceback
import StringIO
import token, tokenize
__credits__ = 'just another tool that I needed'
__version__ = '.7'
__author__ = 'M.E.Farmer'
__date__ = 'Jan 15 2005, Oct 24 2004'

######################################################################

class Stripper:
"""python comment and whitespace stripper

"""
def __init__(self, raw):
self.raw = raw

def format(self, out=sys.stdout, comments=0, spaces=1,
untabify=1, eol='unix'):
''' strip comments, strip extra whitespace,
convert EOL's from Python code.
'''
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
# Strips the first blank line if 1
self.lasttoken = 1
self.temp = StringIO.StringIO()
self.spaces = spaces
self.comments = comments

if untabify:
self.raw = self.raw.expandtabs()
self.raw = self.raw.rstrip()+' '
self.out = out

self.raw = self.raw.replace('\r\n', '\n')
self.raw = self.raw.replace('\r', '\n')
self.lineend = '\n'

# Gather lines
while 1:
pos = self.raw.find(self.lineend, pos) + 1
if not pos: break
self.lines.append(pos)

self.lines.append(len(self.raw))
# Wrap text in a filelike object
self.pos = 0

text = StringIO.StringIO(self.raw)

# Parse the source.
## Tokenize calls the __call__
## function for each token till done.
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()

# Ok now we write it to a file
# but we also need to clean the whitespace
# between the lines and at the ends.
self.temp.seek(0)

# Mac CR
if eol == 'mac':
self.lineend = '\r'
# Windows CR LF
elif eol == 'win':
self.lineend = '\r\n'
# Unix LF
else:
self.lineend = '\n'

for line in self.temp.readlines():
if spaces == -1:
self.out.write(line.rstrip()+self.lineend)
else:
if not line.isspace():
self.lasttoken=0
self.out.write(line.rstrip()+self.lineend)
else:
self.lasttoken+=1
if self.lasttoken<=self.spaces and self.spaces:
self.out.write(self.lineend)

def __call__(self, toktype, toktext,
(srow,scol), (erow,ecol), line):
''' Token handler.
'''
# calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)

#kill the comments
if not self.comments:
# Kill the comments ?
if toktype == tokenize.COMMENT:
return

# handle newlines
if toktype in [token.NEWLINE, tokenize.NL]:
self.temp.write(self.lineend)
return

# send the original whitespace, if needed
if newpos > oldpos:
self.temp.write(self.raw[oldpos:newpos])

# skip indenting tokens
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return

# send text to the temp file
self.temp.write(toktext)
return
######################################################################

def Main():
import sys
if sys.argv[1]:
filein = open(sys.argv[1]).read()
Stripper(filein).format(out=sys.stdout, comments=1, untabify=1,
eol='win')

######################################################################

if __name__ == '__main__':
Main()

M.E.Farmer

qwweeeit · Apr 13, 2005

My code, besides beeing cumbersome and criptic, has another quality:
it is
buggy!
I apologize for that; obviously I discovered it after posting (in the
best tradition of Murphy's law!).
When I will find the solution I let you know, also if the problem is
made difficult for the fact that the for cycle is indexed in terms of
a 5 element tuple, not very easy (at least for me!...).

qwweeeit · Apr 15, 2005

Hi,
I have no more need to corret my code's bugs and send to clp group a
working application (I don't think that there was an eager
expectation...).
Your code is perfectly working (as you can expect from a guru...).
Thank you and bye.

qwweeeit · Apr 16, 2005

Hi,

At last I succeded in implementing a cross reference tool!
(with your help and that of other gurus...).
Now I can face the problem (for me...) of understanding your
code (I have not grasped the classes and objects...).

I give you a brief example of the xref output (taken from your code,
also if the line numbers don't match, because I modified your code,
not beeing interested in eof's other than Linux).

and 076 if self.lasttoken<=self.spaces and
self.spaces:
append 046 self.lines.append(pos)
append 048 self.lines.append(len(self.raw))
argv 116 if sys.argv[1]:
argv 117 filein = open(sys.argv[1]).read()
__author__ 010 __author__ = s_
break 045 if not pos: break
__call__ 080 def __call__(self, toktype, toktext, (srow,scol),
.. .
(erow,ecol), line):
class 015 class Stripper:
COMMENT 092 if toktype == tokenize.COMMENT:
comments 021 def format(self, out=sys.stdout, comments=0,
spaces=1,untabify=1):
comments 033 self.comments = comments
comments 090 if not self.comments:
comments 118 Stripper(filein).format(out=sys.stdout,
comments=0, .
untabify=1)
__credits__ 008 __credits__ = s_
__date__ 011 __date__ = s_
DEDENT 105 if toktype in [token.INDENT, token.DEDENT]:
def 018 def __init__(self, raw):
def 021 def format(self, out=sys.stdout, comments=0,
.. spaces=1,untabify=1):
def 080 def __call__(self, toktype, toktext, (srow,scol),
(erow,ecol), line):
def 114 def Main():
ecol 080 def __call__(self, toktype, toktext, (srow,scol),
.. (erow,ecol), line):
erow 080 def __call__(self, toktype, toktext, (srow,scol),
.. (erow,ecol), line):
ex 059 except tokenize.TokenError, ex:
except 059 except tokenize.TokenError, ex:
expandtabs 036 self.raw = self.raw.expandtabs()
filein 117 filein = open(sys.argv[1]).read()
filein 118 Stripper(filein).format(out=sys.stdout,
comments=0,

untabify=1)
find 044 pos = self.raw.find(self.lineend, pos) + 1
format 021 def format(self, out=sys.stdout, comments=0,
spaces=1,untabify=1):
format 118 Stripper(filein).format(out=sys.stdout,
comments=0,
untabify=1)
import 005 import keyword, os, sys, traceback
import 006 import StringIO
import 007 import token, tokenize
import 115 import sys
INDENT 105 if toktype in [token.INDENT, token.DEDENT]:
__init__ 018 def __init__(self, raw):
isspace 071 if not line.isspace():
keyword 005 import keyword, os, sys, traceback
lasttoken 030 self.lasttoken = 1
lasttoken 072 self.lasttoken=0
lasttoken 075 self.lasttoken+=1
lasttoken 076 if self.lasttoken<=self.spaces and
self.spaces:
....

To obtain this output, you must remove comments and empty lines, move
strings in a db file, leaving as place holder s_ for normal strings
and m_ for triple strings.
See an example:

m_ """python comment and whitespace stripper

""" #016
m_ ''' strip comments, strip extra whitespace, convert EOL's from
Python
code.'''#023
m_ ''' Token handler.''' #082

s_ 'just another tool that I needed' |008 __credits__ = 'just another
tool
that I needed'
s_ '.7' |009 __version__ = '.7'
s_ 'M.E.Farmer' |010 __author__ = 'M.E.Farmer'
s_ 'Jan 15 2005, Oct 24 2004' |011 __date__ = 'Jan 15 2005, Oct 24
2004'
s_ ' ' |037 self.raw = self.raw.rstrip()+'
'
s_ '\n' |040 self.lineend = '\n'
s_ '__main__' |122 if __name__ == '__main__':

I think that this tool is very useful.

Bye

M.E.Farmer · Apr 16, 2005

Glad you are making progress

I give you a brief example of the xref output (taken from your >code,
also if the line numbers don't match, because I modified >your code,
not beeing interested in eof's other than Linux).

What happens when you try to analyze a script from a diffrent os ? It
usually looks like a skewed mess, that is why I have added EOL
conversion so it is painless for you to convert to your eol of choice.
The code I posted consist of a class and a Main function.
The class has three methods.
__init__ is called by Python when you create an instance of the class
Stripper. All __init__ does here is just set a class variable self.raw
..
format is called explicitly with a few arguments to start the
tokenizer.
__call__ is special it is not easy to grasp how this even works.. at
first.
In Python when you treat an instance like a function, Python invokes
the __call__method of that instance if present and if it is callable().
example:
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()
The snippet above is from the Stripper class.
Notice that tokenize.tokenize is being feed a reference to self ( if
this code is running self is an instance of Stripper ).
tokenize.tokenize is really a hidden loop.
Each token generated is sent to self as five parts toktype, toktext,
(startrow,startcol), (endrow,endcol), and line. Self is callable and
has a __call__ method so tokenize sends really sends the five part
info to __call__ for every token.
If this was obvious then ignore it

M.E.Farmer

MrJean1 · Apr 16, 2005

Great tool, indeed! But doc strings stay in the source text.

If you do need to remove doc strings as well, add the following into
the __call__ method.

.... # kill doc strings
.... if not self.docstrings:
.... if toktype == tokenize.STRING and len(toktext) >= 6:
.... t = toktext.lstrip('rRuU')
.... if ((t.startswith("'''") and t.endswith("'''")) or
.... (t.startswith('"""') and t.endswith('"""'))):
.... return

as shown in the original post below. Also, set self.docstrings in the
format method, similar to self.comments as shown below in lines
starting with '...'.

/Jean Brouwers

M.E.Farmer said:
qwweeeit said:

Thanks! If you answer to my posts one more time I could consider

Click to expand...

you

as
my tutor...

It was strange to have found a bug...! In any case I will not go deeper
into the matter, because for me it's enough your explanatiom.
I corrected the problem by hand removing the tokens spanning

Click to expand...

multiple

lines
(there were only 8 cases...).

Instead I haven't understood your hint about comments...
I succeded in realizing a python script which removes comments.

Here it. is (in all its cumbersome and criptic appearence!...):

# removeCommentsTok.py
import tokenize
Input = "pippo1"
Output = "pippo2"
f = open(Input)
fOut=open(Output,"w")

nLastLine=0
for i in tokenize.generate_tokens(f.readline):
. if i[0]==52 and nLastLine != (i[2])[0]:
. . fOut.write((i[4].replace(i[1],'')).rstrip()+'\n')
. . nLastLine=(i[2])[0]
. elif i[0]==4 and nLastLine != (i[2])[0]:
. . fOut.write((i[4]))
. . nLastLine=(i[2])[0]
f.close()
fOut.close()

Some explanations for the guys like me...:
- 52 and 4 are the arbitrary codes for comments and NEWLINE respectively
- the comment removing is obtained by clearing the comment (i[1])

Click to expand...

in

the
input line (i[4])
- I also right trimmed the line to get rid off the remaining

Click to expand...

blanks.
Tokenizer sends multiline strings and comments as a single token.

######################################################################
# python comment and whitespace stripper
######################################################################

import keyword, os, sys, traceback
import StringIO
import token, tokenize
__credits__ = 'just another tool that I needed'
__version__ = '.7'
__author__ = 'M.E.Farmer'
__date__ = 'Jan 15 2005, Oct 24 2004'

######################################################################

class Stripper:
"""python comment and whitespace stripper
"""
def __init__(self, raw):
self.raw = raw

.... def format(self, out=sys.stdout, comments=0, docstrings=0,
spaces=1,

untabify=1, eol='unix'):
''' strip comments, strip extra whitespace,
convert EOL's from Python code.
'''
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
# Strips the first blank line if 1
self.lasttoken = 1
self.temp = StringIO.StringIO()
self.spaces = spaces
self.comments = comments

.... self.docstrings = docstrings

if untabify:
self.raw = self.raw.expandtabs()
self.raw = self.raw.rstrip()+' '
self.out = out

self.raw = self.raw.replace('\r\n', '\n')
self.raw = self.raw.replace('\r', '\n')
self.lineend = '\n'

# Gather lines
while 1:
pos = self.raw.find(self.lineend, pos) + 1
if not pos: break
self.lines.append(pos)

self.lines.append(len(self.raw))
# Wrap text in a filelike object
self.pos = 0

text = StringIO.StringIO(self.raw)

# Parse the source.
## Tokenize calls the __call__
## function for each token till done.
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()

# Ok now we write it to a file
# but we also need to clean the whitespace
# between the lines and at the ends.
self.temp.seek(0)

# Mac CR
if eol == 'mac':
self.lineend = '\r'
# Windows CR LF
elif eol == 'win':
self.lineend = '\r\n'
# Unix LF
else:
self.lineend = '\n'

for line in self.temp.readlines():
if spaces == -1:
self.out.write(line.rstrip()+self.lineend)
else:
if not line.isspace():
self.lasttoken=0
self.out.write(line.rstrip()+self.lineend)
else:
self.lasttoken+=1
if self.lasttoken<=self.spaces and self.spaces:
self.out.write(self.lineend)

def __call__(self, toktype, toktext,
(srow,scol), (erow,ecol), line):
''' Token handler.
'''
# calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)

#kill the comments
if not self.comments:
# Kill the comments ?
if toktype == tokenize.COMMENT:
return

.... # kill doc strings
.... if not self.docstrings:
.... if toktype == tokenize.STRING and len(toktext) >= 6:
.... t = toktext.lstrip('rRuU')
.... if ((t.startswith("'''") and t.endswith("'''")) or
.... (t.startswith('"""') and t.endswith('"""'))):
.... return

# handle newlines
if toktype in [token.NEWLINE, tokenize.NL]:
self.temp.write(self.lineend)
return

# send the original whitespace, if needed
if newpos > oldpos:
self.temp.write(self.raw[oldpos:newpos])

# skip indenting tokens
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return

# send text to the temp file
self.temp.write(toktext)
return
######################################################################

def Main():
import sys
if sys.argv[1]:
filein = open(sys.argv[1]).read()
Stripper(filein).format(out=sys.stdout, comments=1, untabify=1,

eol='win')

Click to expand...

######################################################################

if __name__ == '__main__':
Main()

M.E.Farmer

qwweeeit · Apr 16, 2005

Hi,

Importing a text file from another o.s. is not a problem : I convert
it immediately using the powerful shell functions of Linux (and Unix).

I thank you for the explanation about classes, but I am rather dumb
and
by now I resolved all my problems without them...
Speaking of problems..., I have yet an error in parsing for literal
strings,
when there is more than one literal string by source line.

Perhaps it's time to use classes...

M.E.Farmer · Apr 17, 2005

Thanks Jean,
I have thought about adding docstrings several times, but I was stumped
at how to determine a docstring from a regular tripleqoted string

I have been thinking hard about the problem and I think I have an idea.
If the line has nothing before the start of the string it must be a
docstring.
Sounds simple enough but in Python there are 12 or so 'types' of
strings .
Here is my crack at it feel free to improve it

I reversed the logic on the comments and docstrings so I could add a
special mode to docstring stripping ...pep8 mode .
Pep8 mode only strips double triple quotes from your source code
leaving the offending single triple quotes behind. Probably just stupid
but someone might find it usefull.
######################################################################
# Python source stripper
######################################################################

import os
import sys
import token
import keyword
import StringIO
import tokenize
import traceback
__credits__ = '''
Jürgen Hermann
M.E.Farmer
Jean Brouwers
'''
__version__ = '.8'
__author__ = 'M.E.Farmer'
__date__ = 'Apr 16, 2005,' \
'Jan 15 2005,' \
'Oct 24 2004' \

######################################################################

class Stripper:
"""Python source stripper
"""
def __init__(self, raw):
self.raw = raw

def format(self, out=sys.stdout, comments=0, docstrings=0,
spaces=1, untabify=1, eol='unix'):
""" strip comments,
strip docstrings,
strip extra whitespace and lines,
convert tabs to spaces,
convert EOL's in Python code.
"""
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
# Strips the first blank line if 1
self.lasttoken = 1
self.temp = StringIO.StringIO()
self.spaces = spaces
self.comments = comments
self.docstrings = docstrings

if untabify:
self.raw = self.raw.expandtabs()
self.raw = self.raw.rstrip()+' '
self.out = out

# Have you ever had a multiple line ending script?
# They can be nasty so lets get them all the same.
self.raw = self.raw.replace('\r\n', '\n')
self.raw = self.raw.replace('\r', '\n')
self.lineend = '\n'

# Gather lines
while 1:
pos = self.raw.find(self.lineend, pos) + 1
if not pos: break
self.lines.append(pos)

self.lines.append(len(self.raw))
self.pos = 0

# Wrap text in a filelike object
text = StringIO.StringIO(self.raw)

# Parse the source.
## Tokenize calls the __call__
## method for each token till done.
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()

# Ok now we write it to a file
# but we also need to clean the whitespace
# between the lines and at the ends.
self.temp.seek(0)

# All this should be written into the
# __call__ method just haven't yet...

# Mac CR
if eol == 'mac':
self.lineend = '\r'
# Windows CR LF
elif eol == 'win':
self.lineend = '\r\n'
# Unix LF
else:
self.lineend = '\n'

for line in self.temp.readlines():
if spaces == -1:
self.out.write(line.rstrip()+self.lineend)
else:
if not line.isspace():
self.lasttoken=0
self.out.write(line.rstrip()+self.lineend)
else:
self.lasttoken+=1
if self.lasttoken<=self.spaces and self.spaces:
self.out.write(self.lineend)

def __call__(self, toktype, toktext,
(srow,scol), (erow,ecol), line):
""" Token handler.
"""
# calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)

# kill comments
if self.comments:
if toktype == tokenize.COMMENT:
return

# kill doc strings
if self.docstrings:
# Assume if there is nothing on the
# left side it must be a docstring
if toktype == tokenize.STRING and \
line.lstrip(' rRuU')[0] in ["'",'"']:
t = toktext.lstrip('rRuU')
if (t.startswith('"""') and
(self.docstrings == 'pep8' or
self.docstrings =='8')):
return
elif t.startswith('"""') or t.startswith("'''"):
return

# handle newlines
if toktype in [token.NEWLINE, tokenize.NL]:
self.temp.write(self.lineend)
return

# send the original whitespace
if newpos > oldpos:
self.temp.write(self.raw[oldpos:newpos])

# skip indenting tokens
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return

# send text to the temp file
self.temp.write(toktext)
return
######################################################################

def Main():
import sys
if sys.argv[1]:
filein = open(sys.argv[1]).read()
Stripper(filein).format(out=sys.stdout,
comments=0, docstrings=1, untabify=1, eol='win')
######################################################################

if __name__ == '__main__':
Main()

MrJean1 · Apr 17, 2005

There is an issue with both my and your code: it only works if doc
strings are triple quoted and if there are no other triple quoted
strings in the Python code.

A triple quoted string used in an assignment will be removed, for
example this case

s = '''this string should not be removed'''

It is still unclear how to distinguish doc strings from other strings.
Also, I have not checked the precise Python syntax, but doc strings do
not need to be enclosed by triple quotes. A single quote may be
allowed too.

Maybe this rule will work: a doc string is any string preceded by a
COLON token followed by zero, one or more INDENT or NEWLINE tokens.
Untested!

/Jean Brouwers

M.E.Farmer said:
Thanks Jean,
I have thought about adding docstrings several times, but I was stumped
at how to determine a docstring from a regular tripleqoted string
I have been thinking hard about the problem and I think I have an idea.
If the line has nothing before the start of the string it must be a
docstring.
Sounds simple enough but in Python there are 12 or so 'types' of
strings .
Here is my crack at it feel free to improve it
I reversed the logic on the comments and docstrings so I could add a
special mode to docstring stripping ...pep8 mode .
Pep8 mode only strips double triple quotes from your source code
leaving the offending single triple quotes behind. Probably just stupid
but someone might find it usefull.
######################################################################
# Python source stripper
######################################################################

import os
import sys
import token
import keyword
import StringIO
import tokenize
import traceback
__credits__ = '''
Jürgen Hermann
M.E.Farmer
Jean Brouwers
'''
__version__ = '.8'
__author__ = 'M.E.Farmer'
__date__ = 'Apr 16, 2005,' \
'Jan 15 2005,' \
'Oct 24 2004' \

######################################################################

class Stripper:
"""Python source stripper
"""
def __init__(self, raw):
self.raw = raw

def format(self, out=sys.stdout, comments=0, docstrings=0,
spaces=1, untabify=1, eol='unix'):
""" strip comments,
strip docstrings,
strip extra whitespace and lines,
convert tabs to spaces,
convert EOL's in Python code.
"""
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
# Strips the first blank line if 1
self.lasttoken = 1
self.temp = StringIO.StringIO()
self.spaces = spaces
self.comments = comments
self.docstrings = docstrings

if untabify:
self.raw = self.raw.expandtabs()
self.raw = self.raw.rstrip()+' '
self.out = out

# Have you ever had a multiple line ending script?
# They can be nasty so lets get them all the same.
self.raw = self.raw.replace('\r\n', '\n')
self.raw = self.raw.replace('\r', '\n')
self.lineend = '\n'

# Gather lines
while 1:
pos = self.raw.find(self.lineend, pos) + 1
if not pos: break
self.lines.append(pos)

self.lines.append(len(self.raw))
self.pos = 0

# Wrap text in a filelike object
text = StringIO.StringIO(self.raw)

# Parse the source.
## Tokenize calls the __call__
## method for each token till done.
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()

# Ok now we write it to a file
# but we also need to clean the whitespace
# between the lines and at the ends.
self.temp.seek(0)

# All this should be written into the
# __call__ method just haven't yet...

# Mac CR
if eol == 'mac':
self.lineend = '\r'
# Windows CR LF
elif eol == 'win':
self.lineend = '\r\n'
# Unix LF
else:
self.lineend = '\n'

for line in self.temp.readlines():
if spaces == -1:
self.out.write(line.rstrip()+self.lineend)
else:
if not line.isspace():
self.lasttoken=0
self.out.write(line.rstrip()+self.lineend)
else:
self.lasttoken+=1
if self.lasttoken<=self.spaces and self.spaces:
self.out.write(self.lineend)

def __call__(self, toktype, toktext,
(srow,scol), (erow,ecol), line):
""" Token handler.
"""
# calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)

# kill comments
if self.comments:
if toktype == tokenize.COMMENT:
return

# kill doc strings
if self.docstrings:
# Assume if there is nothing on the
# left side it must be a docstring
if toktype == tokenize.STRING and \
line.lstrip(' rRuU')[0] in ["'",'"']:
t = toktext.lstrip('rRuU')
if (t.startswith('"""') and
(self.docstrings == 'pep8' or
self.docstrings =='8')):
return
elif t.startswith('"""') or t.startswith("'''"):
return

# handle newlines
if toktype in [token.NEWLINE, tokenize.NL]:
self.temp.write(self.lineend)
return

# send the original whitespace
if newpos > oldpos:
self.temp.write(self.raw[oldpos:newpos])

# skip indenting tokens
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return

# send text to the temp file
self.temp.write(toktext)
return
######################################################################

def Main():
import sys
if sys.argv[1]:
filein = open(sys.argv[1]).read()
Stripper(filein).format(out=sys.stdout,
comments=0, docstrings=1, untabify=1, eol='win')
######################################################################

if __name__ == '__main__':
Main()

M.E.Farmer · Apr 17, 2005

MrJean1 said:
There is an issue with both my and your code: it only works if doc
strings are triple quoted and if there are no other triple quoted
strings in the Python code.

I had not considered single quoted strings

A triple quoted string used in an assignment will be removed, for
example this case

s = '''this string should not be removed'''

It is still unclear how to distinguish doc strings from other strings.
Also, I have not checked the precise Python syntax, but doc strings do
not need to be enclosed by triple quotes. A single quote may be
allowed too.

Maybe this rule will work: a doc string is any string preceded by a
COLON token followed by zero, one or more INDENT or NEWLINE tokens.
Untested!

Not needed , if you reread my post I explain that I had solved that
issue.
If you use the line argument that tokenizer supplies we can strip
whitespace and 'rRuU' from the start of the line and look for a single
quote or a double quote .
I have tested it and it works.
Reworked the 'pep8' thing and fixed the bug you mentioned here is the
changes.

######################################################################

# Python source stripper

Click to expand...

######################################################################

import os
import sys
import token
import keyword
import StringIO
import tokenize
import traceback
__credits__ = '''
Jürgen Hermann
M.E.Farmer
Jean Brouwers
'''
__version__ = '.8'
__author__ = 'M.E.Farmer'
__date__ = 'Apr 16, 2005,' \
'Jan 15 2005,' \
'Oct 24 2004' \

Click to expand...

######################################################################

class Stripper:
"""Python source stripper
"""
def __init__(self, raw):
self.raw = raw

def format(self, out=sys.stdout, comments=0, docstrings=0,
spaces=1, untabify=1, eol='unix'):
""" strip comments,
strip docstrings,
strip extra whitespace and lines,
convert tabs to spaces,
convert EOL's in Python code.
"""
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
# Strips the first blank line if 1
self.lasttoken = 1
self.temp = StringIO.StringIO()
self.spaces = spaces
self.comments = comments
self.docstrings = docstrings

if untabify:
self.raw = self.raw.expandtabs()
self.raw = self.raw.rstrip()+' '
self.out = out

# Have you ever had a multiple line ending script?
# They can be nasty so lets get them all the same.
self.raw = self.raw.replace('\r\n', '\n')
self.raw = self.raw.replace('\r', '\n')
self.lineend = '\n'

# Gather lines
while 1:
pos = self.raw.find(self.lineend, pos) + 1
if not pos: break
self.lines.append(pos)

self.lines.append(len(self.raw))
self.pos = 0

# Wrap text in a filelike object
text = StringIO.StringIO(self.raw)

# Parse the source.
## Tokenize calls the __call__
## method for each token till done.
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()

# Ok now we write it to a file
# but we also need to clean the whitespace
# between the lines and at the ends.
self.temp.seek(0)

# All this should be written into the
# __call__ method just haven't yet...

# Mac CR
if eol == 'mac':
self.lineend = '\r'
# Windows CR LF
elif eol == 'win':
self.lineend = '\r\n'
# Unix LF
else:
self.lineend = '\n'

for line in self.temp.readlines():
if spaces == -1:
self.out.write(line.rstrip()+self.lineend)
else:
if not line.isspace():
self.lasttoken=0
self.out.write(line.rstrip()+self.lineend)
else:
self.lasttoken+=1
if self.lasttoken<=self.spaces and self.spaces:
self.out.write(self.lineend)

def __call__(self, toktype, toktext,
(srow,scol), (erow,ecol), line):
""" Token handler.
"""
# calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)

# kill comments
if self.comments:
if toktype == tokenize.COMMENT:
return

Click to expand...

# kill doc strings
if self.docstrings:
# Assume if there is nothing on the
# left side it must be a docstring
if toktype == tokenize.STRING and \
line.lstrip(' rRuU')[0] in ["'",'"']:
t = toktext.lstrip('rRuU')
# pep8 frowns on triple single quotes
if ( self.docstrings == 'pep8' or
self.docstrings == 8):
# pep8 frowns on single triples
if not t.startswith('"""'):
return
else:
return

# handle newlines
if toktype in [token.NEWLINE, tokenize.NL]:
self.temp.write(self.lineend)
return

# send the original whitespace
if newpos > oldpos:
self.temp.write(self.raw[oldpos:newpos])

# skip indenting tokens
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return

# send text to the temp file
self.temp.write(toktext)
return
######################################################################

def Main():
import sys
if sys.argv[1]:
filein = open(sys.argv[1]).read()
Stripper(filein).format(out=sys.stdout,
comments=0, docstrings='pep8', untabify=1, eol='win')
######################################################################

if __name__ == '__main__':
Main()

That should work like a charm for all types of docstrings without
disturbing others strings.

M.E.Farmer

M.E.Farmer · Apr 17, 2005

Google has now 'fixed' there whitespace issue and now has an auto-quote
issue argggh!

The script is located at:
http://bellsouthpwp.net/m/e/mefjr75/python/stripper.py

M.E.Farmer

M.E.Farmer · Apr 17, 2005

I found the bug and hope I have squashed it.
Single and qouble quoted strings that were assignments and spanned
multilines using \ , were chopped after the first line.
example:
__date__ = 'Apr 16, 2005,' \
'Jan 15 2005,' \
'Oct 24 2004'
became:
__date__ = 'Apr 16, 2005,' \

Not good

tokenizer sends this as:
name
operator
string
string
string
newline

I added test for string assignments that end in \.
A flag is set and then all strings till a newline are ignored.
Also rearranged the script a little.
Maybe that will do it ...
Updates available at

MrJean1 · Apr 19, 2005

Attached is another version of the stripper.py file. It contains my
change which seem to handle docstring correctly (at least on itself).

/Jean Brouwers

<pre>

######################################################################
# Python source stripper / cleaner

######################################################################

import os
import sys
import token
import keyword
import StringIO
import tokenize
import traceback
__credits__ = \
'''
J¸rgen Hermann
M.E.Farmer
Jean Brouwers
'''
__version__ = '.8'
__author__ = 'M.E.Farmer'
__date__ = 'Apr 16, 2005,' \
'Jan 15 2005,' \
'Oct 24 2004' \

'''this docstring should be removed
'''

######################################################################

class Stripper:
"""Python source stripper / cleaner
"""
def __init__(self, raw):
self.raw = raw

def format(self, out=sys.stdout, comments=0, docstrings=0,
spaces=1, untabify=1, eol='unix'):
""" strip comments,
strip docstrings,
strip extra whitespace and lines,
convert tabs to spaces,
convert EOL's in Python code.
"""
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
self.temp = StringIO.StringIO()
# Strips the first blank line if 1
self.lasttoken = 1
self.spaces = spaces
# 0 = no change, 1 = strip 'em
self.comments = comments # yep even these
# 0 = no change, 1 = strip 'em, 8 or 'pep8'= strip all but
"""'s
self.docstrings = docstrings

if untabify:
self.raw = self.raw.expandtabs()
self.raw = self.raw.rstrip()+' '
self.out = out

# Have you ever had a multiple line ending script?
# They can be nasty so lets get them all the same.
self.raw = self.raw.replace('\r\n', '\n')
self.raw = self.raw.replace('\r', '\n')
self.lineend = '\n'

# Gather lines
while 1:
pos = self.raw.find(self.lineend, pos) + 1
if not pos: break
self.lines.append(pos)

self.lines.append(len(self.raw))
self.pos = 0
self.lastOP = ''

# Wrap text in a filelike object
text = StringIO.StringIO(self.raw)

# Parse the source.
## Tokenize calls the __call__
## method for each token till done.
try:
tokenize.tokenize(text.readline, self)
except tokenize.TokenError, ex:
traceback.print_exc()

# Ok now we write it to a file
# but we also need to clean the whitespace
# between the lines and at the ends.
self.temp.seek(0)

# All this should be written into the
# __call__ method just haven't yet...

# Mac CR
if eol == 'mac':
self.lineend = '\r'
# Windows CR LF
elif eol == 'win':
self.lineend = '\r\n'
# Unix LF
else:
self.lineend = '\n'

for line in self.temp.readlines():
if spaces == -1:
self.out.write(line.rstrip()+self.lineend)
else:
if not line.isspace():
self.lasttoken=0
self.out.write(line.rstrip()+self.lineend)
else:
self.lasttoken+=1
if self.lasttoken<=self.spaces and self.spaces:
self.out.write(self.lineend)

def __call__(self, toktype, toktext, (srow,scol), (erow,ecol),
line):
""" Token handler.
"""
# calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)

##print "*token: %s text: %r line: %r" % \
(token.tok_name[toktype], toktext, line)

# kill comments
if self.comments:
if toktype == tokenize.COMMENT:
return

# kill doc strings
if self.docstrings:
# a STRING must be a docstring
# if the most recent OP was ':'
if toktype == tokenize.STRING and self.lastOP == ':':
# pep8 frowns on triple single quotes
if (self.docstrings == 'pep8' or
self.docstrings == 8):
if not toktext.endswith('"""'):
return
else:
return
elif toktype == token.OP:
# remember most recent OP
self.lastOP = toktext
elif self.lastOP == ':':
# newline and indent are OK inside docstring
if toktype not in [token.NEWLINE, token.INDENT]:
# otherwise the docstring ends
self.lastOP = ''
elif toktype == token.NEWLINE:
# consider any string starting
# on a new line as a docstring
self.lastOP = ':'

# handle newlines
if toktype in [token.NEWLINE, tokenize.NL]:
self.temp.write(self.lineend)
return

# send the original whitespace
if newpos > oldpos:
self.temp.write(self.raw[oldpos:newpos])

# skip indenting tokens
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return

# send text to the temp file
self.temp.write(toktext)
return
######################################################################

def Main():
import sys
if sys.argv[1]:
filein = open(sys.argv[1]).read()
Stripper(filein).format(out=sys.stdout,
comments=1, docstrings=1, untabify=1, eol='win')
######################################################################

if __name__ == '__main__':
Main()

</pre>

M.E.Farmer · Apr 22, 2005

Hello Jean,
Glad to see your still playing along.
I have tested your script and it is broken too

Good idea about checking for the ':' , it just doesn't cover every
case.
This is the very reason I had not included docstring support before!
The problem is more diffcult than it first appears,
I am sure you have noticed

Python is fairly flexible in it's layout and very dynamic in it's
execution.
This can lead to some hard to spot and hard to remove docstrings.

After staring at the problem for a day or so ( for the second time ),
*I am still stumped*

#####################################################################
# this is a test I have put together for docstrings
#####################################################################
"""This is a module doc it should be removed""" \
"This is really nasty but legal" \
'''Dang this is even worse''' + \
'this should be removed'#This is legal too
#####################################################################
assignment = \
"""
this should stay
so should this
"""
more_assignment = 'keep me,' \
'keep me too,' \
'keep me.'

#####################################################################
def func():
'This should be removed' \
"""This should be removed"""
pass
######################################################################
def funq(d = {'MyPass':
"""This belongs to a dict and should stay"""
,'MyOtherPass':
'Kepp this string %s'\
"Keep this too" % 42 + """dfgffdgfdgdfg"""}):
"""This docstring is ignored""" + ''' by Python introspection
why?'''
pass
######################################################################
def Usage():
"""This should be removed but how, removal will break the function.
This should be removed %s """# what do we do here
return Usage.__doc__% '42'
######################################################################
class Klass:
u"This should be removed" \
''' this too '''
def __init__(self, num):
""" This is should be removed but how ? %d """ % num
return None
'People do this sometime for a block comment type of thing' \
"This type of string should be removed also"
def func2(self):
r'erase/this\line\sdfdsf\sdf\dfsdf'
def inner():
u'''should be removed'''
return 42
return inner
######################################################################
u'People do this sometime for a block comment type of thing' \
r"This type of string should be removed also" \
""" and this one too! """
# did I forget anything obvious ?
#####################################################################

When the docstring is removed it should also consume the blank line
that is left behind.
Got to go to work, I'll think about it over the weekend.
If anyone else wants to play you are welcome to join.
Can pyparsing do this easily?( Paul probably has a 'six-line' solution
tucked away somewhere

M.E.Farmer

Directory Caching, suggestions and comments?	0	May 15, 2014
make sublists of a list broken at nth certain list items	2	Jul 8, 2013
String and list error while running a Markov Chain	1	Aug 26, 2020
Request for comments - concurrent ssh client	0	Nov 4, 2009
Python point location of intersect between two lines	0	Feb 28, 2018
Survey details won't go through using php, ajax, Mysql	0	Oct 26, 2023
Attribute error while executing python script	0	Apr 16, 2014
Yet another "simple" headscratcher	4	May 31, 2014

Removing comments... tokenize error

qwweeeit

Fredrik Lundh

qwweeeit

M.E.Farmer

qwweeeit

qwweeeit

qwweeeit

M.E.Farmer

MrJean1

qwweeeit

M.E.Farmer

MrJean1

M.E.Farmer

M.E.Farmer

M.E.Farmer

MrJean1

M.E.Farmer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads