How to strip comments out of code

silviocortes · Oct 30, 2007

Howdy...

I need to write a class that will take a java file as input, strip all
the comments out, and save thre result in a different file....

Can I use that StreamTokenizer to do that? I can't really understand
how that work... Help, anyone?

I guess I could also write code myself, but how would I handle a code
like that:

1: // this is a one line comment
2: System.out.println ("//");

In the example below, the first line would be removed... What's the
best way to know when "//" is not part of a comment. For that matter,
the same with "/*"

Any help is welcome. Tks.

Daniel Pitts · Oct 30, 2007

Howdy...

I need to write a class that will take a java file as input, strip all
the comments out, and save thre result in a different file....

Can I use that StreamTokenizer to do that? I can't really understand
how that work... Help, anyone?

I guess I could also write code myself, but how would I handle a code
like that:

1: // this is a one line comment
2: System.out.println ("//");

In the example below, the first line would be removed... What's the
best way to know when "//" is not part of a comment. For that matter,
the same with "/*"

Any help is welcome. Tks.

It's actually not that easy of a problem, but there is hope! You can
probably find a Java source parser out there somewhere by using a little
website I call Google. Check it out at Google.com.

Gordon Beaton · Oct 30, 2007

I need to write a class that will take a java file as input, strip
all the comments out, and save thre result in a different file....

Run the code through a C preprocessor.

/gordon

--

Esmond Pitt · Oct 30, 2007

I need to write a class that will take a java file as input, strip all
the comments out, and save thre result in a different file....

Why?

Lew · Oct 30, 2007

I need to write a class that will take a java file as input, strip all
the comments out, and save thre result in a different file....

Have your class call javac.

Roedy Green · Oct 30, 2007

Can I use that StreamTokenizer to do that? I can't really understand
how that work... Help, anyone?

You would do it with a little finite state machine, or a parser.

See http://mindprod.com/jgloss/finitestate.html
http://mindprod.com/jgloss/parser.html

You can see an example of such a parser as part of
http://mindprod.com/products1.html#JDISPLAY
see com.mindprod.jprep.JavaTokenizer

You can strip out all the code except that which deals with comments,
by collapsing other states (each implemented with an enum constant)
into one.

You could simply search for all // with indexOf and rip out till nl or
all /* and rip out till */

However make sure you handle // embedded in /* ... */
and /* embedded in //.

You have to scan for next /* or // whichever comes first, then process
that.

THen there is the simplest solution of all. Google for "strip Java
comments" and see what comes up.

Tris Orendorff · Oct 30, 2007

(e-mail address removed) burped up warm pablum in

Howdy...

I need to write a class that will take a java file as input, strip all
the comments out, and save thre result in a different file....

Can I use that StreamTokenizer to do that? I can't really understand
how that work... Help, anyone?

I guess I could also write code myself, but how would I handle a code
like that:

1: // this is a one line comment
2: System.out.println ("//");

In the example below, the first line would be removed... What's the
best way to know when "//" is not part of a comment. For that matter,
the same with "/*"

All you need is a lexer (lex) to pick up tokens--no parsing (yacc or bison) required. I have a version in C and
lex for MS-DOS which was slapped together in 1990. You can find it at
http://sourceforge.net/projects/cshroud . Comments are naturally disposed of since that is half the job of
shrouding.

--
Tris Orendorff
[ Anyone naming their child should spend a few minutes checking rhyming slang and dodgy sounding
names. Brad and Angelina failed to do this when naming their kid Shiloh Pitt. At some point, someone at
school is going to spoonerise her name.
Craig Stark]

Mark Rafn · Oct 31, 2007

I need to write a class that will take a java file as input, strip all
the comments out, and save thre result in a different file....

This is harder than you think. Use a real parser.

1: // this is a one line comment
2: System.out.println ("//");

Here are some more test cases for you:
public class Comment {
public static void main(String[] args) {
String note = "// 1 "; // this is a comment
System.out.println(note);

/* // comment */ note = "2";
System.out.println(note);

char ch = '"'; // code = "3 if broken"
System.out.println(note);

note=\u0022 // 4";
System.out.println(note);
}
}

The output should be
// 1
2
2
// 4

Piotr Kobzda · Oct 31, 2007

I need to write a class that will take a java file as input, strip all
the comments out, and save thre result in a different file....

Assuming the use of correct Java sources as an input, the code below
should do the trick. (Warning: not tested intensively!)

Note that it tries to preserve as much of the original code as possible.
That is, the line numbers, positions, and escape sequences of the code
in output should be the same as in input (that may help in debugging).

piotr

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.io.Reader;
import java.util.ArrayDeque;
import java.util.Deque;

public class CommentStripper {

public static void main(String[] args) throws Exception {
InputStream in = new BufferedInputStream(
new FileInputStream("CommentStripper.java"));
Reader source = new InputStreamReader(in);
PrintWriter out = new PrintWriter(System.out, true);
stripComments(source, out);
}

public static void stripComments(
Reader source, PrintWriter out) throws IOException {
SourceReader reader = new SourceReader(source);

StringBuilder outbf = new StringBuilder();
boolean inComment = false;
for(Char next; (next = reader.next()) != Char.EOF

{

int commentCharsInLine = 0;
for(Char sc; !(sc = next).isEOL()

{
next = reader.next();

if (inComment) {
if (sc.codePoint == '*' && next.codePoint == '/') {
// end of comment

// read next
next = reader.next();

if (!next.isEOL()) {
// write out spaces
int ix = outbf.length();
outbf.setLength(ix + commentCharsInLine + 2);
for(final int len = outbf.length(); ix < len; ++ix) {
outbf.setCharAt(ix, ' ');
}
}

commentCharsInLine = 0;
inComment = false;
} else {
commentCharsInLine++;
}

} else if (sc.codePoint == '/' && next.codePoint == '*') {
// start of multiline comment
inComment = true;
commentCharsInLine = 2;

// read next
next = reader.next();

} else if (sc.codePoint == '/' && next.codePoint == '/') {
// single line comment

// skip to the end of line
while(!next.isEOL()) {
next = reader.next();
}

} else if (sc.codePoint == '"' || sc.codePoint == '\'' ) {
// text literal...

sc.appendSource(outbf);

// lookup end of literal (should be in the same line)
boolean literalEndFound = false;
for(; !next.isEOL(); next = reader.next()) {
next.appendSource(outbf);
if (next.codePoint == '\\') {
// read & write next
next = reader.next();
if (!next.isEOL()) {
next.appendSource(outbf);
}
continue;
}
if (literalEndFound = next.codePoint == sc.codePoint) {
// read next
next = reader.next();
break;
}
}
if (!literalEndFound) {
// syntax error in input...
throw new IOException("End of text literal not found");
}

} else {
// write out source "as is"
sc.appendSource(outbf);
}
}

// flush buffered line
String outLine = outbf.toString();
if (outLine.trim().length() == 0) {
out.println();
} else {
out.println(outLine);
}

outbf.setLength(0);
}
}

private static abstract class Char {
final int codePoint;

Char(int codePoint) {
this.codePoint = codePoint;
}

boolean isEOL() {
return codePoint == '\n';
}

abstract void appendSource(StringBuilder sb);

static final Char EOF = new Char(-1) {

@Override
public void appendSource(StringBuilder sb) {
// write nothing
}

@Override
boolean isEOL() {
return true;
}
};

static Char newInstance(final InputChar c) {
return new Char(c.value) {

@Override
void appendSource(StringBuilder sb) {
c.appendSource(sb);
}
};
}

static Char newInstance(int codePoint, final InputChar c) {
return new Char(codePoint) {

@Override
void appendSource(StringBuilder sb) {
c.appendSource(sb);
}
};
}

static Char newInstance(int codePoint, final InputChar... chars) {
return new Char(codePoint) {

@Override
void appendSource(StringBuilder sb) {
for(InputChar c : chars) {
c.appendSource(sb);
}
}
};
}

@Override
public String toString() {
StringBuilder sb = new StringBuilder();
appendSource(sb);
return "[" + codePoint + "]=" + sb.toString();
}

}

private static abstract class InputChar {
final int value;

static final InputChar EOF = new InputChar(-1) {

@Override
void appendSource(StringBuilder sb) {
// write nothing
};
};

InputChar(int value) {
this.value = value;
}

abstract void appendSource(StringBuilder sb);

static InputChar newCharInstance(int value) {
return new InputChar(value) {

@Override
void appendSource(StringBuilder sb) {
sb.append((char)value);
}
};
}

static InputChar newEscapeSequenceInstance(int value, final
CharSequence seq) {
return new InputChar(value) {

@Override
void appendSource(StringBuilder sb) {
sb.append(seq);
}
};
}

}

private static class SourceReader {
private Reader in;

SourceReader(Reader in) {
this.in = in;
}

private Deque<InputChar> inputChars = new ArrayDeque<InputChar>();

Char next() throws IOException {
InputChar nc = nextInputChar();
if (nc == InputChar.EOF) {
return Char.EOF;
}

InputChar fc = nextInputChar();

if (nc.value == '\r' && fc.value == '\n') {
return Char.newInstance('\n', nc, fc);
}
if (nc.value == '\r' || nc.value == '\n') {
unread(fc);
return Char.newInstance('\n', nc);
}

if (Character.isSurrogatePair((char)nc.value, (char)fc.value)) {
return Char.newInstance(
Character.toCodePoint((char)nc.value, (char)fc.value), nc, fc);
}

unread(fc);
return Char.newInstance(nc);
}

private void unread(InputChar c) {
if (inputChars == null) {
if (c != InputChar.EOF) {
inputChars = new ArrayDeque<InputChar>();
} else {
return;
}
}
inputChars.addFirst(c);
}

private InputChar nextInputChar() throws IOException {
if (inputChars == null) {
return InputChar.EOF;
}
if (!inputChars.isEmpty()) {
return inputChars.removeFirst();
}

int r0 = in.read();
if (r0 == -1) {
inputChars = null;
return InputChar.EOF;
}
if (r0 == '\\') {
int r1 = in.read();
if (r1 == '\\') {
// double backslash, read each separately
inputChars.add(InputChar.newCharInstance(r0));
return inputChars.peek();
}
if (r1 == 'u') {
// escape sequence
StringBuilder seqbf = new StringBuilder();
// collect all 'u's
seqbf.append((char)r0);
do {
seqbf.append((char)r1);
r1 = in.read();
} while(r1 == 'u');
// parse escape sequence value
parseSeq: if (r1 != -1) {
seqbf.append((char)r1);
for(int i = 3; i > 0; --i) {
r1 = in.read();
if (r1 == -1) break parseSeq;
seqbf.append((char)r1);
}
if (r1 != -1) {
int val = Integer.parseInt(
seqbf.substring(seqbf.length() - 4), 16);
return InputChar.newEscapeSequenceInstance(val, seqbf);
}
}
// incorrect escape sequence...
throw new IOException("Incorrect escape sequence: '" + seqbf
+ "'");
}
// unknown...
inputChars.add(InputChar.newCharInstance(r1));
}
return InputChar.newCharInstance(r0);
}

void close() throws IOException {
if (in != null) {
in.close();
}
in = null;
inputChars = null;
}
}

}

Lew · Oct 31, 2007

Why not just decompile the bytecode?

Esmond Pitt · Oct 31, 2007

Mark said:
This is harder than you think. Use a real parser.

You don't need a real parser. You need a real lexer. Javac removes
comments in the lexer, as does every compiler I've ever written. So can you.

Piotr Kobzda · Oct 31, 2007

Lew said:
Why not just decompile the bytecode?

Because that's not always possible to achieve even equivalent source
code from the bytecode? (Keywords: Type erasure, compile-time constant
expressions resolution, obfuscation, etc...)

piotr

Piotr Kobzda · Oct 31, 2007

Esmond said:
You don't need a real parser. You need a real lexer. Javac removes
comments in the lexer, as does every compiler I've ever written. So can
you.

Javac's lexer do not removes comments (not all at least). Important
comments, i.e. /** ... */ must be preserver for parser because they may
contain information needed for code generation (e.g. @deprecated Javadoc
tags).

In fact, there is not clear distinction between the javac lexer, and
parser I think...

BTW, The OP may also utilize the Java Compiler API (JSR-199) and its
Tree API (the latter is still under com.sun.*, but AFAIK is "almost"
stable now...). The starting point example is below (requires
tolls.jar!). It needs more detailed scanning of source tree (extend
TreeScanner) because of current Tree.toString() implementations gives
not so exact preview of the original source code (e.g. annotations'
attribute default values are skipped from output, etc...). In the OP's
particular problem I prefer to use simplified "stripper" (the one sent
by me earlier to this thread), because everything is under "my control"
there. However, the 199 API usages are much wider than that, so its
importance is much beyond my simple approach.

piotr

import javax.tools.JavaCompiler;
import javax.tools.JavaFileObject;
import javax.tools.StandardJavaFileManager;
import javax.tools.ToolProvider;

import com.sun.source.tree.AnnotationTree;
import com.sun.source.tree.CompilationUnitTree;
import com.sun.source.tree.ImportTree;
import com.sun.source.tree.Tree;
import com.sun.source.tree.TreeVisitor;
import com.sun.source.util.TreeScanner;

public class JavaCBasedCommentStripper {

public static void main(String[] args) throws Exception {
final JavaCompiler compiler = ToolProvider.getSystemJavaCompiler();
final StandardJavaFileManager fileManager = compiler
.getStandardFileManager(null, null, null);
Iterable<? extends JavaFileObject> compilationUnits = fileManager
.getJavaFileObjects("JavaCBasedCommentStripper.java");
com.sun.source.util.JavacTask jt = (com.sun.source.util.JavacTask)
compiler
.getTask(null, fileManager, null, null, null, compilationUnits);
Iterable<? extends CompilationUnitTree> ts = jt.parse();

for (CompilationUnitTree cu : ts) {
// System.out.println(cu); // preserves /** comments */

for(AnnotationTree at : cu.getPackageAnnotations()) {
System.out.println(at);
}
String pkg = cu.getPackageName().toString();
if (!pkg.equals("")) {
System.out.println("package " + pkg + ";\n");
}
for(ImportTree it : cu.getImports()) {
System.out.print(it);
}

for(Tree td : cu.getTypeDecls()) {
System.out.println(td); // not all details in output!

// extend the following instead...
// TreeVisitor<Void, Void> tv = new TreeScanner<Void, Void>() {
//
// @Override
// public Void visit...
//
// };
// td.accept(tv, null);

}
}
}
}

Martin Gregorie · Oct 31, 2007

Tris said:
(e-mail address removed) burped up warm pablum in

All you need is a lexer (lex) to pick up tokens--no parsing (yacc or bison) required. I have a version in C and
lex for MS-DOS which was slapped together in 1990. You can find it at
http://sourceforge.net/projects/cshroud . Comments are naturally disposed of since that is half the job of
shrouding.

As you want to process Java and can read it, you're better off using
Coco/R. Unlike lex+yacc, it has a Java port which is written in Java and
generates Java. Its fractionally easier to get your head round as well.

http://www.ssw.uni-linz.ac.at/Research/Projects/Coco/

Lew · Oct 31, 2007

Piotr said:
Because that's not always possible to achieve even equivalent source
code from the bytecode? (Keywords: Type erasure, compile-time constant
expressions resolution, obfuscation, etc...)

You make good points, except for the obfuscation part.

Piotr Kobzda · Oct 31, 2007

Lew said:
You make good points, except for the obfuscation part.

Well, the obfuscation is mentioned here to indicate a possibility of the
one-way only transformation of the source code into the bytecode.
Compilers are free to optimize, or -- just like the obfuscators -- to
"mangle" the code in the way preventing from reverse engineering (even
not fully generated debug info, for example the LVT not present in a
class-file, is a kind of the obfuscation meant by me here).

piotr

=?iso-8859-1?B?UulnaXMgROljYW1wcw==?= · Oct 31, 2007

Why not just decompile the bytecode?

Because decompilers change the syntax. Sometimes making it hard to
understand, sometimes easier, but changed anyway.
For instance
public String toString() {
String myname = this.getName();
return("#<"
+ (myname!=null ? (" " + myname) : "" )
+ ">");
}
becomes (with jad)
public String toString()
{
String myname = getName();
return (new StringBuilder()).append("#<").append(myname ==
null ? "" : (new StringBuilder()).append("
").append(myname).toString()).append(">").toString();
}

Also, decompilers have problems, in particular with inline functions
or static declarations (see for instance http://www.kpdus.com/jad.html#bugs,
and JAD is one of the best AFAIK).

So, it was a nice idea, but does not provide a good answer to the
need. I like the idea of using a C/C++ preprocessor (even though there
might be side effects, too).

Esmond Pitt · Oct 31, 2007

Piotr said:
Javac's lexer do not removes comments (not all at least).

In other words it could. So in other words it can be done by a lexer.

Tris Orendorff · Nov 1, 2007

As you want to process Java and can read it, you're better off using
Coco/R. Unlike lex+yacc, it has a Java port which is written in Java and
generates Java. Its fractionally easier to get your head round as well.

http://www.ssw.uni-linz.ac.at/Research/Projects/Coco/

Agreed! Coco looks like a well thought out tool.

--
Tris Orendorff
[ Anyone naming their child should spend a few minutes checking rhyming slang and dodgy sounding
names. Brad and Angelina failed to do this when naming their kid Shiloh Pitt. At some point, someone at
school is going to spoonerise her name.
Craig Stark]

Esmond Pitt · Nov 2, 2007

Tris said:
Agreed! Coco looks like a well thought out tool.

All you really need is JavaCC 4.0 which comes with the Java 5.0 grammar.
Then just use the tokenizer.

I'm tempted to quit out of frustration	1	Aug 13, 2023
I have to finish this code for my assignment but I cant figure out how to solve it	1	Jun 27, 2023
Why is this WordPress comments form not submitting?	1	Jan 12, 2020
How to try a range of hex values in C# code ?	0	Nov 19, 2022
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
How to fix this code?	1	Sep 22, 2023
Padding strings for a clean visual print out...	5	Dec 23, 2023
How to strip ruby comments in a ruby line of code?	16	Nov 19, 2009

How to strip comments out of code

silviocortes

Daniel Pitts

Gordon Beaton

Esmond Pitt

Lew

Roedy Green

Tris Orendorff

Mark Rafn

Piotr Kobzda

Lew

Esmond Pitt

Piotr Kobzda

Piotr Kobzda

Martin Gregorie

Lew

Piotr Kobzda

=?iso-8859-1?B?UulnaXMgROljYW1wcw==?=

Esmond Pitt

Tris Orendorff

Esmond Pitt

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads