String/source code analysis tools

M

Moosebumps

I have a whole bunch of script files in a custom scripting "language" that
were basically copied and pasted all over the place -- a huge mess,
basically.

I want to clean this up using Python -- and I'm wondering if there is any
sort of algorithm for detecting copied and pasted code with slight
modifications.

i.e. a simple example:

If I have two pieces of code like this:

func1( a, b, c, 13, d, e, f )
func2( x, y, z, z )

and

func1( a, b, c, 55, d, e, f )
func2( x, y, z, x )

I would like to be able to detect the redundancies. This is obviously a
simple example, the real code is worlds messier -- say a 3 line script, each
line has 800 characters, copied 10 times over with slight modifications
among the 800 characters. I'm not exaggerating. So I'm wondering if there
is any code out there that will assist me in refactoring this code.

My feeling that a general solution this is very intractable, but I thought
I'd ask. I'd probably have to roll my own based on the specifics of the
situation.

It is actually sort of like a diff algorithm maybe, but it wouldn't go line
by line. How would I do a diff, token by token? I don't know anything
about what algorithms diffs use.

thanks,
MB
 
M

Moosebumps

Man, I love Python! After writing this, with about 10 minutes of googling,
I found the difflib, which can do diffs token by token. I can do what I
want with about 10 lines of code probably. Wow.

I think the diff is pretty much the best solution -- but if anyone has any
other pointers I would appreciate it. I would have to diff all pairs of
files and I can get a score of how similar they are to each other. So if I
have 10 files I would have to run it 45 times to get all pairs of diffs.
That should be OK since they are small files in general.

MB
 
H

Heiko Wundram

Am Donnerstag 22 April 2004 08:56 schrieb Moosebumps:
It is actually sort of like a diff algorithm maybe, but it wouldn't go line
by line. How would I do a diff, token by token? I don't know anything
about what algorithms diffs use.

What about difflib? (part of the standard library) You'd have to write your
own tokenization function, but that shouldn't be hard...

Heiko.
 
I

Ira Baxter

Moosebumps said:
I have a whole bunch of script files in a custom scripting "language" that
were basically copied and pasted all over the place -- a huge mess,
basically.

I want to clean this up using Python -- and I'm wondering if there is any
sort of algorithm for detecting copied and pasted code with slight
modifications.

Not in Python, but could be used to do this.
We offer a clone detection tool that works on very large source code basis,
and detects cloned clone with "slight modifications".
You'd have to provide a grammar for your 'scripting language'.
See http://www.semanticdesigns.com/Products/Clone/index.html.
 
?

=?iso-8859-1?Q?Fran=E7ois?= Pinard

[Ira Baxter]
Not in Python, but could be used to do this. We offer a clone
detection tool that works on very large source code basis, and detects
cloned clone with "slight modifications". You'd have to provide a
grammar for your 'scripting language'. See
http://www.semanticdesigns.com/Products/Clone/index.html.

Thanks for the reference, I'm saving it for later perusal or study.

Many years ago, because I had a cleaning problem which I presume similar
to yours, I wrote then used a tool for this, but all in C. I called
it `mdiff' (for "multi-diff"), and it is likely found within some old
pretest of `Free wdiff' -- I did not really touch `wdiff' in years, even
if I ponder republishing it this summer, given I find some free time.

`mdiff' seeks for identical sequences of lines within one or more files
(I used it for many dozens of files at once). One difficulty was to
design a way for displaying the output in a usable way, and this was an
interesting problem at least. `mdiff' did the job for me, but I do not
really remember the state of this project nor how `mdiff' would behave
if recompiled today. But, as usual with me, if you feel like toying,
just ask for the sources, or wander for them from my home web page! :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top