H
Henry Townsend
Is there a standard algorithm or module which finds the N longest common
substrings in a set of text files?
Here's the use case: I'm trying to clean up a very old, very large, and
very ugly build system which has thousands of unparameterized
compile/link commands in hundreds of Makefiles. I want to search them
for frequently-occurring long substrings. Hopefully this will turn up
phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or
"-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for semantic
meaning, make up reasonable names like $(SYS_LIBS) and $(UNIX_DEFINES),
and do a global replace. Then repeat until satisfied.
I've done some searching but haven't found anything close to the above.
Thanks,
HT
substrings in a set of text files?
Here's the use case: I'm trying to clean up a very old, very large, and
very ugly build system which has thousands of unparameterized
compile/link commands in hundreds of Makefiles. I want to search them
for frequently-occurring long substrings. Hopefully this will turn up
phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or
"-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for semantic
meaning, make up reasonable names like $(SYS_LIBS) and $(UNIX_DEFINES),
and do a global replace. Then repeat until satisfied.
I've done some searching but haven't found anything close to the above.
Thanks,
HT