Trying to somehow measure similarity of text files

L

lbrtchx

~
It is easy for files which are binary equal or can be somehow match
line by line with some diff-like code, but other than I couldn't
really find anything in java or otherwise
~
There is the dump3.sourceforge.net project, but I find it a little
two much and way too demanding to be run in a server. It appeared to
be concived with pictures as data format in mind
~
Do you know of any C, CPP or java project or a good well-out white
paper on this problematic?
~
Thanks
lbrtchx
 
R

Roedy Green

It is easy for files which are binary equal or can be somehow match
line by line with some diff-like code, but other than I couldn't
really find anything in java or otherwise

If you had a way of chunking the file. e.g. sentences in a text file,
newlines in a CSV file ,then you could compute a hashCode for each
"sentence".

You could then process your two files and create a list of hashcodes.
Then sort each list. Then compare counting matches. That gives you a
rough idea of how many sentences they have in common and how many are
unique to each. Compute a ratio of common/total unique sentences.

Ignore collisions (two sentences (either same of different) in same
file producing same hash code.

It is rude to ask questions in one group with followup to another.

I was thinking of some logic like this for creating delta files, that
could efficiently transmit changes to text files that have mainly been
reordered.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top