Trying to somehow measure similarity of text files

lbrtchx · Jan 7, 2008

~
It is easy for files which are binary equal or can be somehow match
line by line with some diff-like code, but other than I couldn't
really find anything in java or otherwise
~
There is the dump3.sourceforge.net project, but I find it a little
two much and way too demanding to be run in a server. It appeared to
be concived with pictures as data format in mind
~
Do you know of any C, CPP or java project or a good well-out white
paper on this problematic?
~
Thanks
lbrtchx

Roedy Green · Jan 7, 2008

It is easy for files which are binary equal or can be somehow match
line by line with some diff-like code, but other than I couldn't
really find anything in java or otherwise

If you had a way of chunking the file. e.g. sentences in a text file,
newlines in a CSV file ,then you could compute a hashCode for each
"sentence".

You could then process your two files and create a list of hashcodes.
Then sort each list. Then compare counting matches. That gives you a
rough idea of how many sentences they have in common and how many are
unique to each. Compute a ratio of common/total unique sentences.

Ignore collisions (two sentences (either same of different) in same
file producing same hash code.

It is rude to ask questions in one group with followup to another.

I was thinking of some logic like this for creating delta files, that
could efficiently transmit changes to text files that have mainly been
reordered.

Data saving in condition of changing reality	0	Apr 29, 2022
java sdk freseez when trying to read / write files... i think.(jdk1.6.0_12)	10	Aug 3, 2009
comparing arrays of strings ...	1	Jan 20, 2009
burn files to DVD with Java	2	Nov 5, 2011
Text::BibTeX -- bibgrep-like script?	4	Mar 27, 2013
How to compare two large text files?	3	Jun 19, 2007
Prevent VS 2005 from trying to parse media files	6	Oct 21, 2008
Efficiently concatenating contents of multiple files	5	Jul 2, 2008

Trying to somehow measure similarity of text files

lbrtchx

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads