Middle matching - any Python library functions (besides re)?

E

EP

Hi,

I'm a bit green in this area and wonder to what extent there may be
some existing Python tools (or if I have to scratch my head real hard
for an appropriate algorithm... ) I'd hate to build an inferior
solution to that someone has painstakingly built before me.

I have some files which may have had the same origin, but some may have
had some cruft added to the front, and some may have had some cruft
added to the back; thus they may be of slightly different lengths, but
if they had the same origin, there will be a matching pattern of bytes
in the middle, though it may be offset relative to each other. I want
to find which files have in common with which other files the same
pattern of origin within them. The cruft portions should be a small %
of the overall file lengths.

Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.

TIA for bearing with my ignorance of the clear solution I'm surely
blind to...

EP
 
P

Paul Rubin

EP said:
Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.

If these are text files, typically you'd use the Unix 'diff' utility
to locate the differences.
 
B

bearophileHUGS

If you want to avoid an O(n^2) algorithm, you may need to find a
signature for each file. Then you use such signatures to compute
hashes, and unique them with a dict (dict values may be the file
names). Later you can weed out the few wrong collisions produced by the
possibly approximated signature. If you can determine of a given a file
where its heading garbage stops, then you can compute the signature
just computing the python hash of some of the following bytes (30 or
200 byte may suffice).

Bye,
bearophile
 
P

Paddy

EP said:
Hi,

I'm a bit green in this area and wonder to what extent there may be
some existing Python tools (or if I have to scratch my head real hard
for an appropriate algorithm... ) I'd hate to build an inferior
solution to that someone has painstakingly built before me.

I have some files which may have had the same origin, but some may have
had some cruft added to the front, and some may have had some cruft
added to the back; thus they may be of slightly different lengths, but
if they had the same origin, there will be a matching pattern of bytes
in the middle, though it may be offset relative to each other. I want
to find which files have in common with which other files the same
pattern of origin within them. The cruft portions should be a small %
of the overall file lengths.

Are they source files?
Is the cruft comments?

Have you done an exhaustive search for info on the files and the
histories? If they are systematically generated then there may be a way
to systematically uncover the matching info from their histories, such
as source code management histories, file pathnames, file dates?

What more can you find out about cruft? Is there a way to strip the
cruft by program? If so, file sizes and checksums could be compared on
the stripped files.
Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.

Can you change the flow that creates these files so that the
information you want is generated alongside the files?

Can you use data later in the flow to more easily extract the info you
want?
TIA for bearing with my ignorance of the clear solution I'm surely
blind to...

EP

Just some questions that I woud ask myself if given the task.

- Paddy.
 
S

Simon Forman

Paul said:
If these are text files, typically you'd use the Unix 'diff' utility
to locate the differences.

If you can, you definitely want to use diff. Otherwise, the difflib
standard library module may be of use to you. Also, since you're
talking about comparing many files to each other, you could pull out a
substring of one file and use the 'in' "operator" to check if that
substring is in another file. Something like this:

f = open(filename) # or if binary open(filename, 'rb')
f.seek(somewhere_in_the_file)
substr = f.read(some_amount_of_data)
f.close()

try_diffing_us = []
for fn in list_of_filenames:
data = open(fn).read() # or again open(fn, 'rb')...
if substr in data:
try_diffing_us.append(fn)

# then diff just those filenames...

That's a naive implementation but it should illustrate how to cut down
on the number of actual diffs you'll need to perform. Of course, if
your files are large it may not be feasible to do this with all of
them. But they'd have to be really large, or there'd have to be lots
and lots of them... :)

More information on your actual use case would be helpful in narrowing
down the best options.

Peace,
~Simon
 
A

Andrew Robert

Simon said:
Paul said:
If these are text files, typically you'd use the Unix 'diff' utility
to locate the differences.

If you can, you definitely want to use diff. Otherwise, the difflib
standard library module may be of use to you. Also, since you're
talking about comparing many files to each other, you could pull out a
substring of one file and use the 'in' "operator" to check if that
substring is in another file. Something like this:

f = open(filename) # or if binary open(filename, 'rb')
f.seek(somewhere_in_the_file)
substr = f.read(some_amount_of_data)
f.close()

try_diffing_us = []
for fn in list_of_filenames:
data = open(fn).read() # or again open(fn, 'rb')...
if substr in data:
try_diffing_us.append(fn)

# then diff just those filenames...

That's a naive implementation but it should illustrate how to cut down
on the number of actual diffs you'll need to perform. Of course, if
your files are large it may not be feasible to do this with all of
them. But they'd have to be really large, or there'd have to be lots
and lots of them... :)

More information on your actual use case would be helpful in narrowing
down the best options.

Peace,
~Simon

Would it be more efficient to checksum the files and then only diff the ones that fail a checksum compare?


Utilizing the functions below may be of some help.


#!/usr/bin/python
#
#
# Function: generate and compare checksums on a file


import md5, sys


def getsum(filename):
"""
Generate the check sum based on received chunks of the file
"""
md5sum = md5.new()
f = open(filename, 'r')
for line in getblocks(f) :
md5sum.update(line)
f.close()
return md5sum.hexdigest()

def getblocks(f, blocksize=1024):
"""
Read file in small chunks to avoid having large files loaded into memory
"""
while True:
s = f.read(blocksize)
if not s: break
yield s

def checksum_compare(caller, cs='',check='', filename=''):
"""
Compare the generated and received checksum valued
"""
if cs != check:
return 1 # compare failed
else:
return 0 # compare successful




--
Adversity: That which does not kill me only postpones the inevitable.
 
S

Simon Forman

Andrew said:
Simon said:
Paul said:
Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.
If these are text files, typically you'd use the Unix 'diff' utility
to locate the differences.

If you can, you definitely want to use diff. Otherwise, the difflib
standard library module may be of use to you. Also, since you're
talking about comparing many files to each other, you could pull out a
substring of one file and use the 'in' "operator" to check if that
substring is in another file. Something like this:

f = open(filename) # or if binary open(filename, 'rb')
f.seek(somewhere_in_the_file)
substr = f.read(some_amount_of_data)
f.close()

try_diffing_us = []
for fn in list_of_filenames:
data = open(fn).read() # or again open(fn, 'rb')...
if substr in data:
try_diffing_us.append(fn)

# then diff just those filenames...

That's a naive implementation but it should illustrate how to cut down
on the number of actual diffs you'll need to perform. Of course, if
your files are large it may not be feasible to do this with all of
them. But they'd have to be really large, or there'd have to be lots
and lots of them... :)

More information on your actual use case would be helpful in narrowing
down the best options.

Peace,
~Simon

Would it be more efficient to checksum the files and then only diff the ones that fail a checksum compare?

The thing about a checksum algorithm is that if you call it with some
data, then change even one byte of the data and call it again, the
resulting checksums will be (should be) very different.

Checksumming the entire file contents won't help, but as bearophile
said: "If you can determine of a given a file where its heading garbage
stops, then you can compute the signature just computing the python
hash of some of the following bytes (30 or
200 byte may suffice)."

If the OP can do that then yes, it would likely be more efficient
(although computing and comparing checksums on just 200 bytes or less
might not be a significant gain on simply comparing the strings
themselves.) But if he can't then the next best thing would be to take
a subsection of the file somewhere after the heading cruft and search
(using string 'in' string form) for that subsection in other files.
(Actually there may be a better option than that, but I'm just not
bright enough to have thought of it...)
Utilizing the functions below may be of some help.


#!/usr/bin/python
#
#
# Function: generate and compare checksums on a file


import md5, sys


def getsum(filename):
"""
Generate the check sum based on received chunks of the file
"""
md5sum = md5.new()
f = open(filename, 'r')
for line in getblocks(f) :
md5sum.update(line)
f.close()
return md5sum.hexdigest()

def getblocks(f, blocksize=1024):
"""
Read file in small chunks to avoid having large files loaded into memory
"""
while True:
s = f.read(blocksize)
if not s: break
yield s

def checksum_compare(caller, cs='',check='', filename=''):
"""
Compare the generated and received checksum valued
"""
if cs != check:
return 1 # compare failed
else:
return 0 # compare successful

I'm curious why you included this elaborate function with it's unused
args (caller and filename), unnecessary defaults, and it's odd
inversion of Boolean values..

How is "if not checksum_compare(something, sum0, sum1): #do
something..." any better than "if sum0 == sum1: #do something..." ?

Peace,
~Simon
 
A

Andrew Robert

Simon said:
Andrew said:
Simon said:
Paul Rubin wrote:
Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.
If these are text files, typically you'd use the Unix 'diff' utility
to locate the differences.
If you can, you definitely want to use diff. Otherwise, the difflib
standard library module may be of use to you. Also, since you're
talking about comparing many files to each other, you could pull out a
substring of one file and use the 'in' "operator" to check if that
substring is in another file. Something like this:

f = open(filename) # or if binary open(filename, 'rb')
f.seek(somewhere_in_the_file)
substr = f.read(some_amount_of_data)
f.close()

try_diffing_us = []
for fn in list_of_filenames:
data = open(fn).read() # or again open(fn, 'rb')...
if substr in data:
try_diffing_us.append(fn)

# then diff just those filenames...

That's a naive implementation but it should illustrate how to cut down
on the number of actual diffs you'll need to perform. Of course, if
your files are large it may not be feasible to do this with all of
them. But they'd have to be really large, or there'd have to be lots
and lots of them... :)

More information on your actual use case would be helpful in narrowing
down the best options.

Peace,
~Simon
Would it be more efficient to checksum the files and then only diff the ones that fail a checksum compare?

The thing about a checksum algorithm is that if you call it with some
data, then change even one byte of the data and call it again, the
resulting checksums will be (should be) very different.

Checksumming the entire file contents won't help, but as bearophile
said: "If you can determine of a given a file where its heading garbage
stops, then you can compute the signature just computing the python
hash of some of the following bytes (30 or
200 byte may suffice)."

If the OP can do that then yes, it would likely be more efficient
(although computing and comparing checksums on just 200 bytes or less
might not be a significant gain on simply comparing the strings
themselves.) But if he can't then the next best thing would be to take
a subsection of the file somewhere after the heading cruft and search
(using string 'in' string form) for that subsection in other files.
(Actually there may be a better option than that, but I'm just not
bright enough to have thought of it...)
Utilizing the functions below may be of some help.


#!/usr/bin/python
#
#
# Function: generate and compare checksums on a file


import md5, sys


def getsum(filename):
"""
Generate the check sum based on received chunks of the file
"""
md5sum = md5.new()
f = open(filename, 'r')
for line in getblocks(f) :
md5sum.update(line)
f.close()
return md5sum.hexdigest()

def getblocks(f, blocksize=1024):
"""
Read file in small chunks to avoid having large files loaded into memory
"""
while True:
s = f.read(blocksize)
if not s: break
yield s

def checksum_compare(caller, cs='',check='', filename=''):
"""
Compare the generated and received checksum valued
"""
if cs != check:
return 1 # compare failed
else:
return 0 # compare successful

I'm curious why you included this elaborate function with it's unused
args (caller and filename), unnecessary defaults, and it's odd
inversion of Boolean values..

How is "if not checksum_compare(something, sum0, sum1): #do
something..." any better than "if sum0 == sum1: #do something..." ?

Peace,
~Simon
Because I was lazy..

The checksume_compare came from something else I wrote that had special
logging and e-mailer calls in it.

Should have ripped the reference to caller and file name out..
 
S

Simon Forman

Andrew said:
Because I was lazy..

The checksume_compare came from something else I wrote that had special
logging and e-mailer calls in it.

Should have ripped the reference to caller and file name out..

Aaaahh the subtle joys of cut-and-paste programming... :-D

(I've done it myself... ;-)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top