Middle matching - any Python library functions (besides re)?

EP · Aug 28, 2006

Hi,

I'm a bit green in this area and wonder to what extent there may be
some existing Python tools (or if I have to scratch my head real hard
for an appropriate algorithm... ) I'd hate to build an inferior
solution to that someone has painstakingly built before me.

I have some files which may have had the same origin, but some may have
had some cruft added to the front, and some may have had some cruft
added to the back; thus they may be of slightly different lengths, but
if they had the same origin, there will be a matching pattern of bytes
in the middle, though it may be offset relative to each other. I want
to find which files have in common with which other files the same
pattern of origin within them. The cruft portions should be a small %
of the overall file lengths.

Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.

TIA for bearing with my ignorance of the clear solution I'm surely
blind to...

EP

Paul Rubin · Aug 28, 2006

EP said:
Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.

If these are text files, typically you'd use the Unix 'diff' utility
to locate the differences.

bearophileHUGS · Aug 28, 2006

If you want to avoid an O(n^2) algorithm, you may need to find a
signature for each file. Then you use such signatures to compute
hashes, and unique them with a dict (dict values may be the file
names). Later you can weed out the few wrong collisions produced by the
possibly approximated signature. If you can determine of a given a file
where its heading garbage stops, then you can compute the signature
just computing the python hash of some of the following bytes (30 or
200 byte may suffice).

Bye,
bearophile

Paddy · Aug 28, 2006

EP said:
Hi,

I'm a bit green in this area and wonder to what extent there may be
some existing Python tools (or if I have to scratch my head real hard
for an appropriate algorithm... ) I'd hate to build an inferior
solution to that someone has painstakingly built before me.

I have some files which may have had the same origin, but some may have
had some cruft added to the front, and some may have had some cruft
added to the back; thus they may be of slightly different lengths, but
if they had the same origin, there will be a matching pattern of bytes
in the middle, though it may be offset relative to each other. I want
to find which files have in common with which other files the same
pattern of origin within them. The cruft portions should be a small %
of the overall file lengths.

Are they source files?
Is the cruft comments?

Have you done an exhaustive search for info on the files and the
histories? If they are systematically generated then there may be a way
to systematically uncover the matching info from their histories, such
as source code management histories, file pathnames, file dates?

What more can you find out about cruft? Is there a way to strip the
cruft by program? If so, file sizes and checksums could be compared on
the stripped files.

Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.

Can you change the flow that creates these files so that the
information you want is generated alongside the files?

Can you use data later in the flow to more easily extract the info you
want?

TIA for bearing with my ignorance of the clear solution I'm surely
blind to...

EP

Just some questions that I woud ask myself if given the task.

- Paddy.

Simon Forman · Aug 28, 2006

Paul said:
If these are text files, typically you'd use the Unix 'diff' utility
to locate the differences.

If you can, you definitely want to use diff. Otherwise, the difflib
standard library module may be of use to you. Also, since you're
talking about comparing many files to each other, you could pull out a
substring of one file and use the 'in' "operator" to check if that
substring is in another file. Something like this:

f = open(filename) # or if binary open(filename, 'rb')
f.seek(somewhere_in_the_file)
substr = f.read(some_amount_of_data)
f.close()

try_diffing_us = []
for fn in list_of_filenames:
data = open(fn).read() # or again open(fn, 'rb')...
if substr in data:
try_diffing_us.append(fn)

# then diff just those filenames...

That's a naive implementation but it should illustrate how to cut down
on the number of actual diffs you'll need to perform. Of course, if
your files are large it may not be feasible to do this with all of
them. But they'd have to be really large, or there'd have to be lots
and lots of them...

More information on your actual use case would be helpful in narrowing
down the best options.

Peace,
~Simon

Andrew Robert · Aug 28, 2006

Simon said:
Paul said:

If these are text files, typically you'd use the Unix 'diff' utility
to locate the differences.

Click to expand...

If you can, you definitely want to use diff. Otherwise, the difflib
standard library module may be of use to you. Also, since you're
talking about comparing many files to each other, you could pull out a
substring of one file and use the 'in' "operator" to check if that
substring is in another file. Something like this:

f = open(filename) # or if binary open(filename, 'rb')
f.seek(somewhere_in_the_file)
substr = f.read(some_amount_of_data)
f.close()

try_diffing_us = []
for fn in list_of_filenames:
data = open(fn).read() # or again open(fn, 'rb')...
if substr in data:
try_diffing_us.append(fn)

# then diff just those filenames...

That's a naive implementation but it should illustrate how to cut down
on the number of actual diffs you'll need to perform. Of course, if
your files are large it may not be feasible to do this with all of
them. But they'd have to be really large, or there'd have to be lots
and lots of them...

More information on your actual use case would be helpful in narrowing
down the best options.

Peace,
~Simon

Would it be more efficient to checksum the files and then only diff the ones that fail a checksum compare?

Utilizing the functions below may be of some help.

#!/usr/bin/python
#
#
# Function: generate and compare checksums on a file

import md5, sys

def getsum(filename):
"""
Generate the check sum based on received chunks of the file
"""
md5sum = md5.new()
f = open(filename, 'r')
for line in getblocks(f) :
md5sum.update(line)
f.close()
return md5sum.hexdigest()

def getblocks(f, blocksize=1024):
"""
Read file in small chunks to avoid having large files loaded into memory
"""
while True:
s = f.read(blocksize)
if not s: break
yield s

def checksum_compare(caller, cs='',check='', filename=''):
"""
Compare the generated and received checksum valued
"""
if cs != check:
return 1 # compare failed
else:
return 0 # compare successful

--
Adversity: That which does not kill me only postpones the inevitable.

Simon Forman · Aug 28, 2006

Andrew said:
Simon said:

Paul said:

Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.
If these are text files, typically you'd use the Unix 'diff' utility
to locate the differences.

Click to expand...

If you can, you definitely want to use diff. Otherwise, the difflib
standard library module may be of use to you. Also, since you're
talking about comparing many files to each other, you could pull out a
substring of one file and use the 'in' "operator" to check if that
substring is in another file. Something like this:

f = open(filename) # or if binary open(filename, 'rb')
f.seek(somewhere_in_the_file)
substr = f.read(some_amount_of_data)
f.close()

try_diffing_us = []
for fn in list_of_filenames:
data = open(fn).read() # or again open(fn, 'rb')...
if substr in data:
try_diffing_us.append(fn)

# then diff just those filenames...

That's a naive implementation but it should illustrate how to cut down
on the number of actual diffs you'll need to perform. Of course, if
your files are large it may not be feasible to do this with all of
them. But they'd have to be really large, or there'd have to be lots
and lots of them...

More information on your actual use case would be helpful in narrowing
down the best options.

Peace,
~Simon

Click to expand...

Would it be more efficient to checksum the files and then only diff the ones that fail a checksum compare?

The thing about a checksum algorithm is that if you call it with some
data, then change even one byte of the data and call it again, the
resulting checksums will be (should be) very different.

Checksumming the entire file contents won't help, but as bearophile
said: "If you can determine of a given a file where its heading garbage
stops, then you can compute the signature just computing the python
hash of some of the following bytes (30 or
200 byte may suffice)."

If the OP can do that then yes, it would likely be more efficient
(although computing and comparing checksums on just 200 bytes or less
might not be a significant gain on simply comparing the strings
themselves.) But if he can't then the next best thing would be to take
a subsection of the file somewhere after the heading cruft and search
(using string 'in' string form) for that subsection in other files.
(Actually there may be a better option than that, but I'm just not
bright enough to have thought of it...)

Utilizing the functions below may be of some help.

#!/usr/bin/python
#
#
# Function: generate and compare checksums on a file

import md5, sys

def getsum(filename):
"""
Generate the check sum based on received chunks of the file
"""
md5sum = md5.new()
f = open(filename, 'r')
for line in getblocks(f) :
md5sum.update(line)
f.close()
return md5sum.hexdigest()

def getblocks(f, blocksize=1024):
"""
Read file in small chunks to avoid having large files loaded into memory
"""
while True:
s = f.read(blocksize)
if not s: break
yield s

def checksum_compare(caller, cs='',check='', filename=''):
"""
Compare the generated and received checksum valued
"""
if cs != check:
return 1 # compare failed
else:
return 0 # compare successful

I'm curious why you included this elaborate function with it's unused
args (caller and filename), unnecessary defaults, and it's odd
inversion of Boolean values..

How is "if not checksum_compare(something, sum0, sum1): #do
something..." any better than "if sum0 == sum1: #do something..." ?

Peace,
~Simon

Andrew Robert · Aug 28, 2006

Simon said:
Andrew said:

Simon said:

Paul Rubin wrote:
Given that I am looking for matches of all files against all other
files (of similar length) is there a better bet than using re.search?
The initial application concerns files in the 1,000's, and I could use
a good solution for a number of files in the 100,000's.
If these are text files, typically you'd use the Unix 'diff' utility
to locate the differences.
If you can, you definitely want to use diff. Otherwise, the difflib
standard library module may be of use to you. Also, since you're
talking about comparing many files to each other, you could pull out a
substring of one file and use the 'in' "operator" to check if that
substring is in another file. Something like this:

f = open(filename) # or if binary open(filename, 'rb')
f.seek(somewhere_in_the_file)
substr = f.read(some_amount_of_data)
f.close()

try_diffing_us = []
for fn in list_of_filenames:
data = open(fn).read() # or again open(fn, 'rb')...
if substr in data:
try_diffing_us.append(fn)

# then diff just those filenames...

That's a naive implementation but it should illustrate how to cut down
on the number of actual diffs you'll need to perform. Of course, if
your files are large it may not be feasible to do this with all of
them. But they'd have to be really large, or there'd have to be lots
and lots of them...

More information on your actual use case would be helpful in narrowing
down the best options.

Peace,
~Simon

Click to expand...

Would it be more efficient to checksum the files and then only diff the ones that fail a checksum compare?

Click to expand...

The thing about a checksum algorithm is that if you call it with some
data, then change even one byte of the data and call it again, the
resulting checksums will be (should be) very different.

Checksumming the entire file contents won't help, but as bearophile
said: "If you can determine of a given a file where its heading garbage
stops, then you can compute the signature just computing the python
hash of some of the following bytes (30 or
200 byte may suffice)."

If the OP can do that then yes, it would likely be more efficient
(although computing and comparing checksums on just 200 bytes or less
might not be a significant gain on simply comparing the strings
themselves.) But if he can't then the next best thing would be to take
a subsection of the file somewhere after the heading cruft and search
(using string 'in' string form) for that subsection in other files.
(Actually there may be a better option than that, but I'm just not
bright enough to have thought of it...)

Utilizing the functions below may be of some help.

#!/usr/bin/python
#
#
# Function: generate and compare checksums on a file

import md5, sys

def getsum(filename):
"""
Generate the check sum based on received chunks of the file
"""
md5sum = md5.new()
f = open(filename, 'r')
for line in getblocks(f) :
md5sum.update(line)
f.close()
return md5sum.hexdigest()

def getblocks(f, blocksize=1024):
"""
Read file in small chunks to avoid having large files loaded into memory
"""
while True:
s = f.read(blocksize)
if not s: break
yield s

def checksum_compare(caller, cs='',check='', filename=''):
"""
Compare the generated and received checksum valued
"""
if cs != check:
return 1 # compare failed
else:
return 0 # compare successful

Click to expand...

I'm curious why you included this elaborate function with it's unused
args (caller and filename), unnecessary defaults, and it's odd
inversion of Boolean values..

How is "if not checksum_compare(something, sum0, sum1): #do
something..." any better than "if sum0 == sum1: #do something..." ?

Peace,
~Simon

Because I was lazy..

The checksume_compare came from something else I wrote that had special
logging and e-mailer calls in it.

Should have ripped the reference to caller and file name out..

Simon Forman · Aug 28, 2006

Andrew said:
Because I was lazy..

The checksume_compare came from something else I wrote that had special
logging and e-mailer calls in it.

Should have ripped the reference to caller and file name out..

Aaaahh the subtle joys of cut-and-paste programming... :-D

(I've done it myself... ;-)

Processing in Python help	0	Aug 31, 2022
Languages for different purposes (was Re: New user's initial thoughts/ criticisms of Python)	4	Nov 9, 2013
Future standard GUI library	51	May 18, 2013
Tips Re Pattern Matching / REGEX	1	Mar 27, 2008
Re-using copyrighted code	2	Jun 8, 2013
Help regarding python facepy library	0	Sep 16, 2013
idiom for RE matching	14	Jul 19, 2007
Python drawing library?	2	May 15, 2011

Middle matching - any Python library functions (besides re)?

EP

Paul Rubin

bearophileHUGS

Paddy

Simon Forman

Andrew Robert

Simon Forman

Andrew Robert

Simon Forman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads