difflib and intelligent file differences

hayes.tyler · Mar 26, 2009

Hello All:

I am starting to work on a file comparison script where I have to
compare the contents of two large files. Originally I thought to just
sort on a numeric key, and use UNIX's comm to do a line by line
comparison. However, this would fail, hence my thinking that I really
should've just used Python from the start. Let me outline the problem.

Imagine two text files, f1 and f2,

f1 is
1
2
3
4
5

and f2 is

12
2
3
4
5

where each line can be thought of as a record, not a running sentence.
Okay, this one is easy, in fact, this is just a line by line
comparison using comm -3 f1 f2. BUT...
(and this is why I'm thinking of using Python's difflib to work on it)

Now say f1 is

1
2
3
4
5

and f2 is

2
3
4
5

The only difference of the *contents* is 1, but if you did a line by
line comparison, all of them would return because of the line
difference at the beginning. So, what I'm really looking for, is not
just a line by line comparison, but a file contents comparison.
Ideally, all I want to generate is a file of lines which would contain
the differences.

My first thought is to do a sweep, where the first sweep takes one
line from f1, travels f2, if found, deletes it from a tmp version of
f2, and then on to the second line, and so on. If not found, it writes
to a file. At the end, if there are also lines still in f1 that never
were matched because it was longer, it appends those as well to the
difference file. At the end, you have a nice summary of the lines
(i.e., records) which are not found in either file.

Any suggestions where to start?

Marco Mariani · Mar 26, 2009

My first thought is to do a sweep, where the first sweep takes one
line from f1, travels f2, if found, deletes it from a tmp version of
f2, and then on to the second line, and so on. If not found, it writes
to a file. At the end, if there are also lines still in f1 that never
were matched because it was longer, it appends those as well to the
difference file. At the end, you have a nice summary of the lines
(i.e., records) which are not found in either file.

Any suggestions where to start?

You can adapt and use this, provided the files are already sorted.
Memory usage scales linearly with the size of the file difference, and
time scales linearly with file sizes.

#!/usr/bin/env python

import sys

def run(fname_a, fname_b):
filea = file(fname_a)
fileb = file(fname_b)
a_lines = set()
b_lines = set()

while True:
a = filea.readline()
b = fileb.readline()
if not (a or b):
break

if a == b:
continue

if a in b_lines:
b_lines.remove(a)
elif a:
a_lines.add(a)

if b in a_lines:
a_lines.remove(b)
elif b:
b_lines.add(b)

for line in a_lines:
print line

if a_lines or b_lines:
print ''
print '***************'
print ''

for line in b_lines:
print line

if __name__ == '__main__':
run(sys.argv[1], sys.argv[2])

Marco Mariani · Mar 26, 2009

BTW, watch out for this break. It might not be what you want :-/

hayes.tyler · Mar 26, 2009

BTW, watch out for this break. It might not be what you want :-/

HA! Just found it

Thanks,

t.

Dave Angel · Mar 26, 2009

First comment, have you looked at the standard module difflib? There's
a sample program diff.py located in tools\scripts that may do what
you need already. It finds the differences in context, and displays
them in a way that's frequently intuitive, showing you what's been
changed, and what's been added or removed. For example, if just one
line has been added, it would display a few lines in front of that one,
and the one line (with a leading +), and then a few lines after it. And
there are switches you can use to get different formatting of the results.

But back to your question, presumably doing it by hand. First question
I have is whether the file's lines are completely independent? For
example, each line is a record in a database, with order irrelevant. If
so, use something like Marco's code. If the files are not fully sorted,
you'll need to do a final pruning at the end, where you delete all
members in common between the two sets.

If the lines are not independent, then you might want to start with
something like difflib.Differ

Dave Angel · Mar 26, 2009

If the lines are really sorted, all you really need is a merge, where
you read one line from each source, and if equal, read another from
each. If one source is less, output the lesser line with appropriate
tag , and refresh that one from its source. Stop when either source has
run out, and then flush the rest of the other source to the output, with
appropriate tag.

Time is linear, and memory use negligible.

Marco said:
You can adapt and use this, provided the files are already sorted.
Memory usage scales linearly with the size of the file difference, and
time scales linearly with file sizes.

#!/usr/bin/env python

import sys

def run(fname_a, fname_b):
filea = file(fname_a)
fileb = file(fname_b)
a_lines = set()
b_lines = set()

while True:
a = filea.readline()
b = fileb.readline()
if not (a or b):
break

if a == b:
continue

if a in b_lines:
b_lines.remove(a)
elif a:
a_lines.add(a)

if b in a_lines:
a_lines.remove(b)
elif b:
b_lines.add(b)

for line in a_lines:
print line

if a_lines or b_lines:
print ''
print '***************'
print ''

for line in b_lines:
print line

if __name__ == '__main__':
run(sys.argv[1], sys.argv[2])

Click to expand...

</div>

Marco Mariani · Mar 26, 2009

Dave said:
If the lines are really sorted, all you really need is a merge,

D'oh. Right. The posted code works on unsorted files. The sorted case is
even simpler as you pointed out.

Steven D'Aprano · Mar 26, 2009

Hello All:

I am starting to work on a file comparison script where I have to
compare the contents of two large files. ....
(and this is why I'm thinking of using
Python's difflib to work on it) ....
Any suggestions where to start?

Python's difflib.

Marco Mariani · Mar 26, 2009

For the archives, and for huge files where /usr/bin/diff or difflib are
not appropriate, here it is.

#!/usr/bin/env python

import sys

def run(filea, fileb):
p = 3
while True:
if p&1: a = filea.readline()
if p&2: b = fileb.readline()
if not a or not b:
break
elif a == b:
p = 3
elif a < b:
sys.stdout.write('-%s' % a)
p = 1
elif b < a:
sys.stdout.write('+%s' % b)
p = 2

for line in filea.readlines():
sys.stdout.write('-%s' % line)

for line in fileb.readlines():
sys.stdout.write('+%s' % line)

if __name__ == '__main__':
run(file(sys.argv[1]), file(sys.argv[2]))

hayes.tyler · Mar 26, 2009

Thanks for all of your suggestions. Turns out Marco's first version
was really the one I needed.

Thanks again,

t.

For the archives, and for huge files where /usr/bin/diff or difflib are
not appropriate, here it is.

#!/usr/bin/env python

Click to expand...

import sys

Click to expand...

def run(filea, fileb):
p = 3
while True:
if p&1: a = filea.readline()
if p&2: b = fileb.readline()
if not a or not b:
break
elif a == b:
p = 3
elif a < b:
sys.stdout.write('-%s' % a)
p = 1
elif b < a:
sys.stdout.write('+%s' % b)
p = 2

Click to expand...

for line in filea.readlines():
sys.stdout.write('-%s' % line)

Click to expand...

for line in fileb.readlines():
sys.stdout.write('+%s' % line)

Click to expand...

if __name__ == '__main__':
run(file(sys.argv[1]), file(sys.argv[2]))

Click to expand...

DiffLib Question	7	May 2, 2007
Differences creating tuples and collections.namedtuples	28	Feb 18, 2013
XML/XHTML/HTML differences, bugs... and howto	0	Jan 23, 2013
Don't understand SequenceMatcher from difflib	0	Jun 21, 2011
python copy selected lines from one file to another using argparse or getopt	3	Jan 8, 2014
modifying standard library functionality (difflib)	4	Jun 24, 2010
Python code problem	2	Apr 23, 2023
code object differences between 2.7 and 3.3a	0	Aug 11, 2011

difflib and intelligent file differences

hayes.tyler

Marco Mariani

Marco Mariani

hayes.tyler

Dave Angel

Dave Angel

Marco Mariani

Steven D'Aprano

Marco Mariani

hayes.tyler

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads