Compare 2 files and discard common lines



I have a requirement to compare 2 text files and write to a 3rd file
only those lines that appear in the 2nd file but not in the 1st file.

Rather than re-invent the wheel I am wondering if anyone has written
anything already?


I have a requirement to compare 2 text files and write to a 3rd file
only those lines that appear in the 2nd file but not in the 1st file.

Rather than re-invent the wheel I am wondering if anyone has written
anything already?

You can use the cmp(x, y) function to tell if a string is similar to

going cmp('spam', 'eggs') will return 1 (spam is greater than eggs)
(have no idea why)
swapping the two give -1
and having 'eggs' and 'eggs' gives 0.

is that what you were looking for?

Stefan Behnel

loial said:
I have a requirement to compare 2 text files and write to a 3rd file
only those lines that appear in the 2nd file but not in the 1st file.

lines_in_file2 = set(open("file2").readlines())
for line in open("file1"):
if line not in lines_in_file2:
print line



I have a requirement to compare 2 text files and write to a 3rd file
only those lines that appear in the 2nd file but not in the 1st file.

Rather than re-invent the wheel I am wondering if anyone has written
anything already?

How large are the files ? You could load up the smallest file into
memory then while iterating over the other one just do 'if line in
other_files_lines:' and do your processing from there. By your
description it doesn't sound like you want to iterate over both files
simultaneously and do a line for line comparison because that would
mean if someone plonks an extra newline somewhere it wouldn't gel.


Another way of doing this might be to use the module difflib to
calculate the differences. It has a sequence matcher under it which
has the function get_matching_blocks

difflib is included with python.


I have a requirement to compare 2 text files and write to a 3rd file
only those lines that appear in the 2nd file but not in the 1st file.

Rather than re-invent the wheel I am wondering if anyone has written
anything already?


only those lines that appear in the 2nd file but not in the 1st file.

set(file_2_recs).difference(set(file_1_recs)) will give the recs in
file_2 that are not in file_1 if you can store both files in memory.
Sets are indexed and so are faster than lists.

BJörn Lindqvist

Open('3rd', 'w').writelines(set(open('2nd').readlines())-set(open('1st')))

Gabriel Genellina

2008/5/29 said:

Is the asymmetry 1st/2nd intentional? I think one could omit .readlines()
in 2nd file too.

Paul McGuire

I have a requirement to compare 2 text files and write to a 3rd file
only those lines that appear in the 2nd file but not in the 1st file.

Rather than re-invent the wheel I am wondering if anyone has written
anything already?

Take the time to learn difflib - it is a standard module, and good for
general comparison of files, sequences, etc.

-- Paul


I have a requirement to compare 2 text files and write to a 3rd file
only those lines that appear in the 2nd file but not in the 1st file.

Rather than re-invent the wheel I am wondering if anyone has written
anything already?

Of course you can do this at any linux or unix command line simply by:

comm -13 file1 file2 >file3


I have a requirement to compare 2 text files and write to a 3rd file
only those lines that appear in the 2nd file but not in the 1st file.

Rather than re-invent the wheel I am wondering if anyone has written
anything already?

It's so easy to do that it won't count as reinventing the wheel:

a = open('a.txt', 'r').read().split('\n')
b = open('b.txt', 'r').read().split('\n')
c = open('c.txt', 'w')
c.write('\n'.join([comm for comm in b if not (comm in a)]))

it's not the fastest common searcher but it works.

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Latest member

Latest Threads
