comparing huge files

s99999999s2003 · Mar 16, 2006

hi
i wrote some code to compare 2 files. One is the base file, the other
file i got from somewhere. I need to compare this file against the
base,
eg base file
abc
def
ghi

eg another file
abc
def
ghi
jkl

after compare , the base file will be overwritten with "jkl". Also both
files tend to grow towards > 20MB ..

Here is my code...using difflib.

pat = re.compile(r'^\+') ## i want to get rid of the '+' from the
difflib output...
def difference(filename,basename):
import difflib
base = open(basename)
a = base.readlines()
input = open(filename)
b = input.readlines()
d = difflib.Differ()
diff = list(d.compare(a, b))
if len(diff) > 0:
os.remove(basename)
o = open(basename, "aU")
for i in diff:
if pat.search(i):
i = i.lstrip("\+ ")
o.writelines(i) ## write a new base
file...
o.close()
g = open(basename)
return g.readlines()

Whenever the 2 files get very large, i find that it's very slow
comparing...any good advice to speed things up.? I thought of removing
readlines() method, and use line by line compare. Is it a better way?
thanks

James Stroud · Mar 16, 2006

hi
i wrote some code to compare 2 files. One is the base file, the other
file i got from somewhere. I need to compare this file against the
base,
eg base file
abc
def
ghi

eg another file
abc
def
ghi
jkl

after compare , the base file will be overwritten with "jkl". Also both
files tend to grow towards > 20MB ..

Here is my code...using difflib.

pat = re.compile(r'^\+') ## i want to get rid of the '+' from the
difflib output...
def difference(filename,basename):
import difflib
base = open(basename)
a = base.readlines()
input = open(filename)
b = input.readlines()
d = difflib.Differ()
diff = list(d.compare(a, b))
if len(diff) > 0:
os.remove(basename)
o = open(basename, "aU")
for i in diff:
if pat.search(i):
i = i.lstrip("\+ ")
o.writelines(i) ## write a new base
file...
o.close()
g = open(basename)
return g.readlines()

Whenever the 2 files get very large, i find that it's very slow
comparing...any good advice to speed things up.? I thought of removing
readlines() method, and use line by line compare. Is it a better way?
thanks

It seems like you want a new base that contains only those lines
contained in 'filename' that are not contained in 'basename' where
'basename' is an ordered subset of filename. In other words, the
'filename' file has all of the lines of 'basename' in order somewhere
but 'basename' has some additional lines. Is that correct? difflib looks
to be overkill for this. Here is a suggestion:

basefile = open(basename)
newfile = open(filename)
baseiter = basefile.xreadlines()
newiter = newfile.xreadlines()

newbase = open('tmp.txt', 'w')

for baseline in baseiter:
for newline in newiter:
if baseline != newline:
newbase.write(newline)
else:
break

for afile in (basefile, newfile, newbase): afile.close()

If 'basename'is not an ordered subset of 'filename', then difflib seems
to be your best bet because you have a computationally intensive problem.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

s99999999s2003 · Mar 16, 2006

thanks for the reply,
I have used another method to solve my problem. ie
1) get the total count of the first file
2) write this total count to basecnt eg basecnt
3) get another file, get the total count of this file. eg filecnt
4) if filecnt > basecnt, read in the values from file[basecnt:filecnt]
5) if filecnt < basecnt, overwrite original basecnt and start over
again.

basically, the problem domain is i want to get the most current records
from a log file to review after every 3 hours. so this log file will
increase or accumulate.

Frithiof Andreas Jensen · Mar 17, 2006

thanks for the reply,
I have used another method to solve my problem. ie
1) get the total count of the first file
2) write this total count to basecnt eg basecnt
3) get another file, get the total count of this file. eg filecnt
4) if filecnt > basecnt, read in the values from file[basecnt:filecnt]
5) if filecnt < basecnt, overwrite original basecnt and start over
again.

basically, the problem domain is i want to get the most current records
from a log file to review after every 3 hours. so this log file will
increase or accumulate.

I did this:

fp = os.popen('/usr/sbin/logtail /var/log/syslog')
loglines = fp.readlines()

.... pyparsing ... stuff .... from loglines
;-)

Python is maybe overkill too - have "cron" call "logtail" and pibe the
output whereever?

PS:

"logtail" is very simple, it works simply by maintaining a "bookmark" from
the last read that is updated after each time the file is read (i.e. on each
call). It is probably a very easy thing to implement in Python. On
Linux/UNIX syslog+logutils can do a lot of work just by configuration (but
you did not say you are on unix)

Remote SSH and Configuring code help	0	Dec 13, 2023
comparing two lists and returning "position"	12	Jun 22, 2007
comparing lists	7	Feb 10, 2010
Diff files with regex comparing	2	May 13, 2007
Comparing huge XML Files	10	Feb 23, 2005
Comparing two files	1	Apr 7, 2008
file compare using difflib.differ	0	Jul 19, 2010
timeit module for comparing the performance of two scripts	9	Jul 11, 2006

comparing huge files

s99999999s2003

James Stroud

s99999999s2003

Frithiof Andreas Jensen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads