How to remove subset from a file efficiently?

fynali · Jan 14, 2006

$ cat cleanup_use_psyco_and_list_compr.py
#!/usr/bin/python

#import psyco
#psyco.full()

postpaid_file = open('/home/sajid/python/wip/stc/2/PSP0000333')
outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco',
'w')

barred = {}

for number in open('/home/sajid/python/wip/stc/2/CBR0000333'):
barred[number] = None # just add it as a key

for number in postpaid_file:
if number not in barred: outfile.writelines(number)

postpaid_file.close(); outfile.close()

--
$ time ./cleanup_use_psyco_and_list_compr.py

real 0m22.587s
user 0m21.653s
sys 0m0.440s

Not using psyco is faster!

Raymond Hettinger · Jan 14, 2006

b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))

file('PSP-CBR.dat,ray','w').writelines(itertools.ifilterfalse(b.__contains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))

--
$ time ./cleanup_ray.py

real 0m5.451s
user 0m4.496s
sys 0m0.428s

(-: Damn! That saves a bit more time! Bravo!

Click to expand...

[[email protected]]
Have you tried the explicit loop variant with psyco ? My experience is
that psyco is pretty good at optimizing for loop which usually results
in faster code than even built-in map/filter variant.

Though it would just be 1 or 2 sec difference(given what you already
have) so may not be important but could be fun.

The code is pretty tight and is now most likely I/O bound. If so,
further speed-ups will be hard to come by (even with psyco). The four
principal steps of reading, membership testing, filtering, and writing
are all C coded methods which are directly linked together with no
interpreter loop overhead or method lookups. Hard to beat.

Bengt Richter · Jan 14, 2006

Have you tried the explicit loop variant with psyco ? My experience is
that psyco is pretty good at optimizing for loop which usually results
in faster code than even built-in map/filter variant.

Though it would just be 1 or 2 sec difference(given what you already
have) so may not be important but could be fun.

OTOH, when you are dealing with large files and near-optimal simple processing you are
likely to be comparing i/o-bound processes, meaning differences observed
will be symptoms of os and file system performance more than of the algorithms.

An exception is when a slight variation in algorithm can cause a large change
in i/o performance, such as if it causes physical seek and read patterns of disk
access that the OS/file_system and disk interface hardware can't entirely optimize out
with smart buffering etc. Not to mention possible interactions with all the other things
an OS may be doing "simultaneously" switching between things that it accounts for as real/user/sys.

Regards,
Bengt Richter

Remove repeated words from a file	3	Sep 18, 2009
Permitting access to only a subset of the public methods	2	Jul 17, 2008
How to send an anonymous mail via Python script	6	Sep 21, 2013
problem writing to a file each record read	2	Mar 15, 2006
how to remove code duplication	23	Aug 11, 2008
how best to clear objects from a frame	3	Aug 2, 2010
Remove only special characters and junk characters from a file	0	Aug 14, 2007
Tryign to send mail via a python script by using the local MTA	58	Sep 15, 2013

How to remove subset from a file efficiently?

fynali

Raymond Hettinger

Bengt Richter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads