How to remove subset from a file efficiently?

F

fynali

$ cat cleanup_use_psyco_and_list_compr.py
#!/usr/bin/python

#import psyco
#psyco.full()

postpaid_file = open('/home/sajid/python/wip/stc/2/PSP0000333')
outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco',
'w')

barred = {}

for number in open('/home/sajid/python/wip/stc/2/CBR0000333'):
barred[number] = None # just add it as a key

for number in postpaid_file:
if number not in barred: outfile.writelines(number)

postpaid_file.close(); outfile.close()

--
$ time ./cleanup_use_psyco_and_list_compr.py

real 0m22.587s
user 0m21.653s
sys 0m0.440s

Not using psyco is faster!
 
R

Raymond Hettinger

b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))
file('PSP-CBR.dat,ray','w').writelines(itertools.ifilterfalse(b.__contains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))

--
$ time ./cleanup_ray.py

real 0m5.451s
user 0m4.496s
sys 0m0.428s

(-: Damn! That saves a bit more time! Bravo!
[[email protected]]
Have you tried the explicit loop variant with psyco ? My experience is
that psyco is pretty good at optimizing for loop which usually results
in faster code than even built-in map/filter variant.

Though it would just be 1 or 2 sec difference(given what you already
have) so may not be important but could be fun.

The code is pretty tight and is now most likely I/O bound. If so,
further speed-ups will be hard to come by (even with psyco). The four
principal steps of reading, membership testing, filtering, and writing
are all C coded methods which are directly linked together with no
interpreter loop overhead or method lookups. Hard to beat.
 
B

Bengt Richter

Have you tried the explicit loop variant with psyco ? My experience is
that psyco is pretty good at optimizing for loop which usually results
in faster code than even built-in map/filter variant.

Though it would just be 1 or 2 sec difference(given what you already
have) so may not be important but could be fun.
OTOH, when you are dealing with large files and near-optimal simple processing you are
likely to be comparing i/o-bound processes, meaning differences observed
will be symptoms of os and file system performance more than of the algorithms.

An exception is when a slight variation in algorithm can cause a large change
in i/o performance, such as if it causes physical seek and read patterns of disk
access that the OS/file_system and disk interface hardware can't entirely optimize out
with smart buffering etc. Not to mention possible interactions with all the other things
an OS may be doing "simultaneously" switching between things that it accounts for as real/user/sys.

Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,111
Latest member
VetaMcRae
Top