How to remove subset from a file efficiently?

F

fynali

Hi all,

I have two files:

- PSP0000320.dat (quite a large list of mobile numbers),
- CBR0000319.dat (a subset of the above, a list of barred bumbers)

# head PSP0000320.dat CBR0000319.dat
==> PSP0000320.dat <==
96653696338
96653766996
96654609431
96654722608
96654738074
96655697044
96655824738
96656190117
96656256762
96656263751

==> CBR0000319.dat <==
96651131135
96651131135
96651420412
96651730095
96652399117
96652399142
96652399142
96652399142
96652399160
96652399271

Objective: to remove the numbers present in barred-list from the
PSPfile.

$ ls -lh PSP0000320.dat CBR0000319.dat
... 56M Dec 28 19:41 PSP0000320.dat
... 8.6M Dec 28 19:40 CBR0000319.dat

$ wc -l PSP0000320.dat CBR0000319.dat
4,462,603 PSP0000320.dat
693,585 CBR0000319.dat

I wrote the following in python to do it:

#: c01:rmcommon.py
barredlist = open(r'/home/sjd/python/wip/CBR0000319.dat', 'r')
postlist = open(r'/home/sjd/python/wip/PSP0000320.dat', 'r')
outfile = open(r'/home/sjd/python/wip/PSP-CBR.dat', 'w')

# reading it all in one go, so as to avoid frequent disk accesses
(assume machine has plenty memory)
barredlist.read()
postlist.read()

#
for number in postlist:
if number in barrlist:
pass
else:
outfile.write(number)

barredlist.close(); postlist.close(); outfile.close()
#:~

The above code simply takes too long to complete. If I were to do a
diff -y PSP0000320.dat CBR0000319.dat, catch the '<' & clean it up with
sed -e 's/\([0-9]*\) *</\1/' > PSP-CBR.dat it takes <4 minutes to
complete.

I wrote the following in bash to do the same:

#!/bin/bash

ARGS=2

if [ $# -ne $ARGS ] # takes two arguments
then
echo; echo "Usage: `basename $0` {PSPfile} {CBRfile}"
echo; echo " eg.: `basename $0` PSP0000320.dat
CBR0000319.dat"; echo;
echo "NOTE: first argument: PSP file, second: CBR file";
echo " this script _does_ no_ input validation!"
exit 1
fi;

# fix prefix; cost: 12.587 secs
cat $1 | sed -e 's/^0*/966/' > $1.good
cat $2 | sed -e 's/^0*/966/' > $2.good

# sort/save files; for the 4,462,603 lines, cost: 36.589 secs
sort $1.good > $1.sorted
sort $2.good > $2.sorted

# diff -y {PSP} {CBR}, grab the ones in PSPfile; cost: 31.817 secs
diff -y $1.sorted $2.sorted | grep "<" > $1.filtered

# remove trailing junk [spaces & <]; cost: 1 min 3 secs
cat $1.filtered | sed -e 's/\([0-9]*\) *</\1/' > $1.cleaned

# remove intermediate files, good, sorted, filtered
rm -f *.good *.sorted *.filtered
#:~

....but strangely though, there's a discrepancy, the reason for which I
can't figure out!

Needless to say, I'm utterly new to python and my programming skills &
know-how are rudimentary.

Any help will be genuinely appreciated.
 
R

Raymond Hettinger

[fynali]
I have two files:

- PSP0000320.dat (quite a large list of mobile numbers),
- CBR0000319.dat (a subset of the above, a list of barred bumbers)

# print all non-barred mobile phone numbers
barred = set(open('CBR0000319.dat'))
for num in open('PSP0000320.dat'):
if num not in barred:
print num,

Raymond
 
A

AJL

Hi all,

I have two files:

- PSP0000320.dat (quite a large list of mobile numbers),
- CBR0000319.dat (a subset of the above, a list of barred bumbers)
....

Objective: to remove the numbers present in barred-list from the
PSPfile.

How fast does this run?

a = set(file('PSP0000320.dat'))
b = set(file('CBR0000319.dat'))
file('PSP-CBR.dat', 'w').writelines(a.difference(b))

AJL
 
M

Mike Meyer

fynali said:
Hi all,

I have two files:

Others have pointed out the Python solution - use a set instead of a
list for membership testing. I want to point out a better Unix
solution ('cause I probably wouldn't have written a Python program to
do this):
Objective: to remove the numbers present in barred-list from the
PSPfile.

$ ls -lh PSP0000320.dat CBR0000319.dat
... 56M Dec 28 19:41 PSP0000320.dat
... 8.6M Dec 28 19:40 CBR0000319.dat

$ wc -l PSP0000320.dat CBR0000319.dat
4,462,603 PSP0000320.dat
693,585 CBR0000319.dat

I wrote the following in bash to do the same:

#!/bin/bash

ARGS=2

if [ $# -ne $ARGS ] # takes two arguments
then
echo; echo "Usage: `basename $0` {PSPfile} {CBRfile}"
echo; echo " eg.: `basename $0` PSP0000320.dat
CBR0000319.dat"; echo;
echo "NOTE: first argument: PSP file, second: CBR file";
echo " this script _does_ no_ input validation!"
exit 1
fi;

# fix prefix; cost: 12.587 secs
cat $1 | sed -e 's/^0*/966/' > $1.good
cat $2 | sed -e 's/^0*/966/' > $2.good

# sort/save files; for the 4,462,603 lines, cost: 36.589 secs
sort $1.good > $1.sorted
sort $2.good > $2.sorted

# diff -y {PSP} {CBR}, grab the ones in PSPfile; cost: 31.817 secs
diff -y $1.sorted $2.sorted | grep "<" > $1.filtered

# remove trailing junk [spaces & <]; cost: 1 min 3 secs
cat $1.filtered | sed -e 's/\([0-9]*\) *</\1/' > $1.cleaned

# remove intermediate files, good, sorted, filtered
rm -f *.good *.sorted *.filtered
#:~

...but strangely though, there's a discrepancy, the reason for which I
can't figure out!

The above script can be shortened quite a bit:

#!/bin/sh

comm -23 <(sed 's/^0*/966/' $1 | sort) <(sed 's/^0*/966/ $2 | sort)

Will output only lines that occur in $1. It also runs the seds and
sorts in parallel, which can make a significant difference in the
clock time it takes to get the job done.

The Python version is probably faster, since it doesn't sort the
data.

<mike
 
C

Christopher Weimann

- PSP0000320.dat (quite a large list of mobile numbers),
- CBR0000319.dat (a subset of the above, a list of barred bumbers)

fgrep -x -v -f CBR0000319.dat PSP0000320.dat > PSP-CBR.dat
 
R

Raymond Hettinger

AJL said:
How fast does this run?

a = set(file('PSP0000320.dat'))
b = set(file('CBR0000319.dat'))
file('PSP-CBR.dat', 'w').writelines(a.difference(b))

Turning PSP into a set takes extra time, consumes unnecessary memory,
eliminates duplicates (possibly a bad thing), and loses the original
input ordering (probably a bad thing).

To jam the action into a couple lines, try this:

b = set(file('CBR0000319.dat'))
file('PSP-CBR.dat','w').writelines(itertools.ifilterfalse(b.__contains__,file('PSP0000320.dat')))

Raymond
 
A

AJL

Turning PSP into a set takes extra time, consumes unnecessary memory,
eliminates duplicates (possibly a bad thing), and loses the original
input ordering (probably a bad thing).

To jam the action into a couple lines, try this:

b = set(file('CBR0000319.dat'))
file('PSP-CBR.dat','w').writelines(itertools.ifilterfalse(b.__contains__,file('PSP0000320.dat')))

Raymond

The OP said "assume machine has plenty memory". ;)

I saw some solutions that used sets and was wondering why they stopped
at using a set for the first file and not the second when the problem is
really a set problem but I can see the reasoning behind it now.

AJL
 
F

fynali

$ time fgrep -x -v -f CBR0000333 PSP0000333 > PSP-CBR.dat.fgrep

real 0m31.551s
user 0m16.841s
sys 0m0.912s

--
$ time ./cleanup.py

real 0m6.080s
user 0m4.836s
sys 0m0.408s

--
$ wc -l PSP-CBR.dat.fgrep PSP-CBR.dat.python
3872421 PSP-CBR.dat.fgrep
3872421 PSP-CBR.dat.python

Fantastic, at any rate the time is down from my initial ~4 min.!

Thank you Chris. The fgrep approach is clean and to the point; and one
more reason to love the *nix approach to handling everyday problems.

Fredrik's set|dict approach in Python above gives me one more reason to
love Python. And it is indeed fast, 5x!

Thank you all for all your help.
 
F

fynali

$ cat cleanup_ray.py
#!/usr/bin/python
import itertools

b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))

file('PSP-CBR.dat,ray','w').writelines(itertools.ifilterfalse(b.__contains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))

--
$ time ./cleanup_ray.py

real 0m5.451s
user 0m4.496s
sys 0m0.428s

(-: Damn! That saves a bit more time! Bravo!

Thanks to you Raymond.
 
B

bonono

fynali said:
$ cat cleanup_ray.py
#!/usr/bin/python
import itertools

b = set(file('/home/sajid/python/wip/stc/2/CBR0000333'))

file('PSP-CBR.dat,ray','w').writelines(itertools.ifilterfalse(b.__contains__,file('/home/sajid/python/wip/stc/2/PSP0000333')))

--
$ time ./cleanup_ray.py

real 0m5.451s
user 0m4.496s
sys 0m0.428s

(-: Damn! That saves a bit more time! Bravo!

Thanks to you Raymond.
Have you tried the explicit loop variant with psyco ? My experience is
that psyco is pretty good at optimizing for loop which usually results
in faster code than even built-in map/filter variant.

Though it would just be 1 or 2 sec difference(given what you already
have) so may not be important but could be fun.
 
F

fynali

--
$ ./cleanup.py
Traceback (most recent call last):
File "./cleanup.py", line 3, in ?
import itertools
ImportError: No module named itertools

--
$ time ./cleanup.py
File "./cleanup.py", line 8
outfile.writelines(number for number in postpaid_file if number
not in barred)
^
SyntaxError: invalid syntax

The earlier results I posted were run on my workstation which has
Python 2.4.1,

$ uname -a && python -V
Linux sajid 2.6.13-15.7-smp #1 SMP
Tue Nov 29 14:32:29 UTC 2005 i686 i686 i386 GNU/Linux
Python 2.4.1

but the server on which the actual processing will be done has an older
version )-:

$ uname -a && python -V
Linux cactus 2.4.21-20.ELsmp #1 SMP
Wed Aug 18 20:46:40 EDT 2004 i686 i686 i386 GNU/Linux
Python 2.2.3

Is a rewrite possible of Raymond's or Fredrik's suggestions above which
will still give me the time saving made?
 
F

fynali

[bonono]
Have you tried the explicit loop variant with psyco ?

Sure I wouldn't mind trying; can you suggest some code snippets along
the lines of which I should try...?

[fynali]
> Needless to say, I'm utterly new to python and my programming
> skills & know-how are rudimentary.

(-:
 
F

Fredrik Lundh

fynali said:
Is a rewrite possible of Raymond's or Fredrik's suggestions above which
will still give me the time saving made?

Python 2.2 don't have a readymade set type (new in 2.3), and it doesn't
support generator expressions (the thing that caused the syntax error).

however, using a dictionary instead of the set

barred = {}
for number in open(open('/home/sjd/python/wip/CBR0000319.dat')):
barred[number] = None # just add it as a key

and a list comprehension instead of the generator expression

outfile.writelines([number for number in infile if number not in barred])

(note the extra brackets)

should give you decent performance under 2.2.

</F>
 
B

bonono

fynali said:
[bonono]
Have you tried the explicit loop variant with psyco ?

Sure I wouldn't mind trying; can you suggest some code snippets along
the lines of which I should try...?

[fynali]
Needless to say, I'm utterly new to python and my programming
skills & know-how are rudimentary.

(-:

just add :

import psyco
psyco.full()

to the beginning of the code from Fredrik for 2.2(the one using dict
and list comprehension). You system may not have the psyco module
though.
 
F

fynali

$ cat cleanup.py
#!/usr/bin/python

postpaid_file = open('/home/oracle/stc/test/PSP0000333')
outfile = open('/home/oracle/stc/test/PSP-CBR.dat', 'w')

barred = {}

for number in open('/home/oracle/stc/test/CBR0000333'):
barred[number] = None # just add it as a key

outfile.writelines([number for number in postpaid_file if number
not in barred])

postpaid_file.close(); outfile.close()

--
$ time ./cleanup.py

real 0m31.007s
user 0m24.660s
sys 0m3.550s

Can we say that using generators & newer Python _is_ faster?
 
F

fynali

$ cat cleanup_use_psyco_and_list_compr.py
#!/usr/bin/python

import psyco
psyco.full()

postpaid_file = open('/home/sajid/python/wip/stc/2/PSP0000333')
outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco',
'w')

barred = {}

for number in open('/home/sajid/python/wip/stc/2/CBR0000333'):
barred[number] = None # just add it as a key

outfile.writelines([number for number in postpaid_file if number
not in barred])

postpaid_file.close(); outfile.close()

--
$ time ./cleanup_use_psyco_and_list_compr.py

real 0m39.638s
user 0m5.532s
sys 0m0.868s

This was run on my machine (w/ Python 2.4.1), can't install psyco on
the actual server at the moment.

I guess using generators & newer Python is indeed faster|better.
 
B

bonono

fynali said:
$ cat cleanup_use_psyco_and_list_compr.py
#!/usr/bin/python

import psyco
psyco.full()

postpaid_file = open('/home/sajid/python/wip/stc/2/PSP0000333')
outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco',
'w')

barred = {}

for number in open('/home/sajid/python/wip/stc/2/CBR0000333'):
barred[number] = None # just add it as a key

outfile.writelines([number for number in postpaid_file if number
not in barred])

postpaid_file.close(); outfile.close()

--
$ time ./cleanup_use_psyco_and_list_compr.py

real 0m39.638s
user 0m5.532s
sys 0m0.868s

This was run on my machine (w/ Python 2.4.1), can't install psyco on
the actual server at the moment.

I guess using generators & newer Python is indeed faster|better.
um, strange, so psyco is slower than not using it ?

you may try to expand the list comprehension to :

for number in postpaid_file:
if number not in barred: outfile.writelines(number)
 
F

fynali

$ cat cleanup_use_psyco_and_list_compr.py
#!/usr/bin/python

import psyco
psyco.full()

postpaid_file = open('/home/sajid/python/wip/stc/2/PSP0000333')
outfile = open('/home/sajid/python/wip/stc/2/PSP-CBR.dat.psyco',
'w')

barred = {}

for number in open('/home/sajid/python/wip/stc/2/CBR0000333'):
barred[number] = None # just add it as a key

for number in postpaid_file:
if number not in barred: outfile.writelines(number)

postpaid_file.close(); outfile.close()

--
$ time ./cleanup_use_psyco_and_list_compr.py

real 0m24.293s
user 0m22.633s
sys 0m0.524s

Saves ~6 secs.
 
B

bonono

fynali said:
Sorry, pls read that ~15 secs.
That is more or less about it. As set() is faster than dict(), about 2x
on my machine and I assume a portion of your time is in set/dict
creation as it is pretty large data set.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top