[snip some damn lie aka "benchmark"]
[me]
Also you only have 1000 entries in B!
Try it again with all entries in B also ;-)
Remember the original poster had 100K entries!
Well, that's the closest I can do:
$ py
Python 2.4c1 (#3, Nov 26 2004, 23:39:44)
[GCC 3.3.3 (SuSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more information...>> alist=[line.strip() for line in open('/usr/share/dict/words')]
..>> words=set()
..>> for word in alist:
.... words.add(word + '\n')
.... words.add(word[::-1] + '\n')
....
..>> len(words)
90525
..>> words=list(words)
..>> open('/tmp/A', 'w').writelines(words)
..>> import random; random.shuffle(words)
..>> open('/tmp/B', 'w').writelines(words[:90000])
..>>
$ time sort A B B | uniq -u >/dev/null
real 0m2.408s
user 0m2.437s
sys 0m0.037s
$ time grep -Fvf B A >/dev/null
real 0m1.208s
user 0m1.161s
sys 0m0.035s
What now?-)
Mind you, I only replied in the first place because you wrote (my
emphasis) "...here is *the* unix way..." and it's the bad days of the
month (not mine, actually, but I suffer along...)
Note the order is trivial to restore with a
"decorate-sort-undecorate" idiom.
Using python or unix tools (eg 'paste -d', 'sort -k', 'cut -d')?
Because the python way has been already discussed by Friedrik, John and
Tim, and the unix way gets overly complicated (aka non-trivial) if DSU
is involved.
BTW, the following occurred to me:
tzot@tril/tmp
$ cat >A
aa
ss
dd
ff
gg
hh
jj
kk
ll
aa
tzot@tril/tmp
$ cat >B
ss
ff
hh
kk
tzot@tril/tmp
$ sort A B B | uniq -u
dd
gg
jj
ll
tzot@tril/tmp
$ grep -Fvf B A
aa
dd
gg
jj
ll
aa
Note that 'aa' is contained twice in the A file (to be filtered by B).
So said:
Essentially, want to do efficient grep, i..e from A remove those lines which
are also present in file B.
grep is the unix way to go for both speed and correctness.
I would call this issue a dead horse.