Efficient grep using Python?

Discussion in 'Python' started by sf, Dec 15, 2004.

  1. sf

    sf Guest

    Just started thinking about learning python.

    Is there any place where I can get some free examples, especially for
    following kind of problem ( it must be trivial for those using python)

    I have files A, and B each containing say 100,000 lines (each line=one
    string without any space)

    I want to do

    " A - (A intersection B) "

    Essentially, want to do efficient grep, i..e from A remove those lines which
    are also present in file B.
     
    sf, Dec 15, 2004
    #1
    1. Advertising

  2. sf

    John Hunter Guest

    >>>>> "sf" == sf <> writes:

    sf> Just started thinking about learning python. Is there any
    sf> place where I can get some free examples, especially for
    sf> following kind of problem ( it must be trivial for those using
    sf> python)

    sf> I have files A, and B each containing say 100,000 lines (each
    sf> line=one string without any space)

    sf> I want to do

    sf> " A - (A intersection B) "

    sf> Essentially, want to do efficient grep, i..e from A remove
    sf> those lines which are also present in file B.

    If you're only talking about 100K lines or so, and you have a
    reasonably modern computer, you can do this all in memory. If order
    doesn't matter (it probably does) you can use a set to get all the
    lines in file B that are not in A

    from sets import Set
    A = Set(file('test1.dat').readlines())
    B = Set(file('test2.dat').readlines())
    print B-A

    To preserve order, you should use a dictionary that maps lines to line
    numbers. You can later use these numbers to sort

    A = dict([(line, num) for num,line in enumerate(file('test1.dat'))])
    B = dict([(line, num) for num,line in enumerate(file('test2.dat'))])

    keep = [(num, line) for line,num in B.items() if not A.has_key(line)]
    keep.sort()
    for num, line in keep:
    print line,

    Now someone else will come along and tell you all this functionality
    is already in the standard library. But it's always fun to hack this
    out yourself once because python makes such things so damned easy.

    JDH
     
    John Hunter, Dec 15, 2004
    #2
    1. Advertising

  3. "sf" <> wrote:

    > I have files A, and B each containing say 100,000 lines (each line=one
    > string without any space)
    >
    > I want to do
    >
    > " A - (A intersection B) "
    >
    > Essentially, want to do efficient grep, i..e from A remove those lines which
    > are also present in file B.


    that's an unusual definition of "grep", but the following seems to
    do what you want:

    afile = "a.txt"
    bfile = "b.txt"

    bdict = dict.fromkeys(open(bfile).readlines())

    for line in open(afile):
    if line not in bdict:
    print line,

    </F>
     
    Fredrik Lundh, Dec 15, 2004
    #3
  4. sf

    Guest

    sf wrote:
    > Just started thinking about learning python.
    >
    > Is there any place where I can get some free examples, especially for
    > following kind of problem ( it must be trivial for those using python)
    >
    > I have files A, and B each containing say 100,000 lines (each line=one
    > string without any space)
    >
    > I want to do
    >
    > " A - (A intersection B) "
    >
    > Essentially, want to do efficient grep, i..e from A remove those lines which
    > are also present in file B.


    You could implement elegantly using the new sets feature
    For reference here is the unix way to do it:

    sort a b b | uniq -u

    --
    Pádraig Brady - http://www.pixelbeat.org
    --
     
    , Dec 15, 2004
    #4
  5. sf

    Tim Peters Guest

    ["sf" <>]
    >> I have files A, and B each containing say 100,000 lines (each
    >> line=one string without any space)
    >>
    >> I want to do
    >>
    >> " A - (A intersection B) "
    >>
    >> Essentially, want to do efficient grep, i..e from A remove those
    >> lines which are also present in file B.


    [Fredrik Lundh]
    > that's an unusual definition of "grep", but the following seems to
    > do what you want:
    >
    > afile = "a.txt"
    > bfile = "b.txt"
    >
    > bdict = dict.fromkeys(open(bfile).readlines())
    >
    > for line in open(afile):
    > if line not in bdict:
    > print line,
    >
    > </F>


    Note that an open file is an iterable object, yielding the lines in
    the file. The "for" loop exploited that above, but fromkeys() can
    also exploit it. That is,

    bdict = dict.fromkeys(open(bfile))

    is good enough (there's no need for the .readlines()).
     
    Tim Peters, Dec 15, 2004
    #5
  6. Tim Peters wrote:

    >> bdict = dict.fromkeys(open(bfile).readlines())
    >>
    >> for line in open(afile):
    >> if line not in bdict:
    >> print line,
    >>
    >> </F>

    >
    > Note that an open file is an iterable object, yielding the lines in
    > the file. The "for" loop exploited that above, but fromkeys() can
    > also exploit it. That is,
    >
    > bdict = dict.fromkeys(open(bfile))
    >
    > is good enough (there's no need for the .readlines()).


    (sigh. my brain knows that, but my fingers keep forgetting)

    and yes, for this purpose, "dict.fromkeys" can be replaced
    with "set".

    bdict = set(open(bfile))

    (and then you can save a few more bytes by renaming the
    variable...)

    </F>
     
    Fredrik Lundh, Dec 15, 2004
    #6
  7. sf

    Tim Peters Guest

    [Fredrik Lundh]
    >>> bdict = dict.fromkeys(open(bfile).readlines())
    >>>
    >>> for line in open(afile):
    >>> if line not in bdict:
    >>> print line,
    >>>
    >>> </F>


    [Tim Peters]
    >> Note that an open file is an iterable object, yielding the lines in
    >> the file. The "for" loop exploited that above, but fromkeys() can
    >> also exploit it. That is,
    >>
    >> bdict = dict.fromkeys(open(bfile))
    >>
    >> is good enough (there's no need for the .readlines()).


    [/F]
    > (sigh. my brain knows that, but my fingers keep forgetting)
    >
    > and yes, for this purpose, "dict.fromkeys" can be replaced
    > with "set".
    >
    > bdict = set(open(bfile))
    >
    > (and then you can save a few more bytes by renaming the
    > variable...)


    Except the latter two are just shallow spelling changes. Switching
    from fromkeys(open(f).readlines()) to fromkeys(open(f)) is much more
    interesting, since it can allow major reduction in memory use. Even
    if all the lines in the file are pairwise distinct, not materializing
    them into a giant list can be a significant win. I wouldn't have
    bothered replying if the only point were that you can save a couple
    bytes of typing <wink>.
     
    Tim Peters, Dec 15, 2004
    #7
  8. On Wed, 15 Dec 2004 17:07:37 +0100, rumours say that "Fredrik Lundh"
    <> might have written:

    >> Essentially, want to do efficient grep, i..e from A remove those lines which
    >> are also present in file B.

    >
    >that's an unusual definition of "grep"


    that would be

    grep -vf B A

    and it is a rare use of grep, indeed.
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
     
    Christos TZOTZIOY Georgiou, Dec 16, 2004
    #8
  9. On Wed, 15 Dec 2004 16:10:08 +0000, rumours say that
    might have written:

    >> Essentially, want to do efficient grep, i..e from A remove those lines which
    >> are also present in file B.

    >
    >You could implement elegantly using the new sets feature
    >For reference here is the unix way to do it:
    >
    >sort a b b | uniq -u


    No, like I just wrote in another post, he wants

    $ grep -vf B A

    I think that

    $ sort A B B | uniq -u

    can be abbreviated to

    $ sort -u A B B

    which is the union rather than the intersection of the files, wastes
    some time by considering B twice, and finally destroys original line
    order (should it be important).
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
     
    Christos TZOTZIOY Georgiou, Dec 16, 2004
    #9
  10. sf

    Guest

    Christos TZOTZIOY Georgiou wrote:
    > On Wed, 15 Dec 2004 16:10:08 +0000, rumours say that
    > might have written:
    >
    >
    >>>Essentially, want to do efficient grep, i..e from A remove those lines which
    >>>are also present in file B.

    >>
    >>You could implement elegantly using the new sets feature
    >>For reference here is the unix way to do it:
    >>
    >>sort a b b | uniq -u

    >
    >
    > No, like I just wrote in another post, he wants
    >
    > $ grep -vf B A
    >
    > I think that
    >
    > $ sort A B B | uniq -u
    >
    > can be abbreviated to
    >
    > $ sort -u A B B
    >
    > which is the union rather than the intersection of the files


    wrong. Notice the -u option to uniq.
    http://marc.free.net.ph/message/20021101.043138.1bc24964.html

    > wastes some time by considering B twice


    I challenge you to a benchmark :)

    > and finally destroys original line
    > order (should it be important).


    true

    --
    Pádraig Brady - http://www.pixelbeat.org
    --
     
    , Dec 16, 2004
    #10
  11. On Thu, 16 Dec 2004 14:28:21 +0000, rumours say that
    might have written:

    [sf]
    >>>>Essentially, want to do efficient grep, i..e from A remove those lines which
    >>>>are also present in file B.


    [p@draig]
    >>>You could implement elegantly using the new sets feature
    >>>For reference here is the unix way to do it:
    >>>
    >>>sort a b b | uniq -u


    [Christos]
    >> No, like I just wrote in another post, he wants
    >>
    >> $ grep -vf B A
    >>
    >> I think that
    >>
    >> $ sort A B B | uniq -u
    >>
    >> can be abbreviated to
    >>
    >> $ sort -u A B B
    >>
    >> which is the union rather than the intersection of the files


    [P@draig]
    >wrong. Notice the -u option to uniq.
    >http://marc.free.net.ph/message/20021101.043138.1bc24964.html


    I see your point. That's a new to me use of uniq, since I started using
    Unices long before GNU versions of the tools, but then, I might have
    missed the -u option.

    $ cat A
    aa
    ss
    dd
    ff
    gg
    hh
    jj
    kk
    $ cat B
    aa
    dd
    gg
    jj
    pp
    $ time sort A B B | uniq -u
    ff
    hh
    kk
    ss

    real 0m0.004s
    user 0m0.004s
    sys 0m0.004s
    $ time grep -vf B A
    ss
    ff
    hh
    kk

    real 0m0.003s
    user 0m0.000s
    sys 0m0.003s

    So I stand corrected that your solution does *not* give the union.

    >> wastes some time by considering B twice

    >
    >I challenge you to a benchmark :)


    Well, the numbers I provided above are almost meaningless with such a
    small set (and they easily could be reverse, I just kept the
    convenient-to-me first run :). Do you really believe that sorting three
    files and then scanning their merged output counting duplicates is
    faster than scanning two files (and doing lookups during the second
    scan)?

    $ python
    Python 2.3.3 (#1, Aug 31 2004, 13:51:39)
    [GCC 3.3.3 (SuSE Linux)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> x=open('/usr/share/dict/words').readlines()
    >>> len(x)

    45378
    >>> import random
    >>> random.shuffle(x)
    >>> open("/tmp/A", "w").writelines(x)
    >>> random.shuffle(x)
    >>> open("/tmp/B", "w").writelines(x[:1000])
    >>>

    $ time sort A B B | uniq -u >/dev/null

    real 0m0.311s
    user 0m0.315s
    sys 0m0.008s
    $ time grep -Fvf B A >/dev/null

    real 0m0.067s
    user 0m0.064s
    sys 0m0.003s

    (Yes, I cheated by adding the F (for no regular expressions) flag :)

    >> and finally destroys original line
    >> order (should it be important).

    >
    >true


    That's our final agreement :)
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
     
    Christos TZOTZIOY Georgiou, Dec 16, 2004
    #11
  12. sf

    Guest

    Christos TZOTZIOY Georgiou wrote:
    > On Thu, 16 Dec 2004 14:28:21 +0000, rumours say that
    >>I challenge you to a benchmark :)

    >
    >
    > Well, the numbers I provided above are almost meaningless with such a
    > small set (and they easily could be reverse, I just kept the
    > convenient-to-me first run :). Do you really believe that sorting three
    > files and then scanning their merged output counting duplicates is
    > faster than scanning two files (and doing lookups during the second
    > scan)?
    >
    > $ python
    > Python 2.3.3 (#1, Aug 31 2004, 13:51:39)
    > [GCC 3.3.3 (SuSE Linux)] on linux2
    > Type "help", "copyright", "credits" or "license" for more information.
    >
    >>>>x=open('/usr/share/dict/words').readlines()
    >>>>len(x)

    >
    > 45378
    >
    >>>>import random
    >>>>random.shuffle(x)
    >>>>open("/tmp/A", "w").writelines(x)
    >>>>random.shuffle(x)
    >>>>open("/tmp/B", "w").writelines(x[:1000])
    >>>>

    >
    > $ time sort A B B | uniq -u >/dev/null
    >
    > real 0m0.311s
    > user 0m0.315s
    > sys 0m0.008s
    > $ time grep -Fvf B A >/dev/null
    >
    > real 0m0.067s
    > user 0m0.064s
    > sys 0m0.003s
    >
    > (Yes, I cheated by adding the F (for no regular expressions) flag :)


    Also you only have 1000 entries in B!
    Try it again with all entries in B also ;-)
    Remember the original poster had 100K entries!

    >>>and finally destroys original line
    >>>order (should it be important).

    >>
    >>true

    >
    > That's our final agreement :)


    Note the order is trivial to restore with a
    "decorate-sort-undecorate" idiom.

    --
    Pádraig Brady - http://www.pixelbeat.org
    --
     
    , Dec 17, 2004
    #12
  13. sf

    sf Guest

    The point is that when you have 100,000s of records, this grep becomes
    really slow?

    Any comments?

    Thats why I looked for python :)


    > that would be
    >
    > grep -vf B A
    >
    > and it is a rare use of grep, indeed.
    > --
    > TZOTZIOY, I speak England very best.
    > "Be strict when sending and tolerant when receiving." (from RFC1958)
    > I really should keep that in mind when talking with people, actually...
     
    sf, Dec 17, 2004
    #13
  14. Re: Efficient grep using Python? [OT]

    On Fri, 17 Dec 2004 12:21:08 +0000, rumours say that
    might have written:

    [snip some damn lie aka "benchmark"]

    [me]
    >> (Yes, I cheated by adding the F (for no regular expressions) flag :)

    >
    >Also you only have 1000 entries in B!
    >Try it again with all entries in B also ;-)
    >Remember the original poster had 100K entries!


    Well, that's the closest I can do:

    $ py
    Python 2.4c1 (#3, Nov 26 2004, 23:39:44)
    [GCC 3.3.3 (SuSE Linux)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys; sys.ps1='.>>'

    ..>> alist=[line.strip() for line in open('/usr/share/dict/words')]
    ..>> words=set()
    ..>> for word in alist:
    .... words.add(word + '\n')
    .... words.add(word[::-1] + '\n')
    ....
    ..>> len(words)
    90525
    ..>> words=list(words)
    ..>> open('/tmp/A', 'w').writelines(words)
    ..>> import random; random.shuffle(words)
    ..>> open('/tmp/B', 'w').writelines(words[:90000])
    ..>>
    $ time sort A B B | uniq -u >/dev/null

    real 0m2.408s
    user 0m2.437s
    sys 0m0.037s
    $ time grep -Fvf B A >/dev/null

    real 0m1.208s
    user 0m1.161s
    sys 0m0.035s

    What now?-)

    Mind you, I only replied in the first place because you wrote (my
    emphasis) "...here is *the* unix way..." and it's the bad days of the
    month (not mine, actually, but I suffer along...)

    >>>>and finally destroys original line
    >>>>order (should it be important).
    >>>
    >>>true

    >>
    >> That's our final agreement :)

    >
    >Note the order is trivial to restore with a
    >"decorate-sort-undecorate" idiom.


    Using python or unix tools (eg 'paste -d', 'sort -k', 'cut -d')?
    Because the python way has been already discussed by Friedrik, John and
    Tim, and the unix way gets overly complicated (aka non-trivial) if DSU
    is involved.

    BTW, the following occurred to me:

    tzot@tril/tmp
    $ cat >A
    aa
    ss
    dd
    ff
    gg
    hh
    jj
    kk
    ll
    aa
    tzot@tril/tmp
    $ cat >B
    ss
    ff
    hh
    kk
    tzot@tril/tmp
    $ sort A B B | uniq -u
    dd
    gg
    jj
    ll
    tzot@tril/tmp
    $ grep -Fvf B A
    aa
    dd
    gg
    jj
    ll
    aa

    Note that 'aa' is contained twice in the A file (to be filtered by B).
    So our methods do not produce the same output. As far as the OP wrote:

    >Essentially, want to do efficient grep, i..e from A remove those lines which
    >are also present in file B.


    grep is the unix way to go for both speed and correctness.

    I would call this issue a dead horse.
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
     
    Christos TZOTZIOY Georgiou, Dec 17, 2004
    #14
  15. sf

    Guest

    sf wrote:
    > The point is that when you have 100,000s of records, this grep becomes
    > really slow?


    There are performance bugs with current versions of grep
    and multibyte characters that are only getting addressed now.
    To work around these do `export LANG=C` first.

    In my experience grep is not scalable since it's O(n^2).
    See below (note A and B are randomized versions of
    /usr/share/dict/words (and therefore worst case for the
    sort method)).

    $ wc -l A B
    45427 A
    45427 B

    $ export LANG=C

    $ time grep -Fvf B A
    real 0m0.437s

    $ time sort A B B | uniq -u
    real 0m0.262s

    $ rpm -q grep coreutils
    grep-2.5.1-16.1
    coreutils-4.5.3-19

    --
    Pádraig Brady - http://www.pixelbeat.org
    --
     
    , Dec 17, 2004
    #15
  16. On Fri, 17 Dec 2004 14:22:34 +0000, rumours say that
    might have written:

    sf:

    >sf wrote:
    >> The point is that when you have 100,000s of records, this grep becomes
    >> really slow?

    >
    >There are performance bugs with current versions of grep
    >and multibyte characters that are only getting addressed now.
    >To work around these do `export LANG=C` first.


    You also should use the -F flag that Pádraig suggests, since you don't
    have regular expressions in the B file.

    >In my experience grep is not scalable since it's O(n^2).
    >See below (note A and B are randomized versions of
    >/usr/share/dict/words (and therefore worst case for the
    >sort method)).
    >
    >$ wc -l A B
    > 45427 A
    > 45427 B
    >
    >$ export LANG=C
    >
    >$ time grep -Fvf B A
    >real 0m0.437s
    >
    >$ time sort A B B | uniq -u
    >real 0m0.262s
    >
    >$ rpm -q grep coreutils
    >grep-2.5.1-16.1
    >coreutils-4.5.3-19


    sf, you better do your own benchmarks (there is quick, sample code in
    other posts of mine and Pádraig's) on your machine, since on my test
    machine the numbers are reversed re to these of Pádraig's (grep takes
    half the time).

    package versions (on SuSE 9.1 64-bit):

    $ rpm -q grep coreutils
    grep-2.5.1-427
    coreutils-5.2.1-21

    language:
    $ echo $LANG
    en_US.UTF-8

    Caution: both solutions are interexchangeable as long as you don't have
    duplicate lines in the A file. If you do, use the grep version.
    --
    TZOTZIOY, I speak England very best.
    "Be strict when sending and tolerant when receiving." (from RFC1958)
    I really should keep that in mind when talking with people, actually...
     
    Christos TZOTZIOY Georgiou, Dec 17, 2004
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jane Austine

    Efficient grep using Python?

    Jane Austine, Dec 16, 2004, in forum: Python
    Replies:
    1
    Views:
    592
    Tim Peters
    Dec 16, 2004
  2. Milo Thurston

    Using array.select with grep

    Milo Thurston, Aug 1, 2008, in forum: Ruby
    Replies:
    15
    Views:
    176
    David A. Black
    Aug 1, 2008
  3. Mmcolli00 Mom
    Replies:
    3
    Views:
    143
    Joel VanderWerf
    May 14, 2009
  4. Dan King

    Using popen & grep

    Dan King, Apr 12, 2010, in forum: Ruby
    Replies:
    2
    Views:
    201
    Robert Klemme
    Apr 13, 2010
  5. Simon Harrison

    Using grep on subarrays - help!

    Simon Harrison, Apr 3, 2011, in forum: Ruby
    Replies:
    11
    Views:
    209
    7stud --
    Apr 5, 2011
Loading...

Share This Page