python vs. grep

Discussion in 'Python' started by Anton Slesarev, May 6, 2008.

  1. I've read great paper about generators:
    http://www.dabeaz.com/generators/index.html

    Author say that it's easy to write analog of common linux tools such
    as awk,grep etc. He say that performance could be even better.

    But I have some problem with writing performance grep analog.


    It's my script:

    import re
    pat = re.compile("sometext")

    f = open("bigfile",'r')

    flines = (line for line in f if pat.search(line))
    c=0
    for x in flines:
    c+=1
    print c

    and bash:
    grep "sometext" bigfile | wc -l

    Python code 3-4 times slower on windows. And as I remember on linux
    the same situation...

    Buffering in open even increase time.

    Is it possible to increase file reading performance?
    Anton Slesarev, May 6, 2008
    #1
    1. Advertising

  2. Anton Slesarev

    Ian Kelly Guest

    On Tue, May 6, 2008 at 1:42 PM, Anton Slesarev <> wrote:
    > Is it possible to increase file reading performance?


    Dunno about that, but this part:

    > flines = (line for line in f if pat.search(line))
    > c=0
    > for x in flines:
    > c+=1
    > print c


    could be rewritten as just:

    print sum(1 for line in f if pat.search(line))
    Ian Kelly, May 6, 2008
    #2
    1. Advertising

  3. Anton Slesarev <> writes:

    > f = open("bigfile",'r')
    >
    > flines = (line for line in f if pat.search(line))
    > c=0
    > for x in flines:
    > c+=1
    > print c


    It would be simpler (and probably faster) not to use a generator expression:

    search = re.compile('sometext').search

    c = 0
    for line in open('bigfile'):
    if search(line):
    c += 1

    Perhaps faster (because the number of name lookups is reduced), using
    itertools.ifilter:

    from itertools import ifilter

    c = 0
    for line in ifilter(search, 'bigfile'):
    c += 1


    If 'sometext' is just text (no regexp wildcards) then even simpler:

    ....
    for line in ...:
    if 'sometext' in line:
    c += 1

    I don't believe you'll easily beat grep + wc using Python though.

    Perhaps faster?

    sum(bool(search(line)) for line in open('bigfile'))
    sum(1 for line in ifilter(search, open('bigfile')))

    ....etc...

    All this is untested!
    --
    Arnaud
    Arnaud Delobelle, May 6, 2008
    #3
  4. 2008/5/6, Anton Slesarev <>:
    > But I have some problem with writing performance grep analog.

    [...]
    > Python code 3-4 times slower on windows. And as I remember on linux
    > the same situation...
    >
    > Buffering in open even increase time.
    >
    > Is it possible to increase file reading performance?


    The best advice would be not to try to beat grep, but if you really
    want to, this is the right place ;)

    Here is my code:
    $ cat grep.py
    import sys

    if len(sys.argv) != 3:
    print 'grep.py <pattern> <file>'
    sys.exit(1)

    f = open(sys.argv[2],'r')

    print ''.join((line for line in f if sys.argv[1] in line)),

    $ ls -lh debug.0
    -rw-r----- 1 gminick root 4,1M 2008-05-07 00:49 debug.0

    ---
    $ time grep nusia debug.0 |wc -l
    26009

    real 0m0.042s
    user 0m0.020s
    sys 0m0.004s
    ---

    ---
    $ time python grep.py nusia debug.0 |wc -l
    26009

    real 0m0.077s
    user 0m0.044s
    sys 0m0.016s
    ---

    ---
    $ time grep nusia debug.0

    real 0m3.163s
    user 0m0.016s
    sys 0m0.064s
    ---

    ---
    $ time python grep.py nusia debug.0
    [26009 lines here...]
    real 0m2.628s
    user 0m0.032s
    sys 0m0.064s
    ---

    So, printing the results take 2.6 secs for python and 3.1s for original grep.
    Suprised? The only reason for this is that we have reduced the number
    of write calls in the python example:

    $ strace -ooriggrep.log grep nusia debug.0
    $ grep write origgrep.log |wc -l
    26009


    $ strace -opygrep.log python grep.py nusia debug.0
    $ grep write pygrep.log |wc -l
    12


    Wish you luck saving your CPU cycles :)

    --
    Regards,
    Wojtek Walczak
    http://www.stud.umk.pl/~wojtekwa/
    Wojciech Walczak, May 6, 2008
    #4
  5. I try to save my time not cpu cycles)

    I've got file which I really need to parse:
    -rw-rw-r-- 1 xxx xxx 3381564736 May 7 09:29 bigfile

    That's my results:

    $ time grep "python" bigfile | wc -l
    2470

    real 0m4.744s
    user 0m2.441s
    sys 0m2.307s

    And python scripts:

    import sys

    if len(sys.argv) != 3:
    print 'grep.py <pattern> <file>'
    sys.exit(1)

    f = open(sys.argv[2],'r')

    print ''.join((line for line in f if sys.argv[1] in line)),

    $ time python grep.py "python" bigfile | wc -l
    2470

    real 0m37.225s
    user 0m34.215s
    sys 0m3.009s

    Second script:

    import sys

    if len(sys.argv) != 3:
    print 'grepwc.py <pattern> <file>'
    sys.exit(1)

    f = open(sys.argv[2],'r',100000000)

    print sum((1 for line in f if sys.argv[1] in line)),


    time python grepwc.py "python" bigfile
    2470

    real 0m39.357s
    user 0m34.410s
    sys 0m4.491s

    40 sec and 5. This is really sad...

    That was on freeBSD.



    On windows cygwin.

    Size of bigfile is ~50 mb

    $ time grep "python" bigfile | wc -l
    51

    real 0m0.196s
    user 0m0.169s
    sys 0m0.046s

    $ time python grepwc.py "python" bigfile
    51

    real 0m25.485s
    user 0m2.733s
    sys 0m0.375s
    Anton Slesarev, May 7, 2008
    #5
  6. Anton Slesarev

    Ville Vainio Guest

    On May 6, 10:42 pm, Anton Slesarev <> wrote:

    > flines = (line for line in f if pat.search(line))


    What about re.findall() / re.finditer() for the whole file contents?
    Ville Vainio, May 7, 2008
    #6
  7. Anton Slesarev

    Pop User Guest

    Anton Slesarev wrote:
    >
    > But I have some problem with writing performance grep analog.
    >


    I don't think you can ever catch grep. Searching is its only purpose in
    life and its very good at it. You may be able to come closer, this
    thread relates.

    http://groups.google.com/group/comp...read/thread/2f564523f476840a/d9476da5d7a9e466

    This relates to the speed of re. If you don't need regex don't use re.
    If you do need re an alternate re library might be useful but you
    aren't going to catch grep.
    Pop User, May 7, 2008
    #7
  8. On May 7, 7:22 pm, Pop User <12us.com> wrote:
    > Anton Slesarev wrote:
    >
    > > But I have some problem with writing performance grep analog.

    >
    > I don't think you can ever catch grep. Searching is its only purpose in
    > life and its very good at it. You may be able to come closer, this
    > thread relates.
    >
    > http://groups.google.com/group/comp.lang.python/browse_thread/thread/...
    >
    > This relates to the speed of re. If you don't need regex don't use re.
    > If you do need re an alternate re library might be useful but you
    > aren't going to catch grep.


    In my last test I dont use re. As I understand the main problem in
    reading file.
    Anton Slesarev, May 7, 2008
    #8
  9. Anton Slesarev wrote:
    > I try to save my time not cpu cycles)
    >
    > I've got file which I really need to parse:
    > -rw-rw-r-- 1 xxx xxx 3381564736 May 7 09:29 bigfile
    >
    > That's my results:
    >
    > $ time grep "python" bigfile | wc -l
    > 2470
    >
    > real 0m4.744s
    > user 0m2.441s
    > sys 0m2.307s
    >
    > And python scripts:
    >
    > import sys
    >
    > if len(sys.argv) != 3:
    > print 'grep.py <pattern> <file>'
    > sys.exit(1)
    >
    > f = open(sys.argv[2],'r')
    >
    > print ''.join((line for line in f if sys.argv[1] in line)),
    >
    > $ time python grep.py "python" bigfile | wc -l
    > 2470
    >
    > real 0m37.225s
    > user 0m34.215s
    > sys 0m3.009s
    >
    > Second script:
    >
    > import sys
    >
    > if len(sys.argv) != 3:
    > print 'grepwc.py <pattern> <file>'
    > sys.exit(1)
    >
    > f = open(sys.argv[2],'r',100000000)
    >
    > print sum((1 for line in f if sys.argv[1] in line)),
    >
    >
    > time python grepwc.py "python" bigfile
    > 2470
    >
    > real 0m39.357s
    > user 0m34.410s
    > sys 0m4.491s
    >
    > 40 sec and 5. This is really sad...
    >
    > That was on freeBSD.
    >
    >
    >
    > On windows cygwin.
    >
    > Size of bigfile is ~50 mb
    >
    > $ time grep "python" bigfile | wc -l
    > 51
    >
    > real 0m0.196s
    > user 0m0.169s
    > sys 0m0.046s
    >
    > $ time python grepwc.py "python" bigfile
    > 51
    >
    > real 0m25.485s
    > user 0m2.733s
    > sys 0m0.375s
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >



    All these examples assume your regular expression will not span multiple
    lines, but this can easily be the case. How would you process the file
    with regular expressions that span multiple lines?
    Ricardo Aráoz, May 8, 2008
    #9
  10. Anton Slesarev

    Alan Isaac Guest

    Alan Isaac, May 8, 2008
    #10
  11. Anton Slesarev

    Robert Kern Guest

    Alan Isaac wrote:
    > Anton Slesarev wrote:
    >> I've read great paper about generators:
    >> http://www.dabeaz.com/generators/index.html Author say that it's easy
    >> to write analog of common linux tools such as awk,grep etc. He say
    >> that performance could be even better. But I have some problem with
    >> writing performance grep analog.

    >
    > https://svn.enthought.com/svn/sandbox/grin/trunk/


    As the author of grin I can definitively state that it is not at all competitive
    with grep in terms of speed. grep reads files really fast. awk is probably
    beatable, though.

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
    Robert Kern, May 8, 2008
    #11
  12. Anton Slesarev

    Ville Vainio Guest

    On May 8, 8:11 pm, Ricardo Aráoz <> wrote:

    > All these examples assume your regular expression will not span multiple
    > lines, but this can easily be the case. How would you process the file
    > with regular expressions that span multiple lines?


    re.findall/ finditer, as I said earlier.
    Ville Vainio, May 9, 2008
    #12
  13. Ville Vainio wrote:
    > On May 8, 8:11 pm, Ricardo Aráoz <> wrote:
    >
    >> All these examples assume your regular expression will not span multiple
    >> lines, but this can easily be the case. How would you process the file
    >> with regular expressions that span multiple lines?

    >
    > re.findall/ finditer, as I said earlier.
    >


    Hi, sorry took so long to answer. Too much work.

    findall/finditer do not address the issue, they merely find ALL the
    matches in a STRING. But if you keep reading the files a line at a time
    (as most examples given in this thread do) then you are STILL in trouble
    when a regular expression spans multiple lines.
    The easy/simple (too easy/simple?) way I see out of it is to read THE
    WHOLE file into memory and don't worry. But what if the file is too
    heavy? So I was wondering if there is any other way out of it. Does grep
    read the whole file into memory? Does it ONLY process a line at a time?
    Ricardo Aráoz, May 12, 2008
    #13
  14. Anton Slesarev

    Kam-Hung Soh Guest

    On Tue, 13 May 2008 00:03:08 +1000, Ricardo Aráoz <>
    wrote:

    > Ville Vainio wrote:
    >> On May 8, 8:11 pm, Ricardo Aráoz <> wrote:
    >>
    >>> All these examples assume your regular expression will not span
    >>> multiple
    >>> lines, but this can easily be the case. How would you process the file
    >>> with regular expressions that span multiple lines?

    >> re.findall/ finditer, as I said earlier.
    >>

    >
    > Hi, sorry took so long to answer. Too much work.
    >
    > findall/finditer do not address the issue, they merely find ALL the
    > matches in a STRING. But if you keep reading the files a line at a time
    > (as most examples given in this thread do) then you are STILL in trouble
    > when a regular expression spans multiple lines.
    > The easy/simple (too easy/simple?) way I see out of it is to read THE
    > WHOLE file into memory and don't worry. But what if the file is too
    > heavy? So I was wondering if there is any other way out of it. Does grep
    > read the whole file into memory? Does it ONLY process a line at a time?
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >


    Standard grep can only match a line at a time. Are you thinking about
    "sed", which has a sliding window?

    See http://www.gnu.org/software/sed/manual/sed.html, Section 4.13

    --
    Kam-Hung Soh <a href="http://kamhungsoh.com/blog">Software Salariman</a>
    Kam-Hung Soh, May 13, 2008
    #14
  15. Ricardo Aráoz <> writes:

    > The easy/simple (too easy/simple?) way I see out of it is to read THE
    > WHOLE file into memory and don't worry. But what if the file is too


    The easiest and simplest approach is often the best with
    Python. Reading in the whole file is rarely too heavy, and you omit
    the python "object overhead" entirely - all the code executes in the
    fast C extensions.

    If the file is too big, you might want to look up mmap:

    http://effbot.org/librarybook/mmap.htm
    Ville M. Vainio, May 13, 2008
    #15
  16. Ville M. Vainio wrote:
    > Ricardo Aráoz <> writes:
    >
    >> The easy/simple (too easy/simple?) way I see out of it is to read THE
    >> WHOLE file into memory and don't worry. But what if the file is too

    >
    > The easiest and simplest approach is often the best with
    > Python.


    Keep forgetting that!

    >
    > If the file is too big, you might want to look up mmap:
    >
    > http://effbot.org/librarybook/mmap.htm


    Thanks!
    Ricardo Aráoz, May 14, 2008
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Zlatko Hristov

    Python script to grep squid logs

    Zlatko Hristov, Apr 14, 2004, in forum: Python
    Replies:
    1
    Views:
    864
    Lee Harr
    Apr 15, 2004
  2. sf
    Replies:
    15
    Views:
    772
    Christos TZOTZIOY Georgiou
    Dec 17, 2004
  3. Jane Austine

    Efficient grep using Python?

    Jane Austine, Dec 16, 2004, in forum: Python
    Replies:
    1
    Views:
    575
    Tim Peters
    Dec 16, 2004
  4. tereglow

    Grep Equivalent for Python

    tereglow, Mar 14, 2007, in forum: Python
    Replies:
    15
    Views:
    26,486
    tereglow
    Mar 19, 2007
  5. Replies:
    3
    Views:
    375
    BartlebyScrivener
    Nov 8, 2007
Loading...

Share This Page