python vs. grep

Anton Slesarev · May 6, 2008

I've read great paper about generators:
http://www.dabeaz.com/generators/index.html

Author say that it's easy to write analog of common linux tools such
as awk,grep etc. He say that performance could be even better.

But I have some problem with writing performance grep analog.

It's my script:

import re
pat = re.compile("sometext")

f = open("bigfile",'r')

flines = (line for line in f if pat.search(line))
c=0
for x in flines:
c+=1
print c

and bash:
grep "sometext" bigfile | wc -l

Python code 3-4 times slower on windows. And as I remember on linux
the same situation...

Buffering in open even increase time.

Is it possible to increase file reading performance?

Ian Kelly · May 6, 2008

Is it possible to increase file reading performance?

Dunno about that, but this part:

flines = (line for line in f if pat.search(line))
c=0
for x in flines:
c+=1
print c

could be rewritten as just:

print sum(1 for line in f if pat.search(line))

Arnaud Delobelle · May 6, 2008

Anton Slesarev said:
f = open("bigfile",'r')

flines = (line for line in f if pat.search(line))
c=0
for x in flines:
c+=1
print c

It would be simpler (and probably faster) not to use a generator expression:

search = re.compile('sometext').search

c = 0
for line in open('bigfile'):
if search(line):
c += 1

Perhaps faster (because the number of name lookups is reduced), using
itertools.ifilter:

from itertools import ifilter

c = 0
for line in ifilter(search, 'bigfile'):
c += 1

If 'sometext' is just text (no regexp wildcards) then even simpler:

....
for line in ...:
if 'sometext' in line:
c += 1

I don't believe you'll easily beat grep + wc using Python though.

Perhaps faster?

sum(bool(search(line)) for line in open('bigfile'))
sum(1 for line in ifilter(search, open('bigfile')))

....etc...

All this is untested!

Wojciech Walczak · May 6, 2008

2008/5/6 said:
But I have some problem with writing performance grep analog. [...]
Python code 3-4 times slower on windows. And as I remember on linux
the same situation...

Buffering in open even increase time.

Is it possible to increase file reading performance?

The best advice would be not to try to beat grep, but if you really
want to, this is the right place

Here is my code:
$ cat grep.py
import sys

if len(sys.argv) != 3:
print 'grep.py <pattern> <file>'
sys.exit(1)

f = open(sys.argv[2],'r')

print ''.join((line for line in f if sys.argv[1] in line)),

$ ls -lh debug.0
-rw-r----- 1 gminick root 4,1M 2008-05-07 00:49 debug.0

---
$ time grep nusia debug.0 |wc -l
26009

real 0m0.042s
user 0m0.020s
sys 0m0.004s
---

---
$ time python grep.py nusia debug.0 |wc -l
26009

real 0m0.077s
user 0m0.044s
sys 0m0.016s
---

---
$ time grep nusia debug.0

real 0m3.163s
user 0m0.016s
sys 0m0.064s
---

---
$ time python grep.py nusia debug.0
[26009 lines here...]
real 0m2.628s
user 0m0.032s
sys 0m0.064s
---

So, printing the results take 2.6 secs for python and 3.1s for original grep.
Suprised? The only reason for this is that we have reduced the number
of write calls in the python example:

$ strace -ooriggrep.log grep nusia debug.0
$ grep write origgrep.log |wc -l
26009

$ strace -opygrep.log python grep.py nusia debug.0
$ grep write pygrep.log |wc -l
12

Wish you luck saving your CPU cycles

Anton Slesarev · May 7, 2008

I try to save my time not cpu cycles)

I've got file which I really need to parse:
-rw-rw-r-- 1 xxx xxx 3381564736 May 7 09:29 bigfile

That's my results:

$ time grep "python" bigfile | wc -l
2470

real 0m4.744s
user 0m2.441s
sys 0m2.307s

And python scripts:

import sys

if len(sys.argv) != 3:
print 'grep.py <pattern> <file>'
sys.exit(1)

f = open(sys.argv[2],'r')

print ''.join((line for line in f if sys.argv[1] in line)),

$ time python grep.py "python" bigfile | wc -l
2470

real 0m37.225s
user 0m34.215s
sys 0m3.009s

Second script:

import sys

if len(sys.argv) != 3:
print 'grepwc.py <pattern> <file>'
sys.exit(1)

f = open(sys.argv[2],'r',100000000)

print sum((1 for line in f if sys.argv[1] in line)),

time python grepwc.py "python" bigfile
2470

real 0m39.357s
user 0m34.410s
sys 0m4.491s

40 sec and 5. This is really sad...

That was on freeBSD.

On windows cygwin.

Size of bigfile is ~50 mb

$ time grep "python" bigfile | wc -l
51

real 0m0.196s
user 0m0.169s
sys 0m0.046s

$ time python grepwc.py "python" bigfile
51

real 0m25.485s
user 0m2.733s
sys 0m0.375s

Ville Vainio · May 7, 2008

flines = (line for line in f if pat.search(line))

What about re.findall() / re.finditer() for the whole file contents?

Pop User · May 7, 2008

Anton said:
But I have some problem with writing performance grep analog.

I don't think you can ever catch grep. Searching is its only purpose in
life and its very good at it. You may be able to come closer, this
thread relates.

http://groups.google.com/group/comp...read/thread/2f564523f476840a/d9476da5d7a9e466

This relates to the speed of re. If you don't need regex don't use re.
If you do need re an alternate re library might be useful but you
aren't going to catch grep.

Anton Slesarev · May 7, 2008

I don't think you can ever catch grep. Searching is its only purpose in
life and its very good at it. You may be able to come closer, this
thread relates.

http://groups.google.com/group/comp.lang.python/browse_thread/thread/...

This relates to the speed of re. If you don't need regex don't use re.
If you do need re an alternate re library might be useful but you
aren't going to catch grep.

In my last test I dont use re. As I understand the main problem in
reading file.

Ricardo Aráoz · May 8, 2008

Anton said:
I try to save my time not cpu cycles)

I've got file which I really need to parse:
-rw-rw-r-- 1 xxx xxx 3381564736 May 7 09:29 bigfile

That's my results:

$ time grep "python" bigfile | wc -l
2470

real 0m4.744s
user 0m2.441s
sys 0m2.307s

And python scripts:

import sys

if len(sys.argv) != 3:
print 'grep.py <pattern> <file>'
sys.exit(1)

f = open(sys.argv[2],'r')

print ''.join((line for line in f if sys.argv[1] in line)),

$ time python grep.py "python" bigfile | wc -l
2470

real 0m37.225s
user 0m34.215s
sys 0m3.009s

Second script:

import sys

if len(sys.argv) != 3:
print 'grepwc.py <pattern> <file>'
sys.exit(1)

f = open(sys.argv[2],'r',100000000)

print sum((1 for line in f if sys.argv[1] in line)),

time python grepwc.py "python" bigfile
2470

real 0m39.357s
user 0m34.410s
sys 0m4.491s

40 sec and 5. This is really sad...

That was on freeBSD.

On windows cygwin.

Size of bigfile is ~50 mb

$ time grep "python" bigfile | wc -l
51

real 0m0.196s
user 0m0.169s
sys 0m0.046s

$ time python grepwc.py "python" bigfile
51

real 0m25.485s
user 0m2.733s
sys 0m0.375s

All these examples assume your regular expression will not span multiple
lines, but this can easily be the case. How would you process the file
with regular expressions that span multiple lines?

Alan Isaac · May 8, 2008

Anton said:
I've read great paper about generators:
http://www.dabeaz.com/generators/index.html
Author say that it's easy to write analog of common linux tools such
as awk,grep etc. He say that performance could be even better.
But I have some problem with writing performance grep analog.

https://svn.enthought.com/svn/sandbox/grin/trunk/

hth,
Alan Isaac

Robert Kern · May 8, 2008

Alan said:
https://svn.enthought.com/svn/sandbox/grin/trunk/

As the author of grin I can definitively state that it is not at all competitive
with grep in terms of speed. grep reads files really fast. awk is probably
beatable, though.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Ville Vainio · May 9, 2008

All these examples assume your regular expression will not span multiple
lines, but this can easily be the case. How would you process the file
with regular expressions that span multiple lines?

re.findall/ finditer, as I said earlier.

Ricardo Aráoz · May 12, 2008

Ville said:
re.findall/ finditer, as I said earlier.

Hi, sorry took so long to answer. Too much work.

findall/finditer do not address the issue, they merely find ALL the
matches in a STRING. But if you keep reading the files a line at a time
(as most examples given in this thread do) then you are STILL in trouble
when a regular expression spans multiple lines.
The easy/simple (too easy/simple?) way I see out of it is to read THE
WHOLE file into memory and don't worry. But what if the file is too
heavy? So I was wondering if there is any other way out of it. Does grep
read the whole file into memory? Does it ONLY process a line at a time?

Kam-Hung Soh · May 12, 2008

Hi, sorry took so long to answer. Too much work.

findall/finditer do not address the issue, they merely find ALL the
matches in a STRING. But if you keep reading the files a line at a time
(as most examples given in this thread do) then you are STILL in trouble
when a regular expression spans multiple lines.
The easy/simple (too easy/simple?) way I see out of it is to read THE
WHOLE file into memory and don't worry. But what if the file is too
heavy? So I was wondering if there is any other way out of it. Does grep
read the whole file into memory? Does it ONLY process a line at a time?

Standard grep can only match a line at a time. Are you thinking about
"sed", which has a sliding window?

See http://www.gnu.org/software/sed/manual/sed.html, Section 4.13

Ville M. Vainio · May 13, 2008

Ricardo Aráoz said:
The easy/simple (too easy/simple?) way I see out of it is to read THE
WHOLE file into memory and don't worry. But what if the file is too

The easiest and simplest approach is often the best with
Python. Reading in the whole file is rarely too heavy, and you omit
the python "object overhead" entirely - all the code executes in the
fast C extensions.

If the file is too big, you might want to look up mmap:

http://effbot.org/librarybook/mmap.htm

Ricardo Aráoz · May 13, 2008

Ville said:
The easiest and simplest approach is often the best with
Python.

Keep forgetting that!

If the file is too big, you might want to look up mmap:

http://effbot.org/librarybook/mmap.htm

Thanks!

Can I beat perl at grep-like processing speed?	4	Dec 29, 2006
Grep Equivalent for Python	15	Mar 14, 2007
converting a sed / grep / awk / . . . bash pipe line into python	11	Sep 2, 2008
Convert AWK regex to Python	6	May 16, 2011
Why is Python telling me variable is local not global?	3	Sep 2, 2023
greatly differing processing time between java and Linux while calculating hashes?	0	Sep 9, 2012
Minimal Python installation?	1	May 21, 2008
New way of writing socket servers in #Linux kernel 3.9 (and in#Python too)	3	Aug 24, 2013

python vs. grep

Anton Slesarev

Ian Kelly

Arnaud Delobelle

Wojciech Walczak

Anton Slesarev

Ville Vainio

Pop User

Anton Slesarev

Ricardo Aráoz

Alan Isaac

Robert Kern

Ville Vainio

Ricardo Aráoz

Kam-Hung Soh

Ville M. Vainio

Ricardo Aráoz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads