groupby() seems slow

7stud · Oct 16, 2007

I'm applying groupby() in a very simplistic way to split up some data,
but when I timeit against another method, it takes twice as long. The
following groupby() code groups the data between the "</tr>" strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )

def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []

import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?

George Sakkis · Oct 16, 2007

I'm applying groupby() in a very simplistic way to split up some data,
but when I timeit against another method, it takes twice as long. The
following groupby() code groups the data between the "</tr>" strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )

def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []

import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?

Yes and no. Yes, the groupby version can be improved a little by
calling a builtin method instead of a Python function. No, test1 still
beats it hands down (and with Psyco even further); it is almost good
as it gets in pure Python.

FWIW, here's a faster and more compact version with groupby:

def test3b(data):
join = ' '.join
return [join(group) for key,group in
itertools.groupby(data, "</tr>".__eq__)
if not key]

George

Massimo Di Pierro · Oct 16, 2007

Shouldn't this
b
b

output

b\nb

instead?

Massimo

I'm applying groupby() in a very simplistic way to split up some
data,
but when I timeit against another method, it takes twice as long.
The
following groupby() code groups the data between the "</tr>" strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )

def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []

import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?

Click to expand...

Yes and no. Yes, the groupby version can be improved a little by
calling a builtin method instead of a Python function. No, test1 still
beats it hands down (and with Psyco even further); it is almost good
as it gets in pure Python.

FWIW, here's a faster and more compact version with groupby:

def test3b(data):
join = ' '.join
return [join(group) for key,group in
itertools.groupby(data, "</tr>".__eq__)
if not key]

George

Massimo Di Pierro · Oct 16, 2007

Even stranger
b
b

Massimo

Shouldn't this
b
b

output

b\nb

instead?

Massimo

I'm applying groupby() in a very simplistic way to split up some
data,
but when I timeit against another method, it takes twice as long.
The
following groupby() code groups the data between the "</tr>"
strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )

def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []

import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?

Click to expand...

Yes and no. Yes, the groupby version can be improved a little by
calling a builtin method instead of a Python function. No, test1
still
beats it hands down (and with Psyco even further); it is almost good
as it gets in pure Python.

FWIW, here's a faster and more compact version with groupby:

def test3b(data):
join = ' '.join
return [join(group) for key,group in
itertools.groupby(data, "</tr>".__eq__)
if not key]

George

Click to expand...

Chris Mellon · Oct 16, 2007

Even stranger

b
b

You called print, so instead of getting an escaped string literal, the
string is being printed to your terminal, which is printing the
newline.

DiPierro, Massimo · Oct 16, 2007

That is not the problem. The problem is that

re.sub('a','\\n','bab')

cannot be the same as

re.sub('a','\n','bab')

This is evaluating the string to be substituted before the substitution.

Massimo

________________________________________
From: [email protected] [[email protected]] On Behalf Of Chris Mellon [[email protected]]
Sent: Tuesday, October 16, 2007 1:12 PM
To: (e-mail address removed)
Subject: Re: re.sub

Even stranger

b
b

You called print, so instead of getting an escaped string literal, the
string is being printed to your terminal, which is printing the
newline.

DiPierro, Massimo · Oct 16, 2007

It is the fisrt line that is wrong, the second follows from the first, I agree.

________________________________________
From: Tim Chase [[email protected]]
Sent: Tuesday, October 16, 2007 1:20 PM
To: DiPierro, Massimo
Cc: (e-mail address removed); Berthiaume, Andre
Subject: Re: re.sub

Even stranger

b
b

That's to be expected. When not using a print statement, the raw
evaluation prints the representation of the object. In this
case, the representation is 'b\nb'. When you use the print
statement, it actually prints the characters rather than their
representations. No need to mess with re.sub() to get the behavior:
a
b 'a\nb'

-tkc

Tim Chase · Oct 16, 2007

Even stranger

b
b

That's to be expected. When not using a print statement, the raw
evaluation prints the representation of the object. In this
case, the representation is 'b\nb'. When you use the print
statement, it actually prints the characters rather than their
representations. No need to mess with re.sub() to get the behavior:
a
b 'a\nb'

-tkc

DiPierro, Massimo · Oct 16, 2007

Let me show you a very bad consequence of this...

a=open('file1.txt','rb').read()
b=re.sub('x',a,'x')
open('file2.txt','wb').write(b)

Now if file1.txt contains a \n or \" then file2.txt is not the same as file1.txt while it should be.
Massimo

________________________________________
From: Tim Chase [[email protected]]
Sent: Tuesday, October 16, 2007 1:20 PM
To: DiPierro, Massimo
Cc: (e-mail address removed); Berthiaume, Andre
Subject: Re: re.sub

Even stranger

b
b

That's to be expected. When not using a print statement, the raw
evaluation prints the representation of the object. In this
case, the representation is 'b\nb'. When you use the print
statement, it actually prints the characters rather than their
representations. No need to mess with re.sub() to get the behavior:
a
b 'a\nb'

-tkc

DiPierro, Massimo · Oct 16, 2007

Thank you this answers my question. I wanted to make sure it was actually designed this way.

Massimo

________________________________________
From: Tim Chase [[email protected]]
Sent: Tuesday, October 16, 2007 1:38 PM
To: DiPierro, Massimo
Cc: (e-mail address removed); Berthiaume, Andre
Subject: Re: re.sub

Let me show you a very bad consequence of this...

a=open('file1.txt','rb').read()
b=re.sub('x',a,'x')
open('file2.txt','wb').write(b)

Now if file1.txt contains a \n or \" then file2.txt is not the
same as file1.txt while it should be.

That's functioning as designed. If you want to treat file1.txt
as a literal pattern for replacement, use re.escape() on it to
escape things you don't want.

http://docs.python.org/lib/node46.html#l2h-407

Or, you can specially treat newlines:

b=re.sub('x', a.replace('\n', '\\n'), 'x')

or just escape the backslashes on the incoming pattern:

b=re.sub('x', a.replace('\\', '\\\\'), 'x')

In the help for the RE module's syntax, this is explicitly noted:

http://docs.python.org/lib/re-syntax.html
"""
If you're not using a raw string to express the pattern, remember
that Python also uses the backslash as an escape sequence in
string literals; if the escape sequence isn't recognized by
Python's parser, the backslash and subsequent character are
included in the resulting string. However, if Python would
recognize the resulting sequence, the backslash should be
repeated twice. This is complicated and hard to understand, so
it's highly recommended that you use raw strings for all but the
simplest expressions.
"""

The short upshot: "it's highly recommended that you use raw
strings for all but the simplest expressions."

Thus, the string that you pass as your regexp should be a regexp.
Not a "python interpretation a regexp before the regex engine
gets to touch it".

-tkc

Tim Chase · Oct 16, 2007

Let me show you a very bad consequence of this...

a=open('file1.txt','rb').read()
b=re.sub('x',a,'x')
open('file2.txt','wb').write(b)

Now if file1.txt contains a \n or \" then file2.txt is not the
same as file1.txt while it should be.

That's functioning as designed. If you want to treat file1.txt
as a literal pattern for replacement, use re.escape() on it to
escape things you don't want.

http://docs.python.org/lib/node46.html#l2h-407

Or, you can specially treat newlines:

b=re.sub('x', a.replace('\n', '\\n'), 'x')

or just escape the backslashes on the incoming pattern:

b=re.sub('x', a.replace('\\', '\\\\'), 'x')

In the help for the RE module's syntax, this is explicitly noted:

http://docs.python.org/lib/re-syntax.html
"""
If you're not using a raw string to express the pattern, remember
that Python also uses the backslash as an escape sequence in
string literals; if the escape sequence isn't recognized by
Python's parser, the backslash and subsequent character are
included in the resulting string. However, if Python would
recognize the resulting sequence, the backslash should be
repeated twice. This is complicated and hard to understand, so
it's highly recommended that you use raw strings for all but the
simplest expressions.
"""

The short upshot: "it's highly recommended that you use raw
strings for all but the simplest expressions."

Thus, the string that you pass as your regexp should be a regexp.
Not a "python interpretation a regexp before the regex engine
gets to touch it".

-tkc

Raymond Hettinger · Oct 16, 2007

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?

The groupby() function is not where you are losing speed. In test1,
you've in-lined the code for computing the key. In test3, groupby()
makes expensive, repeated calls to a pure python key function. For
an apples-to-apples comparison, try something like this:

def test4():
master_list = []
row = []
for elem in data:
if key(elem) == 'a':
row.append(elem)
elif row:
master_list.append(' '.join(row))
del row[:]

Raymond

All CRUD operations work except POST. Why?	2	May 28, 2023
groupby - summing multiple columns in a list of lists	1	May 17, 2011
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
sqlite3 import performance	1	Sep 5, 2008
Simple eval	5	Nov 18, 2007
[ANN] Struqtural: High level database interface library	3	Jul 17, 2010
Unexpected timing results with file I/O	11	Feb 4, 2008
garbage collector and slowdown (guillaume weymeskirch)	0	Oct 4, 2008

groupby() seems slow

7stud

George Sakkis

Massimo Di Pierro

Massimo Di Pierro

Chris Mellon

DiPierro, Massimo

DiPierro, Massimo

Tim Chase

DiPierro, Massimo

DiPierro, Massimo

Tim Chase

Raymond Hettinger

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads