groupby() seems slow

7

7stud

I'm applying groupby() in a very simplistic way to split up some data,
but when I timeit against another method, it takes twice as long. The
following groupby() code groups the data between the "</tr>" strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )


def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []


import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?
 
G

George Sakkis

I'm applying groupby() in a very simplistic way to split up some data,
but when I timeit against another method, it takes twice as long. The
following groupby() code groups the data between the "</tr>" strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )

def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []

import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?

Yes and no. Yes, the groupby version can be improved a little by
calling a builtin method instead of a Python function. No, test1 still
beats it hands down (and with Psyco even further); it is almost good
as it gets in pure Python.

FWIW, here's a faster and more compact version with groupby:

def test3b(data):
join = ' '.join
return [join(group) for key,group in
itertools.groupby(data, "</tr>".__eq__)
if not key]


George
 
M

Massimo Di Pierro

Shouldn't this
b
b

output

b\nb

instead?

Massimo

I'm applying groupby() in a very simplistic way to split up some
data,
but when I timeit against another method, it takes twice as long.
The
following groupby() code groups the data between the "</tr>" strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )

def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []

import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?

Yes and no. Yes, the groupby version can be improved a little by
calling a builtin method instead of a Python function. No, test1 still
beats it hands down (and with Psyco even further); it is almost good
as it gets in pure Python.

FWIW, here's a faster and more compact version with groupby:

def test3b(data):
join = ' '.join
return [join(group) for key,group in
itertools.groupby(data, "</tr>".__eq__)
if not key]


George
 
M

Massimo Di Pierro

Even stranger
b
b

Massimo


Shouldn't this
b
b

output

b\nb

instead?

Massimo

I'm applying groupby() in a very simplistic way to split up some
data,
but when I timeit against another method, it takes twice as long.
The
following groupby() code groups the data between the "</tr>"
strings:

data = [
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
"1.5","</tr>","2.5","3.5","4.5","</tr>","</tr>","5.5","6.5","</tr>",
]

import itertools

def key(s):
if s[0] == "<":
return 'a'
else:
return 'b'

def test3():

master_list = []
for group_key, group in itertools.groupby(data, key):
if group_key == "b":
master_list.append(list(group) )

def test1():
master_list = []
row = []

for elmt in data:
if elmt[0] != "<":
row.append(elmt)
else:
if row:
master_list.append(" ".join(row) )
row = []

import timeit

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?

Yes and no. Yes, the groupby version can be improved a little by
calling a builtin method instead of a Python function. No, test1
still
beats it hands down (and with Psyco even further); it is almost good
as it gets in pure Python.

FWIW, here's a faster and more compact version with groupby:

def test3b(data):
join = ' '.join
return [join(group) for key,group in
itertools.groupby(data, "</tr>".__eq__)
if not key]


George
 
C

Chris Mellon

Even stranger

b
b

You called print, so instead of getting an escaped string literal, the
string is being printed to your terminal, which is printing the
newline.
 
D

DiPierro, Massimo

That is not the problem. The problem is that

re.sub('a','\\n','bab')

cannot be the same as

re.sub('a','\n','bab')

This is evaluating the string to be substituted before the substitution.

Massimo

________________________________________
From: [email protected] [[email protected]] On Behalf Of Chris Mellon [[email protected]]
Sent: Tuesday, October 16, 2007 1:12 PM
To: (e-mail address removed)
Subject: Re: re.sub

Even stranger

b
b

You called print, so instead of getting an escaped string literal, the
string is being printed to your terminal, which is printing the
newline.
 
D

DiPierro, Massimo

It is the fisrt line that is wrong, the second follows from the first, I agree.

________________________________________
From: Tim Chase [[email protected]]
Sent: Tuesday, October 16, 2007 1:20 PM
To: DiPierro, Massimo
Cc: (e-mail address removed); Berthiaume, Andre
Subject: Re: re.sub
Even stranger

b
b

That's to be expected. When not using a print statement, the raw
evaluation prints the representation of the object. In this
case, the representation is 'b\nb'. When you use the print
statement, it actually prints the characters rather than their
representations. No need to mess with re.sub() to get the behavior:
a
b 'a\nb'

-tkc
 
T

Tim Chase

Even stranger

That's to be expected. When not using a print statement, the raw
evaluation prints the representation of the object. In this
case, the representation is 'b\nb'. When you use the print
statement, it actually prints the characters rather than their
representations. No need to mess with re.sub() to get the behavior:
a
b 'a\nb'

-tkc
 
D

DiPierro, Massimo

Let me show you a very bad consequence of this...

a=open('file1.txt','rb').read()
b=re.sub('x',a,'x')
open('file2.txt','wb').write(b)

Now if file1.txt contains a \n or \" then file2.txt is not the same as file1.txt while it should be.
Massimo

________________________________________
From: Tim Chase [[email protected]]
Sent: Tuesday, October 16, 2007 1:20 PM
To: DiPierro, Massimo
Cc: (e-mail address removed); Berthiaume, Andre
Subject: Re: re.sub
Even stranger

b
b

That's to be expected. When not using a print statement, the raw
evaluation prints the representation of the object. In this
case, the representation is 'b\nb'. When you use the print
statement, it actually prints the characters rather than their
representations. No need to mess with re.sub() to get the behavior:
a
b 'a\nb'

-tkc
 
D

DiPierro, Massimo

Thank you this answers my question. I wanted to make sure it was actually designed this way.

Massimo

________________________________________
From: Tim Chase [[email protected]]
Sent: Tuesday, October 16, 2007 1:38 PM
To: DiPierro, Massimo
Cc: (e-mail address removed); Berthiaume, Andre
Subject: Re: re.sub
Let me show you a very bad consequence of this...

a=open('file1.txt','rb').read()
b=re.sub('x',a,'x')
open('file2.txt','wb').write(b)

Now if file1.txt contains a \n or \" then file2.txt is not the
same as file1.txt while it should be.

That's functioning as designed. If you want to treat file1.txt
as a literal pattern for replacement, use re.escape() on it to
escape things you don't want.

http://docs.python.org/lib/node46.html#l2h-407

Or, you can specially treat newlines:

b=re.sub('x', a.replace('\n', '\\n'), 'x')

or just escape the backslashes on the incoming pattern:

b=re.sub('x', a.replace('\\', '\\\\'), 'x')

In the help for the RE module's syntax, this is explicitly noted:

http://docs.python.org/lib/re-syntax.html
"""
If you're not using a raw string to express the pattern, remember
that Python also uses the backslash as an escape sequence in
string literals; if the escape sequence isn't recognized by
Python's parser, the backslash and subsequent character are
included in the resulting string. However, if Python would
recognize the resulting sequence, the backslash should be
repeated twice. This is complicated and hard to understand, so
it's highly recommended that you use raw strings for all but the
simplest expressions.
"""

The short upshot: "it's highly recommended that you use raw
strings for all but the simplest expressions."

Thus, the string that you pass as your regexp should be a regexp.
Not a "python interpretation a regexp before the regex engine
gets to touch it".

-tkc
 
T

Tim Chase

Let me show you a very bad consequence of this...
a=open('file1.txt','rb').read()
b=re.sub('x',a,'x')
open('file2.txt','wb').write(b)

Now if file1.txt contains a \n or \" then file2.txt is not the
same as file1.txt while it should be.

That's functioning as designed. If you want to treat file1.txt
as a literal pattern for replacement, use re.escape() on it to
escape things you don't want.

http://docs.python.org/lib/node46.html#l2h-407

Or, you can specially treat newlines:

b=re.sub('x', a.replace('\n', '\\n'), 'x')

or just escape the backslashes on the incoming pattern:

b=re.sub('x', a.replace('\\', '\\\\'), 'x')

In the help for the RE module's syntax, this is explicitly noted:

http://docs.python.org/lib/re-syntax.html
"""
If you're not using a raw string to express the pattern, remember
that Python also uses the backslash as an escape sequence in
string literals; if the escape sequence isn't recognized by
Python's parser, the backslash and subsequent character are
included in the resulting string. However, if Python would
recognize the resulting sequence, the backslash should be
repeated twice. This is complicated and hard to understand, so
it's highly recommended that you use raw strings for all but the
simplest expressions.
"""

The short upshot: "it's highly recommended that you use raw
strings for all but the simplest expressions."

Thus, the string that you pass as your regexp should be a regexp.
Not a "python interpretation a regexp before the regex engine
gets to touch it".

-tkc
 
R

Raymond Hettinger

t = timeit.Timer("test3()", "from __main__ import test3, key, data")
print t.timeit()
t = timeit.Timer("test1()", "from __main__ import test1, data")
print t.timeit()

--output:---
42.791079998
19.0128788948

I thought groupby() would be faster. Am I doing something wrong?

The groupby() function is not where you are losing speed. In test1,
you've in-lined the code for computing the key. In test3, groupby()
makes expensive, repeated calls to a pure python key function. For
an apples-to-apples comparison, try something like this:

def test4():
master_list = []
row = []
for elem in data:
if key(elem) == 'a':
row.append(elem)
elif row:
master_list.append(' '.join(row))
del row[:]


Raymond
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top