dict.get and str.xsplit

B

bearophileHUGS

When I use dictionaries I find the dict.get() method very handy, and
I'd like to use it often, but I think it's quite slow. This is a small
benchmark:

from random import randrange
from timeit import default_timer as clock

def main(N, M):
keys = list(set(randrange(2*N) for n in xrange(N)))
n2 = len(keys) / 2
keys1, keys2 = keys[:n2], keys[n2:]
adict = dict.fromkeys(keys1)

t = clock()
for _ in xrange(M):
for k in keys1:
r = adict.get(k, None)
for k in keys2:
r = adict.get(k, None)
print round(clock() - t, 2), "s"

t = clock()
for _ in xrange(M):
for k in keys1:
if k in adict:
r = adict[k]
else:
r = None
for k in keys2:
if k in adict:
r = adict[k]
else:
r = None
print round(clock() - t, 2), "s"

main(40000, 30)


On my old PC the output is:
2.36 s
1.59 s
(Your timings may be 4-5 times smaller, so you can tweak N and M to
have significant results).
And note that the if/then version is faster even if it calls the hash
function two times, while get() is supposed to do it once (in this
case the hashing function is very simple and fast. If the keys are
long strings the results of a similar benchmark may be different,
because the computing of the hashing function may become the slowest
part of that code).

This is a real difference, that has real impact on the programs I
write, so I often use the if/else approach, despite the dict.get()
method being semantically fitter and shorter.
So can the dict.get() method be speed up? And if not, why?

------------------

Another method I'd like to see, is the str.xsplit() I was talking
about time ago too. It's like str.split() but it's lazy. It avoids to
build a potentially huge list. In real D programs I have seen such
string function may speed up heavy-string-processing code (the kind of
code you may want to use Python/Perl/Ruby for) up to 2-3 times. Seeing
how such kind of processing is common in Python I think this isn't an
optimization to ignore.
(I am now getting better in writing C code, so in few months I may
even become able to submit a xsplit patch myself. I think it's not too
much complex to write, it can borrow lot of code from the str.split
method).

Bye, and thank you for your kind attention,
bearophile
 
M

Marc 'BlackJack' Rintsch

This is a real difference, that has real impact on the programs I
write, so I often use the if/else approach, despite the dict.get()
method being semantically fitter and shorter.
So can the dict.get() method be speed up? And if not, why?

I guess it's the method lookup that's the slow part. Factor it out of the
loop and measure again::

adict_get = adict.get
for _ in xrange(M):
for k in keys1:
r = adict_get(k, None)
for k in keys2:
r = adict_get(k, None)

Ciao,
Marc 'BlackJack' Rintsch
 
C

castironpi

I guess it's the method lookup that's the slow part.  Factor it out of the
loop and measure again::

    adict_get = adict.get
    for _ in xrange(M):
        for k in keys1:
            r = adict_get(k, None)
        for k in keys2:
            r = adict_get(k, None)

Ciao,
        Marc 'BlackJack' Rintsch

Can't be. The string 'get' is only hashed once, since it's hard-coded
into the script, and looking it up can't be any slower than looking up
__getitem__.
 
S

Steve Holden

Marc said:
I guess it's the method lookup that's the slow part. Factor it out of the
loop and measure again::

adict_get = adict.get
for _ in xrange(M):
for k in keys1:
r = adict_get(k, None)
for k in keys2:
r = adict_get(k, None)

And think about using the timeit module, since it's been included among
the batteries for your convenience, and it removes some common pitfalls
with cross-platform timings.

regards
Steve
 
M

Marc 'BlackJack' Rintsch

Can't be. The string 'get' is only hashed once, since it's hard-coded
into the script, and looking it up can't be any slower than looking up
__getitem__.

Within functions it is faster. In the original code the `get` attribute is
looked up on the `adict` object twice in each loop iteration via hashing.
In my code it is looked up once before the loop and within the loop the
local name `adict_get` isn't looked up but hardcoded as index into the
internal locals array of the function.

Ciao,
Marc 'BlackJack' Rintsch
 
B

bearophileHUGS

Marc 'BlackJack' Rintsch:
I guess it's the method lookup that's the slow part. Factor it out of the
loop and measure again::

I did know of that optimization, but sometimes I "forget" about it...
The new timings:

Output timings, best of 3, unloaded CPU:
2.32 s with adict.get
1.56 s with "in" + if/then/else
1.78 s with adict_get

It's slower still, despite doing the lookup two times half of the
times (half keys are present in the test set). The "in" is an
operator, it's faster than a method call, but I don't understand the
other details. Now the difference between 1.78 and 1.56 is small
enough, so probably now it's not much noticeable in practice in real
programs, so my original argument is mostly dead now :) In the code
points where speed matters instead of a single line with a get() I can
use two lines to create a local adict_get.

Bye and thank you,
bearophile
 
C

castironpi

Marc 'BlackJack' Rintsch:


I did know of that optimization, but sometimes I "forget" about it...
The new timings:

Output timings, best of 3, unloaded CPU:
2.32 s with adict.get
1.56 s with "in" + if/then/else
1.78 s with adict_get

It's slower still, despite doing the lookup two times half of the
times (half keys are present in the test set). The "in" is an
operator, it's faster than a method call, but I don't understand the
other details. Now the difference between 1.78 and 1.56 is small
enough, so probably now it's not much noticeable in practice in real
programs, so my original argument is mostly dead now :) In the code
points where speed matters instead of a single line with a get() I can
use two lines to create a local adict_get.

Bye and thank you,
bearophile

Where does adict_getitem fit in? And try: except: KeyError?
 
M

marek.rocki

(e-mail address removed) napisa³(a):
It's slower still, despite doing the lookup two times half of the
times (half keys are present in the test set). The "in" is an
operator, it's faster than a method call, but I don't understand the
other details. Now the difference between 1.78 and 1.56 is small
enough, so probably now it's not much noticeable in practice in real
programs, so my original argument is mostly dead now :) In the code
points where speed matters instead of a single line with a get() I can
use two lines to create a local adict_get.

AFAIK, method/function call is generally slow in Python (similarly to
the dot attribute lookup). As for the original prooblem, why not use
defaultdict? I think it's the most idiomatic way to achieve what we
want. And also the fastest one, according to my quick-and-dirty tests:

from collections import defaultdict

def main(N, M):

# <snip>

addict = defaultdict(lambda: None, adict)
t = clock()
for _ in xrange(M):
for k in keys1:
r = addict[k]
for k in keys2:
r = addict[k]
print round(clock() - t, 2), "s"

adict.get: 3.12 s
adict_get: 2.24 s
in: 1.62 s
defaultdict: 1.05 s
 
B

bearophileHUGS

(e-mail address removed):
As for the original prooblem, why not use
defaultdict? I think it's the most idiomatic way to achieve what we
want. And also the fastest one, according to my quick-and-dirty tests:

It adds the new keys, I can't accept that:
from collections import defaultdict as dd
adict = {1:2, 3:4}
addict = dd(lambda: None, adict)
r = addict[10]
addict
defaultdict(<function <lambda> at 0x008E6FB0>, {1: 2, 10: None, 3: 4})

Bye,
bearophile
 
M

marek.rocki

(e-mail address removed) napisa³(a):
(e-mail address removed):

It adds the new keys, I can't accept that:

Right, my fault not noticing that. I experimented further with
__missing__ method and with "ask forgiveness" idiom (try... except
KeyError) and they were both noticeably slower. So checking with "in"
may be the most efficient way we have.

Regards,
Marek
 
G

Gabriel Genellina

Can't be. The string 'get' is only hashed once, since it's hard-coded
into the script, and looking it up can't be any slower than looking up
__getitem__.

In the original code, the 'get' method is searched a lot of times (every
time it's called), not just once; Python can't assume that it remains the
same at every invocation.

And __getitem__ isn't searched by name; there is a predefined slot in the
type structure for it. In case the method were overriden in Python code,
the slot is replaced with a wrapper function that calls the Python code.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top