dict.get and str.xsplit

bearophileHUGS · Feb 26, 2008

When I use dictionaries I find the dict.get() method very handy, and
I'd like to use it often, but I think it's quite slow. This is a small
benchmark:

from random import randrange
from timeit import default_timer as clock

def main(N, M):
keys = list(set(randrange(2*N) for n in xrange(N)))
n2 = len(keys) / 2
keys1, keys2 = keys[:n2], keys[n2:]
adict = dict.fromkeys(keys1)

t = clock()
for _ in xrange(M):
for k in keys1:
r = adict.get(k, None)
for k in keys2:
r = adict.get(k, None)
print round(clock() - t, 2), "s"

t = clock()
for _ in xrange(M):
for k in keys1:
if k in adict:
r = adict[k]
else:
r = None
for k in keys2:
if k in adict:
r = adict[k]
else:
r = None
print round(clock() - t, 2), "s"

main(40000, 30)

On my old PC the output is:
2.36 s
1.59 s
(Your timings may be 4-5 times smaller, so you can tweak N and M to
have significant results).
And note that the if/then version is faster even if it calls the hash
function two times, while get() is supposed to do it once (in this
case the hashing function is very simple and fast. If the keys are
long strings the results of a similar benchmark may be different,
because the computing of the hashing function may become the slowest
part of that code).

This is a real difference, that has real impact on the programs I
write, so I often use the if/else approach, despite the dict.get()
method being semantically fitter and shorter.
So can the dict.get() method be speed up? And if not, why?

------------------

Another method I'd like to see, is the str.xsplit() I was talking
about time ago too. It's like str.split() but it's lazy. It avoids to
build a potentially huge list. In real D programs I have seen such
string function may speed up heavy-string-processing code (the kind of
code you may want to use Python/Perl/Ruby for) up to 2-3 times. Seeing
how such kind of processing is common in Python I think this isn't an
optimization to ignore.
(I am now getting better in writing C code, so in few months I may
even become able to submit a xsplit patch myself. I think it's not too
much complex to write, it can borrow lot of code from the str.split
method).

Bye, and thank you for your kind attention,
bearophile

Marc 'BlackJack' Rintsch · Feb 26, 2008

This is a real difference, that has real impact on the programs I
write, so I often use the if/else approach, despite the dict.get()
method being semantically fitter and shorter.
So can the dict.get() method be speed up? And if not, why?

I guess it's the method lookup that's the slow part. Factor it out of the
loop and measure again::

adict_get = adict.get
for _ in xrange(M):
for k in keys1:
r = adict_get(k, None)
for k in keys2:
r = adict_get(k, None)

Ciao,
Marc 'BlackJack' Rintsch

castironpi · Feb 26, 2008

I guess it's the method lookup that's the slow part. Factor it out of the
loop and measure again::

adict_get = adict.get
for _ in xrange(M):
for k in keys1:
r = adict_get(k, None)
for k in keys2:
r = adict_get(k, None)

Ciao,
Marc 'BlackJack' Rintsch

Can't be. The string 'get' is only hashed once, since it's hard-coded
into the script, and looking it up can't be any slower than looking up
__getitem__.

Steve Holden · Feb 26, 2008

Marc said:
I guess it's the method lookup that's the slow part. Factor it out of the
loop and measure again::

adict_get = adict.get
for _ in xrange(M):
for k in keys1:
r = adict_get(k, None)
for k in keys2:
r = adict_get(k, None)

And think about using the timeit module, since it's been included among
the batteries for your convenience, and it removes some common pitfalls
with cross-platform timings.

regards
Steve

Marc 'BlackJack' Rintsch · Feb 26, 2008

Can't be. The string 'get' is only hashed once, since it's hard-coded
into the script, and looking it up can't be any slower than looking up
__getitem__.

Within functions it is faster. In the original code the `get` attribute is
looked up on the `adict` object twice in each loop iteration via hashing.
In my code it is looked up once before the loop and within the loop the
local name `adict_get` isn't looked up but hardcoded as index into the
internal locals array of the function.

Ciao,
Marc 'BlackJack' Rintsch

bearophileHUGS · Feb 26, 2008

Marc 'BlackJack' Rintsch:

I guess it's the method lookup that's the slow part. Factor it out of the
loop and measure again::

I did know of that optimization, but sometimes I "forget" about it...
The new timings:

Output timings, best of 3, unloaded CPU:
2.32 s with adict.get
1.56 s with "in" + if/then/else
1.78 s with adict_get

It's slower still, despite doing the lookup two times half of the
times (half keys are present in the test set). The "in" is an
operator, it's faster than a method call, but I don't understand the
other details. Now the difference between 1.78 and 1.56 is small
enough, so probably now it's not much noticeable in practice in real
programs, so my original argument is mostly dead now

In the code
points where speed matters instead of a single line with a get() I can
use two lines to create a local adict_get.

Bye and thank you,
bearophile

castironpi · Feb 26, 2008

Marc 'BlackJack' Rintsch:

I did know of that optimization, but sometimes I "forget" about it...
The new timings:

Output timings, best of 3, unloaded CPU:
2.32 s with adict.get
1.56 s with "in" + if/then/else
1.78 s with adict_get

It's slower still, despite doing the lookup two times half of the
times (half keys are present in the test set). The "in" is an
operator, it's faster than a method call, but I don't understand the
other details. Now the difference between 1.78 and 1.56 is small
enough, so probably now it's not much noticeable in practice in real
programs, so my original argument is mostly dead now In the code
points where speed matters instead of a single line with a get() I can
use two lines to create a local adict_get.

Bye and thank you,
bearophile

Where does adict_getitem fit in? And try: except: KeyError?

marek.rocki · Feb 26, 2008

(e-mail address removed) napisa³(a):

It's slower still, despite doing the lookup two times half of the
times (half keys are present in the test set). The "in" is an
operator, it's faster than a method call, but I don't understand the
other details. Now the difference between 1.78 and 1.56 is small
enough, so probably now it's not much noticeable in practice in real
programs, so my original argument is mostly dead now In the code
points where speed matters instead of a single line with a get() I can
use two lines to create a local adict_get.

AFAIK, method/function call is generally slow in Python (similarly to
the dot attribute lookup). As for the original prooblem, why not use
defaultdict? I think it's the most idiomatic way to achieve what we
want. And also the fastest one, according to my quick-and-dirty tests:

from collections import defaultdict

def main(N, M):

# <snip>

addict = defaultdict(lambda: None, adict)
t = clock()
for _ in xrange(M):
for k in keys1:
r = addict[k]
for k in keys2:
r = addict[k]
print round(clock() - t, 2), "s"

adict.get: 3.12 s
adict_get: 2.24 s
in: 1.62 s
defaultdict: 1.05 s

bearophileHUGS · Feb 26, 2008

(e-mail address removed):

As for the original prooblem, why not use
defaultdict? I think it's the most idiomatic way to achieve what we
want. And also the fastest one, according to my quick-and-dirty tests:

It adds the new keys, I can't accept that:

from collections import defaultdict as dd
adict = {1:2, 3:4}
addict = dd(lambda: None, adict)
r = addict[10]
addict

Click to expand...

Click to expand...

defaultdict(<function <lambda> at 0x008E6FB0>, {1: 2, 10: None, 3: 4})

Bye,
bearophile

marek.rocki · Feb 26, 2008

(e-mail address removed) napisa³(a):

(e-mail address removed):

It adds the new keys, I can't accept that:

Right, my fault not noticing that. I experimented further with
__missing__ method and with "ask forgiveness" idiom (try... except
KeyError) and they were both noticeably slower. So checking with "in"
may be the most efficient way we have.

Regards,
Marek

Gabriel Genellina · Feb 27, 2008

En Tue said:
Can't be. The string 'get' is only hashed once, since it's hard-coded
into the script, and looking it up can't be any slower than looking up
__getitem__.

In the original code, the 'get' method is searched a lot of times (every
time it's called), not just once; Python can't assume that it remains the
same at every invocation.

And __getitem__ isn't searched by name; there is a predefined slot in the
type structure for it. In case the method were overriden in Python code,
the slot is replaced with a wrapper function that calls the Python code.

Fibonacci	0	May 13, 2023
Calculate rang and derang of ordering of subsets	0	Feb 6, 2024
Code with random module faster on the vm than the vm host...	4	Nov 8, 2013
Range / empty list issues??	1	Dec 11, 2023
Blue J Ciphertext Program	2	Nov 22, 2023
My Status, Ciphertext	2	Nov 28, 2023
ChatGPT will make us Job(Home)less	3	Jan 22, 2023
Calculate rang and derang of ordering of subsets	1	Feb 6, 2024

dict.get and str.xsplit

bearophileHUGS

Marc 'BlackJack' Rintsch

castironpi

Steve Holden

Marc 'BlackJack' Rintsch

bearophileHUGS

castironpi

marek.rocki

bearophileHUGS

marek.rocki

Gabriel Genellina

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads