Optimize function similiar to dict.update() but adds common values

  • Thread starter =?ISO-8859-1?Q?Gregory_Pi=F1ero?=
  • Start date
?

=?ISO-8859-1?Q?Gregory_Pi=F1ero?=

Hey guys,

I thought I'd throw this out there since everyone loves optimization
questions (it's true, check out the number of replies to those type of
questions!)

Right now I have a function shown below. I want to combine two
dictionaries and add the values together where-ever there is overlap.
I figured this function would be reasonably fast, but I have around 1
million keys in each dictionary (~80% overlap) and it just runs all
night.

So maybe my function is not O(n)? I really figured it was but maybe
dictionary lookups are O(nlogn)?

Anyway here is the function. Any ideas on doing this task a lot
faster would be appriciated.

Thanks!

<code apology=sorry if gmail kills my tabs>

def add_freqs(freq1,freq2):
"""Add two word freq dicts"""
newfreq={}
for key,value in freq1.items():
newfreq[key]=value+freq2.get(key,0)
for key,value in freq2.items():
newfreq[key]=value+freq1.get(key,0)
return newfreq

freq1={
'dog':1,
'cat':2,
'human':3
}
freq2={
'perro':1,
'gato':1,
'human':2
}
print add_freqs(freq1,freq2)

answer_I_want={
'dog':1,
'cat':2,
'perro':1,
'gato':1,
'human':5
}
</code>
 
P

Peter Otten

Gregory said:
def add_freqs(freq1,freq2):
    """Add two word freq dicts"""
    newfreq={}
    for key,value in freq1.items():
        newfreq[key]=value+freq2.get(key,0)
    for key,value in freq2.items():
        newfreq[key]=value+freq1.get(key,0)
    return newfreq
Any ideas on doing this task a lot faster would be appriciated.

With items() you copy the whole dictionary into a list of tuples;
iteritems() just walks over the existing dictionary and creates one tuple
at a time.

With "80% overlap", you are looking up and setting four out of five values
twice in your for-loops.

Dump the symmetry and try one of these:

def add_freqs2(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
if key in freq1:
total[key] += value
else:
total[key] = value
return total

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

My guess is that add_freqs3() will perform best.

Peter
 
?

=?ISO-8859-1?Q?Gregory_Pi=F1ero?=

Thanks Peter, those are some really good ideas. I can't wait to try
them out tonight.

Here's a question about your functions. if I only look at the keys in
freq2 then won't I miss any keys that are in freq1 and not in freq2?
That's why I have the two loops in my original function.

-Greg


Gregory said:
def add_freqs(freq1,freq2):
"""Addtwowordfreqdicts"""
newfreq={}
forkey,valueinfreq1.items():
newfreq[key]=value+freq2.get(key,0)
forkey,valueinfreq2.items():
newfreq[key]=value+freq1.get(key,0)
returnnewfreq
Anyideasondoingthistaskalot faster would be appriciated.

With items() you copy the whole dictionary into a list of tuples;
iteritems() just walks over the existing dictionary and creates one tuple
at a time.

With "80% overlap", you are looking up and setting four out of five values
twice in your for-loops.

Dump the symmetry and try one of these:

def add_freqs2(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
if key in freq1:
total[key] += value
else:
total[key] = value
return total

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

My guess is that add_freqs3() will perform best.

Peter
 
S

Steve Holden

Gregory Piñero wrote:
[top-posting rearranged]
Gregory Piñero wrote:

def add_freqs(freq1,freq2):
"""Addtwowordfreqdicts"""
newfreq={}
forkey,valueinfreq1.items():
newfreq[key]=value+freq2.get(key,0)
forkey,valueinfreq2.items():
newfreq[key]=value+freq1.get(key,0)
returnnewfreq
Anyideasondoingthistaskalot faster would be appriciated.

With items() you copy the whole dictionary into a list of tuples;
iteritems() just walks over the existing dictionary and creates one tuple
at a time.

With "80% overlap", you are looking up and setting four out of five values
twice in your for-loops.

Dump the symmetry and try one of these:

def add_freqs2(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
if key in freq1:
total[key] += value
else:
total[key] = value
return total

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

My guess is that add_freqs3() will perform best.
> Thanks Peter, those are some really good ideas. I can't wait to try
> them out tonight.
>
> Here's a question about your functions. if I only look at the keys in
> freq2 then won't I miss any keys that are in freq1 and not in freq2?
> That's why I have the two loops in my original function.
>
No, because the statement

total = dict(freq1) creates total as a shallow copy of freq1. Thus
all that remains to be done is to add the items from freq2.

regards
Steve
 
S

Steve Holden

Gregory Piñero wrote:
[top-posting rearranged]
Gregory Piñero wrote:

def add_freqs(freq1,freq2):
"""Addtwowordfreqdicts"""
newfreq={}
forkey,valueinfreq1.items():
newfreq[key]=value+freq2.get(key,0)
forkey,valueinfreq2.items():
newfreq[key]=value+freq1.get(key,0)
returnnewfreq
Anyideasondoingthistaskalot faster would be appriciated.

With items() you copy the whole dictionary into a list of tuples;
iteritems() just walks over the existing dictionary and creates one tuple
at a time.

With "80% overlap", you are looking up and setting four out of five values
twice in your for-loops.

Dump the symmetry and try one of these:

def add_freqs2(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
if key in freq1:
total[key] += value
else:
total[key] = value
return total

def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

My guess is that add_freqs3() will perform best.
> Thanks Peter, those are some really good ideas. I can't wait to try
> them out tonight.
>
> Here's a question about your functions. if I only look at the keys in
> freq2 then won't I miss any keys that are in freq1 and not in freq2?
> That's why I have the two loops in my original function.
>
No, because the statement

total = dict(freq1) creates total as a shallow copy of freq1. Thus
all that remains to be done is to add the items from freq2.

regards
Steve
 
P

Peter Otten

Gregory said:
Here's a question about your functions. if I only look at the keys in
freq2 then won't I miss any keys that are in freq1 and not in freq2?

No. As I start with a copy of freq1, all keys of freq1 are already there.
There is probably a loop involved, but it in Python's underlying C
implementation, which is a bit faster.

What is left to do is to (1) add key and value for the 20% of freq2 that are
not in freq1, and to (2) increase the value for the 80% where the key
occurs in both freq1 and freq2. This is done by the for-loop.

Peter
 
?

=?ISO-8859-1?Q?Gregory_Pi=F1ero?=

Yes, that makes sense. I can't wait to run it tonight. Sorry I can't
give you the running time of my original function as it never finished
:-(

I'll report back the running time of the new function though, assuming
it finishes ;-)

Thanks again,

-Greg
 
?

=?ISO-8859-1?Q?Gregory_Pi=F1ero?=

OK, I ran Peter's add_freq3 and it ran four times on really large
dictionaries in about 3000 seconds. So I'd say that at a minimum
that's ten times faster than my original function since it ran all
last night and didn't finish.

Much obliged, Peter!

-Greg
 
C

Christopher Subich

Peter said:
def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

Untested, but replacing the try/except pair with the following should be
semantically equivalent and a bit faster:

total[key] = total.get(key,0) + value
 
B

bonono

Christopher said:
Peter said:
def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

Untested, but replacing the try/except pair with the following should be
semantically equivalent and a bit faster:

total[key] = total.get(key,0) + value
Would that have a performance impact as it seems that key needs to be
hashed twice or may be += also need to do it twice ? Just curious.
 
P

Peter Otten

Christopher said:
Peter said:
def add_freqs3(freq1, freq2):
total = dict(freq1)
for key, value in freq2.iteritems():
try:
total[key] += value
except KeyError:
total[key] = value
return total

Untested, but replacing the try/except pair with the following should be
semantically equivalent and a bit faster:

total[key] = total.get(key,0) + value

That depends on the data. For an 80/20 matches to misses ratio it ranges
between the two approaches I gave:

$ python -m timeit -r1 -n1 -s"from bm import fin as f" "f()"
1 loops, best of 1: 70.5 msec per loop
$ python -m timeit -r1 -n1 -s"from bm import fget as f" "f()"
1 loops, best of 1: 91.1 msec per loop
$ python -m timeit -r1 -n1 -s"from bm import fex as f" "f()"
1 loops, best of 1: 142 msec per loop

The break even for fget/fex is at about 92% hit rate:

$ python -m timeit -r1 -n1 -s"from bm import fget as f; d =
dict.fromkeys(range(92000), 0)" "f(d)"
1 loops, best of 1: 88 msec per loop

$ python -m timeit -r1 -n1 -s"from bm import fex as f; d =
dict.fromkeys(range(92000), 0)" "f(d)"
1 loops, best of 1: 87.4 msec per loop

Here's the bm.py module so you can verify my reasoning:

freqs = dict.fromkeys(range(8*10**4), 0)
keys = range(10**5)

def fin(d=freqs, keys=keys):
for key in keys:
if key in d:
d[key] += 42
else:
d[key] = 42
return d

def fget(d=freqs, keys=keys):
for key in keys:
d[key] = d.get(key, 0) + 42
return d

def fex(d=freqs, keys=keys):
for key in keys:
try:
d[key] += 42
except KeyError:
d[key] = 42
return d

if __name__ == "__main__":
assert fin(dict(freqs)) == fex(dict(freqs)) == fget(dict(freqs))

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top