Summing a 2D list

M

Maric Michaud

Le Friday 13 June 2008 14:12:40 Karsten Heymann, vous avez écrit :
Hi Mark,

Mark said:
I have a scenario where I have a list like this:

User            Score
1                 0
1                 1
1                 5
2                 3
2                 1
3                 2
4                 3
4                 3
4                 2

And I need to add up the score for each user to get something like
this:

User            Score
1                 6
2                 4
3                 2
4                 8

Is this possible? If so, how can I do it? I've tried looping through
the arrays and not had much luck so far.

Although your problem has already been solved, I'd like to present a
different approach which can be quite a bit faster. The most common
approach seems to be using a dictionary:

summed_up={}
for user,vote in pairs:
  if summed_up.has_key(user):
    summed_up[user]+=vote
  else:
    summed_up[user]=vote

But if the list of users is compact and the maximum value is known
before, the using a list and coding the user into the list position is
much more elegant:

So, writing C in python, which has dictionnary as builtin type, should be
considered "more elegant" ?
summed_up=list( (0,) * max_user )
for user,vote in pairs:
  summed_up[user] += vote

I've run a quick and dirty test on these approaches and found that the
latter takes only half the time than the first. More precisely, with
about 2 million pairs, i got:
* dict approach: 2s
        (4s with "try: ... except KeyError:" instead of the "if")
* list approach: 0.9s

You are comparing apples with lemons, there is no such a difference between
list index access and dictionnary key access in Python.
BTW this was inspired by the book "Programming Pearls" I read some
years ago where a similar approach saved some magnitudes of time
(using a bit field instead of a list to store reserved/free phone
numbers IIRC).

If you know in advance the number and names of users, what prevent you to
initialize completelly the target dictionnary ?

The following code compare the same algorithm, once with list and the second
time with dict :

#!/usr/bin/env python

def do(f, u, v) :
from time import time
n=time()
f(u, v)
return time() -n

def c_dict(users, votes) :
d = dict(((e, 0) for e in users))
for u, v in votes : d += v
return d.values()

def c_list(users, votes) :
d = [ 0 for i in users ]
for u, v in votes : d += v
return d

u = range(3000)

import random

v = list((u[r%3000], random.randint(0,10000)) for r in range(5*10**6))

print "with list", do(c_list, u, v)
print "with dict", do(c_dict, u, v)

The result is pretty close now :

maric@redflag1 17:04:36:~$ ./test.py
with list 1.40726399422
with dict 1.63094091415

So why use list where the obvious and natural data structure is a
dictionnary ?
 
K

Karsten Heymann

Paddy said:
How does your solution fare against the defaultdict solution of:

d = collections.defaultdict(int)
for u,s in zip(users,score): d += s


list: 0.931s
dict + "in": 1.495s
defaultdict : 1.991s
dict + "if": ~2s
dict + "try": ~4s

I've posted the (very rough) code to dpaste:

http://dpaste.com/hold/56468/

Yours
Karsten
 
K

Karsten Heymann

Hi Maric,

Maric Michaud said:
So, writing C in python, which has dictionnary as builtin type,
should be considered "more elegant" ?

IMO that's a bit harsh.
You are comparing apples with lemons, there is no such a difference
between list index access and dictionnary key access in Python. [...]
If you know in advance the number and names of users, what prevent
you to initialize completelly the target dictionnary ?

The following code compare the same algorithm, once with list and
the second time with dict : [...]
The result is pretty close now :

maric@redflag1 17:04:36:~$ ./test.py
with list 1.40726399422
with dict 1.63094091415

So why use list where the obvious and natural data structure is a
dictionnary ?

I'd never argue that using a dictionary is the obvious and natural
data structure for this case. But is it the best? Honestly, as your
very nice example shows, we have two solutions that are equally fast,
equally complex to code and equally robust, but one needs
approximately the double amount of memory compared to the other. So,
as much as i like dictionaries, what's the gain you get from using it
in this corner case?

Yours,
Karsten
 
M

Maric Michaud

Hello,

Le Friday 13 June 2008 17:55:44 Karsten Heymann, vous avez écrit :
IMO that's a bit harsh.

harsh ? Sorry, I'm not sure to understand.
You are comparing apples with lemons, there is no such a difference
between list index access and dictionnary key access in Python.
[...]

If you know in advance the number and names of users, what prevent
you to initialize completelly the target dictionnary ?

The following code compare the same algorithm, once with list and
the second time with dict :
[...]

The result is pretty close now :

maric@redflag1 17:04:36:~$ ./test.py
with list 1.40726399422
with dict 1.63094091415

So why use list where the obvious and natural data structure is a
dictionnary ?

I'd never argue that using a dictionary is the obvious and natural
data structure for this case. But is it the best? Honestly, as your
very nice example shows, we have two solutions that are equally fast,
equally complex to code and equally robust, but one needs

Yes, but my example take ordered integer for keys (users' names) which they
should not be in a real case, so retrieving the result is by way easier (and
faster) with a dictionnary.
approximately the double amount of memory compared to the other.

I don't see how you came to this conclusion. Are you sure the extra list take
twice more memory than the extra dictionary ?
So, as much as i like dictionaries, what's the gain you get from using it
in this corner case?

It's the very purpose of it's usage, store and retrieve data by key.

Cheers,
 
M

Maric Michaud

Le Friday 13 June 2008 18:55:24 Maric Michaud, vous avez écrit :
I don't see how you came to this conclusion. Are you sure the extra list
take twice more memory than the extra dictionary ?

twice less, I meant, of course...
 
K

Karsten Heymann

Maric Michaud said:
Le Friday 13 June 2008 17:55:44 Karsten Heymann, vous avez écrit :
harsh ? Sorry, I'm not sure to understand.

Never mind, I got carried away.
Yes, but my example take ordered integer for keys (users' names)
which they should not be in a real case, so retrieving the result is
by way easier (and faster) with a dictionnary.

Of course. As I wrote in my first post, my proposal dependeds upon the
users being a numerical and compact list with a known maximum, as in
the OP's example. Otherwise my approach makes no sense at all :)
I don't see how you came to this conclusion. Are you sure the extra
list take twice more memory than the extra dictionary ?

I'm referring to the resulting data structure. A list with n items
should take approx. half the space of a dictionary with n keys and n
entries. Or, the other way round, if i have a dictionary with keys
1..n, why shouldn't I use a list instead.

Yours,
Karsten
 
S

sturlamolden

Is this possible?

def foobar(user,score):
sums = {}
for u,s in zip(user,score):
try:
sums += s
except KeyError:
sums = s
return [(u, sums) for u in sums].sort()


usersum = foobar(user,score)
for u,s in usersum:
print "%d %d" % (u,s)
 
M

MRAB

Is this possible?

def foobar(user,score):
sums = {}
for u,s in zip(user,score):
try:
sums += s
except KeyError:
sums = s
return [(u, sums) for u in sums].sort()


sort() sorts the list in-place and returns None. Try this instead:

return sorted([(u, sums) for u in sums])

or, better yet:

return sorted(sums.items())
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top