itertools.groupby usage to get structured data

S

Slafs

Hi there!

I'm having trouble to wrap my brain around this kind of problem:

What I have :
1) list of dicts
2) list of keys that i would like to be my grouping arguments of
elements from 1)
3) list of keys that i would like do "aggregation" on the elements
of 1) with some function e.g. sum

For instance i got:
1) [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
{ 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
{'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}, ... ]
2) ['g1', 'g2']
3) ['s_v1', 's_v2']

To be precise 1) is a result of a values_list method from a QuerySet
in Django; 2) is the arguments for that method; 3) those are the
annotation keys. so 1) is a result of:
qs.values_list('g1', 'g2').annotate(s_v1=Sum('v1'), s_v2=Sum('v2'))

What i want to have is:
a "big" nested dictionary with 'g1' values as 1st level keys and a
dictionary of aggregates and "subgroups" in it.

In my example it would be something like this:
{
1 : {
's_v1' : 7.0,
's_v2' : 6.5,
'g2' :{
8 : {
's_v1' : 5.0,
's_v2' : 3.5 },
9 : {
's_v1' : 2.0,
's_v2' : 3.0 }
}
},
2 : {
's_v1' : 6.0,
's_v2' : 8.0,
'g2' : {
8 : {
's_v1' : 6.0,
's_v2' : 8.0}
}
},
....
}

# notice the summed values of s_v1 and s_v2 when g1 == 1

I was looking for a solution that would let me do that kind of
grouping with variable lists of 2) and 3) i.e. having also 'g3' as
grouping element so the 'g2' dicts could also have their own
"subgroup" and be even more nested then.
I was trying something with itertools.groupby and updating nested
dicts, but as i was writing the code it started to feel too verbose to
me :/

Do You have any hints maybe? because i'm kind of stucked :/

Regards

SÅ‚awek
 
S

Steven D'Aprano

Hi there!

I'm having trouble to wrap my brain around this kind of problem:

Perhaps you should consider backing up and staring from somewhere else
with different input data, or changing the requirements. Just a thought.

What I have :
1) list of dicts
2) list of keys that i would like to be my grouping arguments of
elements from 1)
3) list of keys that i would like do "aggregation" on the elements
of 1) with some function e.g. sum


You start with data:

dicts = [ {'g1': 1, 'g2': 8, 's_v1': 5.0, 's_v2': 3.5},
{'g1': 1, 'g2': 9, 's_v1': 2.0, 's_v2': 3.0},
{'g1': 2, 'g2': 8, 's_v1': 6.0, 's_v2': 8.0} ]


It sometimes helps me to think about data structures by drawing them out.
In this case, you have what is effectively a two-dimensional table:

g1 g2 s_v1 s_v2
=== === ===== ====
1 8 5.0 3.5
1 9 2.0 3.0
2 8 6.0 8.0


Nice and simple. But the result you want is a bit more complex -- it's a
dict of dicts of dicts:

{1: {'s_v1': 7.0, 's_v2': 6.5,
'g2': {8: {'s_v1': 5.0, 's_v2': 3.5},
9: {'s_v1': 2.0, 's_v2': 3.0}
}},
2: {'s_v1': 6.0, 's_v2': 8.0,
'g2': {8: {'s_v1' : 6.0, 's_v2': 8.0}
}}}


(I quote from the Zen of Python: "Flat is better than nested." Hmmm.)

which is equivalent to a *four* dimensional table, which is a bit hard to
write out :)

Here's a two-dimensional projection of a single slice with key = 1:

s_v1 s_v2 g2
===== ===== =====
7.0 6.5 | s_v1 s_v2
---------------
8 | 5.0 3.5
9 | 2.0 3.0


Does this help you to either (1) redesign your data structures, or (2)
work out how to go from there?

[...]
I was looking for a solution that would let me do that kind of grouping
with variable lists of 2) and 3) i.e. having also 'g3' as grouping
element so the 'g2' dicts could also have their own "subgroup" and be
even more nested then. I was trying something with itertools.groupby and
updating nested dicts, but as i was writing the code it started to feel
too verbose to me :/

I don't think groupby is the tool you want. It groups *consecutive* items
in sequences:

from itertools import groupby
for key, it in groupby([1,1,1,2,3,4,3,3,3,5,1]):
.... print(key, list(it))
....
1 [1, 1, 1]
2 [2]
3 [3]
4 [4]
3 [3, 3, 3]
5 [5]
1 [1]


Except for the name, I don't see any connection between this and what you
want to do.

The approach I would take is a top-down approach:

dicts = [ ... ] # list of dicts, as above.
result = {}
for d in dicts:
# process each dict in isolation
temp = process(d)
merge(result, temp)


merge() hopefully should be straight forward, and process only needs to
look at one dict at a time.
 
P

Paul Rubin

Slafs said:
What i want to have is:
a "big" nested dictionary with 'g1' values as 1st level keys and a
dictionary of aggregates and "subgroups" in it....

I was looking for a solution that would let me do that kind of
grouping with variable lists of 2) and 3) i.e. having also 'g3' as
grouping element so the 'g2' dicts could also have their own
"subgroup" and be even more nested then.
I was trying something with itertools.groupby and updating nested
dicts, but as i was writing the code it started to feel too verbose to
me :/

Do You have any hints maybe? because i'm kind of stucked :/

I'm not sure I understood the problem and it would help if you gave
sample data with the deeper nesting that you describe. But the
following messy code matches the sample that you did give:

from pprint import pprint
from itertools import groupby

x1 = [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
{ 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
{ 'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}
]
x2 = ['g1', 'g2']
x3 = ['s_v1', 's_v2']

def agg(xdata, group_keys, agg_keys):
if not group_keys:
return {}
k0, ks = group_keys[0], group_keys[1:]
r = {}
def gk(d): return d[k0]
for k, g in groupby(sorted(xdata, key=gk), gk):
gs = list(g)
aggs = dict((ak,sum(d[ak] for d in gs)) for ak in agg_keys)
r[k] = aggs
if ks:
r[k][ks[0]] = agg(gs,group_keys[1:], agg_keys)
return r

pprint (agg(x1, x2, x3))
 
S

Slafs

Slafs said:
What i want to have is:
a "big" nested dictionary with 'g1' values as 1st level keys and a
dictionary of aggregates and "subgroups" in it....
I was looking for a solution that would let me do that kind of
grouping with variable lists of 2) and 3) i.e. having also 'g3' as
grouping element so the 'g2' dicts could also have their own
"subgroup" and be even more nested then.
I was trying something with itertools.groupby and updating nested
dicts, but as i was writing the code it started to feel too verbose to
me :/
Do You have any hints maybe? because i'm kind of stucked :/

I'm not sure I understood the problem and it would help if you gave
sample data with the deeper nesting that you describe.  But the
following messy code matches the sample that you did give:

    from pprint import pprint
    from itertools import groupby

    x1 = [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
              { 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
              { 'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}
              ]
    x2 = ['g1', 'g2']
    x3 = ['s_v1', 's_v2']

    def agg(xdata, group_keys, agg_keys):
        if not group_keys:
            return {}
        k0, ks = group_keys[0], group_keys[1:]
        r = {}
        def gk(d): return d[k0]
        for k, g in groupby(sorted(xdata, key=gk), gk):
            gs = list(g)
            aggs = dict((ak,sum(d[ak] for d in gs)) for ak in agg_keys)
            r[k] = aggs
            if ks:
                r[k][ks[0]] = agg(gs,group_keys[1:], agg_keys)
        return r

    pprint (agg(x1, x2, x3))

Thank you both Steven and Paul for your replies.

@Steven:
Perhaps you should consider backing up and staring from somewhere else
with different input data, or changing the requirements. Just a thought.

I think it's not the issue. The data as you noticed i well structured
(as a table for instance) and I don't think I can go better than that.
I don't think groupby is the tool you want. It groups *consecutive* items
in sequences:

I was using groupby just like in Paul's code.

@Paul:
OMG. I think this is it! (getting my jaw from the floor...)
The funny part is that I was kind of close to this solution ;). I was
considering the use of recursion for this.

Thank You so much!
 
P

Peter Otten

Slafs said:
Hi there!

I'm having trouble to wrap my brain around this kind of problem:

What I have :
1) list of dicts
2) list of keys that i would like to be my grouping arguments of
elements from 1)
3) list of keys that i would like do "aggregation" on the elements
of 1) with some function e.g. sum

For instance i got:
1) [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
{ 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
{'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}, ... ]
2) ['g1', 'g2']
3) ['s_v1', 's_v2']

To be precise 1) is a result of a values_list method from a QuerySet
in Django; 2) is the arguments for that method; 3) those are the
annotation keys. so 1) is a result of:
qs.values_list('g1', 'g2').annotate(s_v1=Sum('v1'), s_v2=Sum('v2'))

What i want to have is:
a "big" nested dictionary with 'g1' values as 1st level keys and a
dictionary of aggregates and "subgroups" in it.

In my example it would be something like this:
{
1 : {
's_v1' : 7.0,
's_v2' : 6.5,
'g2' :{
8 : {
's_v1' : 5.0,
's_v2' : 3.5 },
9 : {
's_v1' : 2.0,
's_v2' : 3.0 }
}
},
2 : {
's_v1' : 6.0,
's_v2' : 8.0,
'g2' : {
8 : {
's_v1' : 6.0,
's_v2' : 8.0}
}
},
...
}

# notice the summed values of s_v1 and s_v2 when g1 == 1

I was looking for a solution that would let me do that kind of
grouping with variable lists of 2) and 3) i.e. having also 'g3' as
grouping element so the 'g2' dicts could also have their own
"subgroup" and be even more nested then.
I was trying something with itertools.groupby and updating nested
dicts, but as i was writing the code it started to feel too verbose to
me :/

Do You have any hints maybe? because i'm kind of stucked :/

Regards

SÅ‚awek

Not super-efficient, but simple:

$ cat python sumover.py
cat: python: No such file or directory
data = [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
{ 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
{'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}]
sum_over = ["s_v1", "s_v2"]
group_by = ["g1", "g2"]

wanted = {
1 : {
's_v1' : 7.0,
's_v2' : 6.5,
'g2' :{
8 : {
's_v1' : 5.0,
's_v2' : 3.5 },
9 : {
's_v1' : 2.0,
's_v2' : 3.0 }
}
},
2 : {
's_v1' : 6.0,
's_v2' : 8.0,
'g2' : {
8 : {
's_v1' : 6.0,
's_v2' : 8.0}
}
},
}

def calc(data, group_by, sum_over):
tree = {}
group_by = group_by + [None]
for item in data:
d = tree
for g in group_by:
for so in sum_over:
d[so] = d.get(so, 0.0) + item[so]
if g:
d = d.setdefault(g, {}).setdefault(item[g], {})
return tree

got = calc(data, group_by, sum_over)[group_by[0]]
assert got == wanted
$ python sumover.py
$

Untested.
 
N

nn

Slafs said:
Hi there!
I'm having trouble to wrap my brain around this kind of problem:
What I have :
  1) list of dicts
  2) list of keys that i would like to be my grouping arguments of
elements from 1)
  3) list of keys that i would like do "aggregation" on the elements
of 1) with some function e.g. sum
For instance i got:
1) [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
      { 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
      {'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}, .... ]
2) ['g1', 'g2']
3) ['s_v1', 's_v2']
To be precise 1) is a result of a values_list method from a QuerySet
in Django; 2) is the arguments for that method; 3) those are the
annotation keys. so 1) is a result of:
   qs.values_list('g1', 'g2').annotate(s_v1=Sum('v1'), s_v2=Sum('v2'))
What i want to have is:
a "big" nested dictionary with 'g1' values as 1st level keys and a
dictionary of aggregates and "subgroups" in it.
In my example it would be something like this:
{
  1 : {
          's_v1' : 7.0,
          's_v2' : 6.5,
          'g2' :{
                   8 : {
                          's_v1' : 5.0,
                          's_v2' : 3.5 },
                   9 :  {
                          's_v1' : 2.0,
                          's_v2' : 3.0 }
                }
       },
  2 : {
           's_v1' : 6.0,
           's_v2' : 8.0,
           'g2' : {
                    8 : {
                          's_v1' : 6.0,
                          's_v2' : 8.0}
           }
       },
...
}
# notice the summed values of s_v1 and s_v2 when g1 == 1
I was looking for a solution that would let me do that kind of
grouping with variable lists of 2) and 3) i.e. having also 'g3' as
grouping element so the 'g2' dicts could also have their own
"subgroup" and be even more nested then.
I was trying something with itertools.groupby and updating nested
dicts, but as i was writing the code it started to feel too verbose to
me :/
Do You have any hints maybe? because i'm kind of stucked :/

SÅ‚awek

Not super-efficient, but simple:

$ cat python sumover.py
cat: python: No such file or directory
data = [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
         { 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
         {'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}]
sum_over = ["s_v1", "s_v2"]
group_by = ["g1", "g2"]

wanted = {
  1 : {
          's_v1' : 7.0,
          's_v2' : 6.5,
          'g2' :{
                   8 : {
                          's_v1' : 5.0,
                          's_v2' : 3.5 },
                   9 :  {
                          's_v1' : 2.0,
                          's_v2' : 3.0 }
                }
       },
  2 : {
           's_v1' : 6.0,
           's_v2' : 8.0,
           'g2' : {
                    8 : {
                          's_v1' : 6.0,
                          's_v2' : 8.0}
           }
       },

}

def calc(data, group_by, sum_over):
    tree = {}
    group_by = group_by + [None]
    for item in data:
        d = tree
        for g in group_by:
            for so in sum_over:
                d[so] = d.get(so, 0.0) + item[so]
            if g:
                d = d.setdefault(g, {}).setdefault(item[g], {})
    return tree

got = calc(data, group_by, sum_over)[group_by[0]]
assert got == wanted
$ python sumover.py
$

Untested.

Very clever. I didn't understand how it worked until I rewrote it like
this:

def calc(data, group_by, sum_over):
tree = {}
group_by = [None] + group_by
for item in data:
d = tree
for g in group_by:
if g:
d = d.setdefault(g, {}).setdefault(item[g], {})
for so in sum_over:
d[so] = d.get(so, 0.0) + item[so]
return tree


Processing "None" in the last round of the loop was throwing me off.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top