itertools.groupby usage to get structured data

Discussion in 'Python' started by Slafs, Feb 4, 2011.

  1. Slafs

    Slafs Guest

    Hi there!

    I'm having trouble to wrap my brain around this kind of problem:

    What I have :
    1) list of dicts
    2) list of keys that i would like to be my grouping arguments of
    elements from 1)
    3) list of keys that i would like do "aggregation" on the elements
    of 1) with some function e.g. sum

    For instance i got:
    1) [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
    { 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
    {'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}, ... ]
    2) ['g1', 'g2']
    3) ['s_v1', 's_v2']

    To be precise 1) is a result of a values_list method from a QuerySet
    in Django; 2) is the arguments for that method; 3) those are the
    annotation keys. so 1) is a result of:
    qs.values_list('g1', 'g2').annotate(s_v1=Sum('v1'), s_v2=Sum('v2'))

    What i want to have is:
    a "big" nested dictionary with 'g1' values as 1st level keys and a
    dictionary of aggregates and "subgroups" in it.

    In my example it would be something like this:
    {
    1 : {
    's_v1' : 7.0,
    's_v2' : 6.5,
    'g2' :{
    8 : {
    's_v1' : 5.0,
    's_v2' : 3.5 },
    9 : {
    's_v1' : 2.0,
    's_v2' : 3.0 }
    }
    },
    2 : {
    's_v1' : 6.0,
    's_v2' : 8.0,
    'g2' : {
    8 : {
    's_v1' : 6.0,
    's_v2' : 8.0}
    }
    },
    ....
    }

    # notice the summed values of s_v1 and s_v2 when g1 == 1

    I was looking for a solution that would let me do that kind of
    grouping with variable lists of 2) and 3) i.e. having also 'g3' as
    grouping element so the 'g2' dicts could also have their own
    "subgroup" and be even more nested then.
    I was trying something with itertools.groupby and updating nested
    dicts, but as i was writing the code it started to feel too verbose to
    me :/

    Do You have any hints maybe? because i'm kind of stucked :/

    Regards

    SÅ‚awek
    Slafs, Feb 4, 2011
    #1
    1. Advertising

  2. On Fri, 04 Feb 2011 15:14:24 -0800, Slafs wrote:

    > Hi there!
    >
    > I'm having trouble to wrap my brain around this kind of problem:


    Perhaps you should consider backing up and staring from somewhere else
    with different input data, or changing the requirements. Just a thought.


    > What I have :
    > 1) list of dicts
    > 2) list of keys that i would like to be my grouping arguments of
    > elements from 1)
    > 3) list of keys that i would like do "aggregation" on the elements
    > of 1) with some function e.g. sum



    You start with data:

    dicts = [ {'g1': 1, 'g2': 8, 's_v1': 5.0, 's_v2': 3.5},
    {'g1': 1, 'g2': 9, 's_v1': 2.0, 's_v2': 3.0},
    {'g1': 2, 'g2': 8, 's_v1': 6.0, 's_v2': 8.0} ]


    It sometimes helps me to think about data structures by drawing them out.
    In this case, you have what is effectively a two-dimensional table:

    g1 g2 s_v1 s_v2
    === === ===== ====
    1 8 5.0 3.5
    1 9 2.0 3.0
    2 8 6.0 8.0


    Nice and simple. But the result you want is a bit more complex -- it's a
    dict of dicts of dicts:

    {1: {'s_v1': 7.0, 's_v2': 6.5,
    'g2': {8: {'s_v1': 5.0, 's_v2': 3.5},
    9: {'s_v1': 2.0, 's_v2': 3.0}
    }},
    2: {'s_v1': 6.0, 's_v2': 8.0,
    'g2': {8: {'s_v1' : 6.0, 's_v2': 8.0}
    }}}


    (I quote from the Zen of Python: "Flat is better than nested." Hmmm.)

    which is equivalent to a *four* dimensional table, which is a bit hard to
    write out :)

    Here's a two-dimensional projection of a single slice with key = 1:

    s_v1 s_v2 g2
    ===== ===== =====
    7.0 6.5 | s_v1 s_v2
    ---------------
    8 | 5.0 3.5
    9 | 2.0 3.0


    Does this help you to either (1) redesign your data structures, or (2)
    work out how to go from there?

    [...]
    > I was looking for a solution that would let me do that kind of grouping
    > with variable lists of 2) and 3) i.e. having also 'g3' as grouping
    > element so the 'g2' dicts could also have their own "subgroup" and be
    > even more nested then. I was trying something with itertools.groupby and
    > updating nested dicts, but as i was writing the code it started to feel
    > too verbose to me :/


    I don't think groupby is the tool you want. It groups *consecutive* items
    in sequences:


    >>> from itertools import groupby
    >>> for key, it in groupby([1,1,1,2,3,4,3,3,3,5,1]):

    .... print(key, list(it))
    ....
    1 [1, 1, 1]
    2 [2]
    3 [3]
    4 [4]
    3 [3, 3, 3]
    5 [5]
    1 [1]


    Except for the name, I don't see any connection between this and what you
    want to do.

    The approach I would take is a top-down approach:

    dicts = [ ... ] # list of dicts, as above.
    result = {}
    for d in dicts:
    # process each dict in isolation
    temp = process(d)
    merge(result, temp)


    merge() hopefully should be straight forward, and process only needs to
    look at one dict at a time.



    --
    Steven
    Steven D'Aprano, Feb 5, 2011
    #2
    1. Advertising

  3. Slafs

    Paul Rubin Guest

    Slafs <> writes:
    > What i want to have is:
    > a "big" nested dictionary with 'g1' values as 1st level keys and a
    > dictionary of aggregates and "subgroups" in it....
    >
    > I was looking for a solution that would let me do that kind of
    > grouping with variable lists of 2) and 3) i.e. having also 'g3' as
    > grouping element so the 'g2' dicts could also have their own
    > "subgroup" and be even more nested then.
    > I was trying something with itertools.groupby and updating nested
    > dicts, but as i was writing the code it started to feel too verbose to
    > me :/
    >
    > Do You have any hints maybe? because i'm kind of stucked :/


    I'm not sure I understood the problem and it would help if you gave
    sample data with the deeper nesting that you describe. But the
    following messy code matches the sample that you did give:

    from pprint import pprint
    from itertools import groupby

    x1 = [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
    { 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
    { 'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}
    ]
    x2 = ['g1', 'g2']
    x3 = ['s_v1', 's_v2']

    def agg(xdata, group_keys, agg_keys):
    if not group_keys:
    return {}
    k0, ks = group_keys[0], group_keys[1:]
    r = {}
    def gk(d): return d[k0]
    for k, g in groupby(sorted(xdata, key=gk), gk):
    gs = list(g)
    aggs = dict((ak,sum(d[ak] for d in gs)) for ak in agg_keys)
    r[k] = aggs
    if ks:
    r[k][ks[0]] = agg(gs,group_keys[1:], agg_keys)
    return r

    pprint (agg(x1, x2, x3))
    Paul Rubin, Feb 5, 2011
    #3
  4. Slafs

    Slafs Guest

    On 5 Lut, 05:58, Paul Rubin <> wrote:
    > Slafs <> writes:
    > > What i want to have is:
    > > a "big" nested dictionary with 'g1' values as 1st level keys and a
    > > dictionary of aggregates and "subgroups" in it....

    >
    > > I was looking for a solution that would let me do that kind of
    > > grouping with variable lists of 2) and 3) i.e. having also 'g3' as
    > > grouping element so the 'g2' dicts could also have their own
    > > "subgroup" and be even more nested then.
    > > I was trying something with itertools.groupby and updating nested
    > > dicts, but as i was writing the code it started to feel too verbose to
    > > me :/

    >
    > > Do You have any hints maybe? because i'm kind of stucked :/

    >
    > I'm not sure I understood the problem and it would help if you gave
    > sample data with the deeper nesting that you describe.  But the
    > following messy code matches the sample that you did give:
    >
    >     from pprint import pprint
    >     from itertools import groupby
    >
    >     x1 = [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
    >               { 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
    >               { 'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}
    >               ]
    >     x2 = ['g1', 'g2']
    >     x3 = ['s_v1', 's_v2']
    >
    >     def agg(xdata, group_keys, agg_keys):
    >         if not group_keys:
    >             return {}
    >         k0, ks = group_keys[0], group_keys[1:]
    >         r = {}
    >         def gk(d): return d[k0]
    >         for k, g in groupby(sorted(xdata, key=gk), gk):
    >             gs = list(g)
    >             aggs = dict((ak,sum(d[ak] for d in gs)) for ak in agg_keys)
    >             r[k] = aggs
    >             if ks:
    >                 r[k][ks[0]] = agg(gs,group_keys[1:], agg_keys)
    >         return r
    >
    >     pprint (agg(x1, x2, x3))


    Thank you both Steven and Paul for your replies.

    @Steven:
    > Perhaps you should consider backing up and staring from somewhere else
    > with different input data, or changing the requirements. Just a thought.


    I think it's not the issue. The data as you noticed i well structured
    (as a table for instance) and I don't think I can go better than that.

    > I don't think groupby is the tool you want. It groups *consecutive* items
    > in sequences:


    I was using groupby just like in Paul's code.

    @Paul:
    OMG. I think this is it! (getting my jaw from the floor...)
    The funny part is that I was kind of close to this solution ;). I was
    considering the use of recursion for this.

    Thank You so much!
    Slafs, Feb 5, 2011
    #4
  5. Slafs

    Peter Otten Guest

    Slafs wrote:

    > Hi there!
    >
    > I'm having trouble to wrap my brain around this kind of problem:
    >
    > What I have :
    > 1) list of dicts
    > 2) list of keys that i would like to be my grouping arguments of
    > elements from 1)
    > 3) list of keys that i would like do "aggregation" on the elements
    > of 1) with some function e.g. sum
    >
    > For instance i got:
    > 1) [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
    > { 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
    > {'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}, ... ]
    > 2) ['g1', 'g2']
    > 3) ['s_v1', 's_v2']
    >
    > To be precise 1) is a result of a values_list method from a QuerySet
    > in Django; 2) is the arguments for that method; 3) those are the
    > annotation keys. so 1) is a result of:
    > qs.values_list('g1', 'g2').annotate(s_v1=Sum('v1'), s_v2=Sum('v2'))
    >
    > What i want to have is:
    > a "big" nested dictionary with 'g1' values as 1st level keys and a
    > dictionary of aggregates and "subgroups" in it.
    >
    > In my example it would be something like this:
    > {
    > 1 : {
    > 's_v1' : 7.0,
    > 's_v2' : 6.5,
    > 'g2' :{
    > 8 : {
    > 's_v1' : 5.0,
    > 's_v2' : 3.5 },
    > 9 : {
    > 's_v1' : 2.0,
    > 's_v2' : 3.0 }
    > }
    > },
    > 2 : {
    > 's_v1' : 6.0,
    > 's_v2' : 8.0,
    > 'g2' : {
    > 8 : {
    > 's_v1' : 6.0,
    > 's_v2' : 8.0}
    > }
    > },
    > ...
    > }
    >
    > # notice the summed values of s_v1 and s_v2 when g1 == 1
    >
    > I was looking for a solution that would let me do that kind of
    > grouping with variable lists of 2) and 3) i.e. having also 'g3' as
    > grouping element so the 'g2' dicts could also have their own
    > "subgroup" and be even more nested then.
    > I was trying something with itertools.groupby and updating nested
    > dicts, but as i was writing the code it started to feel too verbose to
    > me :/
    >
    > Do You have any hints maybe? because i'm kind of stucked :/
    >
    > Regards
    >
    > SÅ‚awek


    Not super-efficient, but simple:

    $ cat python sumover.py
    cat: python: No such file or directory
    data = [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
    { 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
    {'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}]
    sum_over = ["s_v1", "s_v2"]
    group_by = ["g1", "g2"]

    wanted = {
    1 : {
    's_v1' : 7.0,
    's_v2' : 6.5,
    'g2' :{
    8 : {
    's_v1' : 5.0,
    's_v2' : 3.5 },
    9 : {
    's_v1' : 2.0,
    's_v2' : 3.0 }
    }
    },
    2 : {
    's_v1' : 6.0,
    's_v2' : 8.0,
    'g2' : {
    8 : {
    's_v1' : 6.0,
    's_v2' : 8.0}
    }
    },
    }

    def calc(data, group_by, sum_over):
    tree = {}
    group_by = group_by + [None]
    for item in data:
    d = tree
    for g in group_by:
    for so in sum_over:
    d[so] = d.get(so, 0.0) + item[so]
    if g:
    d = d.setdefault(g, {}).setdefault(item[g], {})
    return tree

    got = calc(data, group_by, sum_over)[group_by[0]]
    assert got == wanted
    $ python sumover.py
    $

    Untested.
    Peter Otten, Feb 5, 2011
    #5
  6. Slafs

    nn Guest

    On Feb 5, 7:12 am, Peter Otten <> wrote:
    > Slafs wrote:
    > > Hi there!

    >
    > > I'm having trouble to wrap my brain around this kind of problem:

    >
    > > What I have :
    > >   1) list of dicts
    > >   2) list of keys that i would like to be my grouping arguments of
    > > elements from 1)
    > >   3) list of keys that i would like do "aggregation" on the elements
    > > of 1) with some function e.g. sum

    >
    > > For instance i got:
    > > 1) [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
    > >       { 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
    > >       {'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}, .... ]
    > > 2) ['g1', 'g2']
    > > 3) ['s_v1', 's_v2']

    >
    > > To be precise 1) is a result of a values_list method from a QuerySet
    > > in Django; 2) is the arguments for that method; 3) those are the
    > > annotation keys. so 1) is a result of:
    > >    qs.values_list('g1', 'g2').annotate(s_v1=Sum('v1'), s_v2=Sum('v2'))

    >
    > > What i want to have is:
    > > a "big" nested dictionary with 'g1' values as 1st level keys and a
    > > dictionary of aggregates and "subgroups" in it.

    >
    > > In my example it would be something like this:
    > > {
    > >   1 : {
    > >           's_v1' : 7.0,
    > >           's_v2' : 6.5,
    > >           'g2' :{
    > >                    8 : {
    > >                           's_v1' : 5.0,
    > >                           's_v2' : 3.5 },
    > >                    9 :  {
    > >                           's_v1' : 2.0,
    > >                           's_v2' : 3.0 }
    > >                 }
    > >        },
    > >   2 : {
    > >            's_v1' : 6.0,
    > >            's_v2' : 8.0,
    > >            'g2' : {
    > >                     8 : {
    > >                           's_v1' : 6.0,
    > >                           's_v2' : 8.0}
    > >            }
    > >        },
    > > ...
    > > }

    >
    > > # notice the summed values of s_v1 and s_v2 when g1 == 1

    >
    > > I was looking for a solution that would let me do that kind of
    > > grouping with variable lists of 2) and 3) i.e. having also 'g3' as
    > > grouping element so the 'g2' dicts could also have their own
    > > "subgroup" and be even more nested then.
    > > I was trying something with itertools.groupby and updating nested
    > > dicts, but as i was writing the code it started to feel too verbose to
    > > me :/

    >
    > > Do You have any hints maybe? because i'm kind of stucked :/

    >
    > > Regards

    >
    > > SÅ‚awek

    >
    > Not super-efficient, but simple:
    >
    > $ cat python sumover.py
    > cat: python: No such file or directory
    > data = [ { 'g1' : 1, 'g2' : 8, 's_v1' : 5.0, 's_v2' : 3.5 },
    >          { 'g1' : 1, 'g2' : 9, 's_v1' : 2.0, 's_v2' : 3.0 },
    >          {'g1' : 2, 'g2' : 8, 's_v1' : 6.0, 's_v2' : 8.0}]
    > sum_over = ["s_v1", "s_v2"]
    > group_by = ["g1", "g2"]
    >
    > wanted = {
    >   1 : {
    >           's_v1' : 7.0,
    >           's_v2' : 6.5,
    >           'g2' :{
    >                    8 : {
    >                           's_v1' : 5.0,
    >                           's_v2' : 3.5 },
    >                    9 :  {
    >                           's_v1' : 2.0,
    >                           's_v2' : 3.0 }
    >                 }
    >        },
    >   2 : {
    >            's_v1' : 6.0,
    >            's_v2' : 8.0,
    >            'g2' : {
    >                     8 : {
    >                           's_v1' : 6.0,
    >                           's_v2' : 8.0}
    >            }
    >        },
    >
    > }
    >
    > def calc(data, group_by, sum_over):
    >     tree = {}
    >     group_by = group_by + [None]
    >     for item in data:
    >         d = tree
    >         for g in group_by:
    >             for so in sum_over:
    >                 d[so] = d.get(so, 0.0) + item[so]
    >             if g:
    >                 d = d.setdefault(g, {}).setdefault(item[g], {})
    >     return tree
    >
    > got = calc(data, group_by, sum_over)[group_by[0]]
    > assert got == wanted
    > $ python sumover.py
    > $
    >
    > Untested.


    Very clever. I didn't understand how it worked until I rewrote it like
    this:

    def calc(data, group_by, sum_over):
    tree = {}
    group_by = [None] + group_by
    for item in data:
    d = tree
    for g in group_by:
    if g:
    d = d.setdefault(g, {}).setdefault(item[g], {})
    for so in sum_over:
    d[so] = d.get(so, 0.0) + item[so]
    return tree


    Processing "None" in the last round of the loop was throwing me off.
    nn, Feb 7, 2011
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. G?nter Jantzen

    whatsnew 2.4 about itertools.groupby:

    G?nter Jantzen, Jun 9, 2004, in forum: Python
    Replies:
    0
    Views:
    278
    G?nter Jantzen
    Jun 9, 2004
  2. Replies:
    3
    Views:
    325
    Fredrik Lundh
    May 25, 2006
  3. 7stud

    itertools.groupby

    7stud, May 27, 2007, in forum: Python
    Replies:
    13
    Views:
    591
    =?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=
    Jun 5, 2007
  4. Steve Howell

    Re: itertools.groupby

    Steve Howell, May 27, 2007, in forum: Python
    Replies:
    13
    Views:
    536
  5. Tobiah

    itertools.groupby

    Tobiah, Jan 15, 2008, in forum: Python
    Replies:
    2
    Views:
    299
    Tobiah
    Jan 16, 2008
Loading...

Share This Page