aggregation for a nested dict

chris · Dec 2, 2010

Hi,

i would like to parse many thousand files and aggregate the counts for
the field entries related to every id.

extract_field grep the identifier for the fields with regex.

result = [ { extract_field("id", line) : [extract_field("field1",
line),extract_field("field2", line)]} for line in FILE ]

result gives me.
{'a: ['0', '84']},
{'a': ['0', '84']},
{'b': ['1000', '83']},
{'b': ['0', '84']},

i like to aggregate them for every line or maybe file and get after
the complete parsing procedure
the possibility to count the amount of ids having > 0 entries in
'83'.

{'a: {'0':2, '84':2}}
{'b': {'1000':1,'83':1,'84':1} }

My current solution with mysql is really slow.

Many thanks for advance.
Christian

MRAB · Dec 2, 2010

Hi,

i would like to parse many thousand files and aggregate the counts for
the field entries related to every id.

extract_field grep the identifier for the fields with regex.

result = [ { extract_field("id", line) : [extract_field("field1",
line),extract_field("field2", line)]} for line in FILE ]

result gives me.
{'a: ['0', '84']},
{'a': ['0', '84']},
{'b': ['1000', '83']},
{'b': ['0', '84']},

i like to aggregate them for every line or maybe file and get after
the complete parsing procedure
the possibility to count the amount of ids having> 0 entries in
'83'.

{'a: {'0':2, '84':2}}
{'b': {'1000':1,'83':1,'84':1} }

My current solution with mysql is really slow.

result = [
{'a': ['0', '84']},
{'a': ['0', '84']},
{'b': ['1000', '83']},
{'b': ['0', '84']},
]

from collections import defaultdict

aggregates = defaultdict(lambda: defaultdict(int))
for entry in result:
for key, values in entry.items():
for v in values:
aggregates[key][v] += 1

print(aggregates)

Chris Rebert · Dec 2, 2010

Hi,

i would like to parse many thousand files and aggregate the counts for
the field entries related to every id.

extract_field grep the identifier for the fields with regex.

result = [ { extract_field("id", line) : [extract_field("field1",
line),extract_field("field2", line)]} Â for line Â in FILE ]

result gives me.
{'a: ['0', '84']},
{'a': ['0', '84']},
{'b': ['1000', '83']},
{'b': ['0', '84']},

i like to aggregate them for every line or maybe file and get after
the complete parsing procedure
the possibility to count the amount of ids Â having > 0 entries in
'83'.

{'a: {'0':2, '84':2}}
{'b': {'1000':1,'83':1,'84':1} }

Er, what happened to the '0' for 'b'?

My current solution with mysql is really slow.

Untested:

# requires Python 2.7+ due to Counter
from collections import defaultdict, Counter

FIELDS = ["field1", "field2"]

id2counter = defaultdict(Counter)
for line in FILE:
identifier = extract_field("id", line)
counter = id2counter[identifier]
for field_name in FIELDS:
field_val = int(extract_field(field_name, line))
counter[field_val] += 1

print(id2counter)
print(sum(1 for counter in id2counter.values() if counter[83]))

Cheers,
Chris

Peter Otten · Dec 2, 2010

chris said:
Hi,

i would like to parse many thousand files and aggregate the counts for
the field entries related to every id.

extract_field grep the identifier for the fields with regex.

result = [ { extract_field("id", line) : [extract_field("field1",
line),extract_field("field2", line)]} for line in FILE ]

result gives me.
{'a: ['0', '84']},
{'a': ['0', '84']},
{'b': ['1000', '83']},
{'b': ['0', '84']},

i like to aggregate them for every line or maybe file and get after
the complete parsing procedure
the possibility to count the amount of ids having > 0 entries in
'83'.

{'a: {'0':2, '84':2}}
{'b': {'1000':1,'83':1,'84':1} }

My current solution with mysql is really slow.

.... for line in lines:
.... yield extract_field("id", line), [extract_field(name, line)
for name in "field1", "field2"]
.... .... print row
....
('a', ['0', '84'])
('b', ['1000', '83'])
('a', ['0', '84'])
('b', ['0', '84']).... def __repr__(self): return repr(dict(self))
........ inner = outer[key]
.... for v in values:
.... inner[v] += 1
.... {'a': {'0': 2, '84': 2}, 'b': {'83': 1, '1000': 1, '84': 1, '0': 1}}

Tim Chase · Dec 2, 2010

i would like to parse many thousand files and aggregate the counts for
the field entries related to every id.

extract_field grep the identifier for the fields with regex.

result = [ { extract_field("id", line) : [extract_field("field1",
line),extract_field("field2", line)]} for line in FILE ]

i like to aggregate them for every line or maybe file and get after
the complete parsing procedure

{'a: {'0':2, '84':2}}
{'b': {'1000':1,'83':1,'84':1} }

Click to expand...

I'm not sure what happened to b['0'] based on your initial data,
but assuming that was an oversight...

from collections import defaultdict

aggregates = defaultdict(lambda: defaultdict(int))
for entry in result:
for key, values in entry.items():
for v in values:
aggregates[key][v] += 1

Or, if you don't need the intermediate result, you can tweak
MRAB's solution and just iterate over the file(s):

aggregates = defaultdict(lambda: defaultdict(int))
for line in FILE:
key = extract_field("id", line)
aggregates[key][extract_field("field1", line)] += 1
aggregates[key][extract_field("field2", line)] += 1

or, if you're using an older version (<2.5) that doesn't provide
defaultdict, you could do something like

aggregates = {}
for line in FILE:
key = extract_field("id", line)
d = aggregates.setdefault(key, {})
for fieldname in ('field1', 'field2'):
value = extract_field(fieldname, line)
d[value] = d.get(value, 0) + 1

-tkc

chris · Dec 3, 2010

I very appreciate all responses.
It's incredible how fast it is!

Cheers
Christian

Minimum Total Difficulty	0	Nov 15, 2023
list 2 dict?	8	Jan 2, 2011
dict is really slow for big truck	13	Apr 28, 2009
File to dict	24	Dec 7, 2007
Help figuring out a directory permission change problem	1	May 12, 2023
fun with nested loops	14	Aug 31, 2011
Sort and count word pairs in a string	6	Jan 29, 2023
Using a dict as if it were a module namespace	5	Jan 27, 2008

aggregation for a nested dict

chris

MRAB

Chris Rebert

Peter Otten

Tim Chase

chris

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads