aggregation for a nested dict

C

chris

Hi,

i would like to parse many thousand files and aggregate the counts for
the field entries related to every id.

extract_field grep the identifier for the fields with regex.

result = [ { extract_field("id", line) : [extract_field("field1",
line),extract_field("field2", line)]} for line in FILE ]

result gives me.
{'a: ['0', '84']},
{'a': ['0', '84']},
{'b': ['1000', '83']},
{'b': ['0', '84']},

i like to aggregate them for every line or maybe file and get after
the complete parsing procedure
the possibility to count the amount of ids having > 0 entries in
'83'.

{'a: {'0':2, '84':2}}
{'b': {'1000':1,'83':1,'84':1} }

My current solution with mysql is really slow.

Many thanks for advance.
Christian
 
M

MRAB

Hi,

i would like to parse many thousand files and aggregate the counts for
the field entries related to every id.

extract_field grep the identifier for the fields with regex.

result = [ { extract_field("id", line) : [extract_field("field1",
line),extract_field("field2", line)]} for line in FILE ]

result gives me.
{'a: ['0', '84']},
{'a': ['0', '84']},
{'b': ['1000', '83']},
{'b': ['0', '84']},

i like to aggregate them for every line or maybe file and get after
the complete parsing procedure
the possibility to count the amount of ids having> 0 entries in
'83'.

{'a: {'0':2, '84':2}}
{'b': {'1000':1,'83':1,'84':1} }

My current solution with mysql is really slow.
result = [
{'a': ['0', '84']},
{'a': ['0', '84']},
{'b': ['1000', '83']},
{'b': ['0', '84']},
]

from collections import defaultdict

aggregates = defaultdict(lambda: defaultdict(int))
for entry in result:
for key, values in entry.items():
for v in values:
aggregates[key][v] += 1

print(aggregates)
 
C

Chris Rebert

Hi,

i would like to parse many thousand files and aggregate the counts for
the field entries related to every id.

extract_field grep the identifier for the fields with regex.

result = [ { extract_field("id", line) : [extract_field("field1",
line),extract_field("field2", line)]}  for line  in FILE ]

result gives me.
{'a: ['0', '84']},
{'a': ['0', '84']},
{'b': ['1000', '83']},
{'b': ['0', '84']},

i like to aggregate them for every line or maybe file and get after
the complete parsing procedure
the possibility to count the amount of ids  having > 0 entries in
'83'.

{'a: {'0':2, '84':2}}
{'b': {'1000':1,'83':1,'84':1} }

Er, what happened to the '0' for 'b'?
My current solution with mysql is really slow.

Untested:

# requires Python 2.7+ due to Counter
from collections import defaultdict, Counter

FIELDS = ["field1", "field2"]

id2counter = defaultdict(Counter)
for line in FILE:
identifier = extract_field("id", line)
counter = id2counter[identifier]
for field_name in FIELDS:
field_val = int(extract_field(field_name, line))
counter[field_val] += 1

print(id2counter)
print(sum(1 for counter in id2counter.values() if counter[83]))

Cheers,
Chris
 
P

Peter Otten

chris said:
Hi,

i would like to parse many thousand files and aggregate the counts for
the field entries related to every id.

extract_field grep the identifier for the fields with regex.

result = [ { extract_field("id", line) : [extract_field("field1",
line),extract_field("field2", line)]} for line in FILE ]

result gives me.
{'a: ['0', '84']},
{'a': ['0', '84']},
{'b': ['1000', '83']},
{'b': ['0', '84']},

i like to aggregate them for every line or maybe file and get after
the complete parsing procedure
the possibility to count the amount of ids having > 0 entries in
'83'.

{'a: {'0':2, '84':2}}
{'b': {'1000':1,'83':1,'84':1} }

My current solution with mysql is really slow.
.... for line in lines:
.... yield extract_field("id", line), [extract_field(name, line)
for name in "field1", "field2"]
.... .... print row
....
('a', ['0', '84'])
('b', ['1000', '83'])
('a', ['0', '84'])
('b', ['0', '84']).... def __repr__(self): return repr(dict(self))
........ inner = outer[key]
.... for v in values:
.... inner[v] += 1
.... {'a': {'0': 2, '84': 2}, 'b': {'83': 1, '1000': 1, '84': 1, '0': 1}}
 
T

Tim Chase

i would like to parse many thousand files and aggregate the counts for
the field entries related to every id.

extract_field grep the identifier for the fields with regex.

result = [ { extract_field("id", line) : [extract_field("field1",
line),extract_field("field2", line)]} for line in FILE ]

i like to aggregate them for every line or maybe file and get after
the complete parsing procedure

{'a: {'0':2, '84':2}}
{'b': {'1000':1,'83':1,'84':1} }

I'm not sure what happened to b['0'] based on your initial data,
but assuming that was an oversight...
from collections import defaultdict

aggregates = defaultdict(lambda: defaultdict(int))
for entry in result:
for key, values in entry.items():
for v in values:
aggregates[key][v] += 1

Or, if you don't need the intermediate result, you can tweak
MRAB's solution and just iterate over the file(s):

aggregates = defaultdict(lambda: defaultdict(int))
for line in FILE:
key = extract_field("id", line)
aggregates[key][extract_field("field1", line)] += 1
aggregates[key][extract_field("field2", line)] += 1

or, if you're using an older version (<2.5) that doesn't provide
defaultdict, you could do something like

aggregates = {}
for line in FILE:
key = extract_field("id", line)
d = aggregates.setdefault(key, {})
for fieldname in ('field1', 'field2'):
value = extract_field(fieldname, line)
d[value] = d.get(value, 0) + 1


-tkc
 
C

chris

I very appreciate all responses.
It's incredible how fast it is!

Cheers
Christian
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top