Data aggregation


V

vedranp

Hi,

I have a case where I should aggregate data from the CSV file, which
contains data in this way:

DATE TIME COUNTRY ZIP CITY VALUE1 VALUE2 VALUE3
21.2.2008 00:00 A 1000 CITY1 1 2 3
21.2.2008 00:00 A 1000 CITY2 4 5 6
21.2.2008 00:00 A 1000 CITY3 7 8 9
21.2.2008 00:00 A 1000 CITY4 1 2 3
21.2.2008 00:15 A 1000 CITY1 4 5 6
21.2.2008 00:15 A 1000 CITY2 7 8 9
21.2.2008 00:15 A 1000 CITY3 1 2 3
21.2.2008 00:15 A 1000 CITY4 4 5 6
21.2.2008 00:00 A 2000 CITY10 7 8 9
21.2.2008 00:00 A 2000 CITY20 1 2 3
21.2.2008 00:00 A 2000 CITY30 4 5 6
21.2.2008 00:00 A 2000 CITY40 1 2 3
21.2.2008 00:15 A 2000 CITY10 7 8 9
21.2.2008 00:15 A 2000 CITY20 1 2 3
21.2.2008 00:15 A 2000 CITY30 4 5 6
21.2.2008 00:15 A 2000 CITY40 1 2 3

I need to aggregate data from file1, so the result would be a CSV file
(file2) in this format:

DATE COUNTRY ZIP CITY SumOfVALUE1 SumOfVALUE2 SumOfVALUE3 formula1
21.2.2008 A 1000 CITY1 5 7 9 12
21.2.2008 A 1000 CITY2 11 13 15 24
21.2.2008 A 1000 CITY3 8 10 12 18
21.2.2008 A 1000 CITY4 5 7 9 12
21.2.2008 A 2000 CITY10 14 16 18 30
21.2.2008 A 2000 CITY20 2 4 6 6
21.2.2008 A 2000 CITY30 8 10 12 18
21.2.2008 A 2000 CITY40 2 4 6 6

So, group by DATE, COUNTRY, ZIP and CITY and sum (or do some
calculation) the values and do some calculation from summed fields
(e.g.: formula1 = SumOfVALUE1+SumOfVALUE2). I am able to do this by
first loading file1 in SQL, perform a query there, which returns the
file2 results and then load it back in the SQL in the different table.

I would like to avoid the step of taking data out from database in
order to process it. I would like to process the file1 in Python and
load the result (file2) in SQL.

From some little experience with Perl, I think this is managable with
double hash tables (1: basic hash with key/value = CITY/pointer-to-
other-hash, 2: hash table with values for CITY1), so I assume that
there would be also a way in Python, maybe with dictionaries? Any
ideas?

Regards,
Vedran.
 
Ad

Advertisements

J

jay graves

So, group by DATE, COUNTRY, ZIP and CITY and sum (or do some

You are soooo close. Look up itertools.groupby
Don't forget to sort your data first.

http://aspn.activestate.com/ASPN/search?query=groupby&x=0&y=0&section=PYTHONCKBK&type=Subsection
http://mail.python.org/pipermail/python-list/2006-June/388004.html

From some little experience with Perl, I think this is managable with
double hash tables (1: basic hash with key/value = CITY/pointer-to-
other-hash, 2: hash table with values for CITY1), so I assume that
there would be also a way in Python, maybe with dictionaries? Any
ideas?

Sometimes it makes sense to do this with dictionaries. For example,
if you need to do counts on various combinations of columns.

count of unique values in column 'A'
count of unique values in column 'C'
count of unique combinations of columns 'A' and 'B'
count of unique combinations of columns 'A' and 'C'
count of unique combinations of columns 'B' and 'C'
in all cases, sum(D) and avg(E)

Since I need 'C' by itself, and 'A' and 'C' together, I can't just
sort and break on 'A','B','C'.

HTH
....
jay graves
 
J

John Nagle

vedranp said:
I would like to avoid the step of taking data out from database in
order to process it.

You can probably do this entirely within SQL. Most SQL databases,
including MySQL, will let you put the result of a SELECT into a new
table.

John Nagle
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top