Best/better way? (histogram)

Bernard Rankin · Jan 28, 2009

Hello,

I've got several versions of code to here to generate a histogram-esque structure from rows in a CSV file.

The basic approach is to use a Dict as a bucket collection to count instances of data items.

Other than the try/except(KeyError) idiom for dealing with new bucket names, which I don't like as it desribes the initial state of a KeyValue _after_ you've just described what to do with the existing value, I've come up with a few other methods.

What seems like to most resonable approuch?
Do you have any other ideas?
Is the try/except(KeyError) idiom reallyteh best?

In the code below you will see several 4-line groups of code. Each of set of the n-th line represents one solution to the problem. (Cases 1 & 2 do differ from cases 3 & 4 in the final outcome.)

Thank you

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

from collections import defaultdict
from csv import DictReader
from pprint import pprint

dataFile = open("sampledata.csv")
dataRows = DictReader(dataFile)

catagoryStats = defaultdict(lambda : {'leaf' : '', 'count' : 0})
#catagoryStats = {}
#catagoryStats = defaultdict(int)
#catagoryStats = {}

for row in dataRows:
catagoryRaw = row['CATEGORIES']
catagoryLeaf = catagoryRaw.split('|').pop()

## csb => Catagory Stats Bucket
## multi-statement lines are used for ease of method switching.

csb = catagoryStats[catagoryRaw]; csb['count'] += 1; csb['leaf'] = catagoryLeaf
#csb = catagoryStats.setdefault(catagoryRaw, {'leaf' : '', 'count' : 0}); csb['count'] += 1; csb['leaf'] = catagoryLeaf
#catagoryStats[catagoryRaw] += 1
#catagoryStats[catagoryRaw] = catagoryStats.get(catagoryRaw, 0) + 1

catagoryStatsSorted = catagoryStats.items()

catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1]['count'], reverse=1)
#catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1]['count'], reverse=1)
#catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1], reverse=1)
#catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1], reverse=1)

pprint(catagoryStatsSorted, indent=4, width=60)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sampledata.csv
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CATEGORIES,SKU
"computers|laptops|accessories",12345
"computers|laptops|accessories",12345
"computers|laptops|accessories",12345
"computers|servers|accessories",12345
"computers|servers|accessories",12345
"computers|servers|accessories",12345
"computers|servers|accessories",12345
"computers|servers|accessories",12345
"toys|really|super_fun",12345
"toys|really|super_fun",12345
"toys|really|super_fun",12345
"toys|really|not_at_all_fun",12345

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
output: (in case #1)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In [1]: %run catstat.py
[ ( 'computers|servers|accessories',
{'count': 5, 'leaf': 'accessories'}),
( 'toys|really|super_fun',
{'count': 3, 'leaf': 'super_fun'}),
( 'computers|laptops|accessories',
{'count': 3, 'leaf': 'accessories'}),
( 'toys|really|not_at_all_fun',
{'count': 1, 'leaf': 'not_at_all_fun'})]

Peter Otten · Jan 28, 2009

Bernard said:
I've got several versions of code to here to generate a histogram-esque
structure from rows in a CSV file.

The basic approach is to use a Dict as a bucket collection to count
instances of data items.

Other than the try/except(KeyError) idiom for dealing with new bucket
names, which I don't like as it desribes the initial state of a KeyValue
_after_ you've just described what to do with the existing value, I've
come up with a few other methods.

What seems like to most resonable approuch?

The simplest. That would be #3, cleaned up a bit:

from collections import defaultdict
from csv import DictReader
from pprint import pprint
from operator import itemgetter

def rows(filename):
infile = open(filename, "rb")
for row in DictReader(infile):
yield row["CATEGORIES"]

def stats(values):
histo = defaultdict(int)
for v in values:
histo[v] += 1
return sorted(histo.iteritems(), key=itemgetter(1), reverse=True)

Should you need the inner dict (which doesn't seem to offer any additional
information) you can always add another step:

def format(items):
result = []
for raw, count in items:
leaf = raw.rpartition("|")[2]
result.append((raw, dict(count=count, leaf=leaf)))
return result

pprint(format(stats(rows("sampledata.csv"))), indent=4, width=60)

By the way, if you had broken the problem in steps like above you could have
offered four different stats() functions which would would have been a bit
easier to read...

Peter

Bernard Rankin · Jan 28, 2009

The simplest. That would be #3, cleaned up a bit:

from collections import defaultdict
from csv import DictReader
from pprint import pprint
from operator import itemgetter

def rows(filename):
infile = open(filename, "rb")
for row in DictReader(infile):
yield row["CATEGORIES"]

def stats(values):
histo = defaultdict(int)
for v in values:
histo[v] += 1
return sorted(histo.iteritems(), key=itemgetter(1), reverse=True)

Should you need the inner dict (which doesn't seem to offer any additional
information) you can always add another step:

def format(items):
result = []
for raw, count in items:
leaf = raw.rpartition("|")[2]
result.append((raw, dict(count=count, leaf=leaf)))
return result

pprint(format(stats(rows("sampledata.csv"))), indent=4, width=60)

By the way, if you had broken the problem in steps like above you could have
offered four different stats() functions which would would have been a bit
easier to read...

Thank you. The code reorganization does make make it easer to read.

I'll have to look up the docs on itemgetter()

python-dev Summary for 2004-08-01 through 2004-08-15	17	Aug 24, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Best/better way? (histogram)

Bernard Rankin

Peter Otten

Bernard Rankin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads