data management with python from perl

B

ben moretti

hi

i'm learning python, and one area i'd use it for is data management in
scientific computing. in the case i've tried i want to reformat a data
file from a normalised list to a matrix with some sorted columns. to
do this at the moment i am using perl, which is very easy to do, and i
want to see if python is as easy.

so, the data i am using is some epiphyte population abundance data for
particular sites, and it looks like this:

1.00 1.00 1.00 "MO" 906.00 "genus species 1" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 2" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 3" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 4" 1.00

(i have changed the data to protect the innocent) the first four
columns relate to the location, the fifth to the substrate, the sixth
is the epiphyte species and the seventh the abundance. i need to turn
this into a substrate x species matrix with columns 1 to 4 retained as
sorting columns and the intersection of speces and substrate being the
abundance. the species name needs to be the column headers. this is
going to go into a multivariate analysis of variance programme that
only takes its data in that format. here is an example of the output

region location site stand substrate genus species 1 genus species
2 genus species 3 genus species 4 genus species 5 genus species
6 genus species 7

<..etc..>

1 1 1 MO 906 0 0 0 0 0 0 0 0 0 0 0 0 0 0

<..etc...>

so, to do this in perl - and i won't bore you with the whole script -
i read the file, split it into tokens and then populate a hash of
hashes, the syntax of which is

$HoH{$tokens[0]}{$tokens[1]}{$tokens[2]}{$tokens[3]}{$tokens[4]}{$tokens[5]}
= $tokens[6]

with the various location and species values are the keys of the hash,
and the abundance is the $tokens[6] value. this now gives me a
multidimensional data structure that i can use to loop over the keys
and sort them by each as i go, then to write out the data into a
matrix as above. the syntax for this is generally like

# level 1 - region
foreach $region (sort {$a <=> $b} keys %HoH) {

# level 2 - location
foreach $location (sort {$a <=> $b} keys %{ $HoH{$region} }) {

# level 3 - site
foreach $site (sort {$a <=> $b} keys %{ $HoH{$region}{$location} })

<... etc ...>

there is a bit more perl obviously, but that is the general gist of
it. multidimensional hash and then looping and sorting to get the data
out.

ok. so how do i do this in python? i've tried the "perlish" way but
didn't get very far, however i know it must be able to be done!

if you want to respond to this, try benmoretti at yahoo dot com dot au
as i get too much spam otherwise

cheers

ben
 
P

Peter Otten

ben said:
hi

i'm learning python, and one area i'd use it for is data management in
scientific computing. in the case i've tried i want to reformat a data
file from a normalised list to a matrix with some sorted columns. to
do this at the moment i am using perl, which is very easy to do, and i
want to see if python is as easy.

so, the data i am using is some epiphyte population abundance data for
particular sites, and it looks like this:

1.00 1.00 1.00 "MO" 906.00 "genus species 1" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 2" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 3" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 4" 1.00

(i have changed the data to protect the innocent) the first four
columns relate to the location, the fifth to the substrate, the sixth
is the epiphyte species and the seventh the abundance. i need to turn
this into a substrate x species matrix with columns 1 to 4 retained as
sorting columns and the intersection of speces and substrate being the
abundance. the species name needs to be the column headers. this is
going to go into a multivariate analysis of variance programme that
only takes its data in that format. here is an example of the output

region location site stand substrate genus species 1 genus species
2 genus species 3 genus species 4 genus species 5 genus species
6 genus species 7

<..etc..>

1 1 1 MO 906 0 0 0 0 0 0 0 0 0 0 0 0 0 0

<..etc...>

so, to do this in perl - and i won't bore you with the whole script -
i read the file, split it into tokens and then populate a hash of
hashes, the syntax of which is

$HoH{$tokens[0]}{$tokens[1]}{$tokens[2]}{$tokens[3]}{$tokens[4]}{$tokens[5]}
= $tokens[6]

with the various location and species values are the keys of the hash,
and the abundance is the $tokens[6] value. this now gives me a
multidimensional data structure that i can use to loop over the keys
and sort them by each as i go, then to write out the data into a
matrix as above. the syntax for this is generally like

# level 1 - region
foreach $region (sort {$a <=> $b} keys %HoH) {

# level 2 - location
foreach $location (sort {$a <=> $b} keys %{ $HoH{$region} }) {

# level 3 - site
foreach $site (sort {$a <=> $b} keys %{ $HoH{$region}{$location} })

<... etc ...>

there is a bit more perl obviously, but that is the general gist of
it. multidimensional hash and then looping and sorting to get the data
out.

ok. so how do i do this in python? i've tried the "perlish" way but
didn't get very far, however i know it must be able to be done!

The best solution would probably to be to rely on a database that supports
pivot tables.
However, I've put together a simple class to generate a pivot table to get
you started. It's only 2D, i. e. f(row,col) -> value, but if I have
understood you correctly that should be sufficient (I am not good at
reading perl).
To read your data from a (text) file, have a look at Python's csv module.

Peter

<code>
import sets

class Adder(object):
""" Adds all values entered via set()
"""
def __init__(self, value=0):
self.value = value
def set(self, value):
self.value += value
def get(self):
return self.value

_none = object()
class First(object):
""" Accepts any value the first time set() is called,
requires the same value on subsequent calls of set().
"""
def __init__(self):
self.value = _none
def set(self, value):
if self.value is _none:
self.value = value
else:
if value != self.value:
raise ValueError, "%s expected but got %s" % (self.value,
value)
def get(self):
return self.value

class Pivot(object):
""" A simple Pivot table generator class
"""
def __init__(self, valueAccumulator, rowHeaders):
self.rows = sets.Set()
self.columns = sets.Set()
self.values = {}
self.valueAccumulator = valueAccumulator
self.rowHeaders = rowHeaders
def extend(self, table, extractRow, extractColumn, extractValue):
for record in table:
r = extractRow(record)
c = extractColumn(record)
self.rows.add(r)
self.columns.add(c)
try:
fxy = self.values[r, c]
except KeyError:
fxy = self.valueAccumulator()
self.values[r, c] = fxy
fxy.set(extractValue(record))

def toTable(self, defaultValue=None, columnCompare=None,
rowCompare=None):
""" returns a list of lists.
"""
table = []
rows = list(self.rows)
rows.sort(rowCompare)
columns = list(self.columns)
columns.sort(columnCompare)
headers = self.rowHeaders + [c for c in columns]
table.append(headers)
for row in rows:
record = list(row)
for column in columns:
v = self.values.get((row, column), None)
if v is not None:
v = v.get()
record.append(v)
table.append(record)
return table


def printTable(p):
for row in p.toTable():
print row

if __name__ == "__main__":
table = [
"Jack Welsh Beer 1",
"Richard Maier Beer 1",
"Bill Bush Wine 2",
"Bill Bush Wine 2",
]
table = [row.split() for row in table]
print table
print "-" * 10
p = Pivot(Adder, ["Christian", "Surname"])
def extractRow(record):
return record[0], record[1]
def extractValue(record):
return int(record[3])
def extractColumn(record):
return record[2]
p.extend(table, extractRow, extractColumn, extractValue)

printTable(p)

columns = "region location site stand substrate species
abundance".split()

table = [
[1.0, 1.0, 1.0, "MO", 906, "species 1", 1],
[1.0, 1.0, 1.0, "MO", 906, "species 2", 1],
[1.0, 1.0, 1.0, "MO", 906, "species 3", 1],
[1.0, 1.0, 1.0, "MO", 906, "species 1", 1],
[1.0, 1.0, 1.0, "GO", 706, "species 4", 1],
# [1.0, 1.0, 1.0, "GO", 706, "species 4", 2],# uncomment me
[1.0, 1.0, 1.0, "GO", 806, "species 1", 1],
[1.0, 1.0, 1.0, "GO", 906, "species 1", 1],
[1.0, 1.0, 1.0, "GO", 106, "species 1", 1],
]
p = Pivot(First, columns[:5])
p.extend(table, lambda r: tuple(r[:5]),
lambda r: r[5],
lambda r: r[6])
printTable(p)
</code>
 
C

Christopher Koppler

i'm learning python, and one area i'd use it for is data management in
scientific computing. in the case i've tried i want to reformat a data
file from a normalised list to a matrix with some sorted columns. to
do this at the moment i am using perl, which is very easy to do, and i
want to see if python is as easy.

Not being too familiar with Perl (or scientific computing), I'm not
sure if I understood everything correctly...
1.00 1.00 1.00 "MO" 906.00 "genus species 1" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 2" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 3" 1.00
1.00 1.00 1.00 "MO" 906.00 "genus species 4" 1.00

I _think_ you want your data as a nested dictionary like so:
{1: {1: {1: {"MO": {906: {"genus species 1": 1,
"genus species 2": 1,
"genus species 3": 1,
"genus species 4": 1} }}}}}

so, to do this in perl - and i won't bore you with the whole script -
i read the file, split it into tokens

I hope I will NOT bore you with a whole script, but I've expanded your
data a bit to have a somewhat more complicated/structured data file to
work with (not shown here, this's more than long enough as it is); so
I'll first read it in and split it up:

###

import csv

f = open(r"i:\python\nestedtest.txt", "r") # my testdata
csvreader = csv.reader(f, delimiter=' ', quotechar='"')

###

From your output I gather that maybe you the numbers as numbers, and
not as strings, so I'll convert the data while populating an
intermediate list:

###

def parselist(lst):
"""convert the list's values to floats or integers where
appropriate"""
parsed = []
for itm in lst:
try:
f = float(itm)
i = int(f)
if i == f:
parsed.append(int(i))
else:
parsed.append(f)
except ValueError:
parsed.append(itm)
return parsed

datalist = []
for line in csvreader:
datalist.append(parselist(line))
f.close() # don't need that anymore

###
and then populate a hash of
hashes, the syntax of which is

$HoH{$tokens[0]}{$tokens[1]}{$tokens[2]}{$tokens[3]}{$tokens[4]}{$tokens[5]}
= $tokens[6]

Now, if that does what I think it does (create a nested hash), then
hats off to Perl! I haven't found anything as concise built into
Python (but then I'm not a guru, maybe someone else knows a better
way?), so I rolled my own:

###

def nestdict(lst):
"""create a recursively nested dictionary from a _flat_ list"""
dct = {}
if len(lst) > 2:
dct[lst[0]] = nestdict(lst[1:])
elif len(lst) == 2:
dct[lst[0]] = lst[1]
return dct

###

which is good for ONE line of input; since I have a list of those, I
want to build up the dictionary line by line, for which I need another
function:

###

def nestextend(dct, upd):
"""recursively extend/update a nested dictionary with another one"""
try:
items = upd.items()
for key, val in items:
if key not in dct:
dct[key] = val
else:
nestextend(dct[key], upd[key])
except AttributeError:
dct.update(upd)

datadict = {}
for lst in datalist:
nestextend(datadict, nestdict(lst))

###

datadict now holds all the data from the testfile in a nested
dictionary with the various locations and species values as the keys
of the hash, which is what (I hope) you wanted.
and the abundance is the $tokens[6] value. this now gives me a
multidimensional data structure

Reading that I'm not sure I've understood anything - shouldn't you
want to use a multidimensional array for that? Anyone familiar with
Python's scientific/number crunching/array libraries should be able to
clear that up...
that i can use to loop over the keys and sort them by each as i go,
then to write out the data into a matrix as above.

I'm not sure how you arrive at your matrix output, but looping over
the dictionary shouldn't be a problem now. However, since you also
want to sort the data (by key), and dictionaries notoriously don't
support that, I've written another function:

###

def nestsort(dct):
"""convert a nested dictionary to a nested (key, value) list,
recursively sorting it by key"""
lst = []
try:
items = dct.items()
items.sort()
for key, value in items:
lst.append([key, nestsort(dct[key])])
return lst
except AttributeError:
return dct

sorteddata = nestsort(datadict)

###

So now the data from the beginning looks like:

[1, [1, [1, ["MO", [906, ["genus species 1", 1],
["genus species 2", 1],
["genus species 3", 1],
["genus species 4", 1] ]]]]]

which you probably could have had cheaper...

Now you can do something like:

###

for region, rdata in sorteddata:
print "Region", region
for location, ldata in rdata:
print " " * 2 + "Location", location
for site, sitedata in ldata:
print " " * 4 + "Site", site
for stand, stdata in sitedata:
print " " * 6 + "Stand", stand
for substrate, subdata in stdata:
print " " * 8 + "Substrate", substrate
for genus, abundance in subdata:
print " " * 10 + "Genus", genus, "Abundance", abundance

###

to test my script and your (real) data.
There's next to no error-checking and it sure'd be more
pythonic/beautiful/reusable if I'd subclass'd dict, but it works --
for my data at least.
ok. so how do i do this in python? i've tried the "perlish" way but

Once more said:
didn't get very far, however i know it must be able to be done!

I don't think there's much of anything either language can do that the
other can't, but of course some things are harder than others...
if you want to respond to this, try benmoretti at yahoo dot com dot au
as i get too much spam otherwise

<posted to the NG and forwarded to you>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top