How to store "3D" data? (data structure question)

Sebastian Bassi · Jul 20, 2005

Hello,

I have to parse a text file (was excel, but I translated to CSV) like
the one below, and I am not sure how to store it (to manipulate it
later).

Here is an extract of the data:

Name,Allele,RHA280,RHA801,RHA373,RHA377,HA383
TDF1,181,,,,,
,188,,,,,
,190,,,,,
,193,*,*,,,
,None,,,*,*,*
,,,,,,
TDF2,1200,*,*,,,*
,None,,,*,*,
,,,,,,
TDF3,236,,,,,
,240,,,,,
,244,*,,*,,*
,252,*,*,,,
,None,,,,*,
,,,,,,

Should I use lists? Dictionary? Or a combination?
The final goal is to "count" how many stars (*) has any "LINE" (a line
is RHA280 for instance).
RHA280 has 1 star in TDF1 and 1 star in TDF2 and 2 stars in TDF3.

I am lost because I do analize the data "line by line" (for Line in
FILE) so it is hard to count by column.

Graham Fawcett · Jul 20, 2005

Sebastian said:
Hello,

I have to parse a text file (was excel, but I translated to CSV) like
the one below, and I am not sure how to store it (to manipulate it
later).

Here is an extract of the data:

[snip]

This looks a lot like 2D data (row/column), not 3D. What's the third
axis? It looks, too, that you're not really interested in storage, but
in analysis...

Since your "line" columns all have names, why not use them as keys in a
dictionary? The associated values would be lists, in which you could
keep references to matching rows, or parts of those rows (e.g. name and
allele). Count up the length of the row, and you have your "number of
matches".

import csv # let Python do the grunt work

f = file('name-of-file.csv')
reader = csv.reader(f)

headers = reader.next() # read the first row
line_names = headers[2:]

results = {} # set up the dict
for lname in line_names: # each key is a line-name
results[lname] = []

for row in reader: # iterate the data rows
row_name, allele = row[:2]
line_values = row[2:] # get the line values.
# zip is your friend here. It lets you iterate
# across your line names and corresponding values
# in parallel.
for lname, value in zip(line_names, line_values):
if value == '*':
results[lname].append((row_name, allele))

# a quick look at the results.
for lname, matches in results.items():
print '%s %d' % (lname, len(matches))

Graham

Sebastian Bassi · Jul 20, 2005

This looks a lot like 2D data (row/column), not 3D. What's the third
axis? It looks, too, that you're not really interested in storage, but
in analysis...

I think it as 3D like this:
1st axis: [MARKER]Name, like TDF1, TDF2.
2nd axis: Allele, like 181, 188 and so on.
3rd axis: Line: RHA280, RHA801.

I can have a star in MarkerName TDF1, Allele 181 and Line RHA280.
I can have an empty (o none) in TDF1, Allele 181 and Line RHA801.

What I like to know is what would be a suitable structure to handle this data?
thank you very much!

Sebastian Bassi · Jul 20, 2005

# zip is your friend here. It lets you iterate
# across your line names and corresponding values
# in parallel.

This zip function is new to me, the only zip I knew was pkzip

. So
will read about it.

Graham Fawcett · Jul 20, 2005

Sebastian said:
This looks a lot like 2D data (row/column), not 3D. What's the third
axis? It looks, too, that you're not really interested in storage, but
in analysis...

Click to expand...

I think it as 3D like this:
1st axis: [MARKER]Name, like TDF1, TDF2.
2nd axis: Allele, like 181, 188 and so on.
3rd axis: Line: RHA280, RHA801.

I can have a star in MarkerName TDF1, Allele 181 and Line RHA280.
I can have an empty (o none) in TDF1, Allele 181 and Line RHA801.

Okay. I think what will drive your data-structure question is the way
that you intend to use the data. Conceptually, it will always be 3D, no
matter how you model it, but trying to make a "3D data structure" is
probably not what is most efficient for your application.

If 90% of your searches are of the type, 'does TDF1/181/RHA280 have a
star?' then perhaps a dict using (name,allele,line) as a key makes most
sense:

d = {('TDF1',181,'RHA280'):'*', ...}
query = ('TDF1', 181, 'RHA280')
assert query in d

Really, you don't need '*' as a value for this, just use None if you
like, since all the real useful info is in the keyspace of the dict.

If you're always querying based on line first, then something like my
earlier 'results' dict might make sense:

d = {'RHA280':[('TDF1',181), ...], ...}
for name, allele in d['RHA280']:
if allele == 181: # or some other "query" within
RHA280
...

You get the idea: model the data in the way that makes it most useable
to you, and/or most efficient (if this is a large data set).

But note that by picking a structure like this, you're making it easy
to do certain lookups, but possibly harder (and slower) to do ones you
hadn't thought of yet.

The general solution would be to drop it into a relational database and
use SQL queries. Multidimensional analysis is what relational DBs are
for, after all. A hand-written data structure is almost guaranteed to
be more efficient for a given task, but maybe the flexibility of a
relational db would help serve multiple needs, where a custom structure
may only be suitable for a few applications.

If you're going to roll your own structure, just keep in mind that
dict-lookups are very fast in Python, far more efficient than, e.g.,
checking for membership in a list.

Graham

Sebastian Bassi · Jul 20, 2005

You get the idea: model the data in the way that makes it most useable
to you, and/or most efficient (if this is a large data set).

I don't think this could be called a large dataset (about 40Kb all the file).
It would be an overkill to convert it in MySQL (or any *SQL).
I only need to parse it to reformat it.
May I send the text file to your email and a sample of the needed
output? It seems you understand a lot on this topic and you could do
it very easily (I've been all day trying to solve it without success

I know this is not an usual request, but this would help me a lot and
I would learn with your code (I still trying to understand the zip
built-in function, that seems useful).

How to store data from a sign up form on a website into an sql databse	1	Sep 9, 2022
How to store and retrieve data from the backend	5	Jun 29, 2017
Call perl to store data in DB	4	Aug 31, 2012
Not able to store data to dictionary because of memory limitation	3	Jul 6, 2011
How to change star rating color on mouseenter, on mouseout, and on onclick	0	Sep 28, 2018
shove does not store data as expected	2	Apr 21, 2010
How to show data horizontally not vertically in rdlc	0	Mar 9, 2017
A good data structure to store INI files.	24	Feb 10, 2009

How to store "3D" data? (data structure question)

Sebastian Bassi

Graham Fawcett

Sebastian Bassi

Sebastian Bassi

Graham Fawcett

Sebastian Bassi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads