How to store "3D" data? (data structure question)

S

Sebastian Bassi

Hello,

I have to parse a text file (was excel, but I translated to CSV) like
the one below, and I am not sure how to store it (to manipulate it
later).

Here is an extract of the data:

Name,Allele,RHA280,RHA801,RHA373,RHA377,HA383
TDF1,181,,,,,
,188,,,,,
,190,,,,,
,193,*,*,,,
,None,,,*,*,*
,,,,,,
TDF2,1200,*,*,,,*
,None,,,*,*,
,,,,,,
TDF3,236,,,,,
,240,,,,,
,244,*,,*,,*
,252,*,*,,,
,None,,,,*,
,,,,,,

Should I use lists? Dictionary? Or a combination?
The final goal is to "count" how many stars (*) has any "LINE" (a line
is RHA280 for instance).
RHA280 has 1 star in TDF1 and 1 star in TDF2 and 2 stars in TDF3.

I am lost because I do analize the data "line by line" (for Line in
FILE) so it is hard to count by column.
 
G

Graham Fawcett

Sebastian said:
Hello,

I have to parse a text file (was excel, but I translated to CSV) like
the one below, and I am not sure how to store it (to manipulate it
later).

Here is an extract of the data:
[snip]

This looks a lot like 2D data (row/column), not 3D. What's the third
axis? It looks, too, that you're not really interested in storage, but
in analysis...

Since your "line" columns all have names, why not use them as keys in a
dictionary? The associated values would be lists, in which you could
keep references to matching rows, or parts of those rows (e.g. name and
allele). Count up the length of the row, and you have your "number of
matches".



import csv # let Python do the grunt work

f = file('name-of-file.csv')
reader = csv.reader(f)

headers = reader.next() # read the first row
line_names = headers[2:]

results = {} # set up the dict
for lname in line_names: # each key is a line-name
results[lname] = []

for row in reader: # iterate the data rows
row_name, allele = row[:2]
line_values = row[2:] # get the line values.
# zip is your friend here. It lets you iterate
# across your line names and corresponding values
# in parallel.
for lname, value in zip(line_names, line_values):
if value == '*':
results[lname].append((row_name, allele))

# a quick look at the results.
for lname, matches in results.items():
print '%s %d' % (lname, len(matches))


Graham
 
S

Sebastian Bassi

This looks a lot like 2D data (row/column), not 3D. What's the third
axis? It looks, too, that you're not really interested in storage, but
in analysis...

I think it as 3D like this:
1st axis: [MARKER]Name, like TDF1, TDF2.
2nd axis: Allele, like 181, 188 and so on.
3rd axis: Line: RHA280, RHA801.

I can have a star in MarkerName TDF1, Allele 181 and Line RHA280.
I can have an empty (o none) in TDF1, Allele 181 and Line RHA801.

What I like to know is what would be a suitable structure to handle this data?
thank you very much!
 
S

Sebastian Bassi

# zip is your friend here. It lets you iterate
# across your line names and corresponding values
# in parallel.

This zip function is new to me, the only zip I knew was pkzip :). So
will read about it.
 
G

Graham Fawcett

Sebastian said:
This looks a lot like 2D data (row/column), not 3D. What's the third
axis? It looks, too, that you're not really interested in storage, but
in analysis...

I think it as 3D like this:
1st axis: [MARKER]Name, like TDF1, TDF2.
2nd axis: Allele, like 181, 188 and so on.
3rd axis: Line: RHA280, RHA801.

I can have a star in MarkerName TDF1, Allele 181 and Line RHA280.
I can have an empty (o none) in TDF1, Allele 181 and Line RHA801.

Okay. I think what will drive your data-structure question is the way
that you intend to use the data. Conceptually, it will always be 3D, no
matter how you model it, but trying to make a "3D data structure" is
probably not what is most efficient for your application.

If 90% of your searches are of the type, 'does TDF1/181/RHA280 have a
star?' then perhaps a dict using (name,allele,line) as a key makes most
sense:

d = {('TDF1',181,'RHA280'):'*', ...}
query = ('TDF1', 181, 'RHA280')
assert query in d

Really, you don't need '*' as a value for this, just use None if you
like, since all the real useful info is in the keyspace of the dict.

If you're always querying based on line first, then something like my
earlier 'results' dict might make sense:

d = {'RHA280':[('TDF1',181), ...], ...}
for name, allele in d['RHA280']:
if allele == 181: # or some other "query" within
RHA280
...

You get the idea: model the data in the way that makes it most useable
to you, and/or most efficient (if this is a large data set).

But note that by picking a structure like this, you're making it easy
to do certain lookups, but possibly harder (and slower) to do ones you
hadn't thought of yet.

The general solution would be to drop it into a relational database and
use SQL queries. Multidimensional analysis is what relational DBs are
for, after all. A hand-written data structure is almost guaranteed to
be more efficient for a given task, but maybe the flexibility of a
relational db would help serve multiple needs, where a custom structure
may only be suitable for a few applications.

If you're going to roll your own structure, just keep in mind that
dict-lookups are very fast in Python, far more efficient than, e.g.,
checking for membership in a list.

Graham
 
S

Sebastian Bassi

You get the idea: model the data in the way that makes it most useable
to you, and/or most efficient (if this is a large data set).

I don't think this could be called a large dataset (about 40Kb all the file).
It would be an overkill to convert it in MySQL (or any *SQL).
I only need to parse it to reformat it.
May I send the text file to your email and a sample of the needed
output? It seems you understand a lot on this topic and you could do
it very easily (I've been all day trying to solve it without success
:(
I know this is not an usual request, but this would help me a lot and
I would learn with your code (I still trying to understand the zip
built-in function, that seems useful).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top