How to store "3D" data? (data structure question)

Discussion in 'Python' started by Sebastian Bassi, Jul 20, 2005.

  1. Hello,

    I have to parse a text file (was excel, but I translated to CSV) like
    the one below, and I am not sure how to store it (to manipulate it
    later).

    Here is an extract of the data:

    Name,Allele,RHA280,RHA801,RHA373,RHA377,HA383
    TDF1,181,,,,,
    ,188,,,,,
    ,190,,,,,
    ,193,*,*,,,
    ,None,,,*,*,*
    ,,,,,,
    TDF2,1200,*,*,,,*
    ,None,,,*,*,
    ,,,,,,
    TDF3,236,,,,,
    ,240,,,,,
    ,244,*,,*,,*
    ,252,*,*,,,
    ,None,,,,*,
    ,,,,,,

    Should I use lists? Dictionary? Or a combination?
    The final goal is to "count" how many stars (*) has any "LINE" (a line
    is RHA280 for instance).
    RHA280 has 1 star in TDF1 and 1 star in TDF2 and 2 stars in TDF3.

    I am lost because I do analize the data "line by line" (for Line in
    FILE) so it is hard to count by column.



    --
    <a href="http://www.spreadfirefox.com/?q=affiliates&id=24672&t=1">La
    web sin popups ni spyware: Usa Firefox en lugar de Internet
    Explorer</a>
    Sebastian Bassi, Jul 20, 2005
    #1
    1. Advertising

  2. Sebastian Bassi wrote:
    > Hello,
    >
    > I have to parse a text file (was excel, but I translated to CSV) like
    > the one below, and I am not sure how to store it (to manipulate it
    > later).
    >
    > Here is an extract of the data:
    >

    [snip]

    This looks a lot like 2D data (row/column), not 3D. What's the third
    axis? It looks, too, that you're not really interested in storage, but
    in analysis...

    Since your "line" columns all have names, why not use them as keys in a
    dictionary? The associated values would be lists, in which you could
    keep references to matching rows, or parts of those rows (e.g. name and
    allele). Count up the length of the row, and you have your "number of
    matches".



    import csv # let Python do the grunt work

    f = file('name-of-file.csv')
    reader = csv.reader(f)

    headers = reader.next() # read the first row
    line_names = headers[2:]

    results = {} # set up the dict
    for lname in line_names: # each key is a line-name
    results[lname] = []

    for row in reader: # iterate the data rows
    row_name, allele = row[:2]
    line_values = row[2:] # get the line values.
    # zip is your friend here. It lets you iterate
    # across your line names and corresponding values
    # in parallel.
    for lname, value in zip(line_names, line_values):
    if value == '*':
    results[lname].append((row_name, allele))

    # a quick look at the results.
    for lname, matches in results.items():
    print '%s %d' % (lname, len(matches))


    Graham
    Graham Fawcett, Jul 20, 2005
    #2
    1. Advertising

  3. On 20 Jul 2005 10:47:50 -0700, Graham Fawcett <> wrote:
    > This looks a lot like 2D data (row/column), not 3D. What's the third
    > axis? It looks, too, that you're not really interested in storage, but
    > in analysis...


    I think it as 3D like this:
    1st axis: [MARKER]Name, like TDF1, TDF2.
    2nd axis: Allele, like 181, 188 and so on.
    3rd axis: Line: RHA280, RHA801.

    I can have a star in MarkerName TDF1, Allele 181 and Line RHA280.
    I can have an empty (o none) in TDF1, Allele 181 and Line RHA801.

    What I like to know is what would be a suitable structure to handle this data?
    thank you very much!

    --
    <a href="http://www.spreadfirefox.com/?q=affiliates&id=24672&t=1">La
    web sin popups ni spyware: Usa Firefox en lugar de Internet
    Explorer</a>
    Sebastian Bassi, Jul 20, 2005
    #3
  4. On 20 Jul 2005 10:47:50 -0700, Graham Fawcett <> wrote:
    > # zip is your friend here. It lets you iterate
    > # across your line names and corresponding values
    > # in parallel.


    This zip function is new to me, the only zip I knew was pkzip :). So
    will read about it.

    --
    <a href="http://www.spreadfirefox.com/?q=affiliates&id=24672&t=1">La
    web sin popups ni spyware: Usa Firefox en lugar de Internet
    Explorer</a>
    Sebastian Bassi, Jul 20, 2005
    #4
  5. Sebastian Bassi wrote:
    > On 20 Jul 2005 10:47:50 -0700, Graham Fawcett <> wrote:
    > > This looks a lot like 2D data (row/column), not 3D. What's the third
    > > axis? It looks, too, that you're not really interested in storage, but
    > > in analysis...

    >
    > I think it as 3D like this:
    > 1st axis: [MARKER]Name, like TDF1, TDF2.
    > 2nd axis: Allele, like 181, 188 and so on.
    > 3rd axis: Line: RHA280, RHA801.
    >
    > I can have a star in MarkerName TDF1, Allele 181 and Line RHA280.
    > I can have an empty (o none) in TDF1, Allele 181 and Line RHA801.


    Okay. I think what will drive your data-structure question is the way
    that you intend to use the data. Conceptually, it will always be 3D, no
    matter how you model it, but trying to make a "3D data structure" is
    probably not what is most efficient for your application.

    If 90% of your searches are of the type, 'does TDF1/181/RHA280 have a
    star?' then perhaps a dict using (name,allele,line) as a key makes most
    sense:

    d = {('TDF1',181,'RHA280'):'*', ...}
    query = ('TDF1', 181, 'RHA280')
    assert query in d

    Really, you don't need '*' as a value for this, just use None if you
    like, since all the real useful info is in the keyspace of the dict.

    If you're always querying based on line first, then something like my
    earlier 'results' dict might make sense:

    d = {'RHA280':[('TDF1',181), ...], ...}
    for name, allele in d['RHA280']:
    if allele == 181: # or some other "query" within
    RHA280
    ...

    You get the idea: model the data in the way that makes it most useable
    to you, and/or most efficient (if this is a large data set).

    But note that by picking a structure like this, you're making it easy
    to do certain lookups, but possibly harder (and slower) to do ones you
    hadn't thought of yet.

    The general solution would be to drop it into a relational database and
    use SQL queries. Multidimensional analysis is what relational DBs are
    for, after all. A hand-written data structure is almost guaranteed to
    be more efficient for a given task, but maybe the flexibility of a
    relational db would help serve multiple needs, where a custom structure
    may only be suitable for a few applications.

    If you're going to roll your own structure, just keep in mind that
    dict-lookups are very fast in Python, far more efficient than, e.g.,
    checking for membership in a list.

    Graham
    Graham Fawcett, Jul 20, 2005
    #5
  6. On 20 Jul 2005 11:51:56 -0700, Graham Fawcett <> wrote:
    > You get the idea: model the data in the way that makes it most useable
    > to you, and/or most efficient (if this is a large data set).


    I don't think this could be called a large dataset (about 40Kb all the file).
    It would be an overkill to convert it in MySQL (or any *SQL).
    I only need to parse it to reformat it.
    May I send the text file to your email and a sample of the needed
    output? It seems you understand a lot on this topic and you could do
    it very easily (I've been all day trying to solve it without success
    :(
    I know this is not an usual request, but this would help me a lot and
    I would learn with your code (I still trying to understand the zip
    built-in function, that seems useful).




    --
    <a href="http://www.spreadfirefox.com/?q=affiliates&id=24672&t=1">La
    web sin popups ni spyware: Usa Firefox en lugar de Internet
    Explorer</a>
    Sebastian Bassi, Jul 20, 2005
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?UnVkeQ==?=

    to store or not to store an image

    =?Utf-8?B?UnVkeQ==?=, Mar 29, 2005, in forum: ASP .Net
    Replies:
    6
    Views:
    634
    =?Utf-8?B?UnVkeQ==?=
    Mar 30, 2005
  2. Replies:
    3
    Views:
    485
    Thomas Weidenfeller
    Jun 23, 2005
  3. Sebastian Bassi
    Replies:
    0
    Views:
    407
    Sebastian Bassi
    Jul 20, 2005
  4. A
    Replies:
    27
    Views:
    1,592
    Jorgen Grahn
    Apr 17, 2011
  5. Marc Lucksch

    A good data structure to store INI files.

    Marc Lucksch, Feb 10, 2009, in forum: Perl Misc
    Replies:
    24
    Views:
    402
    Marc Lucksch
    Feb 12, 2009
Loading...

Share This Page