flattening/rolling up/aggregating a large sorted text file

Discussion in 'Python' started by adtvff@yahoo.com, Mar 21, 2007.

  1. Guest

    Hi,

    Given a large ascii file (delimited or fixed width) with one ID field
    and dimensions/measures fields, sorted by dimensions, I'd like to
    "flatten" or "rollup" the file by creating new columns: one for each
    combination of dimension level, and summing up measures over all
    records for a given ID.

    If the wheel has already been invented, great, please point me in the
    right direction. If not, please share some pointers on how to think
    about this problem in order to write efficient code.

    Is a hash with dimension level combinations a good approach, with
    values reset at each new ID level?

    I know mysql, Oracle etc will do this , but they all have a cap on #
    of columns allowed. SAS will allow unlimited columns, but I don't own
    SAS.

    Thanks.


    ID,color,shape,msr1
    ------------------------------
    001, blue, square, 4
    001, red , circle, 5
    001, red, circle, 6


    ID, blue_circle, blue_square, red_circle, red_square
    --------------------------------------------------------------------------
    001,0,4,11,0
    002 ...
     
    , Mar 21, 2007
    #1
    1. Advertising

  2. Steve Holden Guest

    wrote:
    > Hi,
    >
    > Given a large ascii file (delimited or fixed width) with one ID field
    > and dimensions/measures fields, sorted by dimensions, I'd like to
    > "flatten" or "rollup" the file by creating new columns: one for each
    > combination of dimension level, and summing up measures over all
    > records for a given ID.
    >
    > If the wheel has already been invented, great, please point me in the
    > right direction. If not, please share some pointers on how to think
    > about this problem in order to write efficient code.
    >
    > Is a hash with dimension level combinations a good approach, with
    > values reset at each new ID level?
    >
    > I know mysql, Oracle etc will do this , but they all have a cap on #
    > of columns allowed. SAS will allow unlimited columns, but I don't own
    > SAS.
    >
    > Thanks.
    >
    >
    > ID,color,shape,msr1
    > ------------------------------
    > 001, blue, square, 4
    > 001, red , circle, 5
    > 001, red, circle, 6
    >
    >
    > ID, blue_circle, blue_square, red_circle, red_square
    > --------------------------------------------------------------------------
    > 001,0,4,11,0
    > 002 ...
    >

    It seems a bit wrong-headed to force this problem to fit a solution
    where you define relations with a variable number of columns when the
    natural way to solve it would seem to be to sum the msr1 values for each
    unique combination of ID, color and shape. That's a pretty
    straightforward relational problem.

    So, is there some reason the result *has* to have that variable number
    of columns?

    regards
    Steve
    --
    Steve Holden +44 150 684 7255 +1 800 494 3119
    Holden Web LLC/Ltd http://www.holdenweb.com
    Skype: holdenweb http://del.icio.us/steve.holden
    Recent Ramblings http://holdenweb.blogspot.com
     
    Steve Holden, Mar 21, 2007
    #2
    1. Advertising

  3. Shane Geiger Guest

    Apparently you want to use this data to know how many blue circles, blue
    squares, red circles and red squares. In other words, I doubt you want
    to output redundant data columns, you just want this data in a more
    usable format and that you don't actually need to do multiple passes
    over it.

    This is a fun problem to solve because it uses two very powerful tools:
    cvs.dictreader and bitwise categorization.

    Note: your initial data has three records with the same ID. I assumes
    the ID is the unique key. So I changed the data slightly.





    wrote:
    > Hi,
    >
    > Given a large ascii file (delimited or fixed width) with one ID field
    > and dimensions/measures fields, sorted by dimensions, I'd like to
    > "flatten" or "rollup" the file by creating new columns: one for each
    > combination of dimension level, and summing up measures over all
    > records for a given ID.
    >
    > If the wheel has already been invented, great, please point me in the
    > right direction. If not, please share some pointers on how to think
    > about this problem in order to write efficient code.
    >
    > Is a hash with dimension level combinations a good approach, with
    > values reset at each new ID level?
    >
    > I know mysql, Oracle etc will do this , but they all have a cap on #
    > of columns allowed. SAS will allow unlimited columns, but I don't own
    > SAS.
    >
    > Thanks.
    >
    >
    > ID,color,shape,msr1
    > ------------------------------
    > 001, blue, square, 4
    > 001, red , circle, 5
    > 001, red, circle, 6
    >
    >
    > ID, blue_circle, blue_square, red_circle, red_square
    > --------------------------------------------------------------------------
    > 001,0,4,11,0
    > 002 ...
    >
    >


    --
    Shane Geiger
    IT Director
    National Council on Economic Education
    | 402-438-8958 | http://www.ncee.net

    Leading the Campaign for Economic and Financial Literacy



    """

    Apparently you want to use this data to know how many blue circles, blue squares, red circles and red squares. In other words, I doubt you want to output redundant data columns, you just want this data in a more usable format and that you don't actually need to do multiple passes over it.

    This is a fun problem to solve because it uses two very powerful tools: cvs.dictreader and bitwise categorization.

    Note: your initial data has three records with the same ID. I assumes the ID is the unique key. So I changed the data slightly.

    ------

    Given a large ascii file (delimited or fixed width) with one ID field
    and dimensions/measures fields, sorted by dimensions, I'd like to
    "flatten" or "rollup" the file by creating new columns: one for each
    combination of dimension level, and summing up measures over all
    records for a given ID.

    If the wheel has already been invented, great, please point me in the
    right direction. If not, please share some pointers on how to think
    about this problem in order to write efficient code.

    Is a hash with dimension level combinations a good approach, with
    values reset at each new ID level?

    I know mysql, Oracle etc will do this , but they all have a cap on #
    of columns allowed. SAS will allow unlimited columns, but I don't own
    SAS.

    Thanks.


    ID,color,shape,msr1
    ------------------------------
    001, blue, square, 4
    001, red , circle, 5
    001, red, circle, 6


    ID, blue_circle, blue_square, red_circle, red_square
    --------------------------------------------------------------------------
    001,0,4,11,0
    002 ...

    """

    import string


    ## BITWISE CATEGORIZATION STUFF

    def gNextBit(val=0):
    while True:
    y = 2**val
    val += 1
    yield y

    nb = gNextBit()

    categories = ['blue','red','square','circle']
    #categories_value = ['blue','red','square','circle']

    def bitwise_categorize(items):
    d = {}
    for item in items:
    d[item] = nb.next()
    return d

    categories_dict = bitwise_categorize(categories)

    #print categories_dict # {'blue': 1, 'circle': 8, 'square': 4, 'red': 2}

    def get_properties(category_int):
    p_list = []
    for k,v in categories_dict.items():
    if category_int & v == v:
    p_list.append(k)
    return p_list

    def list_properties():
    for i in range(len(categories)**2):
    print "Properties for something with category_int of",str(i),str(get_properties(i))

    #list_properties()



    ### EXAMPLE DATA

    header_fields = ['id','color','shape','msr1']

    example_data = """
    001, blue, square, 4
    002, red , circle, 5
    003, red, circle, 6
    """

    # write out the example
    import os
    def writefile(f, data, perms=750): open(f, 'w').write(data) and os.chmod(f, perms)
    csv_file = "/Users/shanegeiger/temp.csv"
    writefile(csv_file, example_data)




    ### READING IN THE DATA AND CATEGORIZING IT WITH BITWISE CATEGORIZATION

    import csv
    reader = csv.DictReader(open(csv_file), [], delimiter=",")

    data = []
    info = {}

    while True:
    try:
    # Read next "header" line (if there isn't one then exit the loop)
    reader.fieldnames = header_fields
    rdr = reader.next()
    data.append(rdr)

    except StopIteration: break

    categories_int = 0

    # print "categories_dict:",categories_dict

    for rec in data: # for each record
    color_cat = categories_dict[string.strip(rec['color'])] # should add them to the color category
    shape_cat = categories_dict[string.strip(rec['shape'])] # should add them to the color category
    combined_cat = color_cat + shape_cat

    #print "color_category:", color_cat,
    #print " shape_category:", shape_cat,
    #print " combined categories:", combined_cat

    rec['categories_int'] = combined_cat
    if rec['id']:
    info[rec['id']] = rec
    elif rec['color']:
    info[rec['color']] = rec
    elif rec['shape']:
    info[rec['shape']] = rec
    elif rec['msr1']:
    info[rec['msr1']] = rec
    else:
    print "Warning: Not all fields found for record:",str(rec)

    for k,v in info.items():
    print k,"-->",v
    cat_int = v['categories_int']
    print " -- The categories_int variable has value ",cat_int,"which means this record has these properties: ", get_properties(cat_int)

    print



    print "Just to be clear, here are the bitwise categories:"
    list_properties()
     
    Shane Geiger, Mar 21, 2007
    #3
  4. Eddie Corns Guest

    writes:

    >Hi,


    >Given a large ascii file (delimited or fixed width) with one ID field
    >and dimensions/measures fields, sorted by dimensions, I'd like to
    >"flatten" or "rollup" the file by creating new columns: one for each
    >combination of dimension level, and summing up measures over all
    >records for a given ID.


    >If the wheel has already been invented, great, please point me in the
    >right direction. If not, please share some pointers on how to think
    >about this problem in order to write efficient code.


    >Is a hash with dimension level combinations a good approach, with
    >values reset at each new ID level?


    >I know mysql, Oracle etc will do this , but they all have a cap on #
    >of columns allowed. SAS will allow unlimited columns, but I don't own
    >SAS.


    >Thanks.



    >ID,color,shape,msr1
    >------------------------------
    >001, blue, square, 4
    >001, red , circle, 5
    >001, red, circle, 6



    >ID, blue_circle, blue_square, red_circle, red_square
    >--------------------------------------------------------------------------
    >001,0,4,11,0
    >002 ...


    Something like:

    import sys
    from sets import Set

    ids = {}
    keys = Set()

    for line in sys.stdin:
    ID,COL,SHAPE,VAL = [s.strip() for s in line.split(',')]
    ids.setdefault(ID,{})
    key = '%s_%s'%(COL,SHAPE)
    ids[ID].setdefault(key,0)
    ids[ID][key] += int(VAL)
    keys.add(key)

    print 'id',',',','.join([str(key) for key in keys])
    for id,cols in ids.items():
    print id,',', ', '.join([str(cols.get(k,0)) for k in keys])


    Doesn't keep all possible keys just those that are actually used.
    Needs to sort() things here and there.

    Incidentally I don't think you could do it in SQL at all in this way but you
    could do it in a more vertical fashion (eg
    001, red, circle, 11
    001, blue, square, 4
    002, red, rhombus, 99) etc.

    Eddie
     
    Eddie Corns, Mar 22, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    370
  2. Tigera
    Replies:
    7
    Views:
    314
    Daniel T.
    Jun 4, 2007
  3. Replies:
    0
    Views:
    295
  4. Gerald Jones

    aggregating single IP addresses into CIDRs?

    Gerald Jones, May 18, 2004, in forum: Perl Misc
    Replies:
    2
    Views:
    112
    John W. Krahn
    May 19, 2004
  5. Worky Workerson

    Aggregating/Sorting Large files

    Worky Workerson, May 25, 2006, in forum: Perl Misc
    Replies:
    6
    Views:
    159
    Dr.Ruud
    May 25, 2006
Loading...

Share This Page