Performance of list.index - how to speed up a silly algorithm?

Discussion in 'Python' started by Laszlo Nagy, Apr 29, 2010.

  1. Laszlo Nagy

    Laszlo Nagy Guest

    I have some ten thousand rows in a table. E.g.

    columns = ["color","size","weight","value"]
    rows = [
    [ "Yellow", "Big", 2, 4 ],
    [ "Blue", "Big", 3, -4 ],
    [ "Blue", "Small", 10, 55 ],
    ...
    ]

    Some columns are dimensions, others are measures. I want to convert this
    to an indexed version, that looks like this:

    dimension_names = ["color","size"] # List of dimension names
    dimension_cols = [0,1] # Column indexes for dimension values
    dimension_values = { # Dimension value occurences for each dimension
    0: ["Yellow","Blue",....],
    1: ["Big","Small",...],
    }
    measure_names = ["weight","value"] # List of measure names
    measure_cols = [2,3] # List of measure columns
    facts = [ # Facts, containing tuples of (dimension_value_incides,
    measure_values )
    ( (0,0) , (2,4) ),
    ( (1,0) , (3,-4) ),
    ( (1,1) , (10,55) ),
    ...
    ]

    This is how I try to convert to the indexed version:

    #1. Iterate over all rows, and collect possible dimension values.

    cnt = 0
    for row in iterator_factory():
    cnt += 1
    for dimension_idx,col_idx in enumerate(dimension_cols):
    dimension_values[colidx].append(row[cold_idx])
    if cnt%10000:
    dimension_values[colidx] = list(set(dimension_values[colidx]))

    #2. Index facts by dimension values

    facts = []
    for row in iterator_factory():
    dv = []
    for dimension_idx,col_idx in enumerate(dimension_cols):
    dv.append( dimension_values[col_idx].index(row[col_idx]) ) #
    THIS IS THE PROBLEMATIC LINE!
    mv = [ row[col_idx] for col_idx in measure_cols ]
    facts.append( dv,mv )

    (In reality, rows and facts are not stored in memory, because there can
    be 100 000+ facts. I did not want to bore you with the full source code.)

    And finally, here is my problem. If a dimension has many possible
    values, then the list.index() code above slows down eveything. In my
    test case, the problematic dimension had about 36 000 different values,
    and the number of rows was about 60 000. Calling index() on a list of 36
    000 values, times 60 000, is too slow.

    Test performance was 262 rows/sec. If I use dv.append(0) instead of "
    dv.append( dimension_values[col_idx].index(row[col_idx]) ) " then it
    goes up to 11520 rows/sec. If I exclude the problematic dimension, and
    only leave the others (with 1000 or less values) then goes up to 3000
    rows/sec.

    Maybe this should be implemented in C. But I believe that the algorithm
    itself must be wrong (regardless of the language). I really think that
    I'm doing something wrong. Looks like my algorithm's processing time is
    not linear to the number of rows. Not even log(n)*n. There should be a
    more effective way to do this. But how?

    I had the idea of sorting the rows by a given dimension. Then it would
    be obvious to speed up the indexing part - for that dimension. PROBABLY
    sorting all rows would be faster than calling list.index() for each row.
    But... it does not seem very efficient either. What if I have 1 million
    rows and 10 dimensions? Do I sort 1 million rows on the disk 10 times?
    Some of you might have ran into the same problem before, and can tell me
    which is the most efficient way to do this.

    Thanks,

    Laszlo
    Laszlo Nagy, Apr 29, 2010
    #1
    1. Advertising

  2. Laszlo Nagy

    Mensanator Guest

    On Apr 29, 5:21 pm, Laszlo Nagy <> wrote:
    > I have some ten thousand rows in a table. E.g.
    >
    > columns = ["color","size","weight","value"]
    > rows = [
    >      [ "Yellow", "Big", 2, 4 ],
    >      [ "Blue", "Big", 3, -4 ],
    >      [ "Blue", "Small", 10, 55 ],
    >      ...
    > ]
    >
    > Some columns are dimensions, others are measures. I want to convert this
    > to an indexed version, that looks like this:
    >
    > dimension_names = ["color","size"] # List of dimension names
    > dimension_cols = [0,1] # Column indexes for dimension values
    > dimension_values = { # Dimension value occurences for each dimension
    >      0: ["Yellow","Blue",....],
    >      1: ["Big","Small",...],}
    >
    > measure_names = ["weight","value"] # List of measure names
    > measure_cols = [2,3] # List of measure columns
    > facts = [ # Facts, containing tuples of (dimension_value_incides,
    > measure_values )
    >      ( (0,0) , (2,4) ),
    >      ( (1,0) , (3,-4) ),
    >      ( (1,1) , (10,55) ),
    >      ...
    > ]
    >
    > This is how I try to convert to the indexed version:
    >
    > #1. Iterate over all rows, and collect possible dimension values.
    >
    > cnt = 0
    > for row in iterator_factory():
    >      cnt += 1
    >      for dimension_idx,col_idx in enumerate(dimension_cols):
    >          dimension_values[colidx].append(row[cold_idx])
    >          if cnt%10000:
    >              dimension_values[colidx] = list(set(dimension_values[colidx]))
    >
    > #2. Index facts by dimension values
    >
    > facts = []
    > for row in iterator_factory():
    >      dv = []
    >      for dimension_idx,col_idx in enumerate(dimension_cols):
    >          dv.append( dimension_values[col_idx].index(row[col_idx]) ) #
    > THIS IS THE PROBLEMATIC LINE!
    >      mv = [ row[col_idx] for col_idx in measure_cols ]
    >      facts.append( dv,mv )
    >
    > (In reality, rows and facts are not stored in memory, because there can
    > be 100 000+ facts. I did not want to bore you with the full source code.)
    >
    > And finally, here is my problem. If a dimension has many possible
    > values, then the list.index() code above slows down eveything. In my
    > test case, the problematic dimension had about 36 000 different values,
    > and the number of rows was about 60 000. Calling index() on a list of 36
    > 000 values, times 60 000, is too slow.
    >
    > Test performance was 262 rows/sec. If I use dv.append(0) instead of "
    > dv.append( dimension_values[col_idx].index(row[col_idx]) ) " then it
    > goes up to 11520 rows/sec. If I exclude the problematic dimension, and
    > only leave the others (with 1000 or less values) then goes up to 3000
    > rows/sec.
    >
    > Maybe this should be implemented in C. But I believe that the algorithm
    > itself must be wrong (regardless of the language). I really think that
    > I'm doing something wrong. Looks like my algorithm's processing time is
    > not linear to the number of rows. Not even log(n)*n. There should be a
    > more effective way to do this. But how?
    >
    > I had the idea of sorting the rows by a given dimension. Then it would
    > be obvious to speed up the indexing part - for that dimension. PROBABLY
    > sorting all rows would be faster than calling list.index() for each row.
    > But... it does not seem very efficient either. What if I have 1 million
    > rows and 10 dimensions? Do I sort 1 million rows on the disk 10 times?
    > Some of you might have ran into the same problem before, and can tell me
    > which is the most efficient way to do this.
    >
    > Thanks,


    Have you considered using a SQL database? Personally, I would use
    MS-Access and link it to Python via ODBC. That way, I could use
    the Access drag-and-drop design tools and either

    - copy the SQL code of working query designs to Python

    or

    - use the ODBC to link to said queries rather than directly
    to the raw tables

    Of course, Python has SQLlight now, I just don't know SQL that well.

    >
    >     Laszlo
    Mensanator, Apr 30, 2010
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. mk

    (silly?) speed comparisons

    mk, Jul 8, 2008, in forum: Python
    Replies:
    1
    Views:
    255
    Peter Otten
    Jul 9, 2008
  2. MRAB
    Replies:
    2
    Views:
    330
    Laszlo Nagy
    Apr 30, 2010
  3. ngoc
    Replies:
    5
    Views:
    172
    Tad McClellan
    May 11, 2006
  4. Tomasz Chmielewski

    sorting index-15, index-9, index-110 "the human way"?

    Tomasz Chmielewski, Mar 4, 2008, in forum: Perl Misc
    Replies:
    4
    Views:
    280
    Tomasz Chmielewski
    Mar 4, 2008
  5. Onyxx
    Replies:
    1
    Views:
    85
    Dave Angel
    Jun 13, 2013
Loading...

Share This Page