Performance of list.index - how to speed up a silly algorithm?

Laszlo Nagy · Apr 29, 2010

I have some ten thousand rows in a table. E.g.

columns = ["color","size","weight","value"]
rows = [
[ "Yellow", "Big", 2, 4 ],
[ "Blue", "Big", 3, -4 ],
[ "Blue", "Small", 10, 55 ],
...
]

Some columns are dimensions, others are measures. I want to convert this
to an indexed version, that looks like this:

dimension_names = ["color","size"] # List of dimension names
dimension_cols = [0,1] # Column indexes for dimension values
dimension_values = { # Dimension value occurences for each dimension
0: ["Yellow","Blue",....],
1: ["Big","Small",...],
}
measure_names = ["weight","value"] # List of measure names
measure_cols = [2,3] # List of measure columns
facts = [ # Facts, containing tuples of (dimension_value_incides,
measure_values )
( (0,0) , (2,4) ),
( (1,0) , (3,-4) ),
( (1,1) , (10,55) ),
...
]

This is how I try to convert to the indexed version:

#1. Iterate over all rows, and collect possible dimension values.

cnt = 0
for row in iterator_factory():
cnt += 1
for dimension_idx,col_idx in enumerate(dimension_cols):
dimension_values[colidx].append(row[cold_idx])
if cnt%10000:
dimension_values[colidx] = list(set(dimension_values[colidx]))

#2. Index facts by dimension values

facts = []
for row in iterator_factory():
dv = []
for dimension_idx,col_idx in enumerate(dimension_cols):
dv.append( dimension_values[col_idx].index(row[col_idx]) ) #
THIS IS THE PROBLEMATIC LINE!
mv = [ row[col_idx] for col_idx in measure_cols ]
facts.append( dv,mv )

(In reality, rows and facts are not stored in memory, because there can
be 100 000+ facts. I did not want to bore you with the full source code.)

And finally, here is my problem. If a dimension has many possible
values, then the list.index() code above slows down eveything. In my
test case, the problematic dimension had about 36 000 different values,
and the number of rows was about 60 000. Calling index() on a list of 36
000 values, times 60 000, is too slow.

Test performance was 262 rows/sec. If I use dv.append(0) instead of "
dv.append( dimension_values[col_idx].index(row[col_idx]) ) " then it
goes up to 11520 rows/sec. If I exclude the problematic dimension, and
only leave the others (with 1000 or less values) then goes up to 3000
rows/sec.

Maybe this should be implemented in C. But I believe that the algorithm
itself must be wrong (regardless of the language). I really think that
I'm doing something wrong. Looks like my algorithm's processing time is
not linear to the number of rows. Not even log(n)*n. There should be a
more effective way to do this. But how?

I had the idea of sorting the rows by a given dimension. Then it would
be obvious to speed up the indexing part - for that dimension. PROBABLY
sorting all rows would be faster than calling list.index() for each row.
But... it does not seem very efficient either. What if I have 1 million
rows and 10 dimensions? Do I sort 1 million rows on the disk 10 times?
Some of you might have ran into the same problem before, and can tell me
which is the most efficient way to do this.

Thanks,

Laszlo

Mensanator · Apr 30, 2010

I have some ten thousand rows in a table. E.g.

columns = ["color","size","weight","value"]
rows = [
[ "Yellow", "Big", 2, 4 ],
[ "Blue", "Big", 3, -4 ],
[ "Blue", "Small", 10, 55 ],
...
]

Some columns are dimensions, others are measures. I want to convert this
to an indexed version, that looks like this:

dimension_names = ["color","size"] # List of dimension names
dimension_cols = [0,1] # Column indexes for dimension values
dimension_values = { # Dimension value occurences for each dimension
0: ["Yellow","Blue",....],
1: ["Big","Small",...],}

measure_names = ["weight","value"] # List of measure names
measure_cols = [2,3] # List of measure columns
facts = [ # Facts, containing tuples of (dimension_value_incides,
measure_values )
( (0,0) , (2,4) ),
( (1,0) , (3,-4) ),
( (1,1) , (10,55) ),
...
]

This is how I try to convert to the indexed version:

#1. Iterate over all rows, and collect possible dimension values.

cnt = 0
for row in iterator_factory():
cnt += 1
for dimension_idx,col_idx in enumerate(dimension_cols):
dimension_values[colidx].append(row[cold_idx])
if cnt%10000:
dimension_values[colidx] = list(set(dimension_values[colidx]))

#2. Index facts by dimension values

facts = []
for row in iterator_factory():
dv = []
for dimension_idx,col_idx in enumerate(dimension_cols):
dv.append( dimension_values[col_idx].index(row[col_idx]) ) #
THIS IS THE PROBLEMATIC LINE!
mv = [ row[col_idx] for col_idx in measure_cols ]
facts.append( dv,mv )

(In reality, rows and facts are not stored in memory, because there can
be 100 000+ facts. I did not want to bore you with the full source code.)

And finally, here is my problem. If a dimension has many possible
values, then the list.index() code above slows down eveything. In my
test case, the problematic dimension had about 36 000 different values,
and the number of rows was about 60 000. Calling index() on a list of 36
000 values, times 60 000, is too slow.

Test performance was 262 rows/sec. If I use dv.append(0) instead of "
dv.append( dimension_values[col_idx].index(row[col_idx]) ) " then it
goes up to 11520 rows/sec. If I exclude the problematic dimension, and
only leave the others (with 1000 or less values) then goes up to 3000
rows/sec.

Maybe this should be implemented in C. But I believe that the algorithm
itself must be wrong (regardless of the language). I really think that
I'm doing something wrong. Looks like my algorithm's processing time is
not linear to the number of rows. Not even log(n)*n. There should be a
more effective way to do this. But how?

I had the idea of sorting the rows by a given dimension. Then it would
be obvious to speed up the indexing part - for that dimension. PROBABLY
sorting all rows would be faster than calling list.index() for each row.
But... it does not seem very efficient either. What if I have 1 million
rows and 10 dimensions? Do I sort 1 million rows on the disk 10 times?
Some of you might have ran into the same problem before, and can tell me
which is the most efficient way to do this.

Thanks,

Have you considered using a SQL database? Personally, I would use
MS-Access and link it to Python via ODBC. That way, I could use
the Access drag-and-drop design tools and either

- copy the SQL code of working query designs to Python

or

- use the ODBC to link to said queries rather than directly
to the raw tables

Of course, Python has SQLlight now, I just don't know SQL that well.

Fastest way to store ints and floats on disk	2	Aug 7, 2008
Iterate through a list of tuples for processing	0	Sep 20, 2013
How to use ufixed when it involves multiplication a number of times?(VHDL question)	0	Aug 22, 2016
What's the best way to extract 2 values from a CSV file from each row systematically?	6	Sep 23, 2013
is there a way to speed up this tablesort code?	4	Jan 3, 2006
OT: Algorithm to speed up vector range search	3	Oct 14, 2003
Use of Python with GDAL. How to speed up ?	8	Mar 17, 2006
flattening/rolling up/aggregating a large sorted text file	3	Mar 21, 2007

Performance of list.index - how to speed up a silly algorithm?

Laszlo Nagy

Mensanator

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads