Re: Performance of list.index - how to speed up a silly algorithm?

Discussion in 'Python' started by MRAB, Apr 29, 2010.

  1. MRAB

    MRAB Guest

    Laszlo Nagy wrote:
    > I have some ten thousand rows in a table. E.g.
    >
    > columns = ["color","size","weight","value"]
    > rows = [
    > [ "Yellow", "Big", 2, 4 ],
    > [ "Blue", "Big", 3, -4 ],
    > [ "Blue", "Small", 10, 55 ],
    > ...
    > ]
    >
    > Some columns are dimensions, others are measures. I want to convert this
    > to an indexed version, that looks like this:
    >
    > dimension_names = ["color","size"] # List of dimension names
    > dimension_cols = [0,1] # Column indexes for dimension values
    > dimension_values = { # Dimension value occurences for each dimension
    > 0: ["Yellow","Blue",....],
    > 1: ["Big","Small",...],
    > }
    > measure_names = ["weight","value"] # List of measure names
    > measure_cols = [2,3] # List of measure columns
    > facts = [ # Facts, containing tuples of (dimension_value_incides,
    > measure_values )
    > ( (0,0) , (2,4) ),
    > ( (1,0) , (3,-4) ),
    > ( (1,1) , (10,55) ),
    > ...
    > ]
    >
    > This is how I try to convert to the indexed version:
    >
    > #1. Iterate over all rows, and collect possible dimension values.
    >
    > cnt = 0
    > for row in iterator_factory():
    > cnt += 1
    > for dimension_idx,col_idx in enumerate(dimension_cols):
    > dimension_values[colidx].append(row[cold_idx])
    > if cnt%10000:
    > dimension_values[colidx] = list(set(dimension_values[colidx]))
    >
    > #2. Index facts by dimension values
    >
    > facts = []
    > for row in iterator_factory():
    > dv = []
    > for dimension_idx,col_idx in enumerate(dimension_cols):
    > dv.append( dimension_values[col_idx].index(row[col_idx]) ) #
    > THIS IS THE PROBLEMATIC LINE!
    > mv = [ row[col_idx] for col_idx in measure_cols ]
    > facts.append( dv,mv )
    >
    > (In reality, rows and facts are not stored in memory, because there can
    > be 100 000+ facts. I did not want to bore you with the full source code.)
    >
    > And finally, here is my problem. If a dimension has many possible
    > values, then the list.index() code above slows down eveything. In my
    > test case, the problematic dimension had about 36 000 different values,
    > and the number of rows was about 60 000. Calling index() on a list of 36
    > 000 values, times 60 000, is too slow.
    >
    > Test performance was 262 rows/sec. If I use dv.append(0) instead of "
    > dv.append( dimension_values[col_idx].index(row[col_idx]) ) " then it
    > goes up to 11520 rows/sec. If I exclude the problematic dimension, and
    > only leave the others (with 1000 or less values) then goes up to 3000
    > rows/sec.
    >
    > Maybe this should be implemented in C. But I believe that the algorithm
    > itself must be wrong (regardless of the language). I really think that
    > I'm doing something wrong. Looks like my algorithm's processing time is
    > not linear to the number of rows. Not even log(n)*n. There should be a
    > more effective way to do this. But how?
    >
    > I had the idea of sorting the rows by a given dimension. Then it would
    > be obvious to speed up the indexing part - for that dimension. PROBABLY
    > sorting all rows would be faster than calling list.index() for each row.
    > But... it does not seem very efficient either. What if I have 1 million
    > rows and 10 dimensions? Do I sort 1 million rows on the disk 10 times?
    > Some of you might have ran into the same problem before, and can tell me
    > which is the most efficient way to do this.
    >

    The .index method does a linear search, checking on average 1/2 of the
    items in the list. That's why it's so slow.

    In order to avoid that you could build a dict of each value in
    dimension_values[col_idx] and its index in a single pass so that it
    becomes a quick lookup.
     
    MRAB, Apr 29, 2010
    #1
    1. Advertising

  2. MRAB

    Peter Otten Guest

    MRAB wrote:

    > The .index method does a linear search, checking on average 1/2 of the
    > items in the list. That's why it's so slow.
    >
    > In order to avoid that you could build a dict of each value in
    > dimension_values[col_idx] and its index in a single pass so that it
    > becomes a quick lookup.


    For example:

    from itertools import count, izip

    def iterator_factory():
    # columns = ["color", "size", "weight", "value"]
    rows = [
    ["Yellow", "Big", 2, 4],
    ["Blue", "Big", 3, -4],
    ["Blue", "Small", 10, 55],
    #...
    ]

    return iter(rows)

    class Indexer(object):
    def __init__(self):
    self.lookup = {}
    self.counter = count()
    def __call__(self, value):
    try:
    return self.lookup[value]
    except KeyError:
    result = self.lookup[value] = next(self.counter)
    return result
    def to_list(self):
    d = self.lookup
    reverse = dict(izip(d.itervalues(), d.iterkeys()))
    assert len(reverse) == len(d)
    return [reverse for i in xrange(len(reverse))]


    def pairs(rows, dimension_cols, measure_cols, indexers):
    for row in rows:
    dv = [indexer(row[col])
    for indexer, col in izip(indexers, dimension_cols)]
    mv = [row[col] for col in measure_cols]
    yield dv, mv

    def main():
    # dimension_names = ["color", "size"]
    dimension_cols = [0, 1]

    # measure_names = ["weight", "value"]
    measure_cols = [2, 3]

    indexers = [Indexer() for _ in dimension_cols]

    facts = pairs(iterator_factory(),
    dimension_cols, measure_cols, indexers)
    facts = list(facts)

    print facts
    for i, indexer in enumerate(indexers):
    print "%d: %s" % (i, indexer.to_list())

    if __name__ == "__main__":
    main()
     
    Peter Otten, Apr 30, 2010
    #2
    1. Advertising

  3. MRAB

    Laszlo Nagy Guest


    >> The .index method does a linear search, checking on average 1/2 of the
    >> items in the list. That's why it's so slow.
    >>
    >> In order to avoid that you could build a dict of each value in
    >> dimension_values[col_idx] and its index in a single pass so that it
    >> becomes a quick lookup.
    >>

    >
    > For example:
    >
    > from itertools import count, izip
    >
    >

    ....
    > def main():
    > # dimension_names = ["color", "size"]
    > dimension_cols = [0, 1]
    >
    > # measure_names = ["weight", "value"]
    > measure_cols = [2, 3]
    >
    > indexers = [Indexer() for _ in dimension_cols]
    >
    > facts = pairs(iterator_factory(),
    > dimension_cols, measure_cols, indexers)
    > facts = list(facts)
    >
    > print facts
    > for i, indexer in enumerate(indexers):
    > print "%d: %s" % (i, indexer.to_list())
    >
    > if __name__ == "__main__":
    > main()
    >

    Whew! :) Thank you for taking the time. I'm not very familiar to
    itertools yet, so I need some time to understand this *beautiful* code. :)

    L
     
    Laszlo Nagy, Apr 30, 2010
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. mk

    (silly?) speed comparisons

    mk, Jul 8, 2008, in forum: Python
    Replies:
    1
    Views:
    272
    Peter Otten
    Jul 9, 2008
  2. Laszlo Nagy
    Replies:
    1
    Views:
    672
    Mensanator
    Apr 30, 2010
  3. ngoc
    Replies:
    5
    Views:
    199
    Tad McClellan
    May 11, 2006
  4. Tomasz Chmielewski

    sorting index-15, index-9, index-110 "the human way"?

    Tomasz Chmielewski, Mar 4, 2008, in forum: Perl Misc
    Replies:
    4
    Views:
    360
    Tomasz Chmielewski
    Mar 4, 2008
  5. Onyxx
    Replies:
    1
    Views:
    99
    Dave Angel
    Jun 13, 2013
Loading...

Share This Page