For loop searching takes too long!

elsa · Jan 28, 2010

Hi guys,

I've got a problem with my program, in that the code just takes too
long to run. Here's what I'm doing. If anyone has any tips, they'd be
much appreciated!

So, say I have a list of lists that looks something like this (I'm
using a list of lists, rather than a list of tuples, as I need it to
be mutable):

myList = [[786,0],[45, 1],[673,1],...................[23,46]]

there are enough entries in the outer list, that the sum of myList
[0] across all i could be as high as 10^7.

Now, what I need to do is randomly choose one myList, however the
distribution of my random choice needs to be proportional to the
values of myList[0]. So, for this list above, I'd have a much
higher chance of choosing myList[0] than myList[1].

Here is how I'm doing it at the moment:

def chooseI(myList):
mySum=0
choice = random.choice(range(1,sum([i[0] for i in myList])+1))
for i in range(len(myList)):
mySum+=myList[0]
if mySum>=choice:
return i
break

This works just fine if sum([i[0] for i in myList]) is less than
10,000, say. However if its up around 10^7, the whole thing crashes.
Is there a faster way of doing this, that doesn't involve as many
computational steps?

Thanks!

elsa

Peter Otten · Jan 28, 2010

elsa said:
Hi guys,

I've got a problem with my program, in that the code just takes too
long to run. Here's what I'm doing. If anyone has any tips, they'd be
much appreciated!

So, say I have a list of lists that looks something like this (I'm
using a list of lists, rather than a list of tuples, as I need it to
be mutable):

myList = [[786,0],[45, 1],[673,1],...................[23,46]]

there are enough entries in the outer list, that the sum of myList
[0] across all i could be as high as 10^7.

Now, what I need to do is randomly choose one myList, however the
distribution of my random choice needs to be proportional to the
values of myList[0]. So, for this list above, I'd have a much
higher chance of choosing myList[0] than myList[1].

Here is how I'm doing it at the moment:

def chooseI(myList):
mySum=0
choice = random.choice(range(1,sum([i[0] for i in myList])+1))
for i in range(len(myList)):
mySum+=myList[0]
if mySum>=choice:
return i
break

This works just fine if sum([i[0] for i in myList]) is less than
10,000, say. However if its up around 10^7, the whole thing crashes.
Is there a faster way of doing this, that doesn't involve as many
computational steps?

Do the first items of the inner lists change often? If not you can put the
running sum into them, i. e.

myList = [[768, ...], [786+45, ...], [786+45+673, ...], ...]

and use bisect.bisect() to choose the item.

Peter

John Posner · Jan 28, 2010

Hi guys,

I've got a problem with my program, in that the code just takes too
long to run. Here's what I'm doing. If anyone has any tips, they'd be
much appreciated!

So, say I have a list of lists that looks something like this (I'm
using a list of lists, rather than a list of tuples, as I need it to
be mutable):

myList = [[786,0],[45, 1],[673,1],...................[23,46]]

there are enough entries in the outer list, that the sum of myList
[0] across all i could be as high as 10^7.

Now, what I need to do is randomly choose one myList, however the
distribution of my random choice needs to be proportional to the
values of myList[0]. So, for this list above, I'd have a much
higher chance of choosing myList[0] than myList[1].

Here is how I'm doing it at the moment:

def chooseI(myList):
mySum=0
choice = random.choice(range(1,sum([i[0] for i in myList])+1))
for i in range(len(myList)):
mySum+=myList[0]
if mySum>=choice:
return i
break

This works just fine if sum([i[0] for i in myList]) is less than
10,000, say. However if its up around 10^7, the whole thing crashes.
Is there a faster way of doing this, that doesn't involve as many
computational steps?

This way of generating *choice* might be less resource-intensive:

list_total = 0
for x in myList:
list_total += x[0]
choice = random.randint(1, list_total)

This avoids creating another list (the list comprehension) as big as myList.

BTW, I don't think you need to *break* after you've already *return*ed.

-John

Steven D'Aprano · Jan 28, 2010

Now, what I need to do is randomly choose one myList, however the
distribution of my random choice needs to be proportional to the values
of myList[0]. So, for this list above, I'd have a much higher chance
of choosing myList[0] than myList[1].

Here is how I'm doing it at the moment:

def chooseI(myList):
mySum=0
choice = random.choice(range(1,sum([i[0] for i in myList])+1))
for i in range(len(myList)):
mySum+=myList[0]
if mySum>=choice:
return i
break

This isn't related to your problem, but you don't need the break after
the return -- the return will leave the loop and the function, and the
break will never be reached.

You could probably speed the code up a little by changing all the calls
to range into xrange. range actually generates a list of integers, which
is time consuming, while xrange is a lazy generator which just produces
each integer one at a time. (You can ignore this if you are using Python
3.0 or 3.1.)

Another small optimization you could use is to use a generator expression
instead of a list comprehension inside the call to sum. That should
generate the sum lazily, instead of calculating a giant list first and
then adding it up.

But I'd try something like this:

# Untested.
def chooseI(myList):
total = sum(i[0] for i in myList)
mySum = 0
choice = random.randrange(total)
for i in myList:
mySum += i[0]
if mySum >= choice:
return i

Dave Angel · Jan 28, 2010

elsa said:
Hi guys,

I've got a problem with my program, in that the code just takes too
long to run. Here's what I'm doing. If anyone has any tips, they'd be
much appreciated!

So, say I have a list of lists that looks something like this (I'm
using a list of lists, rather than a list of tuples, as I need it to
be mutable):

myList = [[786,0],[45, 1],[673,1],...................[23,46]]

there are enough entries in the outer list, that the sum of myList
[0] across all i could be as high as 10^7.

Now, what I need to do is randomly choose one myList, however the
distribution of my random choice needs to be proportional to the
values of myList[0]. So, for this list above, I'd have a much
higher chance of choosing myList[0] than myList[1].

Here is how I'm doing it at the moment:

def chooseI(myList):
mySum=0
choice = random.choice(range(1,sum([i[0] for i in myList])+1))
for i in range(len(myList)):
mySum+=myList[0]
if mySum>=choice:
return i
break

This works just fine if sum([i[0] for i in myList]) is less than
10,000, say. However if its up around 10^7, the whole thing crashes.
Is there a faster way of doing this, that doesn't involve as many
computational steps?

Thanks!

elsa

"the whole thing crashes" -- give us the specific crash you see,
including the stack trace. "Crash" is far too vague.

At first glance you could be running out of memory. But not with the
values you're quoting. If a typical average value for i[0] is 50,
that's only 200k elements. Anyway, you could print out the len(myList)
to see how big the list is.

There are quicker ways to do the sum, but the first thing to change is
the random.choice() stuff. Once you have a sum, you can use
random.randint() on it directly. No need to make a range list.

Another approach might be to build a new list with a running sum in it.
Then after doing the randint() on the last (highest) item, you can do a
bisect.bisect on that list. The index that returns will be your return
value.

DaveA

Steve Holden · Jan 28, 2010

Dave said:
elsa said:

Hi guys,

I've got a problem with my program, in that the code just takes too
long to run. Here's what I'm doing. If anyone has any tips, they'd be
much appreciated!

So, say I have a list of lists that looks something like this (I'm
using a list of lists, rather than a list of tuples, as I need it to
be mutable):

myList = [[786,0],[45, 1],[673,1],...................[23,46]]

there are enough entries in the outer list, that the sum of myList
[0] across all i could be as high as 10^7.

Now, what I need to do is randomly choose one myList, however the
distribution of my random choice needs to be proportional to the
values of myList[0]. So, for this list above, I'd have a much
higher chance of choosing myList[0] than myList[1].

Here is how I'm doing it at the moment:

def chooseI(myList):
mySum=0
choice = random.choice(range(1,sum([i[0] for i in myList])+1))
for i in range(len(myList)):
mySum+=myList[0]
if mySum>=choice:
return i
break

This works just fine if sum([i[0] for i in myList]) is less than
10,000, say. However if its up around 10^7, the whole thing crashes.
Is there a faster way of doing this, that doesn't involve as many
computational steps?

Thanks!

elsa

Click to expand...

"the whole thing crashes" -- give us the specific crash you see,
including the stack trace. "Crash" is far too vague.

At first glance you could be running out of memory. But not with the
values you're quoting. If a typical average value for i[0] is 50,
that's only 200k elements. Anyway, you could print out the len(myList)
to see how big the list is.

There are quicker ways to do the sum, but the first thing to change is
the random.choice() stuff. Once you have a sum, you can use
random.randint() on it directly. No need to make a range list.

Another approach might be to build a new list with a running sum in it.
Then after doing the randint() on the last (highest) item, you can do a
bisect.bisect on that list. The index that returns will be your return
value.

Your approach seems to validate an informal impression I had that with
N choices one ought to be able to build a binary tree where each
decision went to left or right according to whether a number drawn from
a linear [0,1) distribution was greater that some arbitrary margin, or not.

There are then arithmetical questions of exactly how to arrive at the
correct threshold values for each bifurcation. If the most probable
paths are deliberately placed at the top of the tree then the most
frequent cases will be fastest to generate, also.

regards
Steve

Steve Holden · Jan 28, 2010

Steve said:
Dave said:

elsa said:

Hi guys,

I've got a problem with my program, in that the code just takes too
long to run. Here's what I'm doing. If anyone has any tips, they'd be
much appreciated!

So, say I have a list of lists that looks something like this (I'm
using a list of lists, rather than a list of tuples, as I need it to
be mutable):

myList = [[786,0],[45, 1],[673,1],...................[23,46]]

there are enough entries in the outer list, that the sum of myList
[0] across all i could be as high as 10^7.

Now, what I need to do is randomly choose one myList, however the
distribution of my random choice needs to be proportional to the
values of myList[0]. So, for this list above, I'd have a much
higher chance of choosing myList[0] than myList[1].

Here is how I'm doing it at the moment:

def chooseI(myList):
mySum=0
choice = random.choice(range(1,sum([i[0] for i in myList])+1))
for i in range(len(myList)):
mySum+=myList[0]
if mySum>=choice:
return i
break

This works just fine if sum([i[0] for i in myList]) is less than
10,000, say. However if its up around 10^7, the whole thing crashes.
Is there a faster way of doing this, that doesn't involve as many
computational steps?

Thanks!

elsa

Click to expand...

"the whole thing crashes" -- give us the specific crash you see,
including the stack trace. "Crash" is far too vague.

At first glance you could be running out of memory. But not with the
values you're quoting. If a typical average value for i[0] is 50,
that's only 200k elements. Anyway, you could print out the len(myList)
to see how big the list is.

There are quicker ways to do the sum, but the first thing to change is
the random.choice() stuff. Once you have a sum, you can use
random.randint() on it directly. No need to make a range list.

Another approach might be to build a new list with a running sum in it.
Then after doing the randint() on the last (highest) item, you can do a
bisect.bisect on that list. The index that returns will be your return
value.

Click to expand...

Your approach seems to validate an informal impression I had that with
N choices one ought to be able to build a binary tree where each
decision went to left or right according to whether a number drawn from
a linear [0,1) distribution was greater that some arbitrary margin, or not.

There are then arithmetical questions of exactly how to arrive at the
correct threshold values for each bifurcation. If the most probable
paths are deliberately placed at the top of the tree then the most
frequent cases will be fastest to generate, also.

Bad form, I know, to reply to oneself, but since I was egotistical
enough to read what I wrote again when it was published I must also
record the conjecture that the resulting alphabet, expressed as the
binary history of the bifurcations chosen for any given symbol, must
surely be a Hamming code to be most efficient.

No proof is adduced.

regards
Steve

Dave Angel · Jan 28, 2010

Steve said:
Dave said:

elsa said:

Hi guys,

I've got a problem with my program, in that the code just takes too
long to run. Here's what I'm doing. If anyone has any tips, they'd be
much appreciated!

So, say I have a list of lists that looks something like this (I'm
using a list of lists, rather than a list of tuples, as I need it to
be mutable):

myList = [[786,0],[45, 1],[673,1],...................[23,46]]

there are enough entries in the outer list, that the sum of myList
[0] across all i could be as high as 10^7.

Now, what I need to do is randomly choose one myList, however the
distribution of my random choice needs to be proportional to the
values of myList[0]. So, for this list above, I'd have a much
higher chance of choosing myList[0] than myList[1].

Here is how I'm doing it at the moment:

def chooseI(myList):
mySum=0
choice = random.choice(range(1,sum([i[0] for i in myList])+1))
for i in range(len(myList)):
mySum+=myList[0]
if mySum>=choice:
return i
break

This works just fine if sum([i[0] for i in myList]) is less than
10,000, say. However if its up around 10^7, the whole thing crashes.
Is there a faster way of doing this, that doesn't involve as many
computational steps?

Thanks!

elsa

Click to expand...

"the whole thing crashes" -- give us the specific crash you see,
including the stack trace. "Crash" is far too vague.

At first glance you could be running out of memory. But not with the
values you're quoting. If a typical average value for i[0] is 50,
that's only 200k elements. Anyway, you could print out the len(myList)
to see how big the list is.

There are quicker ways to do the sum, but the first thing to change is
the random.choice() stuff. Once you have a sum, you can use
random.randint() on it directly. No need to make a range list.

Another approach might be to build a new list with a running sum in it.
Then after doing the randint() on the last (highest) item, you can do a
bisect.bisect on that list. The index that returns will be your return
value.

Click to expand...

Your approach seems to validate an informal impression I had that with
N choices one ought to be able to build a binary tree where each
decision went to left or right according to whether a number drawn from
a linear [0,1) distribution was greater that some arbitrary margin, or not.

There are then arithmetical questions of exactly how to arrive at the
correct threshold values for each bifurcation. If the most probable
paths are deliberately placed at the top of the tree then the most
frequent cases will be fastest to generate, also.

regards

If the list will remain the same, it may be worth pre-computing the
running sum into it, and then the function to get a random value just
has to do the bisect. But as you say, if the values are not evenly
distributed, a binary split (which is what bisect does) might not give
the best average time.

I doubt if it would help for small lists (Python code would be slower
than the presumably C code in bisect), but perhaps for larger lists, an
index something like this would be better:

Assuming the list had 40 items, and the sum was 1000 for the sake of
example. you'd build a tree with 500 as the top node. that would
"point" not to 20, but to whatever index had a value that was closest to
500. Then you'd build child nodes that similarly split each portion.
And you'd repeat it till the remaining size was fairly small, at which
point you'd use standard bisect logic.

On a different tack, someone could probably improve on the bisect()
function, for cases where the list is a sorted list of numbers.
Typically a binary algorithm is used because a comparison function gives
either less than, equal, or greater. But if you know they are all
well-behaved numbers, you could do a formula to narrow down the choices
faster.

Let's say that at some point you know the value is between items[5] and
items[105], and the values respectively are 40 and 140. You're
searching for a value of 50. Instead of trying items[55], you do a
scale factor, and do 5 + ( 105-5 ) * (50-40) / (140-40). So you'd try
element 15 next.

All these values are plus/minus 1, but I think I've come close to an
algorithm. Kind of an analogy to Newton's method.

DaveA

Jonathan Gardner · Jan 29, 2010

I've got a problem with my program, in that the code just takes too
long to run. Here's what I'm doing. If anyone has any tips, they'd be
much appreciated!

First of all, don't play with large lists. Large lists have a tendency
to grow larger over time, until they get absurdly large.

If you're dealing with a long list, work with it like you'd work with
a file. If you need random access, stick it in some form of database,
such as BDB or PostgreSQL. If you need an index, stick it in some form
of DB. Eventually, large lists end up like that anyway. Don't fool
yourself early on---prepare for the worst and solve the problems of
tomorrow today.

So, say I have a list of lists that looks something like this (I'm
using a list of lists, rather than a list of tuples, as I need it to
be mutable):

myList = [[786,0],[45, 1],[673,1],...................[23,46]]

there are enough entries in the outer list, that the sum of myList
[0] across all i could be as high as 10^7.

Now, what I need to do is randomly choose one myList, however the
distribution of my random choice needs to be proportional to the
values of myList[0]. So, for this list above, I'd have a much
higher chance of choosing myList[0] than myList[1].

Let's do some thinking here.

Let's say you have a list of N items. You need to choose one, but you
don't know how many there are.

One algorithm that works is you start with the first item. If that's
the only item, you've chosen it.

If there's another item, you roll the dice. There's a 1/2 chance you
drop the first item and take the sedon.

If there's a third item, you roll the dice, There's a 1/3 chance you
drop the first or second item, and take the third.

You see the pattern? Keep the nth item 1/n of the time.

In your case, you're actually pulling off N items each time, all with
the same value. Your chance of keeping the next item is (number of
items in the next one)/(total items seen so far, including the next
one). Do the math for this. It's really simple.

Let's walk through this:

[[786,0],[45, 1],[673,2],...................[23,46]]

Step 1: See [786,0]. Remember 786. Keep 0.

Step 2: See [45,1]. Remember 786+45=831. Keep 1 45/831 of the time.

Step 3: See [673,2]. Remember 831+673=1504. Keep 2 673/1504 of the
time.

Etc...

Now, the algorithm I've just described is much less memory intensive
and can deal with very long lists of numbers. However, you are calling
rand() a lot. rand() is not a cheap operation. One optimization is to
roll rand() once, and keep using that number over and over again.
Granted, for very large numbers of items, you're going to need a very
precise random number of some very small probabilities will never be
chosen. This is left as an exercise for the reader.

Steven D'Aprano · Jan 29, 2010

First of all, don't play with large lists. Large lists have a tendency
to grow larger over time, until they get absurdly large.

If you're dealing with a long list, work with it like you'd work with a
file. If you need random access, stick it in some form of database, such
as BDB or PostgreSQL. If you need an index, stick it in some form of DB.
Eventually, large lists end up like that anyway. Don't fool yourself
early on---prepare for the worst and solve the problems of tomorrow
today.

Talk about premature optimization. The OP probably has a few hundreds of
thousands of items, millions at most, which is trivial on a modern PC.
Your way of thinking is what has led to Firefox using a database to
manage a few thousand bookmarks and cookies, which in turn leads to
consistent data corruption problems. (I've been keeping note, and so far
my installation of Firefox 3 has had database corruption at startup one
time out of five.)

Jonathan Gardner · Feb 1, 2010

Talk about premature optimization. The OP probably has a few hundreds of
thousands of items, millions at most, which is trivial on a modern PC.
Your way of thinking is what has led to Firefox using a database to
manage a few thousand bookmarks and cookies, which in turn leads to
consistent data corruption problems. (I've been keeping note, and so far
my installation of Firefox 3 has had database corruption at startup one
time out of five.)

I think there's a difference between writing your code so that you
have the option of using a database, or a flat file, or any other type
of serial data source, and actually using a database (or whatever.)

I prefer to write my code so that I don't have to re-write it 6 months
later.

Having worked with datasets that are truly enormous, whenever I see a
list that is unbounded (as this appears to be), I freak out because
unbounded can be a really, really big number. Might as well write
software that works for the truly enormous case and the really large
case and be done with it once and for all.

Steven D'Aprano · Feb 1, 2010

Having worked with datasets that are truly enormous, whenever I see a
list that is unbounded (as this appears to be), I freak out because
unbounded can be a really, really big number. Might as well write
software that works for the truly enormous case and the really large
case and be done with it once and for all.

Then write it to work on iterators, if at all possible, and you won't
care how large the data stream is. Even databases have limits!

Of course you should write your code to deal with your expected data, but
it is possible to commit overkill as well as underkill. There are trade-
offs between development effort, runtime speed, memory usage, and size of
data you can deal with, and maximizing the final one is not always the
best strategy.

Look at it this way... for most applications, you probably don't use
arbitrary precision floats even though floating point numbers can have an
unbounded number of decimal places. But in practice, you almost never
need ten decimal places to pi, let alone an infinite number. If the
creators of floating point libraries wrote software "that works for the
truly enormous case", the overhead would be far too high for nearly all
applications.

Choose your algorithm according to your expected need. If you under-
estimate your need, you will write slow code, and if you over-estimate
it, you'll write slow code too. Getting that balance right is why they
pay software architects the big bucks *wink*

Searching for most pythonic/least stupid way to do something simple	10	Mar 16, 2010
SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
Bootstrap Tree View doesnt search properly. I am searching for 954116679 (FSP) but it returns 0 matches found	2	May 27, 2024
groupby - summing multiple columns in a list of lists	1	May 17, 2011
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
Understanding While Loop Execution	3	Feb 18, 2008
indirectly addressing vars in Python	4	Oct 1, 2008
My very first python program, need help	1	Aug 10, 2008

For loop searching takes too long!

elsa

Peter Otten

John Posner

Steven D'Aprano

Dave Angel

Steve Holden

Steve Holden

Dave Angel

Jonathan Gardner

Steven D'Aprano

Jonathan Gardner

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads