extremely slow array indexing?

Grace Fang · Nov 30, 2006

Hi,

I am writing code to sort the columns according to the sum of each
column. The dataset is huge (50k rows x 300k cols), so i need to read
line by line and do the summation to avoid the out-of-memory problem.
But I don't know why it runs very slow, and part of the code is as
follows. I suspect it's because of array index, but not sure. Can
anyone
point out what needs to be modified to make it run fast? thanks in
advance!

....
from numpy import *
....

currSum = zeros(self.componentcount)
currRow = zeros(self.componentcount)
for featureDict in self.featureDictList:
currRow[:] = 0
for components in self.componentdict1:
if featureDict.has_key(components):
col = self.componentdict1[components]
value = featureDict[components]
currRow[col]=value;
currSum = currSum + row;
....

Will McGugan · Nov 30, 2006

Grace said:
Hi,

I am writing code to sort the columns according to the sum of each
column. The dataset is huge (50k rows x 300k cols), so i need to read
line by line and do the summation to avoid the out-of-memory problem.
But I don't know why it runs very slow, and part of the code is as
follows. I suspect it's because of array index, but not sure. Can
anyone
point out what needs to be modified to make it run fast? thanks in
advance!

Array indexing is unlikely to be the culprit. Could it not just be
slow, because you are processing a lot of data? With numbers those big
I would expect to have enough time to go make a coffee, then drink it.

If you think it is slower than it could be, post more code for
optimization advice...

Will McGugan

Grace Fang · Nov 30, 2006

Hi will,Thanks for your reply. The simplified code is as follows, and
you can run it if you like. It takes 7 seconds to process 1000 rows,
which is tolerable, but I wonder why it takes so long, because I also
did one for loop through all of the same rows without accessing array,
which only takes 1 sec to process 1000 rows. Isn't vectorized
operation supposed to run very quickly?

from numpy import *
componentcount = 300000
currSum = zeros(componentcount)
row = zeros(componentcount) #current row
rowcount = 50000
for i in range(1,rowcount):
row[:] = 1
currSum = currSum + row;

John Machin · Nov 30, 2006

Will said:
Array indexing is unlikely to be the culprit. Could it not just be
slow, because you are processing a lot of data? With numbers those big
I would expect to have enough time to go make a coffee, then drink it.

If you think it is slower than it could be, post more code for
optimization advice...

Will McGugan

Hi Grace,
What Will McGugan said, plus:
1. Post *much* more of your code e.g. all relevant parts

2. Explain "featureDict" and "componentdict1"; note that you seem to
be doing more dictionary accessing than array indexing.
3. Tell us what is "row" (not mentioned elsewhere) in the last line of
your code snippet. Should it be "currRow"? For your sake and ours,
copy/paste your code; don't re-type it.
4. Tell us what version of Python [why are you using dict.has_key??],
what platform, how much memory.
5. Tell us what "very slow" means e.g. how many rows per second.

HTH,
John

Robert Kern · Nov 30, 2006

John said:
Hi Grace,
What Will McGugan said, plus:
1. Post *much* more of your code e.g. all relevant parts

Note that Grace has also posted this to numpy-discussion and with prompting
provided the following snippet as a distillation of the key slow part:

from numpy import *

componentcount = 300000
currSum = zeros(componentcount)
row = zeros(componentcount) #current row
rowcount = 50000
for i in range(1,rowcount):
row[:] = 1
currSum = currSum + row;

As it is, the OP gets through 1000 rows every 7 seconds or so on their machine,
and I get about the same on mine.

Changing the last line to "currSum += row" gets a 3x speedup. Dropping the
"row[:] = 1" line as it's a really just a time-consuming no-op in the example
and probably not an accurate reflection of what's going on in the real code gets
you another 2x speedup.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Python bug? Indexing to matrices	4	Jul 12, 2011
Why is indexing into an numpy array that slow?	3	Nov 8, 2008
How to create a JSON array with values from DOM(HTML TABLE) when I click a button using JQuery/Javascript?	0	May 1, 2023
How to create a JSON array with values from DOM(HTML TABLE) when I click a button using JQuery/Javascript?	0	May 1, 2023
Trouble creating multi dimensional array. 0 to 26 in 3 dimensions.	1	Oct 12, 2022
CSV to matrix array	14	Apr 12, 2013
looping in array vs looping in a dic	5	Sep 20, 2012
NumArray array-indexing	6	Aug 12, 2004

extremely slow array indexing?

Grace Fang

Will McGugan

Grace Fang

John Machin

Robert Kern

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads