extremely slow array indexing?

G

Grace Fang

Hi,

I am writing code to sort the columns according to the sum of each
column. The dataset is huge (50k rows x 300k cols), so i need to read
line by line and do the summation to avoid the out-of-memory problem.
But I don't know why it runs very slow, and part of the code is as
follows. I suspect it's because of array index, but not sure. Can
anyone
point out what needs to be modified to make it run fast? thanks in
advance!

....
from numpy import *
....

currSum = zeros(self.componentcount)
currRow = zeros(self.componentcount)
for featureDict in self.featureDictList:
currRow[:] = 0
for components in self.componentdict1:
if featureDict.has_key(components):
col = self.componentdict1[components]
value = featureDict[components]
currRow[col]=value;
currSum = currSum + row;
....
 
W

Will McGugan

Grace said:
Hi,

I am writing code to sort the columns according to the sum of each
column. The dataset is huge (50k rows x 300k cols), so i need to read
line by line and do the summation to avoid the out-of-memory problem.
But I don't know why it runs very slow, and part of the code is as
follows. I suspect it's because of array index, but not sure. Can
anyone
point out what needs to be modified to make it run fast? thanks in
advance!

Array indexing is unlikely to be the culprit. Could it not just be
slow, because you are processing a lot of data? With numbers those big
I would expect to have enough time to go make a coffee, then drink it.

If you think it is slower than it could be, post more code for
optimization advice...

Will McGugan
 
G

Grace Fang

Hi will,Thanks for your reply. The simplified code is as follows, and
you can run it if you like. It takes 7 seconds to process 1000 rows,
which is tolerable, but I wonder why it takes so long, because I also
did one for loop through all of the same rows without accessing array,
which only takes 1 sec to process 1000 rows. Isn't vectorized
operation supposed to run very quickly?

from numpy import *
componentcount = 300000
currSum = zeros(componentcount)
row = zeros(componentcount) #current row
rowcount = 50000
for i in range(1,rowcount):
row[:] = 1
currSum = currSum + row;
 
J

John Machin

Will said:
Array indexing is unlikely to be the culprit. Could it not just be
slow, because you are processing a lot of data? With numbers those big
I would expect to have enough time to go make a coffee, then drink it.

If you think it is slower than it could be, post more code for
optimization advice...

Will McGugan

Hi Grace,
What Will McGugan said, plus:
1. Post *much* more of your code e.g. all relevant parts :)
2. Explain "featureDict" and "componentdict1"; note that you seem to
be doing more dictionary accessing than array indexing.
3. Tell us what is "row" (not mentioned elsewhere) in the last line of
your code snippet. Should it be "currRow"? For your sake and ours,
copy/paste your code; don't re-type it.
4. Tell us what version of Python [why are you using dict.has_key??],
what platform, how much memory.
5. Tell us what "very slow" means e.g. how many rows per second.

HTH,
John
 
R

Robert Kern

John said:
Hi Grace,
What Will McGugan said, plus:
1. Post *much* more of your code e.g. all relevant parts :)

Note that Grace has also posted this to numpy-discussion and with prompting
provided the following snippet as a distillation of the key slow part:


from numpy import *

componentcount = 300000
currSum = zeros(componentcount)
row = zeros(componentcount) #current row
rowcount = 50000
for i in range(1,rowcount):
row[:] = 1
currSum = currSum + row;


As it is, the OP gets through 1000 rows every 7 seconds or so on their machine,
and I get about the same on mine.

Changing the last line to "currSum += row" gets a 3x speedup. Dropping the
"row[:] = 1" line as it's a really just a time-consuming no-op in the example
and probably not an accurate reflection of what's going on in the real code gets
you another 2x speedup.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,278
Latest member
BuzzDefenderpro

Latest Threads

Top