K
Kevin
Hi there,
I am thinking maybe my previous code (
http://groups.google.com/group/comp...6894b/952b88f8f9aef443?hl=en#952b88f8f9aef443
) needs some total revise to improve the speed, so let's see if anyone
has any good suggestions for this kind of program.
The program is a data preprocessing code. The data format is similar to
Excel like spread sheet: there are many lines of data points (total N),
each data point has M attributes (columns). Each attribute has a K
number of possible values.
The code is supposed to generate all the "data cubes" from the input
data. Each data cube has X attributes (X range from 2 to 6 for our
needs). We need to keep the counts for all the possible values
combinations of each data cube.
For example, for a 2 attributes data cube (X = 2), suppose one
attribute (B) has 2 possible values, and the other attribute (A) has 3
possible avlues. Then we need to get these counts:
A1, A2, A3
B1 2 56 86
B2 34 4 23
In my previous approach, I create a class DataCube, which hodes the
datas for that cube (for the above example, it will hold the 6 ints). I
first generate all the possible data cubes. For 2 attributes data
cubes, there are about total M*(M-1)/2 of them. Then I scan the data
once, trying to update the counts for each data cube.
This approach is naive but just too slow (even if we suppose all the
data cubes can fit in memory).
So for this kind of application, in order to get the correct (and
exact) counts for each data cubes, how do I write the code to max the
speed?
(Please note that N is large, the data can not fit into memory.)
Thanks and good night!
I am thinking maybe my previous code (
http://groups.google.com/group/comp...6894b/952b88f8f9aef443?hl=en#952b88f8f9aef443
) needs some total revise to improve the speed, so let's see if anyone
has any good suggestions for this kind of program.
The program is a data preprocessing code. The data format is similar to
Excel like spread sheet: there are many lines of data points (total N),
each data point has M attributes (columns). Each attribute has a K
number of possible values.
The code is supposed to generate all the "data cubes" from the input
data. Each data cube has X attributes (X range from 2 to 6 for our
needs). We need to keep the counts for all the possible values
combinations of each data cube.
For example, for a 2 attributes data cube (X = 2), suppose one
attribute (B) has 2 possible values, and the other attribute (A) has 3
possible avlues. Then we need to get these counts:
A1, A2, A3
B1 2 56 86
B2 34 4 23
In my previous approach, I create a class DataCube, which hodes the
datas for that cube (for the above example, it will hold the 6 ints). I
first generate all the possible data cubes. For 2 attributes data
cubes, there are about total M*(M-1)/2 of them. Then I scan the data
once, trying to update the counts for each data cube.
This approach is naive but just too slow (even if we suppose all the
data cubes can fit in memory).
So for this kind of application, in order to get the correct (and
exact) counts for each data cubes, how do I write the code to max the
speed?
(Please note that N is large, the data can not fit into memory.)
Thanks and good night!