How to struct this kind of application?

K

Kevin

Hi there,

I am thinking maybe my previous code (
http://groups.google.com/group/comp...6894b/952b88f8f9aef443?hl=en#952b88f8f9aef443
) needs some total revise to improve the speed, so let's see if anyone
has any good suggestions for this kind of program.

The program is a data preprocessing code. The data format is similar to
Excel like spread sheet: there are many lines of data points (total N),
each data point has M attributes (columns). Each attribute has a K
number of possible values.

The code is supposed to generate all the "data cubes" from the input
data. Each data cube has X attributes (X range from 2 to 6 for our
needs). We need to keep the counts for all the possible values
combinations of each data cube.

For example, for a 2 attributes data cube (X = 2), suppose one
attribute (B) has 2 possible values, and the other attribute (A) has 3
possible avlues. Then we need to get these counts:

A1, A2, A3
B1 2 56 86
B2 34 4 23


In my previous approach, I create a class DataCube, which hodes the
datas for that cube (for the above example, it will hold the 6 ints). I
first generate all the possible data cubes. For 2 attributes data
cubes, there are about total M*(M-1)/2 of them. Then I scan the data
once, trying to update the counts for each data cube.

This approach is naive but just too slow (even if we suppose all the
data cubes can fit in memory).

So for this kind of application, in order to get the correct (and
exact) counts for each data cubes, how do I write the code to max the
speed?

(Please note that N is large, the data can not fit into memory.)

Thanks and good night!
 
K

Kevin

By the way, I know there is data structure called frequent pattern
tree. Though building it can be fast, but scanning it to get the counts
we want is also very slow.
 
M

megagurka

Kevin skrev:
So for this kind of application, in order to get the correct (and
exact) counts for each data cubes, how do I write the code to max the
speed?

Don't know if I understand your problem correctly, but I'll give it a
shot. If you have enough memory storing the frequency counts in an
array is probably the fastest method. If you have X attributes the
array would require N1 * N2 * .. * NX entries, where Ni is the number
of possible values for attribute N. However, if N and X are big the
array won't fit into RAM and many entries in the array will be zero. A
better solution in this case would be to use a sparse matrix library.
Using a Map would also work, but I think it would require more memory
and have worse performance.

/JN
 
C

Chris Uppal

Kevin said:
The code is supposed to generate all the "data cubes" from the input
data. Each data cube has X attributes (X range from 2 to 6 for our
needs). We need to keep the counts for all the possible values
combinations of each data cube.

For example, for a 2 attributes data cube (X = 2), suppose one
attribute (B) has 2 possible values, and the other attribute (A) has 3
possible avlues. Then we need to get these counts:

A1, A2, A3
B1 2 56 86
B2 34 4 23

I'm still missing something here. How do the data cubes correspond to the
input ? What do you mean by "We need to keep the counts for all the possible
values combinations of each data cube" ? To put it another way, where do the
numbers 2, 56, 86, etc. come from in your example, and is that example just
/one/ of the data cubes you need to find, or is that a summary of the
information from the whole collection of data cubes ?

-- chris
 
P

Patricia Shanahan

Kevin wrote:
....
For example, for a 2 attributes data cube (X = 2), suppose one
attribute (B) has 2 possible values, and the other attribute (A) has 3
possible avlues. Then we need to get these counts:

A1, A2, A3
B1 2 56 86
B2 34 4 23 ....
(Please note that N is large, the data can not fit into memory.)

I don't know whether this will work out, but here's another way of
looking at the problem.

First sort your input data. File sort utilities are designed to work,
and work fast, with data that doesn't fit in memory.

Taking your two-dimensional example, you would sort with A as primary
key, B as secondary key.

All lines with A=A1 and B=B1 would appear together at the start of the
file, so count them and emit the (A1,B1) number. Next you find any lines
with A=A1 and B=B2. Count them...

In general, all lines that contribute to a given count are consecutive
in the file, so you only need to deal with one count, plus one line, at
a time. All the temporary file shuffling to deal with the data being too
large to fit in memory is dumped off on the sort utility.

Patricia
 
J

jan V

I am thinking maybe my previous code (
) needs some total revise to improve the speed, so let's see if anyone
has any good suggestions for this kind of program.

How much is your current RAM requirement, precisely? If it's up to 2x to 4x
your current physical RAM resource, then you may simply be able to order the
few RAM expansions, fit them, and sit back and relax while your program does
its job as is currently implemented.

If your program is part of a commercial project, then you need to weigh
whether spending a few days (or more....) on changing your code is cheaper
than the "throw more RAM at it" route..
 
R

Roedy Green

First sort your input data. File sort utilities are designed to work,
and work fast, with data that doesn't fit in memory.

What file sorts do you recommend?
 
K

Kevin

First, though ultimately we can not keep all the data cubes in memory
(for those 3, 4, 5, 6 ones, there are just too many). But we can assume
that for 2D data cubes, we can hold them in memory.

The file sorting method mentioned by Patricia seems not work well: we
are to get all the possible data cubes, not just one or two. So does it
mean that for each data cube, we have to sort the data once? Which is
quite slow I am thinking.

The array method does not work. As mentioned in my previous post
(http://groups.google.com/group/comp...6894b/952b88f8f9aef443?hl=en#952b88f8f9aef443),
it seems each time, an access to the array takes very long time (though
it is supposed to be O(1) time).
To my understanding, for example, for int[][] MyData;
a operation like:
MyData[10][24] = MyData[10][24] + i;
will have a similar speed as:
i = i + i;
However, I obseved that the first one is 4+ times slower than the
second one.
That is the reason I am thinking revise my code to improve the speed.

For the "data cubes" I mentioned, let us have this example data file:
A B C
a1 b1 c1
a1 b2 c1
a2 b3 c2
a1 b2 c2

For 2 attributes data cubes, we will have 3 of them: (A, B), (A, C),
(B, C). (Note: (A, B) = (B, A) in our case)
For data cube (A, B), my program needs to get these counts:
b1 b2 b3
a1 1 2 0
a2 0 0 1
(how many data points we have, when A=ax and B=by)

For data cube (A, C), my program needs to get these counts:
c1 c2
a1 2 1
a2 0 1

For data cube (B, C), my program needs to get these counts:
c1 c2
b1 1 0
b2 1 1
b3 1 0

And the program is required to find all these 3 data cubes with these
numbers in them. As you can imagine, for 3 attributes cubes, there
will have much more of them.

Hope this clears what the program is supposed to do.
Any suggestions?

Good day!
 
P

Patricia Shanahan

Kevin said:
First, though ultimately we can not keep all the data cubes in memory
(for those 3, 4, 5, 6 ones, there are just too many). But we can assume
that for 2D data cubes, we can hold them in memory. ....
Hope this clears what the program is supposed to do.
Any suggestions?

Good day!

Note that if you have e.g. the (A,B,C) cube, you can build the (A,B)
cube by summing along the C dimension, without looking at the original
file or any other dimensions. That suggests starting with the highest
dimension cubes, and working with the cubes, rather than the original
file, to build the lower dimensions.

As you progress, the files you have to deal with will get smaller, until
you have an in-memory problem for the 2D cubes.

Patricia
 
P

Patricia Shanahan

Kevin said:
To my understanding, for example, for int[][] MyData;
a operation like:
MyData[10][24] = MyData[10][24] + i;
will have a similar speed as:
i = i + i;
However, I obseved that the first one is 4+ times slower than the
second one.
That is the reason I am thinking revise my code to improve the speed.

The context affects the performance of array access. There are two slow
cases to consider:

1. When you access an array element cold. If you literally do
MyData[10][24] = MyData[10][24] + i; the system is going to have to do
several memory accesses and array length checks. Some of those costs can
be spread over several operations in a suitable loop.

2. If you are scanning data structures that are larger than the
processor's L1 cache, you must expect to pay cache miss overheads.
Again, if you structure the loop right one cache miss may cover several
loads and stores.

Patricia
 
P

Patricia Shanahan

Roedy said:
What file sorts do you recommend?

I've always started with whatever was the installed sort on the system,
and never had to do anything else.

Patricia
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,434
Messages
2,571,689
Members
48,796
Latest member
Greg L.

Latest Threads

Top