Fast access multi-dimensional arrays? (int type)

K

Kevin

Hi all,
I have a question of: is there any way we can create a
multi-dimensional arrays data structure, each grid item is a "int",
with fast access for operations like "increase value in one grid", or
get that value?

I did some tests using two methods, both to my mind are not fast
enough. Details as:

Assumptions:
1) the operations we want to optimize are (faster better):
set/increase/get the int values in the multi-dimensional array.
2) the data is dense.
3) the total dimension number is known in advance.
4) the size for each dimension is also known in advance (and not large,
usually 4-10).

Method 1 I tried (using 2 dimension as example):
create the array like:
int[][] data = new int[sizeA][sizeB];
in this way, we can access the values like:
int v = data[1][3];
data[1][3] = data[1][3] + 1;
data[1][3] = 123;

However, I noticed this method is not fast enough.
So I am thinking maybe the data are not in a continuous memory space,
which making access them slow.

Method 2 I tried (using 2 dimensions as example):
int[] data = new int[sizeA * sizeB];
each time I want to access a data with (indexA, indexB), I use:
int pos = indexA*sizeB + indexB.
int v= data[pos];
data[pos] = data[pos]+1;
data[pos] = 123;

However, I observed the second method does NOT improve the speed of the
data access (mainly in my program, this "data[1][3] = data[1][3] +1"
kind of data access).


So I am wondering do we have any other faster method? (or these 2
methods are already the best methods, I am just asking too much?)

Thanks a lot. :)
 
R

Roedy Green

However, I observed the second method does NOT improve the speed of the
data access (mainly in my program, this "data[1][3] = data[1][3] +1"
kind of data access).

It would perhaps with an optimising compiler that got rid of the
multiplications all over the place. see
http://mindprod.com/jgloss/nativecompiler.html

The only thing left I can think of is JNI and doing your array
manipulations in C, perhaps using pointers instead of indexes.
Then after than is MASM where you use the addressing modes of the
hardware to find things rapidly. If it comes to that and you need
some routines absolutely optimal, I am available for a fee. My BBL
Forth was twice as fast as the competition, all written in finely
tuned assembler.

Doing array ACCESSES in C will be slower still because of the overhead
of the JNI boundary.
 
J

John C. Bollinger

Kevin said:
I have a question of: is there any way we can create a
multi-dimensional arrays data structure, each grid item is a "int",
with fast access for operations like "increase value in one grid", or
get that value?

I did some tests using two methods, both to my mind are not fast
enough. Details as:

Assumptions:
1) the operations we want to optimize are (faster better):
set/increase/get the int values in the multi-dimensional array.
2) the data is dense.
3) the total dimension number is known in advance.
4) the size for each dimension is also known in advance (and not large,
usually 4-10).

(5) The data are (hyper-) rectangular?
Method 1 I tried (using 2 dimension as example):
create the array like:
int[][] data = new int[sizeA][sizeB];
in this way, we can access the values like:
int v = data[1][3];
data[1][3] = data[1][3] + 1;
data[1][3] = 123;

This is the default way of structuring multidimensional primitive data
in Java.
However, I noticed this method is not fast enough.
So I am thinking maybe the data are not in a continuous memory space,
which making access them slow.

Although that is a possibility, it seems unlikely for a small number of
small dimensions.
Method 2 I tried (using 2 dimensions as example):
int[] data = new int[sizeA * sizeB];
each time I want to access a data with (indexA, indexB), I use:
int pos = indexA*sizeB + indexB.
int v= data[pos];
data[pos] = data[pos]+1;
data[pos] = 123;

That is the principal alternative for structuring multidimensional
primitive data in Java.
However, I observed the second method does NOT improve the speed of the
data access (mainly in my program, this "data[1][3] = data[1][3] +1"
kind of data access).


So I am wondering do we have any other faster method? (or these 2
methods are already the best methods, I am just asking too much?)

For more-or-less generic data, I am unaware of any data structure that
is likely to be superior to those you already tried. You should
consider, however, whether the problem is really with the data
structure. A low-level alternative is your application's data access
pattern: especially in the second case, you are likely to get the best
performance if you access the data sequentially from the beginning of
the 1D backing array to the end, and reordering the array dimensions may
improve things. A higher-level alternative is that your code is simply
doing a lot of work. It may be that there is a better algorithm that
would reduce the total work to perform, or it may just be that the
calculation takes a long time because it is fundamentally complex. A
third alternative is that the array access is not part of the problem at
all -- that some other part of the code is slowing it down. Depending
on how you are benchmarking, that other part might be something as
simple and unavoidable as JVM startup time.
 
P

Patricia Shanahan

Kevin said:
Hi all,
I have a question of: is there any way we can create a
multi-dimensional arrays data structure, each grid item is a "int",
with fast access for operations like "increase value in one grid", or
get that value?

Does the total matrix size fit easily in cache? If you have only a
two-dimensional problem, with 4 to 10 elements in each dimension,
obviously the problem does fit.

If yes, then index arithmetic is probably the place to look for
improvements.

If no, then you need to think of your problem in terms of transfers
between memory and cache. Relationships between the order of access in
loops and the order of placement in memory matter. Generally, you want
the innermost loop to scan the dimension that is consecutive in memory.

Patricia
 
T

Thomas Hawtin

Roedy said:
However, I observed the second method does NOT improve the speed of the
data access (mainly in my program, this "data[1][3] = data[1][3] +1"
kind of data access).


It would perhaps with an optimising compiler that got rid of the
multiplications all over the place. see

There isn't any multiplication there (other than, perhaps, a left shift).

The real problem, if there is one there, is that the expression uses two
arrays. The second array will need to be loaded from the first, checked
against null, length loaded from header and a second bounds check.
That's two extra reads.

If a single array is used (with row*cols+col/col*rows+row calculation),
then the single array header and data are more likely to be cached.

As the original posters use of the second method didn't improve
performance, this probably isn't the bottleneck. OTOH, the
multiplication maybe slow, and replacing with a left shift may help.

Tom Hawtin
 
R

Roedy Green

checked
against null,

I wonder if that is an explicit test. I have never seen the Hotspot
code. I suspect if they were clever that test might come out in the
wash by the hardware limit checks.
 
R

Roedy Green

There isn't any multiplication there (other than, perhaps, a left shift).

I was referring to the multiplications you have in the single array
version.
 
R

Roedy Green

Generally, you want
the innermost loop to scan the dimension that is consecutive in memory.

in other words the right most subscript should vary most rapidly.

This is important on three levels:

cpu cache.
sram cache
swapping to disk

At each level things go better if you don't hop around.
 
T

Thomas Hawtin

Roedy said:
I wonder if that is an explicit test. I have never seen the Hotspot
code. I suspect if they were clever that test might come out in the
wash by the hardware limit checks.

I would imagine it's very difficult to make such optimisations against
an array of values which will probably be long lived, and not local.

I came across an old Sun article the other day which stated there was
little benefit from hoisting bounds-checking outside an inner loop. But
that might have just been excuses.

Tom Hawtin
 
K

Kevin

I am also thinking so.
Because there would have a lot of data cubes. Each one may fit in cache
(I guess). But as I scan the data lines one by one, and try to update
each data cube, this may require "cache in cache out" very often.

I did a test on part of the data:
let the MyDataCube.updateCount() be empty, the program runs 16 seconds.

if using the single array approach in MyDataCube, and in the
updateCount() function, it does: 1) calculate the index to update, 2)
update the single dimension array like data = data+1; This will
slow down the program to 94 seconds.

Just in case, let me rephrase what the program trys to do with an
example:

Let us have this example data file:
A B C
a1 b1 c1
a1 b2 c1
a2 b3 c2
a1 b2 c2

For 2D data cubes, we will have 3 of them: (A, B), (A, C), (B, C).
(Note: (A, B) = (B, A) in our case)
For data cube (A, B), my program needs to get these counts:
b1 b2 b3
a1 1 2 0
a2 0 0 1
(how many data points we have, when A=ax and B=by)

For data cube (A, C), my program needs to get these counts:
c1 c2
a1 2 1
a2 0 1

For data cube (B, C), my program needs to get these counts:
c1 c2
b1 1 0
b2 1 1
b3 1 0

And the program is required to find all these 3 data cubes with these
numbers in them. As you can imagine, for 3 attributes cubes, there
will have much more of them.

In my real case, there are about 16K 2 attributes data cubes, each one
takes about 1600 bytes. And there are a million data points.

My current approach is:
1) generate all the possible 2 attribute data cubes, with empty counts
for each of them. Luckly, at least for 2 attributes cases, all of them
fit in the physical memory.
2) scan the data one line by one line. For each line, check and try to
update the count every data cube as:
for each data cube:
2a): from the values of that data line, determine the indexes we
need to update, such as to update [5][12] in that 2D data cube
(determin the 5 and 12 from the values in current data line).
2b) call thatDataCube.updateCount() with the indexes to update the
count (I tried two methods to represent the real "cube" as mentioned
above, one is a nested array, one is a single dimention array).
Of all of them, step 1) and 2a) are pretty fast, they take 16 seconds
in a small test data set. But when including the step 2b), the time
needed increase to 94 seconds sharply ---- that is why in my aboev
post, I suspect array access may take long time.

After another thought, I think the cache swapping may take the major
time: for each data line, my code will process all the 16K data cubes.
Even we can order the 16K data cubes in some way, but in any case, we
have to dump them in and out of cache for each data line, right? Which
is slow down a lot. But, if not doing this way, any other better way?

Any ideas?

Thanks. :)
 
C

Chris Uppal

Kevin said:
Just in case, let me rephrase what the program trys to do with an
example:

Thanks for the extra information, it's much clearer what you are trying to do
now.

BTW, this sounds like data-mining, and I'm not expert on either the theory or
practise of that endeavour, so it may be that there are well known algorithms
for doing this sort of thing that I know nothing about, so take the following
with caution...

It seems that you are attempting to compute a sort of correlation matrix
between the various values of the attributes. If you don't mind, I'll call
them correlation matrixes (even though I'm misusing that term a bit, it's
roughly applicable) rather than "data cubes" since I'll want to distinguish
between the actual data, and the derived matrixes, and calling them "data
cubes" would confuse /me/ even if it didn't confuse /you/ ;-)

First off, you haven't said what you are going to do with your (huge!)
collection of correlation matrixes. On the face of it is seems completely
infeasible to create all of them at once, and then deal with them all at the
same time. I think that your only option is to create them one at a time (or a
few at a time), do whatever you need to with them, and then move on to the next
few. If you can't arrange your overall application like that then I seriously
doubt whether it feasible (although buying a 64-bit box and loading it with a
sufficiently large number of hundreds of GBytes of RAM might work).

Assuming that you can rearrange your application so you use the matrixes one at
a time, I think you have a range of algorithms available. In fact what you
have is a single algorithm with a variable parameter which allows you to choose
between optimising for space and optimising for time.

To start with three special cases of the overall algorithm.

First, you could just load all your raw data in to memory (I'm assuming that it
will fit, if not then you are in deeper trouble ;-) as a list of n-tuples
representing the data (the n-tuples might be int[] arrays, or even byte[]
arrays if the ranges of values are small enough). You can then derive each
correlation matrix from the raw data on demand. Creating each matrix is O(N*n)
where N is the number of n-tuples, and n is the number of values in each tuple.
My /guess/ is that this is the most, or even only, feasible implementation as
the number of dimensions (n) rises, but that might only be true (if it's true
at all) for values of n greater than your application actually uses.

At the other extreme you have an algorithm that computes a single "complete"
correlation matrix, from which you derive sub-matrixes by summing "columns"
(as Patricia has already described). Note that this matrix would be of fairly
high dimension and would be dense, so it might easily require more space than
the first option I mentioned. It would have the property that deriving
high-dimension slices (correlation matrixes for most of the n dimensions in the
raw data) would be quick. Deriving low-dimension slices would be slower, for
instance all the looping needed to derive a slice relating just 2 dimensions of
the raw data might take almost as long as a simple linear scan down the raw
data list.

A third option is to use something like a hashtable (one for each dimension) to
relate possible values of that dimensions to sets of tuples that have that
value in that dimension (I hope that sentence makes sense !). Finding the
"correlation" between, say, two values would then involve finding the
intersection of the two sets. If you wanted the "correlation" (as I'm misusing
the word) between tuples with value A for one dimension and B for another, you
would look up A in the hashtable corresponding to that dimension (finding one
set), and value B in another hashtable corresponding to the other dimension,
and then compute the size of the intersection of those sets. Finding the
intersection of two sets can be trivially done in O(N log M) (where N and M are
the sizes of the two sets) or -- if you pre-sort all your sets on some
arbitrary condition -- in O(N+M). Similarly finding the "correlation" between
three values would involve computing the (size of the) intersection of three
sets, and so on. (BTW, the O(<something>) here is the number of comparisons
involved, but each comparison depends on the number of dimensions of the raw
data, I suppose I should write O(( (N+M) * n) for the actual time required.)
I said "something like a hashtable", in fact if you massage your data
appropriately you should be able to use an array for each index instead, which
is simpler and probably faster. Obviously, producing a complete correlation
matrix would involve repeating this lookup algorithm for each cell.

That last option would produce a relatively small number of hashtables, each
with a relatively small number of entries, but where each entry might point to
a large number of tuples. You could increase lookup speed considerable, but at
the cost of a considerable increase in space usage, by "indexing" more than one
dimension. For instance if you used 2-dimensional indexing, you would have
n*n-1) hashtables, each of which corresponds to two dimensions of the raw data,
and maps pairs of values to the sets of tuples that have matching values in
both dimensions. That would cut down the size of the sets considerable, and so
make computing the intersections that much faster. Again, although I said
"hashtable", it would make more sense to massage the data into a form where you
could use 2-dimensional arrays.

Obviously, you can index as many dimensions as you want. If you index 0
dimensions (if you see what I mean) then this algorithm is identical to the
simple one I started with. If you index all the dimensions then the result is
almost identical to Patricia's suggestion (my second option) -- the only
difference is that since you have indexed /all/ the dimensions you will never
have to compute any set intersections, so it's better just to store the size of
the corresponding sets of tuples instead of the sets themselves; if you do that
then they are identical.

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top