# Storage of symmetric boolean matrix

G

#### Gustavo Rondina

Hello all,

First of all, this might be somewhat off-topic for some of you, so let
me

I am writing a program that requires a huge 2D symmetric matrix of
booleans
(i.e., either 0 or 1). My first thought was to store it in a packed
way, so
the memory needed would be about half of the amount required to store
the
whole array in a traditional fashion. My packing is described as
follows.

Given this symmetric matrix:

0 1 2 3 4
+---+---+---+---+---+
0 | 1 | 0 | 0 | 1 | 1 |
+---+---+---+---+---+
1 | 0 | 0 | 1 | 1 | 0 |
+---+---+---+---+---+
2 | 0 | 1 | 1 | 1 | 1 |
+---+---+---+---+---+
3 | 1 | 1 | 1 | 1 | 1 |
+---+---+---+---+---+
4 | 1 | 0 | 1 | 1 | 0 |
+---+---+---+---+---+

I am storing only the upper triangular portion plus the diagonal
continuously
in memory, so I have:

00 01 02 03 04 11 12 13 14 22 23 24 33 34 44
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

The two digits numbers above the schematic correspond to the indexes
in the
original matrix. Assuming each position is one char, instead of N^2
bytes the
packed way is using only N*(N+1)/2 bytes. The trade-off is of course
having a
more costly way to access the matrix. Instead of simply doing M[J],
I need
a function like this:

#define N 10000 /* size of matrix */
...
1 unsigned char is_set(unsigned int i, unsigned int j)
2 {
3 unsigned int tmp;
4 unsigned int real_index;
5
6 if (i > j) {
7 tmp = j;
8 j = i;
9 i = tmp;
10 }
11
12 real_index = (i * N) - (i * (i - 1) / 2) + j - i;
13 return packed_array[real_index];
14 }

I came up with the formula to convert the indexes from the traditional
array
to the packed array and it seems to work.

Then I realized it is a waste to use a char to store a single
information, so
I decided to the bits of a char to do this, then instead of using N
(N-1)/2
bytes, I would need only 1/8 of this memory. Of course this leads to
bit
manipulation and all. The final result, with a few tweaks, is the
following
function:

1 unsigned char is_set(int i, int j)
2 {
3 register unsigned int idx;
4
5 if (i > j) {
6 register int temp = i;
7 i = j;
8 j = temp;
9 }
10
11 idx = ((i * ((N << 1) - i - 1)) >> 1) + j;
16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned
char) 0x80);
17 }

(I reordered the expression to reduce the number of multiplications
and use
shift operations.)

This function is called a lot of times, and this is having some
noticeable
impact in the overall performance. The memory saving is really huge,
but
the performance is making me wonder if this is really the best choice.

Does anyone know how could I make it faster, or a clever way to solve
this
problem, or any suggestions on the code I posted above? Only
limitation
is that they only work properly if a char is 8 bits long, but I can
live with that.

Thanks,
Gustavo

PS. Sorry if the schematics look messed up, with a monospace font they
looked nice.

K

#### kid joe

Hello all,

First of all, this might be somewhat off-topic for some of you, so let
me

I am writing a program that requires a huge 2D symmetric matrix of
booleans
(i.e., either 0 or 1). My first thought was to store it in a packed
way, so
the memory needed would be about half of the amount required to store
the
whole array in a traditional fashion. My packing is described as
follows.

Given this symmetric matrix:

0 1 2 3 4
+---+---+---+---+---+
0 | 1 | 0 | 0 | 1 | 1 |
+---+---+---+---+---+
1 | 0 | 0 | 1 | 1 | 0 |
+---+---+---+---+---+
2 | 0 | 1 | 1 | 1 | 1 |
+---+---+---+---+---+
3 | 1 | 1 | 1 | 1 | 1 |
+---+---+---+---+---+
4 | 1 | 0 | 1 | 1 | 0 |
+---+---+---+---+---+

I am storing only the upper triangular portion plus the diagonal
continuously
in memory, so I have:

00 01 02 03 04 11 12 13 14 22 23 24 33 34 44
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

The two digits numbers above the schematic correspond to the indexes
in the
original matrix. Assuming each position is one char, instead of N^2
bytes the
packed way is using only N*(N+1)/2 bytes. The trade-off is of course
having a
more costly way to access the matrix. Instead of simply doing M[J],
I need
a function like this:

#define N 10000 /* size of matrix */
...
1 unsigned char is_set(unsigned int i, unsigned int j)
2 {
3 unsigned int tmp;
4 unsigned int real_index;
5
6 if (i > j) {
7 tmp = j;
8 j = i;
9 i = tmp;
10 }
11
12 real_index = (i * N) - (i * (i - 1) / 2) + j - i;
13 return packed_array[real_index];
14 }

I came up with the formula to convert the indexes from the traditional
array
to the packed array and it seems to work.

Then I realized it is a waste to use a char to store a single
information, so
I decided to the bits of a char to do this, then instead of using N
(N-1)/2
bytes, I would need only 1/8 of this memory. Of course this leads to
bit
manipulation and all. The final result, with a few tweaks, is the
following
function:

1 unsigned char is_set(int i, int j)
2 {
3 register unsigned int idx;
4
5 if (i > j) {
6 register int temp = i;
7 i = j;
8 j = temp;
9 }
10
11 idx = ((i * ((N << 1) - i - 1)) >> 1) + j;
16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned
char) 0x80);
17 }

(I reordered the expression to reduce the number of multiplications
and use
shift operations.)

This function is called a lot of times, and this is having some
noticeable
impact in the overall performance. The memory saving is really huge,
but
the performance is making me wonder if this is really the best choice.

Does anyone know how could I make it faster, or a clever way to solve
this
problem, or any suggestions on the code I posted above? Only
limitation
is that they only work properly if a char is 8 bits long, but I can
live with that.

Thanks,
Gustavo

PS. Sorry if the schematics look messed up, with a monospace font they
looked nice.

Hi Gustavo,

The bitshifts and ANDs will be very fast so probably the problem is the
overhead of a function call. Try declaring your function as inline (or use
a Macro).

Cheers,
Joe

B

#### Barry Schwarz

Hello all,

First of all, this might be somewhat off-topic for some of you, so let
me

I am writing a program that requires a huge 2D symmetric matrix of
booleans
(i.e., either 0 or 1). My first thought was to store it in a packed
way, so
the memory needed would be about half of the amount required to store
the
whole array in a traditional fashion. My packing is described as
follows.

snip conventional 2-d array description
I am storing only the upper triangular portion plus the diagonal
continuously
in memory, so I have:

snip description of converted linear array
Then I realized it is a waste to use a char to store a single
information, so
I decided to the bits of a char to do this, then instead of using N
(N-1)/2
bytes, I would need only 1/8 of this memory. Of course this leads to
bit
manipulation and all. The final result, with a few tweaks, is the
following
function:

1 unsigned char is_set(int i, int j)
2 {
3 register unsigned int idx;
4
5 if (i > j) {
6 register int temp = i;
7 i = j;
8 j = temp;
9 }
10
11 idx = ((i * ((N << 1) - i - 1)) >> 1) + j;
16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned
char) 0x80);

What happened to lines 12 through 15?
17 }

(I reordered the expression to reduce the number of multiplications
and use
shift operations.)
This function is called a lot of times, and this is having some
noticeable
impact in the overall performance. The memory saving is really huge,
but
the performance is making me wonder if this is really the best choice.

Welcome to the common experience of being able to optimize for time or
space but not both.
Does anyone know how could I make it faster, or a clever way to solve
this
problem, or any suggestions on the code I posted above? Only
limitation
is that they only work properly if a char is 8 bits long, but I can
live with that.

You need to look at the generated assembly code.

N<<1 is a compile-time constant and should not be computed
each time. If your compiler is not optimizing this, you might want to
compute it once and store it in a static variable.

Assuming packed_array has type unsigned char, the arithmetic
in the return statement is still performed using int and then
converted to unsigned. Is the generated code spending a lot of time
converting between the types? If so, you might consider using
bit objects instead of 8 bit ones.

In my experience, multiplication is the long tent pole in your
i * ((N << 1) - i - 1)
This is the same as
i * (N << 1) - i *(i+1)
If you increase N until it is a power of 2, the first
multiplication can be replaced by another shift.
The second multiplication depends only on i. i will always be
between 0 and N-1. You could build an N element array of int where
element y is set to y*y+y.
(i << NN) - array
with no multiplication involved. (The array reference should result
in a shift, not a multiply, but you do need to check the assembly
code.)

B

#### Ben Bacarisse

Gustavo Rondina said:
I am storing only the upper triangular portion plus the diagonal
continuously
in memory, so I have:

00 01 02 03 04 11 12 13 14 22 23 24 33 34 44
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

The two digits numbers above the schematic correspond to the indexes
in the
original matrix. Assuming each position is one char, instead of N^2
bytes the
packed way is using only N*(N+1)/2 bytes. The trade-off is of course
having a
more costly way to access the matrix. Instead of simply doing M[J],

Then I realized it is a waste to use a char to store a single
information, so
I decided to the bits of a char to do this, then instead of using N
(N-1)/2
bytes, I would need only 1/8 of this memory. Of course this leads to
bit
manipulation and all. The final result, with a few tweaks, is the
following
function:

1 unsigned char is_set(int i, int j)
2 {
3 register unsigned int idx;
4
5 if (i > j) {
6 register int temp = i;
7 i = j;
8 j = temp;
9 }
10
11 idx = ((i * ((N << 1) - i - 1)) >> 1) + j;
16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned
char) 0x80);
17 }

(I reordered the expression to reduce the number of multiplications
and use
shift operations.)

This function is called a lot of times, and this is having some
noticeable
impact in the overall performance. The memory saving is really huge,
but
the performance is making me wonder if this is really the best
choice.

If you've coded the rest cleanly, it should be very simple to switch in
a plain array and see how much better/worse it is.

My only alternative is to use an array of pointers rather than a
packed array. You arrange for M to be a pointer to row i. This
replaces one part of the index calculation with an indirection. You
do this along with any of your other plans -- storing only one half of
the matrix and/or using bits rather than bytes.

You might also want to try this: allocate the diagonals and not the
rows. Why? Because you can make the symmetry explicit that way. For
an NxN matrix you allocate an array of 2N-1 pointers P. P points
to an array of 1 element (M[N-1]), P to two elements (M[N-2]
and M[N-1]) and so on up to P[N-1] which points to the N elements
of the main diagonal. The remaining Ps pointer to the *same* arrays
but in reverse order (P[N] = P[N-2], P[N+1] = P[N-3] and so on).

Now, make a new pointer-to-pointer variable D and set it to &P[N-1].
D is now the main diagonal and both D and D[-1] are the (same)
diagonal just above and below it. Element M[j] is now indexed as

D[i-j][min(i,j)] i.e: D[i-j][i < j ? i : j]

As with all optimisations, this may or may not pay off but I'd want to
try it out.

You can use the bit allocation trick here as well, of course.

Note that when using an array of pointers you don't have to allocate
the smaller arrays separately. You can so it all on one go and simply
point to the appropriate part.

<snip>

T

#### Thomas Matthews

Gustavo said:
Hello all,

First of all, this might be somewhat off-topic for some of you, so let
me

I am writing a program that requires a huge 2D symmetric matrix of
booleans
(i.e., either 0 or 1). My first thought was to store it in a packed
way, so
the memory needed would be about half of the amount required to store
the
whole array in a traditional fashion. My packing is described as
follows.

Given this symmetric matrix:

0 1 2 3 4
+---+---+---+---+---+
0 | 1 | 0 | 0 | 1 | 1 |
+---+---+---+---+---+
1 | 0 | 0 | 1 | 1 | 0 |
+---+---+---+---+---+
2 | 0 | 1 | 1 | 1 | 1 |
+---+---+---+---+---+
3 | 1 | 1 | 1 | 1 | 1 |
+---+---+---+---+---+
4 | 1 | 0 | 1 | 1 | 0 |
+---+---+---+---+---+

I am storing only the upper triangular portion plus the diagonal
continuously
in memory, so I have:

00 01 02 03 04 11 12 13 14 22 23 24 33 34 44
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

The two digits numbers above the schematic correspond to the indexes
in the
original matrix. Assuming each position is one char, instead of N^2
bytes the
packed way is using only N*(N+1)/2 bytes. The trade-off is of course
having a
more costly way to access the matrix. Instead of simply doing M[J],
I need
a function like this:

#define N 10000 /* size of matrix */
...
1 unsigned char is_set(unsigned int i, unsigned int j)
2 {
3 unsigned int tmp;
4 unsigned int real_index;
5
6 if (i > j) {
7 tmp = j;
8 j = i;
9 i = tmp;
10 }
11
12 real_index = (i * N) - (i * (i - 1) / 2) + j - i;
13 return packed_array[real_index];
14 }

I came up with the formula to convert the indexes from the traditional
array
to the packed array and it seems to work.

Then I realized it is a waste to use a char to store a single
information, so
I decided to the bits of a char to do this, then instead of using N
(N-1)/2
bytes, I would need only 1/8 of this memory. Of course this leads to
bit
manipulation and all. The final result, with a few tweaks, is the
following
function:

1 unsigned char is_set(int i, int j)
2 {
3 register unsigned int idx;
4
5 if (i > j) {
6 register int temp = i;
7 i = j;
8 j = temp;
9 }
10
11 idx = ((i * ((N << 1) - i - 1)) >> 1) + j;
16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned
char) 0x80);
17 }

(I reordered the expression to reduce the number of multiplications
and use
shift operations.)

This function is called a lot of times, and this is having some
noticeable
impact in the overall performance. The memory saving is really huge,
but
the performance is making me wonder if this is really the best choice.

Does anyone know how could I make it faster, or a clever way to solve
this
problem, or any suggestions on the code I posted above? Only
limitation
is that they only work properly if a char is 8 bits long, but I can
live with that.

Thanks,
Gustavo

PS. Sorry if the schematics look messed up, with a monospace font they
looked nice.

Gustavo,

I suggest you use an unsigned int as your base (foundation) type rather
than char. Many processors are now geared up to fetch integers at a
fast rate. Some processors actually slow down when processing chars.

For example, if a processor has a 32-bit integer, and you are using
8-bit chars, it would fetch 8 bits at a time using char and 32-bits in
one fetch using integer.

Also, you may want to ask yourself if the space saving is necessary
and actually benefits the product. Many "desktop" applications are not
memory sensitive, so more time should be spent on correctness of the
program.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.comeaucomputing.com/learn/faq/
Other sites:
http://www.josuttis.com -- C++ STL Library book
http://www.sgi.com/tech/stl -- Standard Template Library

N

#### Nobody

(I reordered the expression to reduce the number of multiplications
and use shift operations.)

The compiler ought to be able to do this for you. Certainly,
multiplication or division by 2 will be translated to a shift for any sane
compiler.
This function is called a lot of times, and this is having some
noticeable impact in the overall performance. The memory saving is
really huge, but the performance is making me wonder if this is really
the best choice.

Does anyone know how could I make it faster, or a clever way to solve this
problem, or any suggestions on the code I posted above? Only limitation
is that they only work properly if a char is 8 bits long, but I can live
with that.

If the multiplication (i*(N/2-i-1)) is the bottleneck, using a
pre-calculated array of row pointers may be faster.

Also try:

Returning "int" rather than "char".

Using an array of "int"s (or "long"s) rather than "char"s.

Returning 0/1 rather than 0/0x80, e.g.:

return (packed_array[idx>>3] >> (idx&7)) & 1;

Returning zero/non-zero, e.g.:

return (packed_array[idx>>3] & (1 << (idx&7)));

Defining the function as "static inline ..." in a header file.

Finally: storing values as bits rather than bytes will use 1/8th the
memory, but using a triangular matrix will save less than a half, so maybe
the latter isn't worth it.

G

#### Gustavo Rondina

On Sat, 27 Jun 2009 16:36:06 -0700, Barry Schwarz wrote:

[snip code from previous post]
What happened to lines 12 through 15?

Sorry, I had copied a previous version of the function with a few extra
unnecessary lines which were removed only after I inserted the line
numbers by hand -- next time 'cat -n' certainly will be used.

You need to look at the generated assembly code.

N<<1 is a compile-time constant and should not be computed
each time. If your compiler is not optimizing this, you might want to
compute it once and store it in a static variable.

I did this and it made the function a bit faster, thanks for the tip.

Assuming packed_array has type unsigned char, the arithmetic
in the return statement is still performed using int and then converted
to unsigned. Is the generated code spending a lot of time converting
between the types? If so, you might consider using unsigned int instead
8 bit ones.

Just now Thomas Matthews suggested this as well in another post, thank
you both. I am using integers now, and it is indeed a little faster.

In my experience, multiplication is the long tent pole in your
i * ((N << 1) - i - 1)
This is the same as
i * (N << 1) - i *(i+1)
If you increase N until it is a power of 2, the first
multiplication can be replaced by another shift.
The second multiplication depends only on i. i will always be
between 0 and N-1. You could build an N element array of int where
element y is set to y*y+y.

This is interesting, I will give it a try. I don't know though if I can
afford increasing N to a power of 2. Only tests will tell.

For now I have the following. The computations involving constants have
been removed from the function and the result is stored somewhere else.

Again, this assumes that a unsigned int is 32 bits long.

1 unsigned int is_set (unsigned int i, unsigned int j)
2 {
3 register unsigned int idx;
4
5 if (i > j)
6 {
7 register int temp = i;
8 i = j;
9 j = temp;
10 }
11
12 /* Original expression:
13
14 idx = i * N - ((i * (i - 1)) / 2) + (j - i)
15
16 In the code below NN == (N << 1) - 1 is a variable
17 initialized somwhere else.
18 */
19
20 idx = NN - i;
21 idx *= i;
22 idx >>= 1;
23 idx += j;
24
25 /* packed_array is a array of unsigned integers */
26 return ((packed_array[idx >> 5] << (idx & 31)) & 0x80000000);
27 }

Cheers,

G

#### Gustavo Rondina

I suggest you use an unsigned int as your base (foundation) type rather
than char. Many processors are now geared up to fetch integers at a
fast rate. Some processors actually slow down when processing chars.

I did this modification and indeed it is a bit faster (something around
10% to 15%, but I haven't run extensive benchmarks yet.)

[snip]
Also, you may want to ask yourself if the space saving is necessary and
actually benefits the product. Many "desktop" applications are not
memory sensitive, so more time should be spent on correctness of the
program.

I see. Well, this is a scientific application for simulating the dynamics
of system of molecules. The memory requirement scales with N, which is the
number of molecules/atoms/beads one wants to simulate. In the future the
plan is to port this application to run on GPUs, where memory is not as
abundant as in the regular PC, so I am trying to design the application to
use as less memory as possible so large systems can be simulated.

With the highly parallel design of GPUs the expensive computations of
calculating indexes and such might be hidden by the enormous amount of
threads one can run, but nonetheless I would like to get a fast ``serial''
(i.e., CPU) version as well.

The replies to my original question were very good and gave me some ideas
I might want to try.

Cheers,

B

#### Barry Schwarz

On Sat, 27 Jun 2009 16:36:06 -0700, Barry Schwarz wrote:
snip

This is interesting, I will give it a try. I don't know though if I can
afford increasing N to a power of 2. Only tests will tell.

Your N was defined as 10,000. If this was just an approximation as
opposed to a precise requirement, why not use 8,192 as your
approximation. Otherwise your are up to 16,384. The extra bits in
packed_array won't matter that much. Even the extra 6K elements of
the int array should not be that big a deal given how much space you
saved by packing the boolean values.
For now I have the following. The computations involving constants have
been removed from the function and the result is stored somewhere else.

Again, this assumes that a unsigned int is 32 bits long.

1 unsigned int is_set (unsigned int i, unsigned int j)
2 {
3 register unsigned int idx;
4
5 if (i > j)
6 {
7 register int temp = i;

This should be unsigned also.
8 i = j;
9 j = temp;
10 }

snip

F

#### Flash Gordon

Gustavo Rondina wrote:

I see. Well, this is a scientific application for simulating the dynamics
of system of molecules. The memory requirement scales with N, which is the
number of molecules/atoms/beads one wants to simulate. In the future the
plan is to port this application to run on GPUs, where memory is not as
abundant as in the regular PC, so I am trying to design the application to
use as less memory as possible so large systems can be simulated.

With the highly parallel design of GPUs the expensive computations of
calculating indexes and such might be hidden by the enormous amount of
threads one can run, but nonetheless I would like to get a fast ``serial''
(i.e., CPU) version as well.

The replies to my original question were very good and gave me some ideas
I might want to try.

A few points you might want to consider as you will be switching to a
different processor...

The size of unsigned in could well be different on the final target
processor.

There are differences between compilers on what they can optimise
efficiently.

There are difference between processors on what opperations are efficient.

So, all your careful benchmarking and speed optimisations (and some
space optimisations) could well be undone by switching to a different
processor.

So I would recommend keeping all of the different versions of the code
and getting hold of the real target (or a simulator for it) and re-doing
the work on the final target system!