Storage of symmetric boolean matrix

Discussion in 'C Programming' started by Gustavo Rondina, Jun 27, 2009.

  1. Hello all,

    First of all, this might be somewhat off-topic for some of you, so let
    me
    apology in advance.

    I am writing a program that requires a huge 2D symmetric matrix of
    booleans
    (i.e., either 0 or 1). My first thought was to store it in a packed
    way, so
    the memory needed would be about half of the amount required to store
    the
    whole array in a traditional fashion. My packing is described as
    follows.

    Given this symmetric matrix:

    0 1 2 3 4
    +---+---+---+---+---+
    0 | 1 | 0 | 0 | 1 | 1 |
    +---+---+---+---+---+
    1 | 0 | 0 | 1 | 1 | 0 |
    +---+---+---+---+---+
    2 | 0 | 1 | 1 | 1 | 1 |
    +---+---+---+---+---+
    3 | 1 | 1 | 1 | 1 | 1 |
    +---+---+---+---+---+
    4 | 1 | 0 | 1 | 1 | 0 |
    +---+---+---+---+---+

    I am storing only the upper triangular portion plus the diagonal
    continuously
    in memory, so I have:

    00 01 02 03 04 11 12 13 14 22 23 24 33 34 44
    +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
    | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
    +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

    The two digits numbers above the schematic correspond to the indexes
    in the
    original matrix. Assuming each position is one char, instead of N^2
    bytes the
    packed way is using only N*(N+1)/2 bytes. The trade-off is of course
    having a
    more costly way to access the matrix. Instead of simply doing M[J],
    I need
    a function like this:

    #define N 10000 /* size of matrix */
    ...
    1 unsigned char is_set(unsigned int i, unsigned int j)
    2 {
    3 unsigned int tmp;
    4 unsigned int real_index;
    5
    6 if (i > j) {
    7 tmp = j;
    8 j = i;
    9 i = tmp;
    10 }
    11
    12 real_index = (i * N) - (i * (i - 1) / 2) + j - i;
    13 return packed_array[real_index];
    14 }

    I came up with the formula to convert the indexes from the traditional
    array
    to the packed array and it seems to work.

    Then I realized it is a waste to use a char to store a single
    information, so
    I decided to the bits of a char to do this, then instead of using N
    (N-1)/2
    bytes, I would need only 1/8 of this memory. Of course this leads to
    bit
    manipulation and all. The final result, with a few tweaks, is the
    following
    function:

    1 unsigned char is_set(int i, int j)
    2 {
    3 register unsigned int idx;
    4
    5 if (i > j) {
    6 register int temp = i;
    7 i = j;
    8 j = temp;
    9 }
    10
    11 idx = ((i * ((N << 1) - i - 1)) >> 1) + j;
    16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned
    char) 0x80);
    17 }

    (I reordered the expression to reduce the number of multiplications
    and use
    shift operations.)

    This function is called a lot of times, and this is having some
    noticeable
    impact in the overall performance. The memory saving is really huge,
    but
    the performance is making me wonder if this is really the best choice.

    Does anyone know how could I make it faster, or a clever way to solve
    this
    problem, or any suggestions on the code I posted above? Only
    limitation
    is that they only work properly if a char is 8 bits long, but I can
    live with that.

    Thanks,
    Gustavo

    PS. Sorry if the schematics look messed up, with a monospace font they
    looked nice.
    Gustavo Rondina, Jun 27, 2009
    #1
    1. Advertising

  2. Gustavo Rondina

    kid joe Guest

    On Sat, 27 Jun 2009 14:07:55 -0700, Gustavo Rondina wrote:
    > Hello all,
    >
    > First of all, this might be somewhat off-topic for some of you, so let
    > me
    > apology in advance.
    >
    > I am writing a program that requires a huge 2D symmetric matrix of
    > booleans
    > (i.e., either 0 or 1). My first thought was to store it in a packed
    > way, so
    > the memory needed would be about half of the amount required to store
    > the
    > whole array in a traditional fashion. My packing is described as
    > follows.
    >
    > Given this symmetric matrix:
    >
    > 0 1 2 3 4
    > +---+---+---+---+---+
    > 0 | 1 | 0 | 0 | 1 | 1 |
    > +---+---+---+---+---+
    > 1 | 0 | 0 | 1 | 1 | 0 |
    > +---+---+---+---+---+
    > 2 | 0 | 1 | 1 | 1 | 1 |
    > +---+---+---+---+---+
    > 3 | 1 | 1 | 1 | 1 | 1 |
    > +---+---+---+---+---+
    > 4 | 1 | 0 | 1 | 1 | 0 |
    > +---+---+---+---+---+
    >
    > I am storing only the upper triangular portion plus the diagonal
    > continuously
    > in memory, so I have:
    >
    > 00 01 02 03 04 11 12 13 14 22 23 24 33 34 44
    > +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
    > | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
    > +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
    >
    > The two digits numbers above the schematic correspond to the indexes
    > in the
    > original matrix. Assuming each position is one char, instead of N^2
    > bytes the
    > packed way is using only N*(N+1)/2 bytes. The trade-off is of course
    > having a
    > more costly way to access the matrix. Instead of simply doing M[J],
    > I need
    > a function like this:
    >
    > #define N 10000 /* size of matrix */
    > ...
    > 1 unsigned char is_set(unsigned int i, unsigned int j)
    > 2 {
    > 3 unsigned int tmp;
    > 4 unsigned int real_index;
    > 5
    > 6 if (i > j) {
    > 7 tmp = j;
    > 8 j = i;
    > 9 i = tmp;
    > 10 }
    > 11
    > 12 real_index = (i * N) - (i * (i - 1) / 2) + j - i;
    > 13 return packed_array[real_index];
    > 14 }
    >
    > I came up with the formula to convert the indexes from the traditional
    > array
    > to the packed array and it seems to work.
    >
    > Then I realized it is a waste to use a char to store a single
    > information, so
    > I decided to the bits of a char to do this, then instead of using N
    > (N-1)/2
    > bytes, I would need only 1/8 of this memory. Of course this leads to
    > bit
    > manipulation and all. The final result, with a few tweaks, is the
    > following
    > function:
    >
    > 1 unsigned char is_set(int i, int j)
    > 2 {
    > 3 register unsigned int idx;
    > 4
    > 5 if (i > j) {
    > 6 register int temp = i;
    > 7 i = j;
    > 8 j = temp;
    > 9 }
    > 10
    > 11 idx = ((i * ((N << 1) - i - 1)) >> 1) + j;
    > 16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned
    > char) 0x80);
    > 17 }
    >
    > (I reordered the expression to reduce the number of multiplications
    > and use
    > shift operations.)
    >
    > This function is called a lot of times, and this is having some
    > noticeable
    > impact in the overall performance. The memory saving is really huge,
    > but
    > the performance is making me wonder if this is really the best choice.
    >
    > Does anyone know how could I make it faster, or a clever way to solve
    > this
    > problem, or any suggestions on the code I posted above? Only
    > limitation
    > is that they only work properly if a char is 8 bits long, but I can
    > live with that.
    >
    > Thanks,
    > Gustavo
    >
    > PS. Sorry if the schematics look messed up, with a monospace font they
    > looked nice.


    Hi Gustavo,

    The bitshifts and ANDs will be very fast so probably the problem is the
    overhead of a function call. Try declaring your function as inline (or use
    a Macro).

    Cheers,
    Joe

    --
    ...................... o _______________ _,
    ` Good Evening! , /\_ _| | .-'_|
    `................, _\__`[_______________| _| (_|
    ] [ \, ][ ][ (_|
    kid joe, Jun 27, 2009
    #2
    1. Advertising

  3. On Sat, 27 Jun 2009 14:07:55 -0700 (PDT), Gustavo Rondina
    <> wrote:

    >Hello all,
    >
    >First of all, this might be somewhat off-topic for some of you, so let
    >me
    >apology in advance.
    >
    >I am writing a program that requires a huge 2D symmetric matrix of
    >booleans
    >(i.e., either 0 or 1). My first thought was to store it in a packed
    >way, so
    >the memory needed would be about half of the amount required to store
    >the
    >whole array in a traditional fashion. My packing is described as
    >follows.


    snip conventional 2-d array description

    >I am storing only the upper triangular portion plus the diagonal
    >continuously
    >in memory, so I have:


    snip description of converted linear array

    >Then I realized it is a waste to use a char to store a single
    >information, so
    >I decided to the bits of a char to do this, then instead of using N
    >(N-1)/2
    >bytes, I would need only 1/8 of this memory. Of course this leads to
    >bit
    >manipulation and all. The final result, with a few tweaks, is the
    >following
    >function:
    >
    > 1 unsigned char is_set(int i, int j)
    > 2 {
    > 3 register unsigned int idx;
    > 4
    > 5 if (i > j) {
    > 6 register int temp = i;
    > 7 i = j;
    > 8 j = temp;
    > 9 }
    >10
    >11 idx = ((i * ((N << 1) - i - 1)) >> 1) + j;
    >16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned
    >char) 0x80);


    What happened to lines 12 through 15?

    >17 }
    >
    >(I reordered the expression to reduce the number of multiplications
    >and use
    >shift operations.)


    >
    >This function is called a lot of times, and this is having some
    >noticeable
    >impact in the overall performance. The memory saving is really huge,
    >but
    >the performance is making me wonder if this is really the best choice.
    >


    Welcome to the common experience of being able to optimize for time or
    space but not both.

    >Does anyone know how could I make it faster, or a clever way to solve
    >this
    >problem, or any suggestions on the code I posted above? Only
    >limitation
    >is that they only work properly if a char is 8 bits long, but I can
    >live with that.


    You need to look at the generated assembly code.

    N<<1 is a compile-time constant and should not be computed
    each time. If your compiler is not optimizing this, you might want to
    compute it once and store it in a static variable.

    Assuming packed_array has type unsigned char, the arithmetic
    in the return statement is still performed using int and then
    converted to unsigned. Is the generated code spending a lot of time
    converting between the types? If so, you might consider using
    unsigned int instead of unsigned char and adjusting your shifts for 32
    bit objects instead of 8 bit ones.

    In my experience, multiplication is the long tent pole in your
    expression. Consider your expression
    i * ((N << 1) - i - 1)
    This is the same as
    i * (N << 1) - i *(i+1)
    If you increase N until it is a power of 2, the first
    multiplication can be replaced by another shift.
    The second multiplication depends only on i. i will always be
    between 0 and N-1. You could build an N element array of int where
    element y is set to y*y+y.
    Your multiplication expression then becomes
    (i << NN) - array
    with no multiplication involved. (The array reference should result
    in a shift, not a multiply, but you do need to check the assembly
    code.)

    --
    Remove del for email
    Barry Schwarz, Jun 28, 2009
    #3
  4. Gustavo Rondina <> writes:
    <snip>
    > I am storing only the upper triangular portion plus the diagonal
    > continuously
    > in memory, so I have:
    >
    > 00 01 02 03 04 11 12 13 14 22 23 24 33 34 44
    > +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
    > | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
    > +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
    >
    > The two digits numbers above the schematic correspond to the indexes
    > in the
    > original matrix. Assuming each position is one char, instead of N^2
    > bytes the
    > packed way is using only N*(N+1)/2 bytes. The trade-off is of course
    > having a
    > more costly way to access the matrix. Instead of simply doing M[J],

    <snip>
    > Then I realized it is a waste to use a char to store a single
    > information, so
    > I decided to the bits of a char to do this, then instead of using N
    > (N-1)/2
    > bytes, I would need only 1/8 of this memory. Of course this leads to
    > bit
    > manipulation and all. The final result, with a few tweaks, is the
    > following
    > function:
    >
    > 1 unsigned char is_set(int i, int j)
    > 2 {
    > 3 register unsigned int idx;
    > 4
    > 5 if (i > j) {
    > 6 register int temp = i;
    > 7 i = j;
    > 8 j = temp;
    > 9 }
    > 10
    > 11 idx = ((i * ((N << 1) - i - 1)) >> 1) + j;
    > 16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned
    > char) 0x80);
    > 17 }
    >
    > (I reordered the expression to reduce the number of multiplications
    > and use
    > shift operations.)
    >
    > This function is called a lot of times, and this is having some
    > noticeable
    > impact in the overall performance. The memory saving is really huge,
    > but
    > the performance is making me wonder if this is really the best
    > choice.


    If you've coded the rest cleanly, it should be very simple to switch in
    a plain array and see how much better/worse it is.

    My only alternative is to use an array of pointers rather than a
    packed array. You arrange for M to be a pointer to row i. This
    replaces one part of the index calculation with an indirection. You
    do this along with any of your other plans -- storing only one half of
    the matrix and/or using bits rather than bytes.

    You might also want to try this: allocate the diagonals and not the
    rows. Why? Because you can make the symmetry explicit that way. For
    an NxN matrix you allocate an array of 2N-1 pointers P. P[0] points
    to an array of 1 element (M[0][N-1]), P[1] to two elements (M[0][N-2]
    and M[1][N-1]) and so on up to P[N-1] which points to the N elements
    of the main diagonal. The remaining Ps pointer to the *same* arrays
    but in reverse order (P[N] = P[N-2], P[N+1] = P[N-3] and so on).

    Now, make a new pointer-to-pointer variable D and set it to &P[N-1].
    D[0] is now the main diagonal and both D[1] and D[-1] are the (same)
    diagonal just above and below it. Element M[j] is now indexed as

    D[i-j][min(i,j)] i.e: D[i-j][i < j ? i : j]

    As with all optimisations, this may or may not pay off but I'd want to
    try it out.

    You can use the bit allocation trick here as well, of course.

    Note that when using an array of pointers you don't have to allocate
    the smaller arrays separately. You can so it all on one go and simply
    point to the appropriate part.

    <snip>
    --
    Ben.
    Ben Bacarisse, Jun 28, 2009
    #4
  5. Gustavo Rondina wrote:
    > Hello all,
    >
    > First of all, this might be somewhat off-topic for some of you, so let
    > me
    > apology in advance.
    >
    > I am writing a program that requires a huge 2D symmetric matrix of
    > booleans
    > (i.e., either 0 or 1). My first thought was to store it in a packed
    > way, so
    > the memory needed would be about half of the amount required to store
    > the
    > whole array in a traditional fashion. My packing is described as
    > follows.
    >
    > Given this symmetric matrix:
    >
    > 0 1 2 3 4
    > +---+---+---+---+---+
    > 0 | 1 | 0 | 0 | 1 | 1 |
    > +---+---+---+---+---+
    > 1 | 0 | 0 | 1 | 1 | 0 |
    > +---+---+---+---+---+
    > 2 | 0 | 1 | 1 | 1 | 1 |
    > +---+---+---+---+---+
    > 3 | 1 | 1 | 1 | 1 | 1 |
    > +---+---+---+---+---+
    > 4 | 1 | 0 | 1 | 1 | 0 |
    > +---+---+---+---+---+
    >
    > I am storing only the upper triangular portion plus the diagonal
    > continuously
    > in memory, so I have:
    >
    > 00 01 02 03 04 11 12 13 14 22 23 24 33 34 44
    > +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
    > | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
    > +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
    >
    > The two digits numbers above the schematic correspond to the indexes
    > in the
    > original matrix. Assuming each position is one char, instead of N^2
    > bytes the
    > packed way is using only N*(N+1)/2 bytes. The trade-off is of course
    > having a
    > more costly way to access the matrix. Instead of simply doing M[J],
    > I need
    > a function like this:
    >
    > #define N 10000 /* size of matrix */
    > ...
    > 1 unsigned char is_set(unsigned int i, unsigned int j)
    > 2 {
    > 3 unsigned int tmp;
    > 4 unsigned int real_index;
    > 5
    > 6 if (i > j) {
    > 7 tmp = j;
    > 8 j = i;
    > 9 i = tmp;
    > 10 }
    > 11
    > 12 real_index = (i * N) - (i * (i - 1) / 2) + j - i;
    > 13 return packed_array[real_index];
    > 14 }
    >
    > I came up with the formula to convert the indexes from the traditional
    > array
    > to the packed array and it seems to work.
    >
    > Then I realized it is a waste to use a char to store a single
    > information, so
    > I decided to the bits of a char to do this, then instead of using N
    > (N-1)/2
    > bytes, I would need only 1/8 of this memory. Of course this leads to
    > bit
    > manipulation and all. The final result, with a few tweaks, is the
    > following
    > function:
    >
    > 1 unsigned char is_set(int i, int j)
    > 2 {
    > 3 register unsigned int idx;
    > 4
    > 5 if (i > j) {
    > 6 register int temp = i;
    > 7 i = j;
    > 8 j = temp;
    > 9 }
    > 10
    > 11 idx = ((i * ((N << 1) - i - 1)) >> 1) + j;
    > 16 return ((packed_array[idx >> 3] << (idx & 7)) & (unsigned
    > char) 0x80);
    > 17 }
    >
    > (I reordered the expression to reduce the number of multiplications
    > and use
    > shift operations.)
    >
    > This function is called a lot of times, and this is having some
    > noticeable
    > impact in the overall performance. The memory saving is really huge,
    > but
    > the performance is making me wonder if this is really the best choice.
    >
    > Does anyone know how could I make it faster, or a clever way to solve
    > this
    > problem, or any suggestions on the code I posted above? Only
    > limitation
    > is that they only work properly if a char is 8 bits long, but I can
    > live with that.
    >
    > Thanks,
    > Gustavo
    >
    > PS. Sorry if the schematics look messed up, with a monospace font they
    > looked nice.


    Gustavo,

    I suggest you use an unsigned int as your base (foundation) type rather
    than char. Many processors are now geared up to fetch integers at a
    fast rate. Some processors actually slow down when processing chars.

    For example, if a processor has a 32-bit integer, and you are using
    8-bit chars, it would fetch 8 bits at a time using char and 32-bits in
    one fetch using integer.

    Also, you may want to ask yourself if the space saving is necessary
    and actually benefits the product. Many "desktop" applications are not
    memory sensitive, so more time should be spent on correctness of the
    program.

    --
    Thomas Matthews

    C++ newsgroup welcome message:
    http://www.slack.net/~shiva/welcome.txt
    C++ Faq: http://www.parashift.com/c -faq-lite
    C Faq: http://www.eskimo.com/~scs/c-faq/top.html
    alt.comp.lang.learn.c-c++ faq:
    http://www.comeaucomputing.com/learn/faq/
    Other sites:
    http://www.josuttis.com -- C++ STL Library book
    http://www.sgi.com/tech/stl -- Standard Template Library
    Thomas Matthews, Jun 28, 2009
    #5
  6. Gustavo Rondina

    Nobody Guest

    On Sat, 27 Jun 2009 14:07:55 -0700, Gustavo Rondina wrote:

    > (I reordered the expression to reduce the number of multiplications
    > and use shift operations.)


    The compiler ought to be able to do this for you. Certainly,
    multiplication or division by 2 will be translated to a shift for any sane
    compiler.

    > This function is called a lot of times, and this is having some
    > noticeable impact in the overall performance. The memory saving is
    > really huge, but the performance is making me wonder if this is really
    > the best choice.
    >
    > Does anyone know how could I make it faster, or a clever way to solve this
    > problem, or any suggestions on the code I posted above? Only limitation
    > is that they only work properly if a char is 8 bits long, but I can live
    > with that.


    If the multiplication (i*(N/2-i-1)) is the bottleneck, using a
    pre-calculated array of row pointers may be faster.

    Also try:

    Returning "int" rather than "char".

    Using an array of "int"s (or "long"s) rather than "char"s.

    Returning 0/1 rather than 0/0x80, e.g.:

    return (packed_array[idx>>3] >> (idx&7)) & 1;

    Returning zero/non-zero, e.g.:

    return (packed_array[idx>>3] & (1 << (idx&7)));

    Defining the function as "static inline ..." in a header file.

    Finally: storing values as bits rather than bytes will use 1/8th the
    memory, but using a triangular matrix will save less than a half, so maybe
    the latter isn't worth it.
    Nobody, Jun 28, 2009
    #6
  7. On Sat, 27 Jun 2009 16:36:06 -0700, Barry Schwarz wrote:

    [snip code from previous post]

    > What happened to lines 12 through 15?


    Sorry, I had copied a previous version of the function with a few extra
    unnecessary lines which were removed only after I inserted the line
    numbers by hand -- next time 'cat -n' certainly will be used.


    > You need to look at the generated assembly code.
    >
    > N<<1 is a compile-time constant and should not be computed
    > each time. If your compiler is not optimizing this, you might want to
    > compute it once and store it in a static variable.


    I did this and it made the function a bit faster, thanks for the tip.


    > Assuming packed_array has type unsigned char, the arithmetic
    > in the return statement is still performed using int and then converted
    > to unsigned. Is the generated code spending a lot of time converting
    > between the types? If so, you might consider using unsigned int instead
    > of unsigned char and adjusting your shifts for 32 bit objects instead of
    > 8 bit ones.


    Just now Thomas Matthews suggested this as well in another post, thank
    you both. I am using integers now, and it is indeed a little faster.


    > In my experience, multiplication is the long tent pole in your
    > expression. Consider your expression
    > i * ((N << 1) - i - 1)
    > This is the same as
    > i * (N << 1) - i *(i+1)
    > If you increase N until it is a power of 2, the first
    > multiplication can be replaced by another shift.
    > The second multiplication depends only on i. i will always be
    > between 0 and N-1. You could build an N element array of int where
    > element y is set to y*y+y.


    This is interesting, I will give it a try. I don't know though if I can
    afford increasing N to a power of 2. Only tests will tell.

    For now I have the following. The computations involving constants have
    been removed from the function and the result is stored somewhere else.

    Again, this assumes that a unsigned int is 32 bits long.

    1 unsigned int is_set (unsigned int i, unsigned int j)
    2 {
    3 register unsigned int idx;
    4
    5 if (i > j)
    6 {
    7 register int temp = i;
    8 i = j;
    9 j = temp;
    10 }
    11
    12 /* Original expression:
    13
    14 idx = i * N - ((i * (i - 1)) / 2) + (j - i)
    15
    16 In the code below NN == (N << 1) - 1 is a variable
    17 initialized somwhere else.
    18 */
    19
    20 idx = NN - i;
    21 idx *= i;
    22 idx >>= 1;
    23 idx += j;
    24
    25 /* packed_array is a array of unsigned integers */
    26 return ((packed_array[idx >> 5] << (idx & 31)) & 0x80000000);
    27 }

    Again, thanks for your insights.

    Cheers,

    --
    Gustavo
    Gustavo Rondina, Jun 28, 2009
    #7
  8. On Sun, 28 Jun 2009 00:18:18 -0700, Thomas Matthews wrote:

    > I suggest you use an unsigned int as your base (foundation) type rather
    > than char. Many processors are now geared up to fetch integers at a
    > fast rate. Some processors actually slow down when processing chars.


    I did this modification and indeed it is a bit faster (something around
    10% to 15%, but I haven't run extensive benchmarks yet.)

    [snip]

    > Also, you may want to ask yourself if the space saving is necessary and
    > actually benefits the product. Many "desktop" applications are not
    > memory sensitive, so more time should be spent on correctness of the
    > program.


    I see. Well, this is a scientific application for simulating the dynamics
    of system of molecules. The memory requirement scales with N, which is the
    number of molecules/atoms/beads one wants to simulate. In the future the
    plan is to port this application to run on GPUs, where memory is not as
    abundant as in the regular PC, so I am trying to design the application to
    use as less memory as possible so large systems can be simulated.

    With the highly parallel design of GPUs the expensive computations of
    calculating indexes and such might be hidden by the enormous amount of
    threads one can run, but nonetheless I would like to get a fast ``serial''
    (i.e., CPU) version as well.

    The replies to my original question were very good and gave me some ideas
    I might want to try.

    Cheers,

    --
    Gustavo
    Gustavo Rondina, Jun 28, 2009
    #8
  9. On Sun, 28 Jun 2009 08:49:37 +0000 (UTC), Gustavo Rondina
    <> wrote:

    >On Sat, 27 Jun 2009 16:36:06 -0700, Barry Schwarz wrote:


    snip

    >> In my experience, multiplication is the long tent pole in your
    >> expression. Consider your expression
    >> i * ((N << 1) - i - 1)
    >> This is the same as
    >> i * (N << 1) - i *(i+1)
    >> If you increase N until it is a power of 2, the first
    >> multiplication can be replaced by another shift.
    >> The second multiplication depends only on i. i will always be
    >> between 0 and N-1. You could build an N element array of int where
    >> element y is set to y*y+y.

    >
    >This is interesting, I will give it a try. I don't know though if I can
    >afford increasing N to a power of 2. Only tests will tell.


    Your N was defined as 10,000. If this was just an approximation as
    opposed to a precise requirement, why not use 8,192 as your
    approximation. Otherwise your are up to 16,384. The extra bits in
    packed_array won't matter that much. Even the extra 6K elements of
    the int array should not be that big a deal given how much space you
    saved by packing the boolean values.

    >
    >For now I have the following. The computations involving constants have
    >been removed from the function and the result is stored somewhere else.
    >
    >Again, this assumes that a unsigned int is 32 bits long.
    >
    > 1 unsigned int is_set (unsigned int i, unsigned int j)
    > 2 {
    > 3 register unsigned int idx;
    > 4
    > 5 if (i > j)
    > 6 {
    > 7 register int temp = i;


    This should be unsigned also.

    > 8 i = j;
    > 9 j = temp;
    > 10 }


    snip

    --
    Remove del for email
    Barry Schwarz, Jun 28, 2009
    #9
  10. Gustavo Rondina

    Flash Gordon Guest

    Gustavo Rondina wrote:

    <snip>

    > I see. Well, this is a scientific application for simulating the dynamics
    > of system of molecules. The memory requirement scales with N, which is the
    > number of molecules/atoms/beads one wants to simulate. In the future the
    > plan is to port this application to run on GPUs, where memory is not as
    > abundant as in the regular PC, so I am trying to design the application to
    > use as less memory as possible so large systems can be simulated.
    >
    > With the highly parallel design of GPUs the expensive computations of
    > calculating indexes and such might be hidden by the enormous amount of
    > threads one can run, but nonetheless I would like to get a fast ``serial''
    > (i.e., CPU) version as well.
    >
    > The replies to my original question were very good and gave me some ideas
    > I might want to try.


    A few points you might want to consider as you will be switching to a
    different processor...

    The size of unsigned in could well be different on the final target
    processor.

    There are differences between compilers on what they can optimise
    efficiently.

    There are difference between processors on what opperations are efficient.

    So, all your careful benchmarking and speed optimisations (and some
    space optimisations) could well be undone by switching to a different
    processor.

    So I would recommend keeping all of the different versions of the code
    and getting hold of the real target (or a simulator for it) and re-doing
    the work on the final target system!
    --
    Flash Gordon
    Flash Gordon, Jun 29, 2009
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. lvcargnini

    Matrix composed by two matrix

    lvcargnini, Jul 4, 2006, in forum: VHDL
    Replies:
    3
    Views:
    2,671
    Jonathan Bromley
    Jul 5, 2006
  2. sarathy
    Replies:
    2
    Views:
    660
    sarathy
    Jul 17, 2006
  3. J Leonard
    Replies:
    4
    Views:
    12,672
    Mark Space
    Jan 19, 2008
  4. Metre Meter
    Replies:
    7
    Views:
    370
    Metre Meter
    Aug 6, 2010
  5. Replies:
    32
    Views:
    589
    Arne Vajhøj
    Feb 10, 2013
Loading...

Share This Page