question about macro?

M

MBALOVER

Hi all,

Actually I want my code to be


for (i=0;i<N;i++)
{
array2[i*30]=array1;
array2[i*30+1]=array1;
array2[i*30+2]=array1;
array2[i*30+3]=array1;
..................
..................
..................
array2[i*30+28]=array1;
array2[i*30+29]=array1;
}

If there anyway to use macros to do it instead for listing out
manually in the code like above?
(the listing out such as the above example will be very bad in the
case I need, say, array2[i*1000+0]=array1; to ...
array2[i*1000+999]=array1;)

And please note that for some reasons of my device, I do not want to
use another loop such as

for (i=0;i<N;i++)
for (j=0;j<30;j++)
array2[i*30+j]=array1;


If I write a macro as follows:

#define myMacro (array1, array2) \
for ( j=0;j<30;j++) \
array2[i*30+j]=array1; \



Does it help? I doubt it because it will be the same as the case with
two loops that I want to avoid.

Thanks a lot.
 
I

Ian Collins

MBALOVER said:
Hi all,

Actually I want my code to be


for (i=0;i<N;i++)
{
array2[i*30]=array1;
array2[i*30+1]=array1;
array2[i*30+2]=array1;
array2[i*30+3]=array1;
..................
..................
..................
array2[i*30+28]=array1;
array2[i*30+29]=array1;
}

If there anyway to use macros to do it instead for listing out
manually in the code like above?


Forget macros, they won't help here.

How about memset?

(untested)

memset( array2+i*30, array1, 30*sizeof(array2[0]) );
 
J

John Bode

Hi all,

Actually I want my code to be

for (i=0;i<N;i++)
{
    array2[i*30]=array1;
    array2[i*30+1]=array1;
    array2[i*30+2]=array1;
    array2[i*30+3]=array1;
.................
.................
.................
    array2[i*30+28]=array1;
    array2[i*30+29]=array1;

}

If there anyway to use macros to do it instead for listing out
manually in the code like above?
(the listing out such as the above example will be very bad in the
case I need, say,  array2[i*1000+0]=array1;  to ...
array2[i*1000+999]=array1;)

And please note that for some reasons of my device, I do not want to
use another loop such as

for (i=0;i<N;i++)
for (j=0;j<30;j++)
array2[i*30+j]=array1;

If I write a macro as follows:

#define myMacro (array1, array2)          \
for ( j=0;j<30;j++)                                  \
     array2[i*30+j]=array1;                    \

Does it help? I doubt it because it will be the same as the case with
two loops that I want to avoid.

Thanks a lot.


I'm curious as to why an inner loop would be that bad.

Frankly, you're stuck; if you absolutely cannot use an inner loop,
then you'll have to write out each initialization manually. You could
write another program to automate the process for you, such as:

printf("for (i = 0; i < N; i++)\n{\n");
for (j = 0; j < K; j++) /** where K is the number of items */
{
printf(" array[i*30+%d] = array;\n", j);
}
printf("}\n");

and then paste that output into your code.

Another alternative would be to look into a variation of Duff's device
(see http://en.wikipedia.org/wiki/Duff's_device), where you use an
inner loop but partially unroll it.

for (i = 0; i < N; i++)
{
size_t j = (K + 7)/8;
switch(j % 8)
{
case 0 : do {
case 7 : array[i*30+j+7] = array;
case 6 : array[i*30+j+6] = array;
case 5 : array[i*30+j+5] = array;
case 4 : array[i*30+j+4] = array;
case 3 : array[i*30+j+3] = array;
case 2 : array[i*30+j+2] = array;
case 1 : array[i*30+j+1] = array;
} while (--j > 0);
}
}
 
K

Keith Thompson

MBALOVER said:
Actually I want my code to be


for (i=0;i<N;i++)
{
array2[i*30]=array1;
array2[i*30+1]=array1;
array2[i*30+2]=array1;
array2[i*30+3]=array1;
.................
.................
.................
array2[i*30+28]=array1;
array2[i*30+29]=array1;
}


It certainly looks like a job for a nested for loop (but see below).
If there anyway to use macros to do it instead for listing out
manually in the code like above?
(the listing out such as the above example will be very bad in the
case I need, say, array2[i*1000+0]=array1; to ...
array2[i*1000+999]=array1;)


There's not really any way for the preprocessor, given a number N, to
generate N statements.
And please note that for some reasons of my device, I do not want to
use another loop such as

for (i=0;i<N;i++)
for (j=0;j<30;j++)
array2[i*30+j]=array1;


So you insist on having, say, 30 distinct assignment statements; an
equivalent loop that iterates 30 times and performs exactly the same
assignments isn't good enough.

It might help if we know *how* your device imposes this requirement.

If you just write the nested for loop, how exactly does that not work?
If I write a macro as follows:

#define myMacro (array1, array2) \
for ( j=0;j<30;j++) \
array2[i*30+j]=array1; \



Does it help? I doubt it because it will be the same as the case with
two loops that I want to avoid.


Right, the generated code will almost certainly be the same whether
the loop is written out or generated by a macro.

What you're doing is loop unrolling. It's an optimization technique
that can result in faster but larger code. (In some cases, the
code can be slower just because it's larger, due to cache effects.)
Doing loop unrolling at the source level would usually be considered
premature optimization; a good optimizing compiler should be able
to do this for you.

But if you really need to do this, you might consider writing another
tool that generates C code from an input specification. (m4 is a
popular preprocessing tool; I don't know whether it's powerful enough
for this, but it's worth looking into.)
 
K

Keith Thompson

Ian Collins said:
MBALOVER said:
Actually I want my code to be


for (i=0;i<N;i++)
{
array2[i*30]=array1;
array2[i*30+1]=array1;
array2[i*30+2]=array1;
array2[i*30+3]=array1;
..................
..................
..................
array2[i*30+28]=array1;
array2[i*30+29]=array1;
}

If there anyway to use macros to do it instead for listing out
manually in the code like above?


Forget macros, they won't help here.

How about memset?

(untested)

memset( array2+i*30, array1, 30*sizeof(array2[0]) );


No, memset sets each byte of the target to the same value. Unless
array1 and array2 are byte arrays, it won't help here.

Something similar to memset that works on elements bigger than bytes
would do the trick. But the obvious way to implement such a function
is to use a for loop, which the OP wants to avoid for some unknown
reason.
 
I

Ian Collins

Keith said:
Ian Collins said:
MBALOVER said:
Actually I want my code to be


for (i=0;i<N;i++)
{
array2[i*30]=array1;
array2[i*30+1]=array1;
array2[i*30+2]=array1;
array2[i*30+3]=array1;
..................
..................
..................
array2[i*30+28]=array1;
array2[i*30+29]=array1;
}

If there anyway to use macros to do it instead for listing out
manually in the code like above?

Forget macros, they won't help here.

How about memset?

(untested)

memset( array2+i*30, array1, 30*sizeof(array2[0]) );


No, memset sets each byte of the target to the same value. Unless
array1 and array2 are byte arrays, it won't help here.


Bugger, I forgot about that... I agree with your other posting.
 
M

MBALOVER

Thanks a lot for responses.

Let me explain why I do not want to use a inner loop.

for (i=0;i<N;i++)
for (j=0;j<M;j++)
array2[i*M+j]=array1;

for each for iteration of a for loop, we need a comparison and one
addition. In my system, each comparison takes 3 cycles and one
addition takes 1 cycle.
So each iteration takes 4 cycles.

If I use a inner loop, for each i, I need 4*M additional cycles. In
my case M , N both are large numbers and the system's time budget is
limited.

That is why I do not want to use this way.






MBALOVER said:
Actually I want my code to be
for (i=0;i<N;i++)
{
    array2[i*30]=array1;
    array2[i*30+1]=array1;
    array2[i*30+2]=array1;
    array2[i*30+3]=array1;
.................
.................
.................
    array2[i*30+28]=array1;
    array2[i*30+29]=array1;
}


It certainly looks like a job for a nested for loop (but see below).
If there anyway to use macros to do it instead for listing out
manually in the code like above?
(the listing out such as the above example will be very bad in the
case I need, say,  array2[i*1000+0]=array1;  to ...
array2[i*1000+999]=array1;)


There's not really any way for the preprocessor, given a number N, to
generate N statements.
And please note that for some reasons of my device, I do not want to
use another loop such as
for (i=0;i<N;i++)
for (j=0;j<30;j++)
array2[i*30+j]=array1;


So you insist on having, say, 30 distinct assignment statements; an
equivalent loop that iterates 30 times and performs exactly the same
assignments isn't good enough.

It might help if we know *how* your device imposes this requirement.

If you just write the nested for loop, how exactly does that not work?
If I write a macro as follows:
#define myMacro (array1, array2)          \
for ( j=0;j<30;j++)                                  \
     array2[i*30+j]=array1;                    \

Does it help? I doubt it because it will be the same as the case with
two loops that I want to avoid.

Right, the generated code will almost certainly be the same whether
the loop is written out or generated by a macro.

What you're doing is loop unrolling.  It's an optimization technique
that can result in faster but larger code.  (In some cases, the
code can be slower just because it's larger, due to cache effects.)
Doing loop unrolling at the source level would usually be considered
premature optimization; a good optimizing compiler should be able
to do this for you.

But if you really need to do this, you might consider writing another
tool that generates C code from an input specification.  (m4 is a
popular preprocessing tool; I don't know whether it's powerful enough
for this, but it's worth looking into.)

--
Keith Thompson (The_Other_Keith) (e-mail address removed)  <http://www.ghoti.net/~kst>
Nokia
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
 
I

Ian Collins

MBALOVER wrote:

[please don't top post]
Thanks a lot for responses.

Let me explain why I do not want to use a inner loop.

for (i=0;i<N;i++)
for (j=0;j<M;j++)
array2[i*M+j]=array1;

for each for iteration of a for loop, we need a comparison and one
addition. In my system, each comparison takes 3 cycles and one
addition takes 1 cycle.
So each iteration takes 4 cycles.

If I use a inner loop, for each i, I need 4*M additional cycles. In
my case M , N both are large numbers and the system's time budget is
limited.

That is why I do not want to use this way.


Have you looked at the generated code with full optimisation on your
compiler? All those I've used will perform some degree of loop
unrolling, so you will only suffer an extra

(4*M)/L

where L is the number of iterations inlined by the compiler.
 
B

bartc

MBALOVER said:
Thanks a lot for responses.

Let me explain why I do not want to use a inner loop.

for (i=0;i<N;i++)
for (j=0;j<M;j++)
array2[i*M+j]=array1;

for each for iteration of a for loop, we need a comparison and one
addition. In my system, each comparison takes 3 cycles and one
addition takes 1 cycle.
So each iteration takes 4 cycles.

If I use a inner loop, for each i, I need 4*M additional cycles. In
my case M , N both are large numbers and the system's time budget is
limited.

That is why I do not want to use this way.


How about a partially unrolled inner loop?

So in the the inner loop it does 10 assignments at a time, and the loop
executes M/10 times (with any odd assignments added at the end). Then the
overhead might only be 4*M/10 cycles, and you only ever have to write out
in, full, 10 assignments (or however many you want to do in one go).
 
E

Eric Sosman

Hi all,

Actually I want my code to be


for (i=0;i<N;i++)
{
array2[i*30]=array1;
array2[i*30+1]=array1;
array2[i*30+2]=array1;
array2[i*30+3]=array1;
.................
.................
.................
array2[i*30+28]=array1;
array2[i*30+29]=array1;
}

If there anyway to use macros to do it instead for listing out
manually in the code like above?


Use a second loop nested inside the first.
(the listing out such as the above example will be very bad in the
case I need, say, array2[i*1000+0]=array1; to ...
array2[i*1000+999]=array1;)

And please note that for some reasons of my device, I do not want to
use another loop such as

for (i=0;i<N;i++)
for (j=0;j<30;j++)
array2[i*30+j]=array1;


Use a second loop nested inside-- oh, wait, sorry. It's
the right solution, but if you don't want to use it ... What
are these mysterious "reasons" of yours? Tell us about them,
and maybe somebody will have a better idea than writing line
after line after repetitive line of code.
If I write a macro as follows:

#define myMacro (array1, array2) \
for ( j=0;j<30;j++) \
array2[i*30+j]=array1; \

Does it help? I doubt it because it will be the same as the case with
two loops that I want to avoid.


It does the right thing, which is to use a second loop-- oh,
wait, we've been through that already.

The preprocessor has no iterative constructs.
 
I

ImpalerCore

Hi all,

Actually I want my code to be

for (i=0;i<N;i++)
{
    array2[i*30]=array1;
    array2[i*30+1]=array1;
    array2[i*30+2]=array1;
    array2[i*30+3]=array1;
.................
.................
.................
    array2[i*30+28]=array1;
    array2[i*30+29]=array1;

}

If there anyway to use macros to do it instead for listing out
manually in the code like above?
(the listing out such as the above example will be very bad in the
case I need, say,  array2[i*1000+0]=array1;  to ...
array2[i*1000+999]=array1;)

And please note that for some reasons of my device, I do not want to
use another loop such as

for (i=0;i<N;i++)
for (j=0;j<30;j++)
array2[i*30+j]=array1;

If I write a macro as follows:

#define myMacro (array1, array2)          \
for ( j=0;j<30;j++)                                  \
     array2[i*30+j]=array1;                    \

Does it help? I doubt it because it will be the same as the case with
two loops that I want to avoid.

Thanks a lot.


Just a question. Is the performance gain worth the maintenance and
code size (if embedded) costs that you're adding? I haven't
encountered a scenario where doing something like this is worth the
imposed maintenance cost, so I'm curious what the scenario is.
 
E

Eric Sosman

Thanks a lot for responses.

Let me explain why I do not want to use a inner loop.

for (i=0;i<N;i++)
for (j=0;j<M;j++)
array2[i*M+j]=array1;

for each for iteration of a for loop, we need a comparison and one
addition.


Maybe, but optimizing compilers are already pretty good
at unrolling loops -- especially loops whose iteration count
are compile-time constants.
addition. In my system, each comparison takes 3 cycles and one
addition takes 1 cycle.
So each iteration takes 4 cycles.

It was reasonable to count cycles as recently as, oh, twenty
years ago. Since then, CPU's have become far more intricate and
cycle counts have become extremely difficult to compute. Unless
you're using an unusually simple processor (single-issue, little
or no pipelining, at most one level of memory cache, maybe only
a one- or two-slot store buffer, ...), these cycle counts are
only loosely related to elapsed time.

Your example showed thirty assignments, and you indicated you
might need as many as a thousand. How many cycles will you spend
in wait states with the CPU twiddling its thumbs while RAM ever so
slowly retrieves a fresh batch of instructions?
If I use a inner loop, for each i, I need 4*M additional cycles. In
my case M , N both are large numbers and the system's time budget is
limited.

That is why I do not want to use this way.

Write it both ways (John Bode's suggestion of a "helper"
program will save some time), and MEASURE the difference. Cycle
counting is nearly impossible unless you happen to know that most
of the data is already in CPU registers; a couple of cache misses
(even just in L1) can disrupt the whole calculation. Measure!
 
I

Ian Collins

Ian said:
MBALOVER wrote:

[please don't top post]
Thanks a lot for responses.

Let me explain why I do not want to use a inner loop.

for (i=0;i<N;i++)
for (j=0;j<M;j++)
array2[i*M+j]=array1;

for each for iteration of a for loop, we need a comparison and one
addition. In my system, each comparison takes 3 cycles and one
addition takes 1 cycle.
So each iteration takes 4 cycles.

If I use a inner loop, for each i, I need 4*M additional cycles. In
my case M , N both are large numbers and the system's time budget is
limited.

That is why I do not want to use this way.


Have you looked at the generated code with full optimisation on your
compiler? All those I've used will perform some degree of loop
unrolling, so you will only suffer an extra

(4*M)/L

where L is the number of iterations inlined by the compiler.


A wacky alternative if you have a C++ compiler for your target is to use
template meta-programming to generate the code.
 
B

Ben Bacarisse

Ian Collins said:
MBALOVER said:
Hi all,

Actually I want my code to be


for (i=0;i<N;i++)
{
array2[i*30]=array1;
array2[i*30+1]=array1;
array2[i*30+2]=array1;
array2[i*30+3]=array1;
..................
..................
..................
array2[i*30+28]=array1;
array2[i*30+29]=array1;
}

If there anyway to use macros to do it instead for listing out
manually in the code like above?


Forget macros, they won't help here.

How about memset?

(untested)

memset( array2+i*30, array1, 30*sizeof(array2[0]) );


As Keith has pointed out memset won't do. memcpy, on the other hand,
can do it in 5 calls:

array2[i+30] = array1;
memcpy(array2+i*30 + 1, array2[i+30], sizeof array2);
memcpy(array2+i*30 + 2, array2[i+30], 2 * sizeof array2);
memcpy(array2+i*30 + 4, array2[i+30], 4 * sizeof array2);
memcpy(array2+i*30 + 8, array2[i+30], 8 * sizeof array2);
memcpy(array2+i*30 + 16, array2[i+30], 14 * sizeof array2);

Note the final multiplier (if I've go it right). The initial
assignment can be skipped if array2 and array1 have the same type (by
copying from array1 rather than array2) but it is likely that it is
faster to do the first few by assignment anyway before switching to
memcpy. The only way to be sure is to measure. I'd also measure the
plain loop because as so many people have pointed out, it is often
hard to predict the costs of loops these days.

The 1000 element copy takes 10 calls. Of course even these calls can
be put in a loop that where the overhead is likely to be small
compared to the work done. The result would be for more maintainable
than huge pile of assignments.
 
S

Seebs

Let me explain why I do not want to use a inner loop.
Okay.

for each for iteration of a for loop, we need a comparison and one
addition. In my system, each comparison takes 3 cycles and one
addition takes 1 cycle.

Stop right there.

Write it the simple, clear, and correct way.

Now test: IS IT ACTUALLY TOO SLOW?

If the answer is "no", you're done.

DO NOT TRY TO OPTIMIZE WITHOUT TESTING FIRST. EVER.

You may actually find out that the for loop is faster. Why? Any of
several reasons. Optimizers are pretty good about inner loops; they
get a lot of practice. Furthermore, it is not uncommon for a processor
to be such that an inner loop will be faster than a fully-unrolled
loop, because the fully-unrolled loop breaks the cache. You think
branches are slow, wait until you see the cost of waiting for a single
byte of memory that wasn't in cache...

Seriously, though, you should NOT be trying to optimize stuff like this
at your level of experience. When you're regularly advising the experts
in the group on performance, then you could possibly benefit from talking
about cycle counts. At your level of experience, you're just wasting
your time.

-s
 
I

Ike Naar

Hi all,

Actually I want my code to be


for (i=0;i<N;i++)
{
array2[i*30]=array1;
array2[i*30+1]=array1;
array2[i*30+2]=array1;
array2[i*30+3]=array1;
.................
.................
.................
array2[i*30+28]=array1;
array2[i*30+29]=array1;
}

If there anyway to use macros to do it instead for listing out
manually in the code like above?
(the listing out such as the above example will be very bad in the
case I need, say, array2[i*1000+0]=array1; to ...
array2[i*1000+999]=array1;)

And please note that for some reasons of my device, I do not want to
use another loop such as

for (i=0;i<N;i++)
for (j=0;j<30;j++)
array2[i*30+j]=array1;


You could to this:

for (j = 0; j < 30 * N; j++)
{
array2[j] = array1[j/30];
}
 
F

Flash Gordon

Eric said:
Thanks a lot for responses.

Let me explain why I do not want to use a inner loop.

for (i=0;i<N;i++)
for (j=0;j<M;j++)
array2[i*M+j]=array1;

for each for iteration of a for loop, we need a comparison and one
addition.


Maybe, but optimizing compilers are already pretty good
at unrolling loops -- especially loops whose iteration count
are compile-time constants.


Yes, but possibly not enough. The OP needs to check this.

Personally, if the OP needs to do the full unroll, I would be inclined
to write a small C program to run on the ost system to generate the code
to be compiled for the target.
It was reasonable to count cycles as recently as, oh, twenty
years ago. Since then, CPU's have become far more intricate and
cycle counts have become extremely difficult to compute. Unless
you're using an unusually simple processor (single-issue, little
or no pipelining, at most one level of memory cache, maybe only
a one- or two-slot store buffer, ...), these cycle counts are
only loosely related to elapsed time.

It's not *that* long ago that I was programming an embedded processor
that had a trick three stage pipeline (all instructions when through all
three stages even if they only did something in one of them), no
out-of-order execution and no cache. So excluding wait states for memory
all instructions took in three cycles. However, in practice, due to the
pipelining, all instructions effectively took one cycle except in the
presence of a branch. The processor, and it's derivatives, are still
available today.
Your example showed thirty assignments, and you indicated you
might need as many as a thousand. How many cycles will you spend
in wait states with the CPU twiddling its thumbs while RAM ever so
slowly retrieves a fresh batch of instructions?

On the embedded system I was processing the memory and clock speeds
where carefully selected so there were no wait states.
Write it both ways (John Bode's suggestion of a "helper"
program will save some time), and MEASURE the difference. Cycle
counting is nearly impossible unless you happen to know that most
of the data is already in CPU registers; a couple of cache misses
(even just in L1) can disrupt the whole calculation. Measure!

I agree with measure, although *if* it is a simple enough processor (and
they are still out there) you *can* count.
 
S

Squeamizh

Eric said:
Thanks a lot for responses.
Let me explain why I do not want to use a inner loop.
for (i=0;i<N;i++)
for (j=0;j<M;j++)
array2[i*M+j]=array1;
for each for iteration of a for loop, we need a comparison and one
addition.

    Maybe, but optimizing compilers are already pretty good
at unrolling loops -- especially loops whose iteration count
are compile-time constants.

Yes, but possibly not enough. The OP needs to check this.

Personally, if the OP needs to do the full unroll, I would be inclined
to write a small C program to run on the ost system to generate the code
to be compiled for the target.
    It was reasonable to count cycles as recently as, oh, twenty
years ago.  Since then, CPU's have become far more intricate and
cycle counts have become extremely difficult to compute.  Unless
you're using an unusually simple processor (single-issue, little
or no pipelining, at most one level of memory cache, maybe only
a one- or two-slot store buffer, ...), these cycle counts are
only loosely related to elapsed time.

It's not *that* long ago that I was programming an embedded processor
that had a trick three stage pipeline (all instructions when through all
three stages even if they only did something in one of them), no
out-of-order execution and no cache. So excluding wait states for memory
all instructions took in three cycles. However, in practice, due to the
pipelining, all instructions effectively took one cycle except in the
presence of a branch. The processor, and it's derivatives, are still
available today.


What processor? Just curious...
 
P

Phil Carmody

Ben Bacarisse said:
Ian Collins said:
MBALOVER said:
Hi all,

Actually I want my code to be

for (i=0;i<N;i++)
{
array2[i*30]=array1;
..................
array2[i*30+29]=array1;
}

If there anyway to use macros to do it instead for listing out
manually in the code like above?


Forget macros, they won't help here.

How about memset?


On many installations, you'll find that's a macro! ;-)
(untested)

memset( array2+i*30, array1, 30*sizeof(array2[0]) );


As Keith has pointed out memset won't do. memcpy, on the other hand,
can do it in 5 calls:

array2[i+30] = array1;
memcpy(array2+i*30 + 1, array2[i+30], sizeof array2);
memcpy(array2+i*30 + 2, array2[i+30], 2 * sizeof array2);
memcpy(array2+i*30 + 4, array2[i+30], 4 * sizeof array2);
memcpy(array2+i*30 + 8, array2[i+30], 8 * sizeof array2);
memcpy(array2+i*30 + 16, array2[i+30], 14 * sizeof array2);

Note the final multiplier (if I've go it right).


14's right. The +30's should be *30's.

I timed this against a naive loop for int, long long, and
double types. Conclusions were inconclusive. For double,
the naive loop was faster when compiled for plain i386, but
slower when 'optimised' for the actual athlon-xp architecture.
For int and long long, the memcpys were faster than the loop.
The memcpys were in fact simply unrolled completely.

Given that there was a case where the memcpys were measurably
slower, it's not worth blindly going for the clever technique
until you've both found that the naive loop is slowing you down
and measured the two to make sure that the optimisation is in
fact an improvement. Writing a clever version for speed without
measuring whether it's faster or not is dumb. And frequently
done, alas.

Phil
 
B

Ben Bacarisse

Phil Carmody said:
Ben Bacarisse said:
Ian Collins said:
MBALOVER wrote:
Hi all,

Actually I want my code to be

for (i=0;i<N;i++)
{
array2[i*30]=array1;
..................
array2[i*30+29]=array1;
}

If there anyway to use macros to do it instead for listing out
manually in the code like above?

Forget macros, they won't help here.

How about memset?


On many installations, you'll find that's a macro! ;-)
(untested)

memset( array2+i*30, array1, 30*sizeof(array2[0]) );


As Keith has pointed out memset won't do. memcpy, on the other hand,
can do it in 5 calls:

array2[i+30] = array1;
memcpy(array2+i*30 + 1, array2[i+30], sizeof array2);
memcpy(array2+i*30 + 2, array2[i+30], 2 * sizeof array2);
memcpy(array2+i*30 + 4, array2[i+30], 4 * sizeof array2);
memcpy(array2+i*30 + 8, array2[i+30], 8 * sizeof array2);
memcpy(array2+i*30 + 16, array2[i+30], 14 * sizeof array2);

Note the final multiplier (if I've go it right).


14's right. The +30's should be *30's.


Of course. Thanks.
I timed this against a naive loop for int, long long, and
double types. Conclusions were inconclusive. For double,
the naive loop was faster when compiled for plain i386, but
slower when 'optimised' for the actual athlon-xp architecture.
For int and long long, the memcpys were faster than the loop.
The memcpys were in fact simply unrolled completely.

Was this for the *30 example or the latter *1000 example?
Given that there was a case where the memcpys were measurably
slower, it's not worth blindly going for the clever technique
until you've both found that the naive loop is slowing you down
and measured the two to make sure that the optimisation is in
fact an improvement. Writing a clever version for speed without
measuring whether it's faster or not is dumb. And frequently
done, alas.

The measuring is important but you can't find out that it is not
always faster without writing it! Having written it, I'd be tempted
to keep it around in some sort of conditional compilation so that it
can be switched in when/if measurements show that it is better on some
architecture/implementation.

One does ave to weight the maintenance cost, so there must be at least
the possibility of a benefit.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top