For loop equivalent with the preprocessor

N

Nudge

I have an array, and an unrolled loop which looks like this:

do_something(A[0]);
do_something(A[1]);
....
do_something(A[255]);

I thought: why should I type so much? I should write a macro.

So I was looking to write something along the lines of:

#define write_256(array) #for(i,0,255,do_something(foo);

But I coudn't find a way to do it...

Is there a way to write a macro that the C pre-processor will expand
to k instructions, and be able to reference the iteration number?

I thought of using recursive macro calls along the lines of:

#define write_more(n,array) \
#if (n>0) do_something(foo[256-n]); \
write_more(n-1,array)

but it seems implementers don't like recursive macro calls :)

Any insight?
 
J

Joona I Palaste

Nudge said:
I have an array, and an unrolled loop which looks like this:
do_something(A[0]);
do_something(A[1]);
...
do_something(A[255]);

I thought: why should I type so much? I should write a macro.

I think: Why are you bothering with this? Re-roll your loop into a
genuine for loop, set your compiler's optimisation to "pretty darned
high", and it might unroll the loop for you when generating code.
Remember the rules about optimisation at source code level:
(1) Don't do it.
(2) (For experts only!) Don't do it yet.
 
N

Nudge

Joona said:
Nudge said:
I have an array, and an unrolled loop which looks like this:
do_something(A[0]);
do_something(A[1]);
...
do_something(A[255]);


I thought: why should I type so much? I should write a macro.


I think: Why are you bothering with this? Re-roll your loop into a
genuine for loop, set your compiler's optimisation to "pretty darned
high", and it might unroll the loop for you when generating code.
Remember the rules about optimisation at source code level:
(1) Don't do it.
(2) (For experts only!) Don't do it yet.

ARGH!

I knew I should have said that I was smarter than my compiler :)

I am not writing any serious code, only fooling around.

Is there a way to beat the preprocessor into submission?
 
E

Emmanuel Delahaye

In 'comp.lang.c' said:
I have an array, and an unrolled loop which looks like this:

do_something(A[0]);
do_something(A[1]);
...
do_something(A[255]);

I thought: why should I type so much? I should write a macro.

So I was looking to write something along the lines of:

#define write_256(array) #for(i,0,255,do_something(foo);

But I coudn't find a way to do it...


There is no portable way do do this directly, but there is a trick anyway (I
got it from c.l.c, BTW):

Define the list of constant values:

/* values.itm */
ITEM(0)
ITEM(1)
ITEM(2)
ITEM(3)
....
ITEM(254)
ITEM(255)

This file can be automatically created with a trivial C program.

Now, in your application, you do the following:

/* myapp.c */
<...>

#define ITEM(a) \
do_something(A[a]);

#include "values.itm"
#undef ITEM

That's all folks.

I use massively this trick to define repetitive code. It's very powerful and
makes solid code when mastered.
 
S

Scott Fluhrer

Nudge said:
Joona said:
Nudge said:
I have an array, and an unrolled loop which looks like this:
do_something(A[0]);
do_something(A[1]);
...
do_something(A[255]);


I thought: why should I type so much? I should write a macro.


I think: Why are you bothering with this? Re-roll your loop into a
genuine for loop, set your compiler's optimisation to "pretty darned
high", and it might unroll the loop for you when generating code.
Remember the rules about optimisation at source code level:
(1) Don't do it.
(2) (For experts only!) Don't do it yet.

ARGH!

I knew I should have said that I was smarter than my compiler :)

I am not writing any serious code, only fooling around.

Is there a way to beat the preprocessor into submission?
If you absolutely insist...

#define X0(n) do_something(A[n]);
#define X1(n) X0(n) X0(n+1)
#define X2(n) X1(n) X1(n+2)
#define X3(n) X2(n) X2(n+4)
#define X4(n) X3(n) X3(n+8)
#define X5(n) X4(n) X4(n+16)
#define X6(n) X5(n) X5(n+32)
#define X7(n) X6(n) X6(n+64)
#define X8(n) X7(n) X7(n+128)

X8(0)

If you ask me, Joona's suggestion is considerably more reasonable.
 
P

Pushker Pradhan

I've never done this kind of stuff, but isn't there the option -funroll-loop
(gcc)?
I've seen lot of software use this option? Joona also suggested something
similar I think?

--
Pushkar Pradhan
Emmanuel Delahaye said:
In 'comp.lang.c' said:
I have an array, and an unrolled loop which looks like this:

do_something(A[0]);
do_something(A[1]);
...
do_something(A[255]);

I thought: why should I type so much? I should write a macro.

So I was looking to write something along the lines of:

#define write_256(array) #for(i,0,255,do_something(foo);

But I coudn't find a way to do it...


There is no portable way do do this directly, but there is a trick anyway (I
got it from c.l.c, BTW):

Define the list of constant values:

/* values.itm */
ITEM(0)
ITEM(1)
ITEM(2)
ITEM(3)
...
ITEM(254)
ITEM(255)

This file can be automatically created with a trivial C program.

Now, in your application, you do the following:

/* myapp.c */
<...>

#define ITEM(a) \
do_something(A[a]);

#include "values.itm"
#undef ITEM

That's all folks.

I use massively this trick to define repetitive code. It's very powerful and
makes solid code when mastered.

--
-ed- (e-mail address removed) [remove YOURBRA before answering me]
The C-language FAQ: http://www.eskimo.com/~scs/C-faq/top.html
<blank line>
FAQ de f.c.l.c : http://www.isty-info.uvsq.fr/~rumeau/fclc/
 
B

bd

Pushker said:
I've never done this kind of stuff, but isn't there the option
-funroll-loop (gcc)?
I've seen lot of software use this option? Joona also suggested something
similar I think?

In gcc, -funroll-loop causes loops to be unrolled into, well, non-loops.
Example:
int counter = 0;
int i;
for(i = 0; i < 5; i++)
counter++;

Becomes:
int counter = 0;
int i;
counter++;
counter++;
counter++;
counter++;
counter++;
i = 5;

Of course, this is all done in assembler, so the result's probably different
somewhat. But that's the idea behind unrolling loops - to eliminate the
looping overhead.
 
G

Glen Herrmannsfeldt

Nudge said:
I have an array, and an unrolled loop which looks like this:

do_something(A[0]);
do_something(A[1]);
...
do_something(A[255]);

I thought: why should I type so much? I should write a macro.

So I was looking to write something along the lines of:

#define write_256(array) #for(i,0,255,do_something(foo);

But I coudn't find a way to do it...

Is there a way to write a macro that the C pre-processor will expand
to k instructions, and be able to reference the iteration number?


The PL/I preprocessor can do this.

%DO I=0 TO 255;
do_something(A);
%END
%DEACTIVATE I

(The last is the equivalent of #undef, except that you can %ACTIVATE it and
get the previous value back again. The % indicates preprocessor
statements.)

I have wondered about the decrease in preprocessor power as computers get
faster and computer memory gets larger. Consider the progression from PL/I
to C to Java in terms of preprocessor power.

There is, of course, no rule against using preprocessors for other than the
intended language.

-- glen
 
E

Emmanuel Delahaye

Pushker Pradhan said:
I've never done this kind of stuff, but isn't there the option
-funroll-loop (gcc)?
I've seen lot of software use this option? Joona also suggested
something similar I think?

If your compiler has the option, it's fine, but what if it has not? You have
not always the choice of your compiler. It can belong to a customer's
production process where nothing is allowed to change, or belong to some
peculiar platform... who knows... Better to stay portable. Always (when
possible).
 
E

Emmanuel Delahaye

Glen Herrmannsfeldt said:
The PL/I preprocessor can do this.

%DO I=0 TO 255;
do_something(A);
%END
%DEACTIVATE I

(The last is the equivalent of #undef, except that you can %ACTIVATE it and
get the previous value back again. The % indicates preprocessor
statements.)

I have wondered about the decrease in preprocessor power as computers get
faster and computer memory gets larger. Consider the progression from PL/I
to C to Java in terms of preprocessor power.


I recall that the Intel macro assembler for 8051 I used in the 90's was very
powerfull, and did support iterative constructs like that.
 
G

Glen Herrmannsfeldt

Emmanuel Delahaye said:
Glen Herrmannsfeldt said:
The PL/I preprocessor can do this.

%DO I=0 TO 255;
do_something(A);
%END
%DEACTIVATE I

(The last is the equivalent of #undef, except that you can %ACTIVATE it and
get the previous value back again. The % indicates preprocessor
statements.)

I have wondered about the decrease in preprocessor power as computers get
faster and computer memory gets larger. Consider the progression from PL/I
to C to Java in terms of preprocessor power.


I recall that the Intel macro assembler for 8051 I used in the 90's was very
powerfull, and did support iterative constructs like that.


The OS/360 assemblers supported a variety of constructs, including if and
goto. The while loop could be, and often was, written using macros. I used
to know people using the OS/360 compilers to compile for most microcomputers
by writing a macro for each possible opcode, and then post-processing the
output files.

I never used the 8051, though.

-- glen
 
N

Nudge

Scott said:
Joona I Palaste wrote:

Nudge <[email protected]> scribbled the following:


I have an array, and an unrolled loop which looks like this:


do_something(A[0]);
do_something(A[1]);
...
do_something(A[255]);


I thought: why should I type so much? I should write a macro.


I think: Why are you bothering with this? Re-roll your loop into a
genuine for loop, set your compiler's optimisation to "pretty darned
high", and it might unroll the loop for you when generating code.
Remember the rules about optimisation at source code level:
(1) Don't do it.
(2) (For experts only!) Don't do it yet.

ARGH!

I knew I should have said that I was smarter than my compiler :)

I am not writing any serious code, only fooling around.

Is there a way to beat the preprocessor into submission?

If you absolutely insist...

#define X0(n) do_something(A[n]);
#define X1(n) X0(n) X0(n+1)
#define X2(n) X1(n) X1(n+2)
#define X3(n) X2(n) X2(n+4)
#define X4(n) X3(n) X3(n+8)
#define X5(n) X4(n) X4(n+16)
#define X6(n) X5(n) X5(n+32)
#define X7(n) X6(n) X6(n+64)
#define X8(n) X7(n) X7(n+128)

X8(0)

If you ask me, Joona's suggestion is considerably more reasonable.

What about SHA-256's inner loop, where unrolling 8 times and
symbolically renaming the 8 variables for every iteration allows one
to write only 4 assignments?

I don't think there are many optimizing compilers out there which
are THAT smart...

Anyway, thanks for the log2(n) solution. I find it fairly elegant.
 
S

Scott Fluhrer

Nudge said:
Scott said:
Joona I Palaste wrote:


Nudge <[email protected]> scribbled the following:


I have an array, and an unrolled loop which looks like this:


do_something(A[0]);
do_something(A[1]);
...
do_something(A[255]);


I thought: why should I type so much? I should write a macro.


I think: Why are you bothering with this? Re-roll your loop into a
genuine for loop, set your compiler's optimisation to "pretty darned
high", and it might unroll the loop for you when generating code.
Remember the rules about optimisation at source code level:
(1) Don't do it.
(2) (For experts only!) Don't do it yet.

ARGH!

I knew I should have said that I was smarter than my compiler :)

I am not writing any serious code, only fooling around.

Is there a way to beat the preprocessor into submission?

If you absolutely insist...

#define X0(n) do_something(A[n]);
#define X1(n) X0(n) X0(n+1)
#define X2(n) X1(n) X1(n+2)
#define X3(n) X2(n) X2(n+4)
#define X4(n) X3(n) X3(n+8)
#define X5(n) X4(n) X4(n+16)
#define X6(n) X5(n) X5(n+32)
#define X7(n) X6(n) X6(n+64)
#define X8(n) X7(n) X7(n+128)

X8(0)

If you ask me, Joona's suggestion is considerably more reasonable.

What about SHA-256's inner loop, where unrolling 8 times and
symbolically renaming the 8 variables for every iteration allows one
to write only 4 assignments?

I don't think there are many optimizing compilers out there which
are THAT smart...

Anyway, thanks for the log2(n) solution. I find it fairly elegant.
Actually, elegant it ain't. Some obvious problems:

- A rather annoying amount of changes are required if you change the 256 to,
say, 300
- It's fairly easy to have a typo somewhere in all the #define's. It'll
compile just fine, but perhaps miss do_something(A[153]);
- It is certainly less readable than an explicit for loop

In my opinion, just stick with the explicit for loop

--
poncho
 
N

Nudge

What about SHA-256's inner loop, where unrolling 8 times and
symbolically renaming the 8 variables for every iteration allows one
to write only 4 assignments?

I don't think there are many optimizing compilers out there which
are THAT smart...

Anyway, thanks for the log2(n) solution. I find it fairly elegant.

Actually, elegant it ain't. Some obvious problems:

- A rather annoying amount of changes are required if you change the 256 to,
say, 300
- It's fairly easy to have a typo somewhere in all the #define's. It'll
compile just fine, but perhaps miss do_something(A[153]);
- It is certainly less readable than an explicit for loop

In my opinion, just stick with the explicit for loop

Forgive me if I insist :)

What about SHA-256's inner loop? (You're a cryptographer, you know
what I'm talking about.)

With no unrolling, one must turn a-h into an array (which puts
additional pressure on the compiler) and use the iteration number as
a shifting index, modulo 8.

With 8 bodies of the loop, the modulo operations can go away.

I doubt whether even a few optimizing compilers are smart enough to
do that (I will test with gcc).
 
N

Nudge

Anyway, thanks for the log2(n) solution. I find it fairly

Actually, elegant it ain't. Some obvious problems:

- A rather annoying amount of changes are required if you change
the 256 to, say, 300

- It's fairly easy to have a typo somewhere in all the #define's.
It'll compile just fine, but perhaps miss do_something(A[153]);

- It is certainly less readable than an explicit for loop

In my opinion, just stick with the explicit for loop

Forgive me if I insist.

What about SHA-256's inner loop? (You're a cryptographer, you know
what I'm talking about.)

I want to write it like this:

T1 = SIGMA1(e) + Ch(e,f,g) + W[t] + K[t] + h;
T2 = SIGMA0(a) + Maj(a,b,c);
d += T1;
h = T1 + T2;

In the next iteration, I perform symbolic renaming, i.e.
h is now a
a is now b
b is now c
c is now d
d is now e
e is now f
f is now g
g is now h

If I unroll 8 times, symbolic renaming comes for free.
 
S

Scott Fluhrer

Nudge said:
Anyway, thanks for the log2(n) solution. I find it fairly
elegant.

Actually, elegant it ain't. Some obvious problems:

- A rather annoying amount of changes are required if you change
the 256 to, say, 300

- It's fairly easy to have a typo somewhere in all the #define's.
It'll compile just fine, but perhaps miss do_something(A[153]);

- It is certainly less readable than an explicit for loop

In my opinion, just stick with the explicit for loop

Forgive me if I insist.

What about SHA-256's inner loop? (You're a cryptographer, you know
what I'm talking about.)
If you're asking whether unrolling a loop is ever justified, well, on
occasion it is. You just have to remember the costs:

- It can be less obvious what's going on, making maintenance harder.
- If you unroll too much, the loop might not fit in cache. This leads to
slower performance.
I want to write it like this:

T1 = SIGMA1(e) + Ch(e,f,g) + W[t] + K[t] + h;
T2 = SIGMA0(a) + Maj(a,b,c);
d += T1;
h = T1 + T2;

In the next iteration, I perform symbolic renaming, i.e.
h is now a
a is now b
b is now c
c is now d
d is now e
e is now f
f is now g
g is now h

If I unroll 8 times, symbolic renaming comes for free.
On the other hand, it's nontrivial to use my hack to do the renaming for
you. I'd suggest that if you unroll it 8 times, you explicitly list the 8
invocations.
 
E

Eric Sosman

Scott said:
Nudge said:
Anyway, thanks for the log2(n) solution. I find it fairly
elegant.

Actually, elegant it ain't. Some obvious problems:

- A rather annoying amount of changes are required if you change
the 256 to, say, 300

- It's fairly easy to have a typo somewhere in all the #define's.
It'll compile just fine, but perhaps miss do_something(A[153]);

- It is certainly less readable than an explicit for loop

In my opinion, just stick with the explicit for loop

Forgive me if I insist.

What about SHA-256's inner loop? (You're a cryptographer, you know
what I'm talking about.)
If you're asking whether unrolling a loop is ever justified, well, on
occasion it is. You just have to remember the costs:

- It can be less obvious what's going on, making maintenance harder.
- If you unroll too much, the loop might not fit in cache. This leads to
slower performance.
I want to write it like this:

T1 = SIGMA1(e) + Ch(e,f,g) + W[t] + K[t] + h;
T2 = SIGMA0(a) + Maj(a,b,c);
d += T1;
h = T1 + T2;

In the next iteration, I perform symbolic renaming, i.e.
h is now a
a is now b
b is now c
c is now d
d is now e
e is now f
f is now g
g is now h

If I unroll 8 times, symbolic renaming comes for free.
On the other hand, it's nontrivial to use my hack to do the renaming for
you. I'd suggest that if you unroll it 8 times, you explicitly list the 8
invocations.

For the case in point, you could write something like

#define STEP(a,b,c,d,e,f,g,h) \
T1 = SIGMA1(e) + Ch(e,f,g) + W[t] + K[t] + h; \
T2 = SIGMA0(a) + Maj(a,b,c); \
d += T1; \
h = T1 + T2
...
while (whatever) {
STEP(a,b,c,d,e,f,g,h);
STEP(b,c,d,e,f,g,h,a);
...
STEP(h,a,b,c,d,e,f,g);
}

(Hmmm: If speed is the objective, couldn't the `W[t] + K[t]'
piece be replaced by a precomputed `WplusK[t]'?)
 
N

Nudge

What about SHA-256's inner loop? (You're a cryptographer, you know
If you're asking whether unrolling a loop is ever justified, well, on
occasion it is. You just have to remember the costs:

- It can be less obvious what's going on, making maintenance harder.
- If you unroll too much, the loop might not fit in cache. This leads to
slower performance.

Point taken.
I want to write it like this:

T1 = SIGMA1(e) + Ch(e,f,g) + W[t] + K[t] + h;
T2 = SIGMA0(a) + Maj(a,b,c);
d += T1;
h = T1 + T2;

In the next iteration, I perform symbolic renaming, i.e.
h is now a
a is now b
b is now c
c is now d
d is now e
e is now f
f is now g
g is now h

If I unroll 8 times, symbolic renaming comes for free.

On the other hand, it's nontrivial to use my hack to do the renaming for
you. I'd suggest that if you unroll it 8 times, you explicitly list the 8
invocations.

Well, if I change a-h to V[8] then
a is V[(64-t)%8]
b is V[(65-t)%8]
...
h is V[(71-t)%8]

All the indices can be computed at compile time. A good compiler
should be able to scalarize my array back to a-h.

Yes, 8 is not too bad, but a complete unroll requires 64
(approximately 6 KB of code, it will fit inside the L1 I$ of my
Athlon with room to spare).
 
N

Nudge

For the case in point, you could write something like
#define STEP(a,b,c,d,e,f,g,h) \
T1 = SIGMA1(e) + Ch(e,f,g) + W[t] + K[t] + h; \
T2 = SIGMA0(a) + Maj(a,b,c); \
d += T1; \
h = T1 + T2
...
while (whatever) {
STEP(a,b,c,d,e,f,g,h);
STEP(b,c,d,e,f,g,h,a);
...
STEP(h,a,b,c,d,e,f,g);
}

(Hmmm: If speed is the objective, couldn't the `W[t] + K[t]'
piece be replaced by a precomputed `WplusK[t]'?)

This is precisely what I am doing. Thanks for the hint :)
 
S

Scott Fluhrer

Nudge said:
If you're asking whether unrolling a loop is ever justified, well, on
occasion it is. You just have to remember the costs:

- It can be less obvious what's going on, making maintenance harder.
- If you unroll too much, the loop might not fit in cache. This leads to
slower performance.

Point taken.
I want to write it like this:

T1 = SIGMA1(e) + Ch(e,f,g) + W[t] + K[t] + h;
T2 = SIGMA0(a) + Maj(a,b,c);
d += T1;
h = T1 + T2;

In the next iteration, I perform symbolic renaming, i.e.
h is now a
a is now b
b is now c
c is now d
d is now e
e is now f
f is now g
g is now h

If I unroll 8 times, symbolic renaming comes for free.

On the other hand, it's nontrivial to use my hack to do the renaming for
you. I'd suggest that if you unroll it 8 times, you explicitly list the 8
invocations.

Well, if I change a-h to V[8] then
a is V[(64-t)%8]
b is V[(65-t)%8]
...
h is V[(71-t)%8]

All the indices can be computed at compile time. A good compiler
should be able to scalarize my array back to a-h.

Yes, 8 is not too bad, but a complete unroll requires 64
(approximately 6 KB of code, it will fit inside the L1 I$ of my
Athlon with room to spare).
Yes, it'll fit, but it'll push most everything else out. That means that
once you finish the SHA-256, you'll start generating lots of cache misses
:-(
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top