number of machine instructions - for algorithm measuring

B

ben

hello,

i'm trying to answer a "prove an upper bound on the number of machine
instructions required to process M connections on N objects" exercise
from an algorithms book. i don't understand a very basic detail of how
i should go about answering the exercise. here's a simplified example
question and example programme to hopefully get at what i'd like to
know:



Prove an upper bound on the number of machine instructions required to
process N objects using the below programme. You may
assume, for example, that any C assignment statement always requires
less than c instructions, for some fixed constant c.

/* a very simple and silly example "algorithm" */
#include <stdio.h>
#define N 10000

int main(void)
{
int n = 0;
int x[3] = { 1, 4, 9 };
long unsigned total = 0;

while( n < N ) {
total += x[ n % 3 ];
n++;
}

printf("%lu\n", total);
return 0;
}



i know the "upper bound" aspect of the question isn't particularly
meaningful with this programme, as after the value of N has been
decided there's nothing variable about the number of times the loop is
run -- upper and lower bound the same -- one fixed, by N, value. but i
wanted to leave the question as unchanged as possible.

to answer that exercise, is it a case of assigning a certain number of
machine instructions to each and every C statement, or operation, in
the loop, adding those up, and then multiplying that by the number of
loops?

it's the "You may assume, for example, that any C assignment statement
always requires less than c instructions, for some fixed constant c."
part i'm unsure on.

how exactly would you go about answering the above exercise (ignoring
the sillyness part of it)?

thanks, ben.
 
T

Tom St Denis

ben said:

Not about CLC...

What do you do in big-Oh with constants?

e.g. inner loop takes O(35n) [35 opcodes per pass, not 35 statements
though] that's O(n) time... then the outer loop is O(nlogn) so its the
product or O(n^2logn) time.

.... so the question is really moot.

If you want really precise timings you're gonna have to sample.
Compilers vary too much and they have too many options to enumerate all
possible results..

Tom
 
B

ben

Tom said:
e.g. inner loop takes O(35n) [35 opcodes per pass, not 35 statements
though]

yes, that's the part that i'm asking about and hoping to find out
about. my question is about this part of the exercise question:

"You may assume, for example, that any C assignment statement always
requires less than c instructions, for some fixed constant c."

how would you carry out that part with the example loop -- just for one
loop is all that's needed.

the higher level aspect of the exercise question that you're talking
about Tom, i understand -- that's why i gave a very simple bit of code
and labelled it silly and stated "after the value of N has been
decided there's nothing variable about the number of times the loop is
run -- upper and lower bound the same -- one fixed, by N, value. but i
wanted to leave the question as unchanged as possible" -- there's no
need for the variable run aspect to get at what i'm asking about.





so basically how do you apply "You may assume, for example, that any C
assignment statement always requires less than c instructions, for some
fixed constant c." to

while( n < N ) {
total += x[ n % 3 ];
n++;
}

?

thanks, ben.
 
M

Michael Mair

ben said:
Tom said:
e.g. inner loop takes O(35n) [35 opcodes per pass, not 35 statements
though]

yes, that's the part that i'm asking about and hoping to find out
about. my question is about this part of the exercise question:

"You may assume, for example, that any C assignment statement always
requires less than c instructions, for some fixed constant c."

how would you carry out that part with the example loop -- just for one
loop is all that's needed.

I am not sure how this is meant -- plainly just looking for assignments
seems stupid. "Operation" is more meaningful

[snip]
so basically how do you apply "You may assume, for example, that any C
assignment statement always requires less than c instructions, for some
fixed constant c." to

while( n < N ) {
total += x[ n % 3 ];
n++;
}

Assignments: 2 per loop -> 2*c*N. Complete nonsense.

Operations:
n % 3 -> temp1
x + temp1 -> temp2
total + *temp2 -> total
or:
*temp2 -> temp3
total + temp3 -> total

n + 1 -> n

and of course
n < N -> stay in the loop
else -> leave the loop
or intermediate assignment of n<N -> temp0 and test afterwards.

Would be [5..7]*c*N.

IMO, the exercise should be stated in a clearer way.


Cheers
Michael
 
B

ben

Michael Mair said:
ben said:
Tom said:
e.g. inner loop takes O(35n) [35 opcodes per pass, not 35 statements
though]

yes, that's the part that i'm asking about and hoping to find out
about. my question is about this part of the exercise question:

"You may assume, for example, that any C assignment statement always
requires less than c instructions, for some fixed constant c."

how would you carry out that part with the example loop -- just for one
loop is all that's needed.

I am not sure how this is meant -- plainly just looking for assignments
seems stupid. "Operation" is more meaningful

[snip]
so basically how do you apply "You may assume, for example, that any C
assignment statement always requires less than c instructions, for some
fixed constant c." to

while( n < N ) {
total += x[ n % 3 ];
n++;
}

Assignments: 2 per loop -> 2*c*N. Complete nonsense.

Operations:
n % 3 -> temp1
x + temp1 -> temp2
total + *temp2 -> total
or:
*temp2 -> temp3
total + temp3 -> total

n + 1 -> n

and of course
n < N -> stay in the loop
else -> leave the loop
or intermediate assignment of n<N -> temp0 and test afterwards.

Would be [5..7]*c*N.

right. great.
IMO, the exercise should be stated in a clearer way.

great. it was unclear to me also, party for the 'assignment' reason
which i also thought was odd and partly not understanding exactly how
it should be broken down. so i wasn't sure if it was unclear to me
through my lack of knowledge, or due to the exercise being badly
written -- a bit of both was the answer to that.

thanks very much for the reply Michael,

ben.
 
M

Mark F. Haigh

ben wrote:

Prove an upper bound on the number of machine instructions required to
process N objects using the below programme. You may
assume, for example, that any C assignment statement always requires
less than c instructions, for some fixed constant c.

/* a very simple and silly example "algorithm" */
#include <stdio.h>
#define N 10000

int main(void)
{
int n = 0;
int x[3] = { 1, 4, 9 };
long unsigned total = 0;

while( n < N ) {
total += x[ n % 3 ];
n++;
}

printf("%lu\n", total);
return 0;
}

In my opinion, this is a waste of time. Instruction count no longer
has much of a bearing on how much time a program takes to execute. To
use your example program, with N set to 0x4FFFFFFFu:


[mark@icepick ~]$ gcc -save-temps -Wall -O3 -o bar bar.c && time ./bar
&& wc -l bar.s
1968526669

real 0m16.683s
user 0m16.682s
sys 0m0.002s
44 bar.s <- lines of assembly


[mark@icepick ~]$ gcc -save-temps -Wall -O3 -march=pentium4 -o bar
bar.c && time ./bar && wc -l bar.s
1968526669

real 0m4.313s
user 0m4.312s
sys 0m0.001s
78 bar.s <- lines of assembly


As you can see, a change in an optimization setting roughly doubled the
assembly output, but resulted in a 4x speed improvement. I picked the
quickest of 3 trials each, for the record.

It may be an interesting academic exercise, but its real world
applicability is questionable. Probably not the answer you wanted, but
an answer nonetheless.


Mark F. Haigh
(e-mail address removed)
 
B

ben

Mark F. Haigh said:
ben wrote:

Prove an upper bound on the number of machine instructions required to
process N objects using the below programme. You may
assume, for example, that any C assignment statement always requires
less than c instructions, for some fixed constant c.

/* a very simple and silly example "algorithm" */
#include <stdio.h>
#define N 10000

int main(void)
{
int n = 0;
int x[3] = { 1, 4, 9 };
long unsigned total = 0;

while( n < N ) {
total += x[ n % 3 ];
n++;
}

printf("%lu\n", total);
return 0;
}

In my opinion, this is a waste of time. Instruction count no longer
has much of a bearing on how much time a program takes to execute. To
use your example program, with N set to 0x4FFFFFFFu:


[mark@icepick ~]$ gcc -save-temps -Wall -O3 -o bar bar.c && time ./bar
&& wc -l bar.s
1968526669

real 0m16.683s
user 0m16.682s
sys 0m0.002s
44 bar.s <- lines of assembly


[mark@icepick ~]$ gcc -save-temps -Wall -O3 -march=pentium4 -o bar
bar.c && time ./bar && wc -l bar.s
1968526669

real 0m4.313s
user 0m4.312s
sys 0m0.001s
78 bar.s <- lines of assembly


As you can see, a change in an optimization setting roughly doubled the
assembly output, but resulted in a 4x speed improvement. I picked the
quickest of 3 trials each, for the record.

It may be an interesting academic exercise, but its real world
applicability is questionable. Probably not the answer you wanted, but
an answer nonetheless.

Mark,

well, it sounds like a good point. if you're right then it's a very
good answer and one i'd definitely want. the exercise is just from a
book. i don't have to do it as stated by the exercise. if the way
that's detailed isn't so useful and there's a much more useful way then
i'd prefer to learn and do it a way that's useful. i do intend to use
this stuff in the real world -- that's what i'm learning it for.

i'm just wondering though, if you're comparing operation counts of code
compiled using the same settings, and the different programmes are
written in a similar fashion (that is, for example, not comparing one
programme that unrolls a loop and another that doesn't -- code written
in the same style) then the number of operations is a reasonable way to
compare? basically, use the same base settings and style. (although
"code written in the same style" is probably pretty ambiguous)

further thought: different operations take different amounts of time
don't they? -- i think so. so that'd make operation counting less
useful. for instance i think accessing a word of memory that isn't in
the cpu's cache can take several clock cycles, whereas if that data is
in cache the access might take just one clock cycle. correct?

so basically you're saying timing is the way to measure algorithms?

thanks, ben
 
M

Mark F. Haigh

ben said:
well, it sounds like a good point. if you're right then it's a very
good answer and one i'd definitely want. the exercise is just from a
book. i don't have to do it as stated by the exercise. if the way
that's detailed isn't so useful and there's a much more useful way then
i'd prefer to learn and do it a way that's useful. i do intend to use
this stuff in the real world -- that's what i'm learning it for.

Good decision. Can't remember who said "In theory, theory and practice
are the same, in practice, they're different."
i'm just wondering though, if you're comparing operation counts of code
compiled using the same settings, and the different programmes are
written in a similar fashion (that is, for example, not comparing one
programme that unrolls a loop and another that doesn't -- code written
in the same style) then the number of operations is a reasonable way to
compare? basically, use the same base settings and style. (although
"code written in the same style" is probably pretty ambiguous)

These days, it would be unwise to manually unroll a loop in code
without timings to back it up. Compilers typically understand the
associated branching costs (whether cheap or expensive, cost of
mispredicted branches, etc) and pipeline scheduling better than most
people do.
further thought: different operations take different amounts of time
don't they? -- i think so. so that'd make operation counting less
useful.

Yes, although it depends on the processor architecture exactly how
different the numbers can be. Some instructions can be executed in
parallel. Some instructions may need to be internally broken down into
smaller micro-ops, each of which is individually executed.
for instance i think accessing a word of memory that isn't in
the cpu's cache can take several clock cycles, whereas if that data is
in cache the access might take just one clock cycle. correct?

Worse, much of the time, unless you're dealing with older or embedded
processors. It can take hundreds or even thousands of cycles on newer
fast machines. Typically it's not one cycle from cache, either, and it
will depend on which cache it's coming from.

For the sake of illustration, let's revisit a previous topic-- loop
unrolling. Let's say you manually unrolled a bunch of loops all over
your code. Are you sure that this new bulk of code does not force some
other critical part of code out of cache, resulting in several hundred
cycle wastage every time it has to be re-fetched? Now the several
cycles you saved looks like nothing. Think of the compiler as your
friend that helps you find the sweet spot amongst many different
competing factors.
so basically you're saying timing is the way to measure algorithms?

Absolutely. First, optimize algorithmic complexity. Second, implement
it cleanly. Third, play with your compiler settings/hints and profile
it. Fourth and last, which is rarely needed, micro-optimize.

The fastest linked list in the world will get creamed by a slow hash or
tree if it's used inappropriately. Compilers do a better job of
optimizing if the code is implemented cleanly. Profiling shows you
where the bulk of _time_ is spent, so you don't go chasing false leads
and trying to optimize things that will give you a .00001% increase.


Mark F. Haigh
(e-mail address removed)
 
B

ben

Mark,

ok, thanks for the info

ben.



Mark F. Haigh said:
Good decision. Can't remember who said "In theory, theory and practice
are the same, in practice, they're different."


These days, it would be unwise to manually unroll a loop in code
without timings to back it up. Compilers typically understand the
associated branching costs (whether cheap or expensive, cost of
mispredicted branches, etc) and pipeline scheduling better than most
people do.


Yes, although it depends on the processor architecture exactly how
different the numbers can be. Some instructions can be executed in
parallel. Some instructions may need to be internally broken down into
smaller micro-ops, each of which is individually executed.


Worse, much of the time, unless you're dealing with older or embedded
processors. It can take hundreds or even thousands of cycles on newer
fast machines. Typically it's not one cycle from cache, either, and it
will depend on which cache it's coming from.

For the sake of illustration, let's revisit a previous topic-- loop
unrolling. Let's say you manually unrolled a bunch of loops all over
your code. Are you sure that this new bulk of code does not force some
other critical part of code out of cache, resulting in several hundred
cycle wastage every time it has to be re-fetched? Now the several
cycles you saved looks like nothing. Think of the compiler as your
friend that helps you find the sweet spot amongst many different
competing factors.


Absolutely. First, optimize algorithmic complexity. Second, implement
it cleanly. Third, play with your compiler settings/hints and profile
it. Fourth and last, which is rarely needed, micro-optimize.

The fastest linked list in the world will get creamed by a slow hash or
tree if it's used inappropriately. Compilers do a better job of
optimizing if the code is implemented cleanly. Profiling shows you
where the bulk of _time_ is spent, so you don't go chasing false leads
and trying to optimize things that will give you a .00001% increase.


Mark F. Haigh
(e-mail address removed)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top