Add My Idea to the C++ Compiler

B

Bryan Parkoff

You may notice that switch (...) is much faster than function that can
gain a big improved performance because it only use JMP instruction however
function is required to use CALL, PUSH, and POP instruction that can be
slower.
I created three functions in an array. Array is called
pArray_Func[xx](). pArray_Func[xx]() contains three functions. "xx" can be
used to point each function.
Let say, I have over 1,000 functions. I would put "inline" to each
functions. It does not matter that each function may contain more than 50
lines. It is really BIG that C/C++ compiler tells "inline" is used for
short like 3-5 lines ONLY. I have decided to tell C/C++ compiler by forcing
"inline" there because I don't care if my code is getting bigger and bigger.
I allow 50 lines to be copied to each functions. CALL, PUSH, and POP
instructions can always be removed automatically by the C/C++ compiler
before JMP instruction can always be placed each function.
It is very neat according to my design that I wanted. If I care to keep
small code rather than big code, "inline" should not be used.
Right now, there is a big problem. I can't place "inline" into
pArray_Func[xx](). I can see why. JMP can show "JMP [EAX+EDX*4] that will
work fine. C/C++ compiler can only place "CALL [EAX+EDX*4] into
pArrayFunc[xx](). It is not designed to replace from "CALL [EAX+EDX*4]" to
"JMP [EAX+EDX*4] if "inline" is there. C/C++ compiler will always ignore
"inline".
Another option is to place 1,000 functions inside switch (...). Look
like below.

switch (xx)
{
case 1:
Func1();
case 2:
Func2();
case 3:
Func3();
}

C/C++ compiler can see this code above before it can place "inline" to
each functions inside switch (...). It is very annoying me because I have
to read switch (...) from line to line by scrolling up/down. It is very
long for me to read. It is not HARD to FIND BUG (off my point).
I would want to see only 3 or 4 lines there instead of over 20,000 lines
that contains switch (...). pArrayFunc() may be very useful. We should
tell the C/C++ authors to add keyword like __USE_JMP_INLINE instead of
__inline, _inline, or inline.
Newest version of C/C++ compiler like Intel Compiler can accept
__USE_JMP_INLINE. "__USE_JMP_INLINE" will be added to pArrayFunc[xx](), but
"inline" will always be same for 1,000 functions. After you compile your
C++ source code, it will recognize "__USE_JMP_INLINE", it will rename from
"CALL [EAX+EDX*4]" to "JMP [EAX+EDX*4]" so each functions' PUSH and POP
instructions will be removed. It will look like real JMP Table inside
pArrayFunc[xx](). It would be perfect for easy reading.
Please state your opinion what you think about my idea to modify C/C++
compiler's source code. ***Do not tell that there is no __USE_JMP_INLINE to
all C/C++ compiler. Please ask the authors to add __USE_JMP_INLINE keyword.
It would be perfect and neat C++ coding.***

I have put pArrayFunc[xx]() inside C++ class example below. It gives
you an idea however "__USE_JMP_INLINE" does not EXIST, but should be added
near the future.

#include <stdio.h>



class Test

{

public:

inline void Get_a() { printf("A: %d\n", _a); }

inline void Get_b() { printf("B: %d\n", _b); }

inline void Get_c() { printf("C: %d\n", _c); }



inline void Set_a(unsigned int a) { _a = a; }

inline void Set_b(unsigned int b) { _b = b; }

inline void Set_c(unsigned int c) { _c = c; }



typedef void (Test::*Action)();

static const Action Menu[3];



inline void Call();



private:

unsigned int _a;

unsigned int _b;

unsigned int _c;

};



const Test::Action Test::Menu[3] =

{

&Test::Get_a,

&Test::Get_b,

&Test::Get_c

};



inline void Test::Call()

{

for (int a = 0; a <= 2; a++)

(this->*Menu[a])();

}



void main(void)

{

Test test;

test.Set_a(1);

test.Set_b(2);

test.Set_c(3);

test.Call();



return;

}
 
V

Victor Bazarov

Bryan Parkoff said:
You may notice that switch (...) is much faster than function that can
gain a big improved performance because it only use JMP instruction however
function is required to use CALL, PUSH, and POP instruction that can be
slower.

"Can be" is key here.
I created three functions in an array. Array is called
pArray_Func[xx](). pArray_Func[xx]() contains three functions. "xx" can be
used to point each function.
Let say, I have over 1,000 functions. I would put "inline" to each
functions. It does not matter that each function may contain more than 50
lines. It is really BIG that C/C++ compiler tells "inline" is used for
short like 3-5 lines ONLY. I have decided to tell C/C++ compiler by forcing
"inline" there because I don't care if my code is getting bigger and
bigger.

And compiler can simply tell you to go kiss yourself and completely
ignore your 'inline' keywords. Didn't you know that it's not mandatory?
I allow 50 lines to be copied to each functions. CALL, PUSH, and POP
instructions can always be removed automatically by the C/C++ compiler
before JMP instruction can always be placed each function.
It is very neat according to my design that I wanted. If I care to keep
small code rather than big code, "inline" should not be used.
Right now, there is a big problem. I can't place "inline" into
pArray_Func[xx](). I can see why. JMP can show "JMP [EAX+EDX*4] that will
work fine. C/C++ compiler can only place "CALL [EAX+EDX*4] into
pArrayFunc[xx](). It is not designed to replace from "CALL [EAX+EDX*4]" to
"JMP [EAX+EDX*4] if "inline" is there. C/C++ compiler will always ignore
"inline".

Again, 'inline' is but a suggestion to the compiler. The compiler
does NOT have to follow it.
Another option is to place 1,000 functions inside switch (...). Look
like below.

switch (xx)
{
case 1:
Func1();
case 2:
Func2();
case 3:
Func3();
}

C/C++ compiler can see this code above before it can place "inline" to
each functions inside switch (...). It is very annoying me because I have
to read switch (...) from line to line by scrolling up/down. It is very
long for me to read. It is not HARD to FIND BUG (off my point).

Well, it's totally up to you not to do that. You can group your 'xx'
values and design another layer of functions, within which you would
have another level of indirection:

if (xx < 100)
callOneOfLessThan100Funcs(xx);
else if (xx < 200)
callOneOfLessThan200Funcs(xx);
...
I would want to see only 3 or 4 lines there instead of over 20,000 lines
that contains switch (...). pArrayFunc() may be very useful. We should
tell the C/C++ authors to add keyword like __USE_JMP_INLINE instead of
__inline, _inline, or inline.

Please do not mix _us_ up into your scheme. _You_ need this, _you_
seem to have thought this through, so _you_ should go to comp.std.c++
and suggest adding something to the language. _They_ will listen and
answer.
Newest version of C/C++ compiler like Intel Compiler can accept
__USE_JMP_INLINE. "__USE_JMP_INLINE" will be added to pArrayFunc[xx](), but
"inline" will always be same for 1,000 functions. After you compile your
C++ source code, it will recognize "__USE_JMP_INLINE", it will rename from
"CALL [EAX+EDX*4]" to "JMP [EAX+EDX*4]" so each functions' PUSH and POP
instructions will be removed. It will look like real JMP Table inside
pArrayFunc[xx](). It would be perfect for easy reading.

"Easy reading"?
Please state your opinion what you think about my idea to modify C/C++
compiler's source code. ***Do not tell that there is no __USE_JMP_INLINE to
all C/C++ compiler. Please ask the authors to add __USE_JMP_INLINE keyword.
It would be perfect and neat C++ coding.***

Please go to comp.std.c++ and do it yourself.
I have put pArrayFunc[xx]() inside C++ class example below. It gives
you an idea however "__USE_JMP_INLINE" does not EXIST, but should be added
near the future.

#include <stdio.h>



class Test

{

public:

inline void Get_a() { printf("A: %d\n", _a); }

inline void Get_b() { printf("B: %d\n", _b); }

inline void Get_c() { printf("C: %d\n", _c); }



inline void Set_a(unsigned int a) { _a = a; }

inline void Set_b(unsigned int b) { _b = b; }

inline void Set_c(unsigned int c) { _c = c; }



typedef void (Test::*Action)();

static const Action Menu[3];



inline void Call();



private:

unsigned int _a;

unsigned int _b;

unsigned int _c;

};



const Test::Action Test::Menu[3] =

{

&Test::Get_a,

&Test::Get_b,

&Test::Get_c

};



inline void Test::Call()

{

for (int a = 0; a <= 2; a++)

(this->*Menu[a])();

}



void main(void)

You better learn to use

int main()

otherwise people in comp.std.c++ may not even take you seriously.
 
J

John Carson

Compilers can ignore inline instructions. This is because making a function
inline can actually slow down the code. I don't know all of the details, but
having a large compiled program can slow things down because of the
additional memory requirements (cache sizes and the like are relevant here,
not just total system memory).
 
B

Bryan Parkoff

Victor,

I have posted a copy of my article to comp.std.c++. Please forgive me
that I DO KNOW that most C/C++ Compiler IGNORE "inline" unless I CAN FORCE
them to OBEY my desire.
And compiler can simply tell you to go kiss yourself and completely
ignore your 'inline' keywords. Didn't you know that it's not mandatory?

Again, 'inline' is but a suggestion to the compiler. The compiler
does NOT have to follow it.
I use Intel C/C++ Compiler 7.1. In fact, Intel C/C++ Compiler DOES NOT
ignore "inline" unless I tell Intel C/C++ Compiler to obey my desire unlike
other generic C/C++ Compiler such as Microsoft, GCC, G++, or other Unix
compilers.
You may be surprised that Intel C/C++ Compiler can give me the choice if
I want to use "inline" or not. It has switches like "keep inline" or
"ignore inline" however most generic C/C++ Compilers don't have switches so
they always ignore "inline".
I hope that you understand my point.

Bryan Parkoff
 
B

Bryan Parkoff

John,

You are absolutely correct, but I don't think that C/C++ Compiler can be
slow by compiling SMALL source code that contains ONE copy of functions
however machine language will grow bigger and bigger by placing duplicated
inline functions there. Taking too much memory is not important because we
have over 1GB memory or more. I can assure that machine language will have
approximately 1 MB size, but it does not hurt the memory. It can gain big
performance. C/C++ Compiler may take approximately 100MB to 300MB by
compiling, but machine language can only take 1 MB if it is already in
memory.
Sometimes, source code is too big that C/C++ compiler can take couple
minutes or hours by compiling all of them. It is normal. Does it make
sense?
 
V

Victor Bazarov

Bryan Parkoff said:
I hope that you understand my point.

I can't claim full understanding of it, I am afraid.

You seem to desire to have more control over how the compiler does
what it does. Well, nobody can tell you what to desire. However,
it is impossible to create a language (or a compiler) that would
satisfy everybody. Not to burst your bubble, but be prepared that
your idea may not be met with overwhelming enthusiasm.

Good luck!
 
B

Bob Hairgrove

On Sun, 28 Dec 2003 05:46:58 GMT, "Bryan Parkoff"

[some things snipped...]
... I DO KNOW that most C/C++ Compiler IGNORE "inline" unless I CAN FORCE
them to OBEY my desire.

This reminds me of lines in some old war films that go something like
this:

"Yess, ve haf vays of making you talk..."

Hope I never bump into you in a dark alley!
 
J

John Carson

Bryan Parkoff said:
John,

You are absolutely correct, but I don't think that C/C++ Compiler
can be slow by compiling SMALL source code that contains ONE copy of
functions however machine language will grow bigger and bigger by
placing duplicated inline functions there. Taking too much memory is
not important because we have over 1GB memory or more. I can assure
that machine language will have approximately 1 MB size, but it does
not hurt the memory. It can gain big performance. C/C++ Compiler
may take approximately 100MB to 300MB by compiling, but machine
language can only take 1 MB if it is already in memory.
Sometimes, source code is too big that C/C++ compiler can take
couple minutes or hours by compiling all of them. It is normal.
Does it make sense?


I lack the expertise to comment with any confidence on this topic, but for
what it is worth:

My comments concerned runtime performance of an application, not compile
time. And, as I said in my original post, total available memory is not the
only relevant consideration. Cache memory is also relevant and is much
smaller.
 
T

Thomas Matthews

Inlining functions removes the execution overhead (what you
call PUSH and POP). However, this may not speed up the
execution of a program. All depends on how you optimize
and how your processor works. Remember to profile your
program and optimize the section of code where most of
the execution takes place.

As others have stated, the "inline" keyword is only a
suggestion to the compiler. If the compiler believes
your suggestion is worthy, it may comply; it doesn't
have to listen to it.

Many processors have instruction caches or buffers.
They have a coponent that fetches instructions until
a caution instruction is found (an instuction that
may force the cache to be cleared, such as a jump
or branch). The idea is that the cache is faster
to retrieve instructions out of and that the processor
can take advantage of parallelism: one component
fetches instructions into the cache while another
component decodes the instructions. The inlining
concept is to help with this issue by reducing the
number of branches.

My understanding of your post is that you want a
language feature which forces the compiler to inline
functions so that you can speed up the execution of
your program. A suggestion that has merit. However,
inlining does not necessarily speed up all programs.

Some processors have auxilary or helper coprocessors.
Some examples are floating point processors, graphic
engines and DMA controllers. Placing code specific
to a processor into a function may allow that function
to be executed by a coprocessor while the main processor
is performing other functions. Some processors have
string instructions. So inlining a search function
may actually be slower than using the library function
which takes advantage of the processor's string
instructions.

My suggestion is to profile your code to find out
where most of the execution time is spent. Either
redesign it, eliminate it or replace with assembly
code that can take advantage of processor features.

Also, inlining all functions prevents the use of a
table of function pointers (a.k.a. jump table).
Many data driven applications (programs) take advantage
of tables of function pointers (example: menus).
One benefit is that the table can be expanded without
changing the table driver (thus no retesting of that
code is required).

BTW, most of the cost of software development is in
the manpower of creating it. There is also the issue
of "time to market": if the market window is missed,
the program may have lost its value.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book
http://www.sgi.com/tech/stl -- Standard Template Library
 
D

Dan W.

going from

void func1();
void func2();
void func3();

to

voiid funcN(int n)
switch(n)
{
case 1:
....//inline code
break;
case 2:
....//inline code
break;
default:
....//inline code
}

makes for slower code, if anything.
Each time you need to call any of the functions you still need to call
the generic function containing the switch.

Besides, a lot of programmers don't seem to be aware of the most
important issue of all when it comes to performance: slowness of
memory. And its direct consequence: the issue of cache utilization
efficiency. If you start inlining everything, your code grows, and
pretty soon it is overwhelming the CPU's cache, pushing other useful
code snippets out of it, and in the case of Intel (not AMD) CPU's,
pushing *data* out of cache to fit the overgrown code. Not to speak
of branch-prediction cache pollution that results from inlining
branchy code all over the place. These days, if you want speed,
you're better off optimizing for size... ;-) No joke.
 
D

Dan W.

Besides, if you take a look at how a modern, superscalar processor
works, you'll realize how much it can do to get around the push and
pop's.

Speculative execution, and parallel execution.

When the CPU gets to a function call, it doesn't wait for itself to
finish pushing stuff into the stack, it jumps right into the function
code and starts executing any instructions there that might not depend
on arguments passed via the stack.

Once it meets a return instruction, it pulls the return address NOT
from the stack, but from the "function return cache", which is
single-cycle, and starts (speculatively) executing any instructions
there that don't appear to depend on the function's return, while all
the stack popping is being done *in parallel".

Cheers!
 
B

Bryan Parkoff

Dan,

Wow...If it is the truth, Call Table should be faster than JMP Table for
small code size. If I have to tell that JMP Table should be faster than
Call Table for speed code, it may be perfect for emulation project. Please
take a look at my measures below.

loop 1,000,000
10,000,000 100,000,000 times
1) JMP Table contains 256 JMPs. 1.59s 24.41s
6m 28.53s
2) JMP Table contains 256 Functions. 1.31s 21.07s
5m 28.53s
3) Call Table contains 256 Functions. 0.59s 9.99s
2m 31.12s
4) C++ Class contains 256 Functions. 1.81s 28.90s
7m 46.08s

It proves to be fast on number #3 for small code size, but I have no
idea how fast on number #1 because JMP Table has all 256 inlined functions.
JMP Table on number #2 has 256 NON-inlined functions. Please let me know
what you think.
If Intel claims that AMD is to be loser for cache gain losses, don't
hurt AMD's feeling!!
 
D

Dan W.

Dan,

Wow...If it is the truth, Call Table should be faster than JMP Table for
small code size. If I have to tell that JMP Table should be faster than
Call Table for speed code, it may be perfect for emulation project. Please
take a look at my measures below.

loop 1,000,000
10,000,000 100,000,000 times
1) JMP Table contains 256 JMPs. 1.59s 24.41s
6m 28.53s
2) JMP Table contains 256 Functions. 1.31s 21.07s
5m 28.53s
3) Call Table contains 256 Functions. 0.59s 9.99s
2m 31.12s
4) C++ Class contains 256 Functions. 1.81s 28.90s
7m 46.08s

It proves to be fast on number #3 for small code size, but I have no
idea how fast on number #1 because JMP Table has all 256 inlined functions.
JMP Table on number #2 has 256 NON-inlined functions. Please let me know
what you think.
If Intel claims that AMD is to be loser for cache gain losses, don't
hurt AMD's feeling!!

I'm not sure that your tests prove anything, anyhow. Not your problem
alone; but the problem of benchmarking in general:

The context of the test affects the result.

If you're only calling the functions from one place, the jump table
approach has the advantage in terms of code-size: The reason being
that if a function is inlined *only once* it actually occupies less
space than the the non-inlined version. If you were to randomize
those 256 functions such that, say, on average, each is repeated 16
times (just to go for the square root of 256), but such that in the
call table version, you have, on average, 16 calls to the same code
instance of the function, this would make your code size a lot smaller
and make more efficient cache utilization.

And this would have a much more noticeable advantage in performance if
there are currently many other applications and processes going, where
competition for cache space gets tougher.

Just as gossip, memory latency (DDR) runs at about 100 to 1000
micro-instructions. Where 15 years ago, the paradigm was "What can we
store in memory so we don't have to re-calculate it?", today's
paradigm goes "What can we re-calculate from data we likely have in
cache or registers, so that we don't have to access it from memory?"
;-)

Cheers!
 
D

Dan W.

And there's so many other issues..

Take, for instance, the branch prediction cache:

I'm not sure if I remember correctly, I think the later Athlons have
about 2048 branch prediction entries. So, say that you were to put
ten consecutive loops in each of your subroutines, where each of the
loops executes 10 times. With inlinining you'd get a total of 2560
branches that need to be predicted to avoid stalls and pipeline
flushes, exceeding the available entries, and therefore incurring
stalls and flushes; whereas with function calls you might stay
within the available entries and get successful predictions 80% to 90%
of the time. This would make a huge difference in performance.

Or take the case of memory pre-fetching: If you access memory in a
linear, ascending address fashion, at a constant stride, AMD cpu's,
and probably Intel's also, will instruct the memory controller to
start loading from memory upcoming cache lines (64-byte chunks for
AMD) so that they are already in cache when needed. If you go
downwards in memory, or randomly, you don't get this valuable service,
and your only recourse is to use the prefetch instruction, in
assembler... :(

Also, each of the chips in a memory DIMM has four local cache pages,
if my memory serves me well, such that access time tends to be less
when repeatedly accessing memory within a (1K? or so) neighborhood,
even if not within a raised cache line.

These are some of the many reasons why benchmarks are often so removed
from real world performance.

Cheers!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,074
Latest member
StanleyFra

Latest Threads

Top