New wikibook about efficient C++

J

Juha Nieminen

Carlo said:
I just completed writing an online book about developing efficient
software using the C++ language.

The text contains mostly irrelevant and sometimes erroneous
suggestions. Most of the suggested micro-optimizations will usually not
make your program any faster at all and thus are completely irrelevant.
Some examples:

"Don't check that a pointer is non-null before calling delete on it."

While it's true that you don't have to check for null before deleting,
speedwise that's mostly irrelevant. In most systems with most compilers
the 'delete' itself is a rather heavy operation, and the extra clock
cycles added by the conditional will not make the program relevantly
slower. It might become more relevant if you have a super-fast
specialized memory allocator where a 'delete' takes next to nothing of
time, and humongous amounts of objects are deleted in a short time.
However, in normal situations it's irrelevant.

"Declare const every member function that does not change the state of
the object on which it is applied."

Mostly irrelevant speedwise.

"Instead of writing a for loop over an STL container, use an STL
algorithm with a function-object or a lambda expression"

Why is that any faster than the for loop? (In fact, it might even be
slower if, for whatever reason, the compiler is unable to inline the
lambda function.)

"Though, every time a function containing substantial code is inlined,
the machine code is duplicated, and therefore the total size of the
program is increased, causing a general slowing down."

Mostly not true. There is no necessary correlation between code size
and speed. In fact, sometimes a longer piece of code may perform faster
than a shorter one (for example loop unrolling performed by the compiler
sometimes produces faster code, even in modern processors).

"Among non-tiny functions, only the performance critical ones are to be
declared inline at the optimization stage."

The main reason to decide whether to declare a larger function
'inline' or not is not about speed. The compiler has heuristics for this
and will not inline the function if it estimates that it would be
counter-productive. The main reason to use or avoid 'inline' has more to
do with the quality of the source code.

"In addition, every virtual member function occupies some more space"

Irrelevant, unless you are developing for an embedded system with a
*very* small amount of memory.

"Do not null a pointer after having called the delete operator on it, if
you are sure that this pointer will no more be used."

Irrelevant. The 'delete' itself will usually be so slow that the
additional assignment won't change the anything.

"Garbage collection, that is automatic reclamation of unreferenced
memory, provides the ease to forget about memory deallocation, and
prevents memory leaks. Such feature is not provided by the standard
library, but is provided by non-standard libraries. Though, such memory
management technique causes a performance worse than explicit
deallocation (that is when the delete operator is explicitly called)."

This is simply not true. In fact, GC can be made faster than explicit
deallocation, at least compared to the default memory allocator used by
most C++ compilers.

"To perform input/output operations, instead of calling the standard
library functions, call directly the operating system primitives."

Dubious advice. The general (and portable) advice for fast I/O is to
use fread() and fwrite() for large blocks of data (the C++ equivalents
may actually be equally fast, if called rarely). If very small amounts
of data (such as characters) need to be read or written individually,
use the correspondent C I/O functions.

In places where I/O speed is irrelevant, this advice is
counter-productive.

"Look-up table"

This was relevant in the early 90's. Nowadays it's less evident. With
modern CPUs sometimes using a lookup table instead of a seemingly "slow"
function might actually be slower, depending on a ton of factors.

"Instead of doing case-insensitive comparisons between a strings,
transform all the letters to uppercase (or to lowercase), and then do
case-sensitive comparisons."

Yeah, because converting the string does not take time?
 
C

Carlo Milanesi

Juha Nieminen ha scritto:
The text contains mostly irrelevant and sometimes erroneous
suggestions. Most of the suggested micro-optimizations will usually not
make your program any faster at all and thus are completely irrelevant.
Some examples:

You look too harsh! The book contains 98 advices, your critiques regard
only 11 of them. Are you sure that there others are mostly completely
irrelevant to a non-expert programmer?
"Don't check that a pointer is non-null before calling delete on it."

While it's true that you don't have to check for null before deleting,
speedwise that's mostly irrelevant. In most systems with most compilers
the 'delete' itself is a rather heavy operation, and the extra clock
cycles added by the conditional will not make the program relevantly
slower. It might become more relevant if you have a super-fast
specialized memory allocator where a 'delete' takes next to nothing of
time, and humongous amounts of objects are deleted in a short time.
However, in normal situations it's irrelevant.

I agree that in normal situations it's almost irrelevant, but it is
nevertheless a useless operation that I have seen done by some programmers.
"Declare const every member function that does not change the state of
the object on which it is applied."

Mostly irrelevant speedwise.

Actually, I never found useful this advice, but I was told that some
compilers could exploit the constness to optimixe the code.
I am going to remove this advice.
"Instead of writing a for loop over an STL container, use an STL
algorithm with a function-object or a lambda expression"

Why is that any faster than the for loop? (In fact, it might even be
slower if, for whatever reason, the compiler is unable to inline the
lambda function.)

In the book "C++ Coding Standards" it is written:
"algorithms are also often more efficient than naked loops".
It is explained that they avoid minor inefficiencies introduced by
non-expert programmers, that they exploit the inside knowledge of the
standard containers, and some of them implement sophisticated algorithms
that the average programmer does not know or does not have time to
implement.
Do you think it is better to remove altogether this advice, or it is
better to change it?
"Though, every time a function containing substantial code is inlined,
the machine code is duplicated, and therefore the total size of the
program is increased, causing a general slowing down."

Mostly not true. There is no necessary correlation between code size
and speed. In fact, sometimes a longer piece of code may perform faster
than a shorter one (for example loop unrolling performed by the compiler
sometimes produces faster code, even in modern processors).

The correlation is the code cache size. If you inline almost all the
functions, you get code bloat, i.e. the code does not fit the code
caches. Even compilers do not unroll completely a loop of 1000 iterations.
What guideline do you suggest for the first coding (optimization are
considered later)?
"Among non-tiny functions, only the performance critical ones are to be
declared inline at the optimization stage."

The main reason to decide whether to declare a larger function
'inline' or not is not about speed. The compiler has heuristics for this
and will not inline the function if it estimates that it would be
counter-productive. The main reason to use or avoid 'inline' has more to
do with the quality of the source code.

Before that it is written that if the compiler can decide wich functions
to inline, there is no need to declare them "inline".
This guideline applies to compilers that need explicit inlining.
"In addition, every virtual member function occupies some more space"

Irrelevant, unless you are developing for an embedded system with a
*very* small amount of memory.

OK, I am going to remove this.
"Do not null a pointer after having called the delete operator on it, if
you are sure that this pointer will no more be used."

Irrelevant. The 'delete' itself will usually be so slow that the
additional assignment won't change the anything.

Analogous to the check-before-delete.
"Garbage collection, that is automatic reclamation of unreferenced
memory, provides the ease to forget about memory deallocation, and
prevents memory leaks. Such feature is not provided by the standard
library, but is provided by non-standard libraries. Though, such memory
management technique causes a performance worse than explicit
deallocation (that is when the delete operator is explicitly called)."

This is simply not true. In fact, GC can be made faster than explicit
deallocation, at least compared to the default memory allocator used by
most C++ compilers.

Then why not everyone is using it, and not every guru is recommending it?
I have never measured GC performance. Are there any research papers
aroun about its performance in C++ projects?
"To perform input/output operations, instead of calling the standard
library functions, call directly the operating system primitives."

Dubious advice. The general (and portable) advice for fast I/O is to
use fread() and fwrite() for large blocks of data (the C++ equivalents
may actually be equally fast, if called rarely). If very small amounts
of data (such as characters) need to be read or written individually,
use the correspondent C I/O functions.

In places where I/O speed is irrelevant, this advice is
counter-productive.

This is a "bottleneck" optimization, as everyone in chapters 4 and 5.
Anyway, I will add a guideline about big buffers and aother one about
keeping files open.
"Look-up table"

This was relevant in the early 90's. Nowadays it's less evident. With
modern CPUs sometimes using a lookup table instead of a seemingly "slow"
function might actually be slower, depending on a ton of factors.

This is a "possible" optimization, as everyone in chapters 4 and 5.
If the cost of the computation is bigger than the cost to retrieve the
pre-computed result, then the look-up table is faster.
Some functions take a lot to be computed, even with modern CPUs.
Even in the example of "sqrt", that is quite fast, the look-up table
routine, if inlined, is more than twice as fast.
With a function like pow(x, 1./3), it is 13 times as fast on my computer.
"Instead of doing case-insensitive comparisons between a strings,
transform all the letters to uppercase (or to lowercase), and then do
case-sensitive comparisons."

Yeah, because converting the string does not take time?

I meant the following.
- When loading a collection, convert the case of all the strings.
- When searching the collection for a string, convert the case of that
string before searching.
This makes the loading slower, but for an enough large collection, it
makes the search faster.
Many databases are actually case-insensitive. Why?

Thank you for your comments.
Do you have any guidelines to suggest for the inclusion in the book?
 
K

Kai-Uwe Bux

Carlo said:
Juha Nieminen ha scritto:
Carlo Milanesi wrote: [snip]
"Instead of writing a for loop over an STL container, use an STL
algorithm with a function-object or a lambda expression"

Why is that any faster than the for loop? (In fact, it might even be
slower if, for whatever reason, the compiler is unable to inline the
lambda function.)

In the book "C++ Coding Standards" it is written:
"algorithms are also often more efficient than naked loops".
It is explained that they avoid minor inefficiencies introduced by
non-expert programmers, that they exploit the inside knowledge of the
standard containers, and some of them implement sophisticated algorithms
that the average programmer does not know or does not have time to
implement.
Do you think it is better to remove altogether this advice, or it is
better to change it?

In principle, algorithms could make use of special knowledge about
implementation details of containers such a deque and create faster code
that way. Also, such specializations could be provided for stream and
streambuf iterators. I think Dietmar Kuehl had some code in that direction.
However, it is far from clear that STL implementations in widespread use
have such optimizations built in.

As for the wiki, I would leave the item but add a word of caution. After
all, if you are stuck with a compiler that does a poor job at optimizing
away the abstraction overhead of functors, it could lead to worse
performance; but if you have a library that uses special trickery inside,
it could boost performance. It's one of the many cases where measurement is
paramount and awareness of issues is what is required of the programmer.


Best

Kai-Uwe Bux
 
N

Noah Roberts

Juha said:
"Instead of writing a for loop over an STL container, use an STL
algorithm with a function-object or a lambda expression"

Why is that any faster than the for loop? (In fact, it might even be
slower if, for whatever reason, the compiler is unable to inline the
lambda function.)

Probably based on the fact that the algorithm can take advantage of
implementation specific knowledge whereas your for loop can't, or at
least shouldn't.

I don't know that any implementation does this though.
 
B

Bo Persson

Carlo said:
Juha Nieminen ha scritto:


I agree that in normal situations it's almost irrelevant, but it is
nevertheless a useless operation that I have seen done by some
programmers.

Some of us, and some of the code, have been around since before the
standard was set. Portable code once had to have the checks.

Nowadays, compilers are smarter and at least one optimizes for this
case by removing its own check if it is not needed:

if (_MyWords != nullptr)
0041E0CC mov esi,dword ptr [ebx+14h]
0041E0CF test esi,esi
0041E0D1 je xxx::~xxx+9Ch (41E0ECh)
delete _MyWords;
0041E0D3 mov eax,dword ptr [esi+4]
0041E0D6 test eax,eax
0041E0D8 je xxx::~xxx+93h (41E0E3h)
0041E0DA push eax
0041E0DB call operator delete (4201CEh)
0041E0E0 add esp,4
0041E0E3 push esi
0041E0E4 call operator delete (4201CEh)
0041E0E9 add esp,4


// if (_MyWords != nullptr)
delete _MyWords;
0041E0CC mov esi,dword ptr [ebx+14h]
0041E0CF test esi,esi
0041E0D1 je xxx::~xxx+9Ch (41E0ECh)
0041E0D3 mov eax,dword ptr [esi+4]
0041E0D6 test eax,eax
0041E0D8 je xxx::~xxx+93h (41E0E3h)
0041E0DA push eax
0041E0DB call operator delete (4201CEh)
0041E0E0 add esp,4
0041E0E3 push esi
0041E0E4 call operator delete (4201CEh)
0041E0E9 add esp,4

OK, I am going to remove this.

The rule here is of course that if you need a function to be virtual,
you just have to make it virtual. If you don't , you don't. :)

Analogous to the check-before-delete.

Also for the compiler. If the nulled pointer isn't actually used, the
compiler is likely to optimize away the assignment anyway.

A better rule is to use delete just before the pointer goes out of
scope. Then there is no problem.

Even better is to use a smart pointer or a container that manages
everything for you.

Then why not everyone is using it, and not every guru is
recommending it? I have never measured GC performance. Are there
any research papers aroun about its performance in C++ projects?

As usual, it depends.

Some "gurus" actually do use GC when there is an advantage. This one,
for example:

http://www.hpl.hp.com/personal/Hans_Boehm/gc/



Bo Persson
 
B

Bo Persson

Kai-Uwe Bux said:
Carlo said:
Juha Nieminen ha scritto:
Carlo Milanesi wrote: [snip]
"Instead of writing a for loop over an STL container, use an STL
algorithm with a function-object or a lambda expression"

Why is that any faster than the for loop? (In fact, it might
even be slower if, for whatever reason, the compiler is unable to
inline the lambda function.)

In the book "C++ Coding Standards" it is written:
"algorithms are also often more efficient than naked loops".
It is explained that they avoid minor inefficiencies introduced by
non-expert programmers, that they exploit the inside knowledge of
the standard containers, and some of them implement sophisticated
algorithms that the average programmer does not know or does not
have time to implement.
Do you think it is better to remove altogether this advice, or it
is better to change it?

In principle, algorithms could make use of special knowledge about
implementation details of containers such a deque and create faster
code that way. Also, such specializations could be provided for
stream and streambuf iterators. I think Dietmar Kuehl had some code
in that direction. However, it is far from clear that STL
implementations in widespread use have such optimizations built in.

It seems like the compilers are now smart enough to do most of this
work on their own.

Benchmarking a vector against a deque show little difference in
traversal speed. Some of this is a cache effect win for the contiguous
vector, leaving very little to gain for an improved deque iterator.
As for the wiki, I would leave the item but add a word of caution.
After all, if you are stuck with a compiler that does a poor job at
optimizing away the abstraction overhead of functors, it could lead
to worse performance; but if you have a library that uses special
trickery inside, it could boost performance. It's one of the many
cases where measurement is paramount and awareness of issues is
what is required of the programmer.

Yes, optimizing for weak compilers is very tricky. Getting another
compiler might be a better idea, but perhaps not possible.

Perhaps this kind of advice should be tagged with compiler version?



Bo Persson
 
P

peter koch

Juha Nieminen ha scritto:



You look too harsh! The book contains 98 advices, your critiques regard
only 11 of them. Are you sure that there others are mostly completely
irrelevant to a non-expert programmer?

I do not find that your recommendations are to bad. While I mostly
agre with Juha, much of the advice you give is good even if is not
related to (program) performance. The first advice of not checking a
pointer for null before deleting it, for example, is very good advice
if you wish to increase programmer performance ;-)

The worst advice I saw (and I only read Juhas post) was to avoid the
standard library for I/O. And yet, today it is unfortunately quite
relevant should your program have the bottleneck in formatted I/O.

/Peter
 
C

coal

Juha Nieminen ha scritto:



You look too harsh! The book contains 98 advices, your critiques regard
only 11 of them. Are you sure that there others are mostly completely
irrelevant to a non-expert programmer?



I agree that in normal situations it's almost irrelevant, but it is
nevertheless a useless operation that I have seen done by some programmers..



Actually, I never found useful this advice, but I was told that some
compilers could exploit the constness to optimixe the code.
I am going to remove this advice.



In the book "C++ Coding Standards" it is written:
"algorithms are also often more efficient than naked loops".
It is explained that they avoid minor inefficiencies introduced by
non-expert programmers, that they exploit the inside knowledge of the
standard containers, and some of them implement sophisticated algorithms
that the average programmer does not know or does not have time to
implement.


I favor the "naked loops" in an automated context. If there are
inefficiencies in this code
http://webEbenezer.net/comp/Msgs.hh

I'm interested in knowing what they are.

Brian Wood
Ebenezer Enterprises
www.webEbenezer.net

"A wise man is strong; yea, a man of knowledge increaseth strength."
Proverbs 24:5
 
J

James Kanze

Juha Nieminen ha scritto:

[...]
Then why not everyone is using it, and not every guru is
recommending it?

Many do (Stroustrup, for example). And the guru's that
recommend against it don't do so on performance grounds. The
one thing I think all gurus agree on is that performancewise, it
all depends. There are programs where garbage collection will
speed the program up, and there are programs which will run
slower with it. And that in all cases, it depends on the actual
implementation of the manual management or the garbage
collection.
 
C

Carlo Milanesi

peter koch ha scritto:
The first advice of not checking a
pointer for null before deleting it, for example, is very good advice
if you wish to increase programmer performance ;-)

I have already removed it, as the book is only about program performance.
The worst advice I saw (and I only read Juhas post) was to avoid the
standard library for I/O. And yet, today it is unfortunately quite
relevant should your program have the bottleneck in formatted I/O.

In fact, there is not substantial difference between the performance of
"fread/fwrite" and that of the OS API. But I found noticeable difference
with using the "fstream" family of classes.

For example, the following code:
ifs.seekg(0);
ifs.read(buf, sizeof buf);
int rc = ifs.gcount();

results to be much slower than the following:
rewind(fp);
int rc = fread(buf, 1, sizeof buf, fp);

at least when "buf" if smaller then 1MB.
So I am going to change the advice in "Use stdio instead of fstream for
binary I/O".
 
C

Carlo Milanesi

Bo Persson ha scritto:
As usual, it depends.

Some "gurus" actually do use GC when there is an advantage. This one,
for example:

http://www.hpl.hp.com/personal/Hans_Boehm/gc/

Boehm is a GC evangelist and so his opinions are biased :)
Anyway, I understood that GC is competitive when you are allocating
small objects, i.e. with an average size of less then 64 bytes, and it
is better only if the average allocated object size is even smaller, but
it is worse with larger allocations.
The wikibook recommends to minimize allocations in bottlenecks, by using
"reserve" or object pools. That tends to increase the average object size.
In addition, the wikibook does not say "never use GC", but "only if you
can prove its expediency for the specific case.".
I'll improve a bit the rationale for that.
 
B

Bo Persson

Carlo said:
Bo Persson ha scritto:

Boehm is a GC evangelist and so his opinions are biased :)

You mean that he knows what he is talking about? :)
Anyway, I understood that GC is competitive when you are allocating
small objects, i.e. with an average size of less then 64 bytes, and
it is better only if the average allocated object size is even
smaller, but it is worse with larger allocations.
The wikibook recommends to minimize allocations in bottlenecks, by
using "reserve" or object pools. That tends to increase the average
object size. In addition, the wikibook does not say "never use GC",
but "only if you can prove its expediency for the specific case.".
I'll improve a bit the rationale for that.

Sounds good.


Bo Persson
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top