Windows/Linux optimization problem

Renato · Feb 10, 2006

Hi all,
I have a strange optimization problem. I have written a small program,
basically a matrix-vector multiplication at its core, that needs to
run as fast as possible.

The relevant code snippet is:

for (e = start_e; e < end_e; e++)
for (s = start_s; s < end_s

*e += (*r++) * (*s++);

where all variables are float pointers, 'r' is the matrix, 's' the
vector and 'e' the result vector. I call the operation on the last
line a MAC (multiply-accumulate), a common measure of performance on
DSPs. The 'r' matrix is very large (about 15 MB) and does not fit into
the cache.

The program does a thousand iterations, each consisting of some setup
and of the matrix mult. above, and prints out the speed.

On Linux, using an Intel Xeon 2.6 Ghz (512Kb cache) I get the
following result:

Done in 0.42 seconds (2398.55 iterations/sec) (487.00 Mmac/sec)

The above result was with the optimizing Intel compiler v9.0, which
auto-vectorize loops using SSE. The non-SSE version was only about 20%
slower.

On Windows, using my Athlon 2700+, I get this:

Done in 1.81 seconds (551.61 iterations/sec) (112.00 Mmac/sec)

I then learned that my non-professional copy of VisualC++ does not
optimize binaries (!), so I downloaded the Microsoft Visual C++
Toolkit 2003, which claims to have the same optimizing compiler
featured by the professional version of Microsoft Visual C++. The
result is even worse:

Done in 1.85 seconds (540.53 iterations/sec) (109.75 Mmac/sec)

The windows version was compiled with this command line:

cl /O2 test2.c

Adding flags for SSE instructions did not help.

Anyone has a clue of what I'm doing wrong? The numbers are very
repeatable. Using a smaller 'r' matrix pushed the speed on the Linux
xeon up to 1.5 GMac (!), while the windows version on the Athlon never
went over 250 Mmac.

Thanks for the answers

Alfio

Al Balmer · Feb 10, 2006

Hi all,
I have a strange optimization problem. I have written a small program,

An interesting problem, but way off topic here, where we discuss the
standard C language, not specific implementations, and not
optimization. Look for a Microsoft newsgroup.

Renato · Feb 10, 2006

An interesting problem, but way off topic here, where we discuss the
standard C language, not specific implementations, and not
optimization. Look for a Microsoft newsgroup.

Sorry, I didn't realize that it was offtopic. I'll post it somewhere
else.

Alfio

Malcolm · Feb 10, 2006

Renato said:
The relevant code snippet is:

for (e = start_e; e < end_e; e++)
for (s = start_s; s < end_s
*e += (*r++) * (*s++);

where all variables are float pointers, 'r' is the matrix, 's' the
vector and 'e' the result vector. I call the operation on the last
line a MAC (multiply-accumulate), a common measure of performance on
DSPs. The 'r' matrix is very large (about 15 MB) and does not fit into
the cache.

[ Windows worse than Linux ]

You want to try to look at the assembly code produced (if you have no tools,
make a minimal program and then hand-dissassemble the binary).

This will tell you whether it is the greedy operating system or the bad
compiler causing your problems on the Windows machine.

Tim Prince · Feb 12, 2006

Renato said:
Hi all,

The relevant code snippet is:

for (e = start_e; e < end_e; e++)
for (s = start_s; s < end_s
*e += (*r++) * (*s++);

More than this is relevant, including how the pointers are declared.

The windows version was compiled with this command line:

cl /O2 test2.c

Adding flags for SSE instructions did not help.

Anyone has a clue of what I'm doing wrong? The numbers are very
repeatable. Using a smaller 'r' matrix pushed the speed on the Linux
xeon up to 1.5 GMac (!), while the windows version on the Athlon never
went over 250 Mmac.

Most Windows compilers do not default to requiring programs to comply
with the C standard on typed aliasing. In fact, they don't even act as
C compilers by default. Thus, they may assume possible side effects
which prevent scalar reduction (registerization of the sum
accumulation), or unpredictable changes in the pointer values preventing
those from being registerized. You could make it easier on the compiler
by declaring a local scalar for the accumulator, and unambiguously
moving the assignment to +e to the outer loop.
Using pointers to make counted for loops has pitfalls. Purists might
replace the < condition with !=, but that introduces ambiguities in how
to treat the situation where the loop might wrap around in the address
space. So, it's possible that one compiler might choose not to optimize
for such reasons.

linux <--> windows strcpy etc performance	5	Aug 29, 2010
algorithm, optimization, or other problem?	2	Feb 21, 2006
servlet/applet communication problem or Linux/Windows trouble ?	8	Dec 17, 2009
Confused about benchmarks calculating averages of large arrays.	3	May 26, 2008
Building Python with icc on 64-bit Linux	0	May 26, 2009
ANSI C problem on P4 under Linux & Windows	6	Aug 22, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
ruby profile gave different result in Windows and Linux	2	Nov 29, 2006

Windows/Linux optimization problem

Renato

Al Balmer

Renato

Malcolm

Tim Prince

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads