ANSI C problem on P4 under Linux & Windows

V

VNG

I have an ANSI C program that was compiled under Windows MSVC++ 6.0 (SP6) and
under Linux gnu, and ran under P3, P4 and AMD.

It runs fine on P3 and AMD under both Windows and Linux, but under P4 it has
problems. Under Windows 3GHz P4 runs twice slower than 800MHz P3... and under
Linux not only that it runs slower (while AMD is 40 times faster), but it also
produces wrong numerical results...

Any suggestion what can be the problem?

How to fix the P4 speed under MSVC++ (SP6)?
How to fix P4's speed and numerical result under Linux?

Here's some more details about the compilation:
GNU:
CFLAGS=-O6 -fexpensive-optimizations -ffast-math -fno-strength-reduce
-funroll-loops -fomit-frame-pointer -Wno-long-long -Wno-unused


Basically one of the most intensive loops (that we suspect in but aren't sure if
it causes the problem) looks like this:

static long loop_order;

void functionname ()
{
register float *iPtr, *itPtr, *iPtr1, *cPtr, acc;
register long j;
:
{
register float c1, c2;
j = loop_order;
while (j--)
{
acc = *itPtr-- * c1;
acc += *itPtr-- * c2;
acc += *itPtr++ * c3;
*cPtr++ += *iPtr1++ * acc;
}
}
:
}

We have tried to eliminate the use of the word "register" and redefined "j" as
volatile, no change.


Thanks,
-- VNG
 
S

SM Ryan

# {
# register float c1, c2;
# j = loop_order;
# while (j--)
# {
# acc = *itPtr-- * c1;
# acc += *itPtr-- * c2;
# acc += *itPtr++ * c3;
# *cPtr++ += *iPtr1++ * acc;
# }
# }

Is there some reason to keep loading itPtr[-1] and itPtr[-2]
inside the loop instead of outside?
 
P

Profetas

which OS do you have in your P3?

newer OS/compiler may not use the register to store your vars which will
be slower
 
J

Jens.Toerring

Profetas said:
which OS do you have in your P3?

Did you ever read the post? The OP writes it all at the start of his
article.
newer OS/compiler may not use the register to store your vars which will
be slower

That's simply BS. First of all, 'register' was never more than a
hint to the compiler that a variable will be used a lot and that
it might be a good idea to store it in a register. But the compiler
was always free to disregard this hint. Moreover, newer compilers
are usually quite good at figuring out such things, so you usually
don't need the 'register' keyword anymore because the compiler will
automatically pick the most suitable variables for keeping them in
registers. And, finally, this didn't got anything at all to do with
the OS.
Regards, Jens
 
J

Jens.Toerring

VNG said:
I have an ANSI C program that was compiled under Windows MSVC++ 6.0 (SP6) and
under Linux gnu, and ran under P3, P4 and AMD.
It runs fine on P3 and AMD under both Windows and Linux, but under P4 it has
problems. Under Windows 3GHz P4 runs twice slower than 800MHz P3... and under
Linux not only that it runs slower (while AMD is 40 times faster), but it also
produces wrong numerical results...
Any suggestion what can be the problem?
How to fix the P4 speed under MSVC++ (SP6)?
How to fix P4's speed and numerical result under Linux?
Here's some more details about the compilation:
GNU:
CFLAGS=-O6 -fexpensive-optimizations -ffast-math -fno-strength-reduce
-funroll-loops -fomit-frame-pointer -Wno-long-long -Wno-unused

No idea about the speed issues - and that's rather off-topic here,
because it's about the behavior of certain compilers in combination
with certain processors, which all hasn't much to do with C. And
about the wrong results with gcc have another look at the info
pages concerning the -ffast-math option:
This option should never be turned on by any `-O' option since it
can result in incorrect output for programs which depend on an
exact implementation of IEEE or ISO rules/specifications for math
functions.

Perhaps it got to do something with this...

In your place I would probably start with throwing out all that
options and test carefully which of them really make a difference
- some of them could even result in a slow-down when used with the
wrong processor type. And your code is actually that obfuscated (and
not the one you're using, by the way) that a compiler might have
problems finding out how to optimize on it. Try to rewrite it in an
understandable form and you might have a much better chance to get
it optimized. If you then find it's too slow you still can try to
micro-optimize (but expect the effect to differ between compilers
and processors).
Basically one of the most intensive loops (that we suspect in but aren't sure if
it causes the problem) looks like this:

Profiling your code would probably be better than just guessing...
static long loop_order;
void functionname ()
{
register float *iPtr, *itPtr, *iPtr1, *cPtr, acc;

iPtr is twice defined, that should get the compiler quite a bit upset.
register long j;
:

What's that colon good for?

Why wrap this in another block?
register float c1, c2;

Where do c1 and c2 ever get assigned values?
j = loop_order;
while (j--)
{
acc = *itPtr-- * c1;

iPtr has never been assigned a value.
acc += *itPtr-- * c2;
acc += *itPtr++ * c3;

c3 is never defined anywhere.
*cPtr++ += *iPtr1++ * acc;

cPtr and iPtr1 also didn't get assigned values.

Now, what the hell is all that supposed to do?

Regards, Jens
 
C

CBFalconer

.... snip about systems - OT ...
Basically one of the most intensive loops (that we suspect in but
aren't sure if it causes the problem) looks like this:

static long loop_order;

void functionname ()
{
register float *iPtr, *itPtr, *iPtr1, *cPtr, acc;
register long j;
:
{
register float c1, c2;
j = loop_order;
while (j--)
{
acc = *itPtr-- * c1;
acc += *itPtr-- * c2;
acc += *itPtr++ * c3;
*cPtr++ += *iPtr1++ * acc;
}
}
:
}

We have tried to eliminate the use of the word "register" and
redefined "j" as volatile, no change.

What are those isolated colons doing? The register keyword seems
pointless, as does the volatile. Initializing the various
pointers might help. Same for the cNs. c3 seems to be undefined.
The time for multiplication can vary greatly with the operands.

As ever, first measure. It should not be any great effort to do
some profiling runs.
 
C

Christian Bau

VNG said:
I have an ANSI C program that was compiled under Windows MSVC++ 6.0 (SP6) and
under Linux gnu, and ran under P3, P4 and AMD.

It runs fine on P3 and AMD under both Windows and Linux, but under P4 it has
problems. Under Windows 3GHz P4 runs twice slower than 800MHz P3... and
under
Linux not only that it runs slower (while AMD is 40 times faster), but it
also
produces wrong numerical results...

Any suggestion what can be the problem?

How to fix the P4 speed under MSVC++ (SP6)?
How to fix P4's speed and numerical result under Linux?

Here's some more details about the compilation:
GNU:
CFLAGS=-O6 -fexpensive-optimizations -ffast-math -fno-strength-reduce
-funroll-loops -fomit-frame-pointer -Wno-long-long -Wno-unused


Basically one of the most intensive loops (that we suspect in but aren't sure
if
it causes the problem) looks like this:

static long loop_order;

void functionname ()
{
register float *iPtr, *itPtr, *iPtr1, *cPtr, acc;
register long j;
:
{
register float c1, c2;
j = loop_order;
while (j--)
{
acc = *itPtr-- * c1;
acc += *itPtr-- * c2;
acc += *itPtr++ * c3;
*cPtr++ += *iPtr1++ * acc;
}
}
:
}

P4s dislike accessing data at certain distances from each other. If the
distance between the various pointer variables is a multiple of a large
power of two (for example 64 KB) then you might be in trouble.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,734
Messages
2,569,441
Members
44,832
Latest member
GlennSmall

Latest Threads

Top