Is C faster than fortran?

P

Pierre Asselin

I've been thinking of learning Fortran as number crunching kinda
language for my Physics degree......but then looking around the
internet, people are saying that the libraries/ Algorithms once used
for number crunching is now slowly converting into C, so do you think
I should stick with C, since I know C already, or should I proceed
learning fortran??
Any advice??

I think you can quickly learn enough Fortran at a superficial level
to do some number crunching. Just be good and use "implicit none".

Since you already know C, I would say: write your first project in C
but learn enough Fortran to translate it. Make sure the answers
are the same, then race the two and post the results here :)

Seriously. Fortran compilers can optimize more aggressively than
C because the language semantics are different. C99 plugs this
gap (mostly) with the "restrict" qualifier, but I don't know how
that plays out in practice and I would love to see data.
 
N

Nelu

I am not a US resident so I need a Canadian VISA on my passport
anyway... and it's up from $75 to $100 right now, I believe. But let
me know if you happen to visit the Stowe - Waterbury area.

... and stop with the politics. Everyone is a liberal in VT and I'm
growing tired of it :)) (Oh, it's also OT).

Sorry for the double post, I blame it on Google :)
 
K

Kenneth Brody

Nelu wrote:
[... snip unsnipped quote ...]
Sorry for the double post, I blame it on Google :)

Well, triple, now that you posted it again with the apology. But
who's counting? :)

--
+-------------------------+--------------------+-----------------------+
| Kenneth J. Brody | www.hvcomputer.com | #include |
| kenbrody/at\spamcop.net | www.fptech.com | <std_disclaimer.h> |
+-------------------------+--------------------+-----------------------+
Don't e-mail me at: <mailto:[email protected]>
 
U

user923005

I think you can quickly learn enough Fortran at a superficial level
to do some number crunching.  Just be good and use "implicit none".

Since you already know C, I would say: write your first project in C
but learn enough Fortran to translate it.  Make sure the answers
are the same, then race the two and post the results here :)

Seriously.  Fortran compilers can optimize more aggressively than
C because the language semantics are different.  C99 plugs this
gap (mostly) with the "restrict" qualifier, but I don't know how
that plays out in practice and I would love to see data.

Using:
http://gcc.gnu.org/ml/fortran/2005-10/msg00329/TEST_FPU.f90
Via:
dcorbit@DCORBIT64 /f/tmp
$ g95 -O3 -Wall test_fpu.f90
In file test_fpu.f90:83

91 FORMAT (A,I4,2('/',I2.2))
1
Warning (110): Label 91 at (1) defined but not used
In file test_fpu.f90:2293

INTEGER :: i , info , j , l , ncola , nrowa , nrowb
1
Warning (112): Variable 'ncola' at (1) is set but never used
test_fpu.f90: In function 'dtrmv_':
test_fpu.f90:3611: warning: 'kx' may be used uninitialized in this
function

dcorbit@DCORBIT64 /f/tmp
$ ./a
Benchmark running, hopefully as only ACTIVE task
Test1 - Gauss 2000 (101x101) inverts 15.1 sec Err= 0.000000000000002
Test2 - Crout 2000 (101x101) inverts 7.4 sec Err= 0.000000000000000
Test3 - Crout 2 (1001x1001) inverts 13.8 sec Err= 0.000000000000005
Test4 - Lapack 2 (1001x1001) inverts 10.0 sec Err= 0.000000000000417
total = 46.3 sec


dcorbit@DCORBIT64 /f/tmp
$ g95 --version
G95 (GCC 4.1.2 (g95 0.91!) Mar 21 2007)
Copyright (C) 2002-2005 Free Software Foundation, Inc.

G95 comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of G95
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING


Using:
/*
* --------------------------------------------------------------
* TEST_FPU A number-crunching benchmark using matrix inversion.
* Implemented by: David Frank (e-mail address removed)
* Gauss routine by: Tim Prince (e-mail address removed)
* Crout routine by: James Van Buskirk (e-mail address removed)
* F90->C source by: Sergey N. Voronkov (e-mail address removed)
* Pointer exchange version by: Dieter Buerssner (e-mail address removed)
* --------------------------------------------------------------
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <time.h>

/*
* Compiling with NI = 1001 (default) generates pool(51,51,1000) =
20mb.
* If test system pages due to insufficient memory (disk i/o activity
* occurs), abort run and compile with NI = 200, benchmark will
adjust
* time x 5.
*/

#define NI 1001
#define NN 51
#define RK8 double


/* below are additional C routines supplied by translator */

void memflt()
{
fputs("Memory allocation error\n", stderr);
exit(EXIT_FAILURE);
}


void alloc_arrays(RK8 ** p[NI], RK8 *** a, RK8 *** b)
{
int i,
j;

for (i = 0; i < NI; i++) {
if ((p = (RK8 **) malloc(NN * sizeof(RK8 *))) == NULL)
memflt();
for (j = 0; j < NN; j++)
if ((p[j] = (RK8 *) malloc(NN * sizeof(RK8))) == NULL)
memflt();
}
if ((*a = (RK8 **) malloc(NN * sizeof(RK8 *))) == NULL ||
(*b = (RK8 **) malloc(NN * sizeof(RK8 *))) == NULL)
memflt();
for (i = 0; i < NN; i++)
if (((*a) = (RK8 *) malloc(NN * sizeof(RK8))) == NULL ||
((*b) = (RK8 *) malloc(NN * sizeof(RK8))) == NULL)
memflt();
}


void random_number(RK8 ** pool[NI])
{
int i,
j,
k;

for (i = 0; i < NI; i++)
for (j = 0; j < NN; j++)
for (k = 0; k < NN; k++)
pool[j][k] = (RK8) (rand()) / RAND_MAX;
}


RK8
timesec()
{
return (RK8) (clock()) / CLOCKS_PER_SEC;
}


/* prototype the invert functions that follow exec source */
void Gauss(RK8 ** a, RK8 ** b, int n);
void Crout(RK8 ** a, RK8 ** b, int n);
int rgaussi(RK8 ** a, RK8 ** b, int n);

int main()
{

RK8 **pool[NI]; /* pool of matrices to invert */
RK8 **a,
**ai; /* working matrices use < 256k */
RK8 avg_err,
total_time,
time1;
int i,
j,
n;

char *revision = "01/10/98"; /* Gauss speedup mod */
char invert_id[3][8] =
{
"Gauss", "Crout", "Dieter"};

struct tm *ptm;
time_t crtime;
FILE *fp;

/* Begin by allocating matrix arrays */
alloc_arrays(pool, &a, &ai);

puts("Benchmark running, hopefully as only ACTIVE task");

if ((fp = fopen("test_fpc.dat", "w")) == NULL) {
fprintf(stderr, "Can't open output file!\n");
return EXIT_FAILURE;
}
crtime = time(NULL);
ptm = gmtime(&crtime);

fprintf(fp, "Date run = %2d/%2d/%2d\n",
ptm->tm_mon + 1, ptm->tm_mday, ptm->tm_year);

fputs("Pls supply info below, send to (e-mail address removed)\n"
"Tester name/ net address = \n"
"CPU mfg/id/Mhz = \n"
"MEM/CACHE = \n"
"O.S. / COMPILER = \n"
"Additional comments = \n\n\n\n\n", fp);

fprintf(fp, "Results for %s revision using TEST_FPU.C \n",
revision);

srand(time(NULL)); /* set seed to random number based on time */
random_number(pool); /* fill pool with random data ( 0. -> 1. ) */

for (n = 0; n < 3; n++) { /* for Gauss,Crout algorithms */
time1 = timesec(); /* start benchmark n time */

for (i = 0; i < NI; i++) {
/* get next matrix to invert */
for (j = 0; j < NN; j++)
memcpy(a[j], pool[j], sizeof(RK8) * NN);

switch (n) {
case 0:
Gauss(a, ai, NN); /* invert a -> ai ; destructs a */
Gauss(ai, a, NN); /* invert ai -> a */
break;
case 1:
Crout(a, ai, NN); /* invert a -> ai ; nondestructs a
*/
Crout(ai, a, NN); /* invert ai -> a */
break;
case 2:
rgaussi(a, ai, NN); /* invert a -> ai ; nondestructs a
*/
rgaussi(ai, a, NN); /* invert ai -> a */
break;
}
}

total_time = timesec() - time1; /* = elapsed time sec. */

/* check accuracy last matrix invert. */
avg_err = 0;
for (i = 0; i < NN; i++)
for (j = 0; j < NN; j++)
avg_err += fabs(a[j] - pool[NI - 1][j]);

if (NI == 200)
fprintf(fp, "\n%s 5 x 200 x 2 inverts = %6.1f sec.\n",
invert_id[n], 5 * total_time);
else
fprintf(fp, "\n%s 1000 x 2 inverts = %6.1f sec.\n",
invert_id[n], total_time);

fputs("Accuracy of 2 computed numbers\n", fp);
fprintf(fp, "Original = %18.15f %18.15f\n",
pool[NI - 1][0][0], pool[NI - 1][NN - 1][NN - 1]);
fprintf(fp, "Computed = %18.15f %18.15f\n",
a[0][0], a[NN - 1][NN - 1]);
fprintf(fp, "Avg Err. = %18.15f\n", avg_err / (NN * NN));

} /* for Gauss,Crout algorithms */

puts("Results written to: TEST_FPC.DAT");

return EXIT_SUCCESS;
}


/*
* --------------------------------------
* Invert matrix a -> b by Gauss method
* --------------------------------------
*/
void Gauss(RK8 ** a, RK8 ** b, int n)
{
RK8 d,
temp = 0,
c;
int i,
j,
k,
m,
nn,
*ipvt;

if ((ipvt = (int *) malloc(n * sizeof(int))) == NULL)
memflt();

nn = n;
for (i = 0; i < nn; i++)
ipvt = i;

for (k = 0; k < nn; k++) {
temp = 0.;
m = k;
for (i = k; i < nn; i++) {
d = a[k];
if (fabs(d) > temp) {
temp = fabs(d);
m = i;
}
}
if (m != k) {
j = ipvt[k];
ipvt[k] = ipvt[m];
ipvt[m] = j;
for (j = 0; j < nn; j++) {
temp = a[j][k];
a[j][k] = a[j][m];
a[j][m] = temp;
}
}
d = 1 / a[k][k];
for (j = 0; j < k; j++) {
c = a[j][k] * d;
for (i = 0; i < nn; i++)
a[j] -= a[k] * c;
a[j][k] = c;
}
for (j = k + 1; j < nn; j++) {
c = a[j][k] * d;
for (i = 0; i < nn; i++)
a[j] -= a[k] * c;
a[j][k] = c;
}
for (i = 0; i < nn; i++)
a[k] = -a[k] * d;
a[k][k] = d;
}

for (i = 0; i < nn; i++)
memcpy(b[ipvt], a, sizeof(RK8) * nn);

free(ipvt);
}


/*
* --------------------------------------
* Invert matrix a -> b by Crout method
* --------------------------------------
*/
void Crout(RK8 ** a, RK8 ** b, int n)
{
int i,
j; /* Current row & column */
int maxlok; /* Location of maximum pivot */
int *index; /* Partial pivot record */
RK8 *temp = 0,
the_max;
RK8 tmp,
*ptr;
RK8 *matr = 0;
int k,
ind,
ind2;

if ((index = (int *) malloc(n * sizeof(int))) == NULL ||
(temp = (RK8 *) malloc(n * sizeof(RK8))) == NULL ||
(matr = (RK8 *) malloc(n * n * sizeof(RK8))) == NULL)
memflt();

/* Initialize everything */

for (i = 0; i < n; i++)
index = i;

/* Shuffle matrix */
for (j = 0; j < n; j++) {
for (i = 0; i < j; i++)
b[j] = a[j];
for (i = j; i < n; i++)
b[j] = a[i - j][n - j - 1];
}

/* LU decomposition; reciprocals of diagonal elements in L matrix */
for (j = 0; j < n; j++) {
/* Get current column of L matrix */
for (i = j; i < n; i++) {
tmp = 0;
ind = n - i - 1;
for (k = 0; k < j; k++)
tmp += b[ind][ind + k] * b[j][k];
b[ind][ind + j] -= tmp;
}
maxlok = 0;
the_max = fabs(b[0][j]);
for (i = 1; i < n - j; i++)
if (fabs(b[j + i]) >= the_max) {
the_max = fabs(b[j + i]);
maxlok = i;
}
/* det = det*b(j+maxlok-1,maxlok) */
b[maxlok][j + maxlok] = 1 / b[maxlok][j + maxlok];

/* Swap biggest element to current pivot position */
if (maxlok + 1 != n - j) {
ind = n - maxlok - 1;
ind2 = index[j];
index[j] = index[ind];
index[ind] = ind2;
for (k = n - maxlok; k < n; k++) {
tmp = b[k][j];
b[k][j] = b[k][ind];
b[k][ind] = tmp;
}
memcpy(temp, &(b[maxlok][maxlok]), sizeof(RK8) * (n -
maxlok));
ptr = &(b[n - j - 1][n - j - 1]);
memmove(&(b[maxlok][maxlok]), ptr, sizeof(RK8) * (j + 1));
for (k = j + 1; k < n - maxlok; k++)
b[maxlok][maxlok + k] = b[k][j];
memcpy(ptr, temp, (j + 1) * sizeof(RK8));
for (k = j + 1; k < n - maxlok; k++)
b[k][j] = temp[k];
}
/* Get current row of U matrix */
ind = n - j - 1;
for (i = j + 1; i < n; i++) {
tmp = 0.;
for (k = 0; k < j; k++)
tmp += b[k] * b[ind][ind + k];
b[j] = b[ind][n - 1] * (b[j] - tmp);
}
} /* END DO LU_outer */

/* Invert L matrix */
for (j = 0; j < n - 1; j++) {
temp[0] = b[n - j - 1][n - 1];
for (i = j + 1; i < n; i++) {
ind = n - i - 1;
tmp = 0.;
for (k = 0; k < i - j; k++)
tmp += temp[k] * b[ind][ind + j + k];
b[ind][ind + j] = -tmp * b[ind][n - 1];
temp[i - j] = b[ind][ind + j];
}
}

/* Reshuffle matrix */
for (i = 0; i < (n + 1) / 3; i++) {
memcpy(temp, &(b[2 * (i + 1) - 1]), sizeof(RK8) * (n + 2 -
3 * (i +1)));
for (j = 2 * (i + 1) - 1; j < n - i; j++)
b[j] = b[n - j - 1][n - j + i - 1];
ind = n - i - 1;
for (j = i; j < n + 1 - 2 * (i + 1); j++)
b[j][i + j] = b[n - i - j - 1][ind];
for (k = 0; k < n + 2 - 3 * (i + 1); k++)
b[i + 1 + k][ind] = temp[k];
}

/* Invert U matrix */
for (i = 0; i < n - 1; i++) {
for (j = i + 1; j < n; j++) {
tmp = 0.;
for (k = 0; k < j - i - 1; k++)
tmp += temp[k] * b[j][i + 1 + k];
b[j] = -b[j] - tmp;
temp[j - i - 1] = b[j];
}
}

/* Multiply inverses in reverse order */
for (i = 0; i < n - 1; i++) {
for (k = 0; k < n - i - 1; k++)
temp[k] = b[i + 1 + k];
for (j = 0; j <= i; j++) {
tmp = 0.;
for (k = 0; k < n - i - 1; k++)
tmp += temp[k] * b[j][i + 1 + k];
b[j] += tmp;
}
for (j = i + 1; j < n; j++) {
tmp = 0.;
for (k = j; k < n; k++)
tmp += temp[k - i - 1] * b[j][k];
b[j] = tmp;
}
}

/* Straighten out the columns of the result */
for (i = 0; i < n; i++)
memcpy(matr + n * i, b, sizeof(RK8) * n);
for (i = 0; i < n; i++)
memcpy(b[index], matr + n * i, sizeof(RK8) * n);

free(index);
free(temp);
free(matr);
}

/*
** This routine is due to (e-mail address removed) (Dieter Buerssner)
** Destroys a, return 0: success, 1: zero pivot, 2: out of mem.
*/
int rgaussi(RK8 ** a, RK8 ** b, int n)
{
int i,
j,
k,
maxj,
t;
RK8 maxel,
pivinv,
tmaxel,
aji;
RK8 *tp,
*ai,
*aj;
/* C99: int ipiv[n]; */
int *ipiv = malloc(n * sizeof *ipiv);

if (ipiv == NULL)
return 2;
for (i = 0; i < n; i++)
ipiv = i;
for (i = 0; i < n; i++) {
maxj = -1;
maxel = 0.0;
/* find pivot element */
for (j = i; j < n; j++)
if ((tmaxel = fabs(a[j])) > maxel) {
maxj = j;
maxel = tmaxel;
}
if (maxj < 0) {
free(ipiv);
return 1;
}
/* exchange rows */
if (maxj != i) {
/* just exchange pointers for a */
tp = a[maxj];
a[maxj] = a;
a = tp;
t = ipiv[maxj];
ipiv[maxj] = ipiv;
ipiv = t;
}
ai = a;
pivinv = 1.0 / ai;
ai = 1.0;
for (k = 0; k < n; k++)
ai[k] *= pivinv;
for (j = 0; j < n; j++) {
if (j != i) {
aj = a[j];
aji = aj;
aj = 0.0;
for (k = 0; k < n; k++)
aj[k] -= aji * ai[k];
}
}
}
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
b[ipiv[j]] = a[j];
free(ipiv);
return 0;
}

via:
CL /Ox /Ob2 /Oi /Ot /Oy /GT /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /
D "_UNICODE" /D "UNICODE" /FD /MD /Zp16 /fp:fast /Fo"Release\\" /
Fd"Release\vc80.pdb" /W4 /nologo /c /Wp64 /Zi /Gr /TP /wd4996 /
errorReport:prompt

Date run = 3/27/107
Pls supply info below, send to (e-mail address removed)
Tester name/ net address = {Dann Corbit/[email protected]}
CPU mfg/id/Mhz =
CPU Identification utility v1.11 (c) 1997-2005 Jan
Steunebrink

──────────────────────────────────────────────────────────────────────────────
CPU Vendor and Model: AMD Athlon 64 2800+-3700+
Internal CPU speed : 2199.4 MHz (using internal Time Stamp Counter)
Clock Multiplier : Available only in Real Mode!
CPU-ID Vendor string: AuthenticAMD
CPU-ID Name string : AMD Athlon(tm) 64 Processor 3400+
CPU-ID Signature : 0F4A
│││└─ Stepping or sub-model no.
││└─ Model: Indicates CPU Model and 486 L1
cache mode
│└─ Family: 4=486, Am5x86, Cx5x86
│ 5=Pentium, Nx586, Cx6x86, K5/K6,
C6, mP6
│ 6=PentiumPro/II/III, CxMII/III,
Athlon, C3
│ F=Pentium4, Athlon64
└─ Type: 0=Standard, 1=Overdrive, 2=2nd Dual
Pentium
Current CPU mode : Protected
Internal (L1) cache : Enabled in Write-Back mode
MEM/CACHE = 2GB physical RAM
O.S. / COMPILER = Windows 2003 / Microsoft (R) 32-bit C/C++ Optimizing
Compiler Version 14.00.50727.42 for 80x86
Additional comments =




Results for 01/10/98 revision using TEST_FPU.C

Gauss 1000 x 2 inverts = 0.4 sec.
Accuracy of 2 computed numbers
Original = 0.753715628528703 0.364574114200262
Computed = 0.753715628528701 0.364574114200264
Avg Err. = 0.000000000000001

Crout 1000 x 2 inverts = 0.7 sec.
Accuracy of 2 computed numbers
Original = 0.753715628528703 0.364574114200262
Computed = 0.753715628528701 0.364574114200266
Avg Err. = 0.000000000000003

Dieter 1000 x 2 inverts = 0.4 sec.
Accuracy of 2 computed numbers
Original = 0.753715628528703 0.364574114200262
Computed = 0.753715628528703 0.364574114200264
Avg Err. = 0.000000000000002

We need a Fortran Guru to show me what I am doing wrong, because there
is no feasible explanation for this other than I do not know how to
get performance out of my Fortran compiler.
 
C

CBFalconer

Al said:
.... snip ...
Do it soon. Bush's inane passport moves are going to virtually
cut off international travel unless somebody rapidly gains sanity.
[OT}
Is there some reason you aren't eligible for a passport? IMO, this
should have been done years ago. BTW, Bush is also lobbying to add
22 more countries to the visa waiver program, which certainly won't
hurt international travel.
[OT]

It's just another means of increasing the bureaucracy size and
raising taxes. No practical benefits to anyone or anything. Many
detriments.

During WWII we derided the practice of carrying papers in
dictatorial realms.
 
G

glen herrmannsfeldt

user923005 said:
On Mar 27, 8:41 am, (e-mail address removed) (Pierre Asselin) wrote: (snip)
(snip)

We need a Fortran Guru to show me what I am doing wrong, because there
is no feasible explanation for this other than I do not know how to
get performance out of my Fortran compiler.

That is what they say, and maybe on the average it is true.

There are too many variables to make it true in general.
First, the programs should be as similar as possible, and also
the data. In the (snipped) programs, the C version uses arrays of
pointers to arrays of pointers to arrays of pointers, which is not
likely true for Fortran, and might make a difference either way.

The programs are doing matrix inversion of random numbers, which
are generated using system dependent generators. There may be some
data dependence to the timing of matrix inversion, especially
if the hardware timing is different for different data.

Optimization is probably more sensitive to the number of registers
available than to the ability to rearrange expressions.
Do both Fortran and C programs have the array subscripts in the
optimal order? Most compilers won't change that.

-- glen
 
H

highegg

We need a Fortran Guru to show me what I am doing wrong, because there
is no feasible explanation for this other than I do not know how to
get performance out of my Fortran compiler.

Perhaps. I see you specified a whole bunch of options for the MSVSC++
compiler (out of those I only recall that the /G are for code
generation, /O are presumably optimizers), while you just specify -O3
for g95. Additional options for g95 (as well as other compilers) may
include -ffast-math (equivalent to /fp:fast for VC++ ?), -fomit-frame-
pointer, -march=athlon (or something similar), -funroll-loops.
g95 has almost all optimizations turned off by default.
Even if you do so, do not expect miracles. Two things need to be
noted:
the Fortran code may not be written with speed in mind. Note that
array arithmetic has to create a temporary array if necessary. It is
up to the compiler to determine whether or not it is necessary.
Especially tough might be tricks like this (snipped from TEST_FPU.f90)
b((/m,k/),:) = b((/k,m/),:)

although the swapping here can be done with one temp variable, this is
probably going to create a 2xsize(b,2) temporary array, copy the
righthand side to it, the copy it to the lefthand side. A more speed-
friendly coding could look like this:

elemental subroutine swap(a,b)
real,intent(inout)::a,b
real:: tmp
tmp = a; a = b; b = tmp
end subroutine
....
call swap(b(m,:),b(k,:))

or the good old DO loop.

Moreover, it is important to note that g95 isn't the top optimizing
compiler out there; to be honest, it may well be the bottom. It was
not created with aggresive optimizations in mind - and especially
optimizing of complicated array expressions is a tough task. I think
that gfortran (the newer versions) can perform better, but I would
consider more fair a comparison with a commercial Windows compiler,
like Intel Visual Fortran.
Still, it is hard to tell whether a C code is equivalent to some
Fortran code. I did not closely review your C code to see how close it
gets.

regards,
Jaroslav (not remotely a Fortran guru)
 
U

user923005

Perhaps. I see you specified a whole bunch of options for the MSVSC++
compiler (out of those I only recall that the /G are for code
generation, /O are presumably optimizers), while you just specify -O3
for g95. Additional options for g95 (as well as other compilers) may
include -ffast-math (equivalent to /fp:fast for VC++ ?), -fomit-frame-
pointer, -march=athlon (or something similar), -funroll-loops.
g95 has almost all optimizations turned off by default.
Even if you do so, do not expect miracles. Two things need to be
noted:

I have an AMD machine, so I did like you suggest:
dcorbit@DCORBIT64 /f/tmp
$ g95 -O3 -Wall -ffast-math -fomit-frame-pointer -march=athlon -
funroll-loops
test_fpu.f90
In file test_fpu.f90:83

91 FORMAT (A,I4,2('/',I2.2))
1
Warning (110): Label 91 at (1) defined but not used
In file test_fpu.f90:2293

INTEGER :: i , info , j , l , ncola , nrowa , nrowb
1
Warning (112): Variable 'ncola' at (1) is set but never used
test_fpu.f90: In function 'dtrmv_':
test_fpu.f90:3611: warning: 'kx' may be used uninitialized in this
function

dcorbit@DCORBIT64 /f/tmp
$ ./a
Benchmark running, hopefully as only ACTIVE task
Test1 - Gauss 2000 (101x101) inverts 12.7 sec Err= 0.000000000000003
Test2 - Crout 2000 (101x101) inverts 7.0 sec Err= 0.000000000000001
Test3 - Crout 2 (1001x1001) inverts 11.6 sec Err= 0.000000000000002
Test4 - Lapack 2 (1001x1001) inverts 8.0 sec Err= 0.000000000000643
total = 39.3 sec

the Fortran code may not be written with speed in mind. Note that
array arithmetic has to create a temporary array if necessary. It is
up to the compiler to determine whether or not it is necessary.
Especially tough might be tricks like this (snipped from TEST_FPU.f90)
b((/m,k/),:) = b((/k,m/),:)

although the swapping here can be done with one temp variable, this is
probably going to create a 2xsize(b,2) temporary array, copy the
righthand side to it, the copy it to the lefthand side. A more speed-
friendly coding could look like this:

elemental subroutine swap(a,b)
real,intent(inout)::a,b
real:: tmp
tmp = a; a = b; b = tmp
end subroutine
...
call swap(b(m,:),b(k,:))

or the good old DO loop.

Moreover, it is important to note that g95 isn't the top optimizing
compiler out there; to be honest, it may well be the bottom. It was
not created with aggresive optimizations in mind - and especially
optimizing of complicated array expressions is a tough task. I think
that gfortran (the newer versions) can perform better, but I would
consider more fair a comparison with a commercial Windows compiler,
like Intel Visual Fortran. Don't have that one.

gfortran barfed:
dcorbit@DCORBIT64 /f/tmp
$ gfortran -O3 -Wall test_fpu.f90
In file test_fpu.f90:83

91 FORMAT (A,I4,2('/',I2.2))
1
Warning: Label 91 at (1) defined but not used
test_fpu.f90: In function 'gauss':
test_fpu.f90:106: internal compiler error: in gfc_conv_ss_descriptor,
at fortran/trans-array.c:1235
Please submit a full bug report,
with preprocessed source if appropriate.
See <URL:http://gcc.gnu.org/bugs.html> for instructions.

dcorbit@DCORBIT64 /f/tmp
$ gfortran --version
GNU Fortran 95 (GCC 4.0.1 20050608 (prerelease))
Copyright (C) 2005 Free Software Foundation, Inc.

GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of GNU Fortran
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING

dcorbit@DCORBIT64 /f/tmp
 
R

Richard Bos

Nelu said:
I work 3 miles away from Ben & Jerry's original factory. They'll have
their annual Free Cone Day in April or May. Care to taste some
Strawberry Cheesecake Icecream? You can also visit Canada, it's two
hours away :)).

I prefer real C and real ice cream. Luckily I have a real Italian
gelatero living in town, and a real C Standard printed by Wiley on my
desk.

Richard
 
M

Michael Prager

highegg said:
Moreover, it is important to note that g95 isn't the top optimizing
compiler out there; to be honest, it may well be the bottom.

That's been true for me -- g95 execution time is twice that of
Lahey LF95 on my codes, and Lahey is not the fastest ediest
Windows Fortran compiler.

(Need I say, I still consider g95 a wonderful tool and
outstanding accomplishment.)
 
P

Pierre Asselin

In comp.lang.c user923005 said:
Using:
http://gcc.gnu.org/ml/fortran/2005-10/msg00329/TEST_FPU.f90
Via:
dcorbit@DCORBIT64 /f/tmp
$ g95 -O3 -Wall test_fpu.f90
dcorbit@DCORBIT64 /f/tmp
$ ./a
Benchmark running, hopefully as only ACTIVE task
Test1 - Gauss 2000 (101x101) inverts 15.1 sec Err= 0.000000000000002
Test2 - Crout 2000 (101x101) inverts 7.4 sec Err= 0.000000000000000
Test3 - Crout 2 (1001x1001) inverts 13.8 sec Err= 0.000000000000005
Test4 - Lapack 2 (1001x1001) inverts 10.0 sec Err= 0.000000000000417
total = 46.3 sec

These numbers are quite reasonable. At O(n^3) operations per matrix
inversion, they correspond to 136, 278, 145 and 201 megaflops. I
didn't do an accurate flop count, just 2000*101^3 ops for the first
two and 2*1001^3 for the last two.

Using:
[ similar C program omitted, except: ]
[ #define NI 1001 ] --number of repeats
[ #define NN 51 ] --matrix size
via:
CL /Ox /Ob2 /Oi /Ot /Oy /GT /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /
D "_UNICODE" /D "UNICODE" /FD /MD /Zp16 /fp:fast /Fo"Release\\" /
Fd"Release\vc80.pdb" /W4 /nologo /c /Wp64 /Zi /Gr /TP /wd4996 /
errorReport:prompt
Results for 01/10/98 revision using TEST_FPU.C
Gauss 1000 x 2 inverts = 0.4 sec.
Accuracy of 2 computed numbers
Original = 0.753715628528703 0.364574114200262
Computed = 0.753715628528701 0.364574114200264
Avg Err. = 0.000000000000001
Crout 1000 x 2 inverts = 0.7 sec.
Accuracy of 2 computed numbers
Original = 0.753715628528703 0.364574114200262
Computed = 0.753715628528701 0.364574114200266
Avg Err. = 0.000000000000003
Dieter 1000 x 2 inverts = 0.4 sec.
Accuracy of 2 computed numbers
Original = 0.753715628528703 0.364574114200262
Computed = 0.753715628528703 0.364574114200264
Avg Err. = 0.000000000000002
We need a Fortran Guru to show me what I am doing wrong, because there
is no feasible explanation for this other than I do not know how to
get performance out of my Fortran compiler.

From a cursory inspection of the C, these are smaller systems: 2002
inversions of 51x51 matrices compared to the 2000 101x101 and 2
1001x1001 of the Fortran example. The numbers correspond to ~332,
190 and 332 megaflops, but the accuracy may not be so good because
of the short times.

On the Fortran code, I get
gfortran -o fortran -O3 -funroll-loops TEST_FPU.f90
../fortran
Benchmark running, hopefully as only ACTIVE task
Test1 - Gauss 2000 (101x101) inverts 6.6 sec Err= 0.000000000000003
Test2 - Crout 2000 (101x101) inverts 7.1 sec Err= 0.000000000000000
Test3 - Crout 2 (1001x1001) inverts 22.9 sec Err= 0.000000000000001
Test4 - Lapack 2 (1001x1001) inverts 20.8 sec Err= 0.000000000000273
total = 57.3 sec

Approx 312, 290, 87.6 and 96.4 megaflops (on a little laptop, must
be spilling out of cache).

I translated Gauss() to C by hand but not the other two (especially
not Lapack() !). I'll append my code below. I basically transliterated
the algorithm, using C99 VLA's instead of arrays of pointers. I
also transposed all the matrices so the memory access patterns
would be similar in C and in Fortran; this is ok mathematically,
matrix inversion commutes with transposition. I added a unit test,
not shown, to check that multiplying a matrix by the alleged inverse
returned by Gauss() yields the identity matrix.

I get:

cc -o C -std=c99 -O3 -funroll-loops -Wall TEST_FPU.c
../C
Test1 - Gauss 2000 (101x101) inverts 6.415025 sec Err=0.000000
--rest not implemented--

so comparable to Fortran. Maybe a smidge faster, in fact.

Here is the C version of Gauss(), followed by the original
Fortran. Oh, and the "restrict" keyword doesn't make one
bit of difference.

**********************************************************************

static inline int isamax(int n, double v[n])
{
int i, imax;
double tmp, max;
for(i= 1, imax= 0, max= fabs(v[0]); i<n; i++) {
tmp= fabs(v);
if(tmp>max) { max= tmp; imax= i; }
}
return imax;
}

static void Gauss(int n, double a[n][n])
{
double (* restrict b)[n]= malloc(n*sizeof(b[0]));
double * restrict temp= malloc(n*sizeof(temp[0]));
int * restrict piv= malloc(n*sizeof(piv[0]));
double c, d;
int i, j, k, m, jmax;

memcpy(b, a, n*sizeof(b[0]));
for(i= 0; i<n; i++) piv= i;

for(k= 0; k<n; k++) {
/* largest element in b[k][k:n] */
jmax= k+isamax(n-k, &b[k][k]);

if(jmax != k) {
/* swap columns k and jmax */
m= piv[k]; piv[k]= piv[jmax]; piv[jmax]= m;
for(i= 0; i<n; i++) {
c= b[k];
b[k]= b[jmax];
b[jmax]= c;
}
}

d= 1./b[k][k];

memcpy(temp, b[k], sizeof(b[k]));
for(i= 0; i<n; i++) {
c= b[k]*d;
for(j= 0; j<n; j++) {
b[j]-= temp[j]*c;
}
b[k]= c;
}
for(j= 0; j<n; j++) b[k][j]= -d*temp[j];
b[k][k]= d;
}

for(i= 0; i<n; i++) for(j= 0; j<n; j++) {
a[piv][j]= b[j];
}

free(piv); free(temp); free(b);
}

**********************************************************************

MODULE kinds
INTEGER, PARAMETER :: RK8 = SELECTED_REAL_KIND(15, 300)
END MODULE kinds

SUBROUTINE Gauss (a,n) ! Invert matrix by Gauss method
! --------------------------------------------------------------------
USE kinds
IMPLICIT NONE

INTEGER :: n
REAL(RK8) :: a(n,n)

! - - - Local Variables - - -
REAL(RK8) :: b(n,n), c, d, temp(n)
INTEGER :: i, j, k, m, imax(1), ipvt(n)
! - - - - - - - - - - - - - -
b = a
ipvt = (/ (i, i = 1, n) /)

DO k = 1,n
imax = MAXLOC(ABS(b(k:n,k)))
m = k-1+imax(1)

IF (m /= k) THEN
ipvt( (/m,k/) ) = ipvt( (/k,m/) )
b((/m,k/),:) = b((/k,m/),:)
END IF
d = 1/b(k,k)

temp = b:),k)
DO j = 1, n
c = b(k,j)*d
b:),j) = b:),j)-temp*c
b(k,j) = c
END DO
b:),k) = temp*(-d)
b(k,k) = d
END DO
a:),ipvt) = b

END SUBROUTINE Gauss

**********************************************************************
 
A

Arjen Markus

I used the above program with two different compilers on one Linux
machine and got the following results:

% ifort -o test_fpu test_fpu.f90
% test_fpu
Benchmark running, hopefully as only ACTIVE task
Test1 - Gauss 2000 (101x101) inverts 3.1 sec Err= 0.000000000000002
Test2 - Crout 2000 (101x101) inverts 4.9 sec Err= 0.000000000000003
Test3 - Crout 2 (1001x1001) inverts 9.6 sec Err= 0.000000000000073
Test4 - Lapack 2 (1001x1001) inverts 6.6 sec Err= 0.000000000000537
total = 24.2 sec

and:
% g95 -o test_fpu test_fpu.f90
Benchmark running, hopefully as only ACTIVE task
Test1 - Gauss 2000 (101x101) inverts 56.5 sec Err= 0.000000000000002
Test2 - Crout 2000 (101x101) inverts 21.1 sec Err= 0.000000000000001
Test3 - Crout 2 (1001x1001) inverts 13.1 sec Err= 0.000000000000001
Test4 - Lapack 2 (1001x1001) inverts 34.2 sec Err= 0.000000000000329
total =124.9 sec

That is a factor of 6 in the total time!

When I tried the same with optimisation for speed:

ifort: -O2 -ipo -static (derived from -fast, conservative on
optimisation level)
Benchmark running, hopefully as only ACTIVE task
Test1 - Gauss 2000 (101x101) inverts 3.1 sec Err= 0.000000000000030
Test2 - Crout 2000 (101x101) inverts 4.7 sec Err= 0.000000000000069
Test3 - Crout 2 (1001x1001) inverts 9.9 sec Err= 0.000000000000381
Test4 - Lapack 2 (1001x1001) inverts 16.6 sec Err= 0.000000000002275
total = 34.4 sec

(Slower, surprisingly, a second run showed 24.9 total)

g95: -O3 (I did not want to examine all the details of the
optimisation)
Benchmark running, hopefully as only ACTIVE task
Test1 - Gauss 2000 (101x101) inverts 36.8 sec Err= 0.000000000000002
Test2 - Crout 2000 (101x101) inverts 19.6 sec Err= 0.000000000000000
Test3 - Crout 2 (1001x1001) inverts 23.2 sec Err= 0.000000000000009
Test4 - Lapack 2 (1001x1001) inverts 13.1 sec Err= 0.000000000000822
total = 92.7 sec

In this case we gain about 25% speed - not bad.

So, quite apart from the question whether C is faster or not than
Fortran
there is a very wide spread in what compilers are able to do!

That simply adds another factor to consider.

Regards,

Arjen
 
A

Arjen Markus

I used the above program with two different compilers on one Linux
machine and got the following results:

% ifort -o test_fpu test_fpu.f90

I tried this on a Windows XP machine too, but the Intel Fortran
compiler version terminated in case 3 with a stack overflow.
I did specify a stack of 16 million bytes (via the /F option),
but that did not work. Should I use an even larger stack?

Regards,

Arjen
 
U

user923005

I tried this on a Windows XP machine too, but the Intel Fortran
compiler version terminated in case 3 with a stack overflow.
I did specify a stack of 16 million bytes (via the /F option),
but that did not work. Should I use an even larger stack?

If I recall correctly, the arrays are about 20 MB.
I am kind of surprised that they are formulated as automatic
variables.
In the C version, they are allocated using malloc(), so they do not
exhaust automatic storage.
Hence the C version does not need excessive stack requirements.
 
H

highegg

If I recall correctly, the arrays are about 20 MB.
I am kind of surprised that they are formulated as automatic
variables.

That's very common with current Fortran compilers. Automatic arrays
need to be allocated on subroutine entry and deallocated on subroutine
exit, thus they qualify for stack allocation, which is faster than
heap allocation. Compiler-generated temporaries usually also live on
the stack. Allocatable arrays normally go on heap. Some compilers also
offer options to specify the largest array that can be stack-
allocated.
In the C version, they are allocated using malloc(), so they do not
exhaust automatic storage.
Hence the C version does not need excessive stack requirements.
Neither would the Fortran version, if allocatable arrays were used
instead. However, it is a good general idea to increase stack for
Fortran programs - some compilers do that automatically.
 
P

Pierre Asselin

In comp.lang.fortran Arjen Markus said:
I used the above program with two different compilers on one Linux
machine and got the following results:
[ifort] total = 24.2 sec
[g95] total =124.9 sec
[ifort -O2 -ipo -static] total = 34.4 sec
(Slower, surprisingly, a second run showed 24.9 total)
[g95 -O3] total = 92.7 sec
So, quite apart from the question whether C is faster or not than
Fortran
there is a very wide spread in what compilers are able to do!
That simply adds another factor to consider.

Indeed. Good showing for Intel. I want to try ifort when
I have a chance.

For the piece that I translated I was surprized to see gcc edge
gfortran, since they use the same back-end. I suspect gfortran
had trouble optimizing the array operations; when I wrote the C
I had no choice but to expand them into loops. I want to try the
same thing on the Fortran side --but it will take a while before
I get to it.

I don't have ifc so I won't be able to do an Intel-Intel
comparison.
 
C

CBFalconer

Pierre said:
In comp.lang.fortran Arjen Markus said:
I used the above program with two different compilers on one Linux
machine and got the following results:
[ifort] total = 24.2 sec
[g95] total =124.9 sec
[ifort -O2 -ipo -static] total = 34.4 sec
(Slower, surprisingly, a second run showed 24.9 total)
[g95 -O3] total = 92.7 sec
So, quite apart from the question whether C is faster or not than
Fortran
there is a very wide spread in what compilers are able to do!
That simply adds another factor to consider.

Indeed. Good showing for Intel. I want to try ifort when
I have a chance.

For the piece that I translated I was surprized to see gcc edge
gfortran, since they use the same back-end. I suspect gfortran
had trouble optimizing the array operations; when I wrote the C
I had no choice but to expand them into loops. I want to try the
same thing on the Fortran side --but it will take a while before
I get to it.

I don't have ifc so I won't be able to do an Intel-Intel
comparison.

I suspect the difference in array addressing and its influence on
the memory cache might explain most of the differences.
 
I

Ian Collins

Pierre said:
I used the above program with two different compilers on one Linux
machine and got the following results:

[ifort] total = 24.2 sec
[g95] total =124.9 sec
[ifort -O2 -ipo -static] total = 34.4 sec
(Slower, surprisingly, a second run showed 24.9 total)
[g95 -O3] total = 92.7 sec

So, quite apart from the question whether C is faster or not than
Fortran
there is a very wide spread in what compilers are able to do!

That simply adds another factor to consider.


Indeed. Good showing for Intel. I want to try ifort when
I have a chance.
You might like to try the latest Sun Studio compiler on Linux. On am
AMD64 Solaris box, the performance is significantly faster than f95.
 
T

Tobias Burnus

In comp.lang.fortran Arjen Markus said:
I used the above program with two different compilers on one Linux
machine and got the following results:
[ifort] total = 24.2 sec
[g95] total =124.9 sec

This comparison does not make sense: Ifort uses by default -O2
optimization and GCC (gfortran, g95, g77) uses -O0.
[ifort -O2 -ipo -static] total = 34.4 sec
(Slower, surprisingly, a second run showed 24.9 total)
[g95 -O3] total = 92.7 sec

For comparison (without thinking too much about the options; on x86-64
Linux):

gfortran [4.3] -march=opteron -ffast-math -funroll-loops -ftree-
vectorize -msse3 -O3 TEST_FPU.f90
total = 19.2 sec

ifort [9.1] -O3 -xW -ipo -static TEST_FPU.f90
total = 17.7 sec

sunf95 [8.3] -w4 -fast -xarch=amd64a -xipo=0 TEST_FPU.f90
total = 15.5 sec

NAG f95 [5.1/gcc 4.1.2] -O4 -ieee=full -Bstatic -march=opteron -ffast-
math -funroll-loops -ftree-vectorize -msse3 TEST_FPU.f90
total = 20.2 sec

g95 [0.91-Feb23/gcc 4.0.3] -march=opteron -ffast-math -funroll-loops -
ftree-vectorize -msse3 -O3 TEST_FPU.f90
total = 31.7 sec

Tobias
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top