M
Michael Bader
Hi,
I'm currently working on a matrix multiplication code (matrix times
matrix), and have come along some interesting/confusing results
concerning the running time of the (apparently) same algorithm,
when implemented in C or C++. I noticed that the implementation
in C is faster by a factor of 2.5 compared to a identical(?)
C++-implementation.
The basic algorithm in C is:
/* in function main() */
idxC = 0;
for(i=0; i< dimension; i++)
for(k=0; k< dimension; k++) {
idxA = i*dimension;
endA = idxA + dimension;
idxB = k;
while ( idxA < endA ) {
Celem[idxC] += Aelem[idxA] * Belem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
where matrices A, B, and C are implemented using a 1-dimensional
array and line-by-line numbering (i.e. Fortran-style).
For example, idxA points to the current element A(i,j) of matrix A.
After idxA++, the pointer will point to element A(i,j+1).
Similar, idxA += dimension will move the pointer to element A(i+1,j).
Thus, the code computes the matrix product C=A*B, while trying to
avoid explicit computation of matrix indices.
The respective algorithm in C++ is:
template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}
Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).
For the same size of matrices (729x729), the C implementation takes
around 10 seconds on my machine (Pentium III, gcc, version 3.2,
optimization level -O3 and -funroll-loops).
The C++-implementation takes approximately 25 seconds on the same
machine using g++ (same version, same optimization parameters).
Is anybody out there able to give a comment on the reasons for
this huge loss of performance?
Using a static function, and the double-arrays as parameters, i.e.
template<class MatElem>
void FortranMatrix<MatElem>::stamult(const MatElem* A,
const MatElem* B,
MatElem* C, int dimension) {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C[idxC] += A[idxA] * B[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}
Now, this is almost identical to the C algorithm, but the code still
takes around 20 seconds. I was actually prepared to see certain
performance losses when using C++ instead of C. However, I find it
confusing that you lose a factor 2 in a code that's not even close to
using objected oriented features that may slow the code down.
I didn't try it without using templates, but that can't make too
much of a difference, can it?
I'd appreciate any comment on this, and I would like to thank you
very much for taking the time to read this rather longish article
Michael
I'm currently working on a matrix multiplication code (matrix times
matrix), and have come along some interesting/confusing results
concerning the running time of the (apparently) same algorithm,
when implemented in C or C++. I noticed that the implementation
in C is faster by a factor of 2.5 compared to a identical(?)
C++-implementation.
The basic algorithm in C is:
/* in function main() */
idxC = 0;
for(i=0; i< dimension; i++)
for(k=0; k< dimension; k++) {
idxA = i*dimension;
endA = idxA + dimension;
idxB = k;
while ( idxA < endA ) {
Celem[idxC] += Aelem[idxA] * Belem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
where matrices A, B, and C are implemented using a 1-dimensional
array and line-by-line numbering (i.e. Fortran-style).
For example, idxA points to the current element A(i,j) of matrix A.
After idxA++, the pointer will point to element A(i,j+1).
Similar, idxA += dimension will move the pointer to element A(i+1,j).
Thus, the code computes the matrix product C=A*B, while trying to
avoid explicit computation of matrix indices.
The respective algorithm in C++ is:
template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}
Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).
For the same size of matrices (729x729), the C implementation takes
around 10 seconds on my machine (Pentium III, gcc, version 3.2,
optimization level -O3 and -funroll-loops).
The C++-implementation takes approximately 25 seconds on the same
machine using g++ (same version, same optimization parameters).
Is anybody out there able to give a comment on the reasons for
this huge loss of performance?
Using a static function, and the double-arrays as parameters, i.e.
template<class MatElem>
void FortranMatrix<MatElem>::stamult(const MatElem* A,
const MatElem* B,
MatElem* C, int dimension) {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C[idxC] += A[idxA] * B[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}
Now, this is almost identical to the C algorithm, but the code still
takes around 20 seconds. I was actually prepared to see certain
performance losses when using C++ instead of C. However, I find it
confusing that you lose a factor 2 in a code that's not even close to
using objected oriented features that may slow the code down.
I didn't try it without using templates, but that can't make too
much of a difference, can it?
I'd appreciate any comment on this, and I would like to thank you
very much for taking the time to read this rather longish article
Michael