Performance issue: matrix multiplication in C and C++

Michael Bader · Mar 2, 2004

Hi,

I'm currently working on a matrix multiplication code (matrix times
matrix), and have come along some interesting/confusing results
concerning the running time of the (apparently) same algorithm,
when implemented in C or C++. I noticed that the implementation
in C is faster by a factor of 2.5 compared to a identical(?)
C++-implementation.

The basic algorithm in C is:

/* in function main() */
idxC = 0;
for(i=0; i< dimension; i++)
for(k=0; k< dimension; k++) {
idxA = i*dimension;
endA = idxA + dimension;
idxB = k;
while ( idxA < endA ) {
Celem[idxC] += Aelem[idxA] * Belem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};

where matrices A, B, and C are implemented using a 1-dimensional
array and line-by-line numbering (i.e. Fortran-style).
For example, idxA points to the current element A(i,j) of matrix A.
After idxA++, the pointer will point to element A(i,j+1).
Similar, idxA += dimension will move the pointer to element A(i+1,j).
Thus, the code computes the matrix product C=A*B, while trying to
avoid explicit computation of matrix indices.

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).

For the same size of matrices (729x729), the C implementation takes
around 10 seconds on my machine (Pentium III, gcc, version 3.2,
optimization level -O3 and -funroll-loops).
The C++-implementation takes approximately 25 seconds on the same
machine using g++ (same version, same optimization parameters).

Is anybody out there able to give a comment on the reasons for
this huge loss of performance?

Using a static function, and the double-arrays as parameters, i.e.

template<class MatElem>
void FortranMatrix<MatElem>::stamult(const MatElem* A,
const MatElem* B,
MatElem* C, int dimension) {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C[idxC] += A[idxA] * B[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Now, this is almost identical to the C algorithm, but the code still
takes around 20 seconds. I was actually prepared to see certain
performance losses when using C++ instead of C. However, I find it
confusing that you lose a factor 2 in a code that's not even close to
using objected oriented features that may slow the code down.
I didn't try it without using templates, but that can't make too
much of a difference, can it?

I'd appreciate any comment on this, and I would like to thank you
very much for taking the time to read this rather longish article

Michael

Gianni Mariani · Mar 2, 2004

Michael Bader wrote:
....

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];

C.elem[idxC]

You can probably remove this from the loop.

The compiler probably can't optimize the fetch of the "elem" array, so
you'll need to do it yourself.

Try somthing like:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
MatElem v = 0;
while ( idxA < endA ) {
v += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
C.elem[idxC] = v;
idxC++;
};
}

-- still, you can also fetch the addresses this->elem, B.elem and C.elem
and place them in temporaries (although, the compiler can probably do it
for this->elem and B.elem since *this and B are const but not for C.).

idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).

For the same size of matrices (729x729), the C implementation takes
around 10 seconds on my machine (Pentium III, gcc, version 3.2,
optimization level -O3 and -funroll-loops).
The C++-implementation takes approximately 25 seconds on the same
machine using g++ (same version, same optimization parameters).

Is anybody out there able to give a comment on the reasons for
this huge loss of performance?

Using a static function, and the double-arrays as parameters, i.e.

template<class MatElem>
void FortranMatrix<MatElem>::stamult(const MatElem* A,
const MatElem* B,
MatElem* C, int dimension) {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C[idxC] += A[idxA] * B[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Now, this is almost identical to the C algorithm, but the code still
takes around 20 seconds. I was actually prepared to see certain
performance losses when using C++ instead of C. However, I find it
confusing that you lose a factor 2 in a code that's not even close to
using objected oriented features that may slow the code down.
I didn't try it without using templates, but that can't make too
much of a difference, can it?

Which version of gcc are you using ?

I'd appreciate any comment on this, and I would like to thank you
very much for taking the time to read this rather longish article

Try posting compilable code. The last example really does look like it
should be at least as fast as the C code. Without examining the actual
generated assembly code, it's hard to know why it's doing what it's doing.

G

Andre Kostur · Mar 2, 2004

Hi,

I'm currently working on a matrix multiplication code (matrix times
matrix), and have come along some interesting/confusing results
concerning the running time of the (apparently) same algorithm,
when implemented in C or C++. I noticed that the implementation
in C is faster by a factor of 2.5 compared to a identical(?)
C++-implementation.

The basic algorithm in C is:

/* in function main() */
idxC = 0;
for(i=0; i< dimension; i++)
for(k=0; k< dimension; k++) {
idxA = i*dimension;
endA = idxA + dimension;
idxB = k;
while ( idxA < endA ) {
Celem[idxC] += Aelem[idxA] * Belem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};

where matrices A, B, and C are implemented using a 1-dimensional
array and line-by-line numbering (i.e. Fortran-style).
For example, idxA points to the current element A(i,j) of matrix A.
After idxA++, the pointer will point to element A(i,j+1).
Similar, idxA += dimension will move the pointer to element A(i+1,j).
Thus, the code computes the matrix product C=A*B, while trying to
avoid explicit computation of matrix indices.

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C)
const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).

Show, don't tell. Where is the actual class declaration of FortranMatrix
<MatElem>? What _is_ MatElem? Show it's declaration too.

Claudio Puviani · Mar 2, 2004

Michael Bader said:
Hi,

I'm currently working on a matrix multiplication code (matrix times
matrix), and have come along some interesting/confusing results
concerning the running time of the (apparently) same algorithm,
when implemented in C or C++. I noticed that the implementation
in C is faster by a factor of 2.5 compared to a identical(?)
C++-implementation.

The basic algorithm in C is:

/* in function main() */
idxC = 0;
for(i=0; i< dimension; i++)
for(k=0; k< dimension; k++) {
idxA = i*dimension;
endA = idxA + dimension;
idxB = k;
while ( idxA < endA ) {
Celem[idxC] += Aelem[idxA] * Belem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};

where matrices A, B, and C are implemented using a 1-dimensional
array and line-by-line numbering (i.e. Fortran-style).
For example, idxA points to the current element A(i,j) of matrix A.
After idxA++, the pointer will point to element A(i,j+1).
Similar, idxA += dimension will move the pointer to element A(i+1,j).
Thus, the code computes the matrix product C=A*B, while trying to
avoid explicit computation of matrix indices.

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).

For the same size of matrices (729x729), the C implementation takes
around 10 seconds on my machine (Pentium III, gcc, version 3.2,
optimization level -O3 and -funroll-loops).
The C++-implementation takes approximately 25 seconds on the same
machine using g++ (same version, same optimization parameters).

Is anybody out there able to give a comment on the reasons for
this huge loss of performance?

Using a static function, and the double-arrays as parameters, i.e.

template<class MatElem>
void FortranMatrix<MatElem>::stamult(const MatElem* A,
const MatElem* B,
MatElem* C, int dimension) {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C[idxC] += A[idxA] * B[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Now, this is almost identical to the C algorithm, but the code still
takes around 20 seconds. I was actually prepared to see certain
performance losses when using C++ instead of C. However, I find it
confusing that you lose a factor 2 in a code that's not even close to
using objected oriented features that may slow the code down.
I didn't try it without using templates, but that can't make too
much of a difference, can it?

I'd appreciate any comment on this, and I would like to thank you
very much for taking the time to read this rather longish article

The most likely reason is that g++ doesn't generate intermediate code that's as
good as gcc (in fact, g++ 3.x generates bigger and slower code than g++ 2.95).
This is fairly common. The worst difference I've seen was with Watcom's C and
C++ compilers in the early 90's where some code ran 10 times or more slower when
compiled with their C++ compiler than with their C compiler. This isn't a
difference between the C and C++ languages, but between the two compiler
implementations. C++ compilers just don't have as long a history as C compilers,
and in the time that C++ compiler writers have had to spend adding and
supporting the myriad features that C doesn't have, the C compiler writers have
been able to invest in better code generation. C++ will eventually catch up
(today's C++ compilers compare favorably with C compilers 10 years ago). In the
meantime, you can certainly write performance-critical code in C and link it
into your C++ application.

Claudio Puviani

NKOBAYE027 · Mar 2, 2004

Gianni Mariani said:
Michael Bader wrote:
...

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];

Click to expand...

C.elem[idxC]

You can probably remove this from the loop.

The compiler probably can't optimize the fetch of the "elem" array, so
you'll need to do it yourself.

Try somthing like:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
MatElem v = 0;
while ( idxA < endA ) {
v += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
C.elem[idxC] = v;
idxC++;
};
}

I would also remove

int idxA = i*dimension;
int endA = idxA + dimension;

to the outer loop...no need to redeclare and initialise them each time
through the loop over k. keep what depends upon the loop variable within the
loop and remove everything else

Jason Leasure · Mar 2, 2004

Michael said:
Hi,

I'm currently working on a matrix multiplication code (matrix times
matrix), and have come along some interesting/confusing results
concerning the running time of the (apparently) same algorithm,
when implemented in C or C++. I noticed that the implementation
in C is faster by a factor of 2.5 compared to a identical(?)
C++-implementation.

The basic algorithm in C is:

somewhere up here you declare idxA, endA, idxB, etc. ONCE

/* in function main() */
idxC = 0;
for(i=0; i< dimension; i++)
for(k=0; k< dimension; k++) {
idxA = i*dimension;
endA = idxA + dimension;
idxB = k;
while ( idxA < endA ) {
Celem[idxC] += Aelem[idxA] * Belem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};

where matrices A, B, and C are implemented using a 1-dimensional
array and line-by-line numbering (i.e. Fortran-style).
For example, idxA points to the current element A(i,j) of matrix A.
After idxA++, the pointer will point to element A(i,j+1).
Similar, idxA += dimension will move the pointer to element A(i+1,j).
Thus, the code computes the matrix product C=A*B, while trying to
avoid explicit computation of matrix indices.

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {

int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;

here you declare each of these variables (and assign to them) dimension
squared times.. I'd guess doing this amounts to moving the stack
pointer around a lot more than you have to.

move your declarations to the outside of the loop - keep the definitions
where you need them though.

while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).

For the same size of matrices (729x729), the C implementation takes
around 10 seconds on my machine (Pentium III, gcc, version 3.2,
optimization level -O3 and -funroll-loops).
The C++-implementation takes approximately 25 seconds on the same
machine using g++ (same version, same optimization parameters).

Is anybody out there able to give a comment on the reasons for
this huge loss of performance?

Using a static function, and the double-arrays as parameters, i.e.

template<class MatElem>
void FortranMatrix<MatElem>::stamult(const MatElem* A,
const MatElem* B,
MatElem* C, int dimension) {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C[idxC] += A[idxA] * B[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Now, this is almost identical to the C algorithm, but the code still
takes around 20 seconds. I was actually prepared to see certain
performance losses when using C++ instead of C. However, I find it
confusing that you lose a factor 2 in a code that's not even close to
using objected oriented features that may slow the code down.
I didn't try it without using templates, but that can't make too
much of a difference, can it?

I'd appreciate any comment on this, and I would like to thank you
very much for taking the time to read this rather longish article

Michael

guessingly yours,
Jason

Michael Bader · Mar 3, 2004

Andre said:
Show, don't tell. Where is the actual class declaration of FortranMatrix
<MatElem>? What _is_ MatElem? Show it's declaration too.

No problem, I just felt that the posting was already lengthy enough
without it ...

template<class MatElem> class FortranMatrix {

public:

// Constructors
FortranMatrix(int dim);
FortranMatrix(const FortranMatrix<MatElem>& C);

// element access
// --> direct access by memory index
inline MatElem& operator[] (int index) const;
inline MatElem& operator[] (int index);

// --> access via line index and column index
inline MatElem& operator() (int line, int col) const;
inline MatElem& operator() (int line, int col);

inline void mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const;

/* ... addtionial methods skipped ... */

public:

// Members
MatElem* elem; // matrix elements
int dimension; // number of lines (also number of columns)

};

MatElem is instantiated by double.

Regards

Michael

Michael Bader · Mar 3, 2004

Jason said:
here you declare each of these variables (and assign to them) dimension
squared times.. I'd guess doing this amounts to moving the stack
pointer around a lot more than you have to.

I gave that a try, but the performance gain is almost negligible.
Probably, the optimizer can figure that out on its own.

Thanks anyway

Michael

Michael Bader · Mar 3, 2004

Claudio said:
The most likely reason is that g++ doesn't generate intermediate code
that's as good as gcc (in fact, g++ 3.x generates bigger and slower code
than g++ 2.95).

Hi,

thanks for the suggestion, I'll definitely try to do the same
comparison using an older version of gcc/g++.
I also wanted to use an Intel compiler, but I need to ask
someone who owns a licence for that ...

Thanks

Michael

Michael Bader · Mar 3, 2004

Gianni said:
Try posting compilable code. The last example really does look like it
should be at least as fast as the C code. Without examining the actual
generated assembly code, it's hard to know why it's doing what it's doing.

I've put the code for download on
http://www5.in.tum.de/~bader/mmult/
(directory, listing should be allowed).

You need to set the flag __TEMPLATES_IN_HEADERS for compiling
(for g++, use option -D__TEMPLATES_IN_HEADERS).

If you can use a different compiler/architecture, the results could
lso be interesting.

Thanks in advance

Michael

Michael Bader · Mar 3, 2004

Gianni said:
C.elem[idxC]

You can probably remove this from the loop.

The compiler probably can't optimize the fetch of the "elem" array, so
you'll need to do it yourself.

Well, basically, that was the reason why I tried the "static" version
of the algorithm (third code example in my previous posting).
There, A, B, and C are no longer objects, but plain double arrays.
It actually leads to a performance gain (20sec instead of 25sec),
but there's still a factor 2 compared to the C-programm.

Regards

Michael

Michael Bader · Mar 3, 2004

Gianni said:
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
MatElem v = 0;
while ( idxA < endA ) {
v += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
C.elem[idxC] = v;
idxC++;
};
}

Sorry, I nearly missed your suggestion in that code.
You actually got the solution here: the modification reduces
the running to approx. 10sec in C++.

So using
MatElem v = 0;
while ( idxA < endA ) {
v += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
C.elem[idxC] += v;

instead of
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};

reduces the execution time by a factor of 2 (when using g++).
However, when using gcc (i.e. C), the same modifiction doesn't
make any difference, so it clearly seems to be an optimization
problem.

Thanks for your help

Michael

Java matrix problem	3	Sep 10, 2023
Addition and substraction of polynomials is working fine but the multiplication isn't; what's wrong with my code	1	Nov 22, 2022
Matrix API and Google TTS Integration with location tracker in android app	0	Oct 13, 2021
Boomer trying to learn coding in C and C++	6	Dec 16, 2022
How to accept text and put each letter into a 2d matrix?	0	Jun 3, 2022
Difference between using "let" in a "for" loop	0	Jul 3, 2022
How to automatically append x,y values to a matrix	2	Jan 30, 2023
Need Help: Program to Accept 2 Matrices and Show their Sum	0	Aug 21, 2022

Performance issue: matrix multiplication in C and C++

Michael Bader

Gianni Mariani

Andre Kostur

Claudio Puviani

NKOBAYE027

Jason Leasure

Michael Bader

Michael Bader

Michael Bader

Michael Bader

Michael Bader

Michael Bader

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads