# Performance issue: matrix multiplication in C and C++

Discussion in 'C++' started by Michael Bader, Mar 2, 2004.

Hi,

I'm currently working on a matrix multiplication code (matrix times
matrix), and have come along some interesting/confusing results
concerning the running time of the (apparently) same algorithm,
when implemented in C or C++. I noticed that the implementation
in C is faster by a factor of 2.5 compared to a identical(?)
C++-implementation.

The basic algorithm in C is:

/* in function main() */
idxC = 0;
for(i=0; i< dimension; i++)
for(k=0; k< dimension; k++) {
idxA = i*dimension;
endA = idxA + dimension;
idxB = k;
while ( idxA < endA ) {
Celem[idxC] += Aelem[idxA] * Belem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};

where matrices A, B, and C are implemented using a 1-dimensional
array and line-by-line numbering (i.e. Fortran-style).
For example, idxA points to the current element A(i,j) of matrix A.
After idxA++, the pointer will point to element A(i,j+1).
Similar, idxA += dimension will move the pointer to element A(i+1,j).
Thus, the code computes the matrix product C=A*B, while trying to
avoid explicit computation of matrix indices.

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).

For the same size of matrices (729x729), the C implementation takes
around 10 seconds on my machine (Pentium III, gcc, version 3.2,
optimization level -O3 and -funroll-loops).
The C++-implementation takes approximately 25 seconds on the same
machine using g++ (same version, same optimization parameters).

Is anybody out there able to give a comment on the reasons for
this huge loss of performance?

Using a static function, and the double-arrays as parameters, i.e.

template<class MatElem>
void FortranMatrix<MatElem>::stamult(const MatElem* A,
const MatElem* B,
MatElem* C, int dimension) {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C[idxC] += A[idxA] * B[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Now, this is almost identical to the C algorithm, but the code still
takes around 20 seconds. I was actually prepared to see certain
performance losses when using C++ instead of C. However, I find it
confusing that you lose a factor 2 in a code that's not even close to
using objected oriented features that may slow the code down.
I didn't try it without using templates, but that can't make too
much of a difference, can it?

I'd appreciate any comment on this, and I would like to thank you
very much for taking the time to read this rather longish article

Michael

2. ### Gianni MarianiGuest

....
>
> The respective algorithm in C++ is:
>
> template<class MatElem>
> void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
> FortranMatrix<MatElem>& C) const {
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {
> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;
> while ( idxA < endA ) {
> C.elem[idxC] += elem[idxA] * B.elem[idxB];

C.elem[idxC]

You can probably remove this from the loop.

The compiler probably can't optimize the fetch of the "elem" array, so
you'll need to do it yourself.

Try somthing like:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
MatElem v = 0;
while ( idxA < endA ) {
v += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
C.elem[idxC] = v;
idxC++;
};
}

-- still, you can also fetch the addresses this->elem, B.elem and C.elem
and place them in temporaries (although, the compiler can probably do it
for this->elem and B.elem since *this and B are const but not for C.).

> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
> }
>
> Here, A, B, and C, are objects of some matrix class that have a
> member elem (of type double*).
>
> For the same size of matrices (729x729), the C implementation takes
> around 10 seconds on my machine (Pentium III, gcc, version 3.2,
> optimization level -O3 and -funroll-loops).
> The C++-implementation takes approximately 25 seconds on the same
> machine using g++ (same version, same optimization parameters).
>
> Is anybody out there able to give a comment on the reasons for
> this huge loss of performance?
>
> Using a static function, and the double-arrays as parameters, i.e.
>
> template<class MatElem>
> void FortranMatrix<MatElem>::stamult(const MatElem* A,
> const MatElem* B,
> MatElem* C, int dimension) {
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {
> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;
> while ( idxA < endA ) {
> C[idxC] += A[idxA] * B[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
> }
>
> Now, this is almost identical to the C algorithm, but the code still
> takes around 20 seconds. I was actually prepared to see certain
> performance losses when using C++ instead of C. However, I find it
> confusing that you lose a factor 2 in a code that's not even close to
> using objected oriented features that may slow the code down.
> I didn't try it without using templates, but that can't make too
> much of a difference, can it?

Which version of gcc are you using ?

>
> I'd appreciate any comment on this, and I would like to thank you
> very much for taking the time to read this rather longish article

Try posting compilable code. The last example really does look like it
should be at least as fast as the C code. Without examining the actual
generated assembly code, it's hard to know why it's doing what it's doing.

G

Gianni Mariani, Mar 2, 2004

3. ### Andre KosturGuest

news:c22ga6\$m2a\$-muenchen.de:

> Hi,
>
> I'm currently working on a matrix multiplication code (matrix times
> matrix), and have come along some interesting/confusing results
> concerning the running time of the (apparently) same algorithm,
> when implemented in C or C++. I noticed that the implementation
> in C is faster by a factor of 2.5 compared to a identical(?)
> C++-implementation.
>
> The basic algorithm in C is:
>
> /* in function main() */
> idxC = 0;
> for(i=0; i< dimension; i++)
> for(k=0; k< dimension; k++) {
> idxA = i*dimension;
> endA = idxA + dimension;
> idxB = k;
> while ( idxA < endA ) {
> Celem[idxC] += Aelem[idxA] * Belem[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
>
> where matrices A, B, and C are implemented using a 1-dimensional
> array and line-by-line numbering (i.e. Fortran-style).
> For example, idxA points to the current element A(i,j) of matrix A.
> After idxA++, the pointer will point to element A(i,j+1).
> Similar, idxA += dimension will move the pointer to element A(i+1,j).
> Thus, the code computes the matrix product C=A*B, while trying to
> avoid explicit computation of matrix indices.
>
> The respective algorithm in C++ is:
>
> template<class MatElem>
> void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
> FortranMatrix<MatElem>& C)
> const {
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {
> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;
> while ( idxA < endA ) {
> C.elem[idxC] += elem[idxA] * B.elem[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
> }
>
> Here, A, B, and C, are objects of some matrix class that have a
> member elem (of type double*).

Show, don't tell. Where is the actual class declaration of FortranMatrix
<MatElem>? What _is_ MatElem? Show it's declaration too.

Andre Kostur, Mar 2, 2004
4. ### Claudio PuvianiGuest

> Hi,
>
> I'm currently working on a matrix multiplication code (matrix times
> matrix), and have come along some interesting/confusing results
> concerning the running time of the (apparently) same algorithm,
> when implemented in C or C++. I noticed that the implementation
> in C is faster by a factor of 2.5 compared to a identical(?)
> C++-implementation.
>
> The basic algorithm in C is:
>
> /* in function main() */
> idxC = 0;
> for(i=0; i< dimension; i++)
> for(k=0; k< dimension; k++) {
> idxA = i*dimension;
> endA = idxA + dimension;
> idxB = k;
> while ( idxA < endA ) {
> Celem[idxC] += Aelem[idxA] * Belem[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
>
> where matrices A, B, and C are implemented using a 1-dimensional
> array and line-by-line numbering (i.e. Fortran-style).
> For example, idxA points to the current element A(i,j) of matrix A.
> After idxA++, the pointer will point to element A(i,j+1).
> Similar, idxA += dimension will move the pointer to element A(i+1,j).
> Thus, the code computes the matrix product C=A*B, while trying to
> avoid explicit computation of matrix indices.
>
> The respective algorithm in C++ is:
>
> template<class MatElem>
> void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
> FortranMatrix<MatElem>& C) const {
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {
> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;
> while ( idxA < endA ) {
> C.elem[idxC] += elem[idxA] * B.elem[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
> }
>
> Here, A, B, and C, are objects of some matrix class that have a
> member elem (of type double*).
>
> For the same size of matrices (729x729), the C implementation takes
> around 10 seconds on my machine (Pentium III, gcc, version 3.2,
> optimization level -O3 and -funroll-loops).
> The C++-implementation takes approximately 25 seconds on the same
> machine using g++ (same version, same optimization parameters).
>
> Is anybody out there able to give a comment on the reasons for
> this huge loss of performance?
>
> Using a static function, and the double-arrays as parameters, i.e.
>
> template<class MatElem>
> void FortranMatrix<MatElem>::stamult(const MatElem* A,
> const MatElem* B,
> MatElem* C, int dimension) {
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {
> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;
> while ( idxA < endA ) {
> C[idxC] += A[idxA] * B[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
> }
>
> Now, this is almost identical to the C algorithm, but the code still
> takes around 20 seconds. I was actually prepared to see certain
> performance losses when using C++ instead of C. However, I find it
> confusing that you lose a factor 2 in a code that's not even close to
> using objected oriented features that may slow the code down.
> I didn't try it without using templates, but that can't make too
> much of a difference, can it?
>
> I'd appreciate any comment on this, and I would like to thank you
> very much for taking the time to read this rather longish article

The most likely reason is that g++ doesn't generate intermediate code that's as
good as gcc (in fact, g++ 3.x generates bigger and slower code than g++ 2.95).
This is fairly common. The worst difference I've seen was with Watcom's C and
C++ compilers in the early 90's where some code ran 10 times or more slower when
compiled with their C++ compiler than with their C compiler. This isn't a
difference between the C and C++ languages, but between the two compiler
implementations. C++ compilers just don't have as long a history as C compilers,
and in the time that C++ compiler writers have had to spend adding and
supporting the myriad features that C doesn't have, the C compiler writers have
been able to invest in better code generation. C++ will eventually catch up
(today's C++ compilers compare favorably with C compilers 10 years ago). In the
meantime, you can certainly write performance-critical code in C and link it

Claudio Puviani

Claudio Puviani, Mar 2, 2004
5. ### NKOBAYE027Guest

"Gianni Mariani" <> wrote in message
news:c22kkr\$...
> ...
> >
> > The respective algorithm in C++ is:
> >
> > template<class MatElem>
> > void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
> > FortranMatrix<MatElem>& C) const

{
> > /* ... */
> > int idxC = 0;
> > for(int i=0; i< dimension; i++)
> > for(int k=0; k< dimension; k++) {
> > int idxA = i*dimension;
> > int endA = idxA + dimension;
> > int idxB = k;
> > while ( idxA < endA ) {
> > C.elem[idxC] += elem[idxA] * B.elem[idxB];

>
> C.elem[idxC]
>
> You can probably remove this from the loop.
>
> The compiler probably can't optimize the fetch of the "elem" array, so
> you'll need to do it yourself.
>
> Try somthing like:
>
> template<class MatElem>
> void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
> FortranMatrix<MatElem>& C) const

{
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {
> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;
> MatElem v = 0;
> while ( idxA < endA ) {
> v += elem[idxA] * B.elem[idxB];
> idxA++;
> idxB += dimension;
> };
> C.elem[idxC] = v;
> idxC++;
> };
> }
>

I would also remove

> int idxA = i*dimension;
> int endA = idxA + dimension;

to the outer loop...no need to redeclare and initialise them each time
through the loop over k. keep what depends upon the loop variable within the
loop and remove everything else

NKOBAYE027, Mar 2, 2004
6. ### Jason LeasureGuest

> Hi,
>
> I'm currently working on a matrix multiplication code (matrix times
> matrix), and have come along some interesting/confusing results
> concerning the running time of the (apparently) same algorithm,
> when implemented in C or C++. I noticed that the implementation
> in C is faster by a factor of 2.5 compared to a identical(?)
> C++-implementation.
>
> The basic algorithm in C is:
>

somewhere up here you declare idxA, endA, idxB, etc. ONCE

> /* in function main() */
> idxC = 0;
> for(i=0; i< dimension; i++)
> for(k=0; k< dimension; k++) {
> idxA = i*dimension;
> endA = idxA + dimension;
> idxB = k;
> while ( idxA < endA ) {
> Celem[idxC] += Aelem[idxA] * Belem[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
>
> where matrices A, B, and C are implemented using a 1-dimensional
> array and line-by-line numbering (i.e. Fortran-style).
> For example, idxA points to the current element A(i,j) of matrix A.
> After idxA++, the pointer will point to element A(i,j+1).
> Similar, idxA += dimension will move the pointer to element A(i+1,j).
> Thus, the code computes the matrix product C=A*B, while trying to
> avoid explicit computation of matrix indices.
>
> The respective algorithm in C++ is:
>
> template<class MatElem>
> void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
> FortranMatrix<MatElem>& C) const {
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {

> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;

here you declare each of these variables (and assign to them) dimension
squared times.. I'd guess doing this amounts to moving the stack
pointer around a lot more than you have to.

move your declarations to the outside of the loop - keep the definitions
where you need them though.

> while ( idxA < endA ) {
> C.elem[idxC] += elem[idxA] * B.elem[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
> }
>
> Here, A, B, and C, are objects of some matrix class that have a
> member elem (of type double*).
>
> For the same size of matrices (729x729), the C implementation takes
> around 10 seconds on my machine (Pentium III, gcc, version 3.2,
> optimization level -O3 and -funroll-loops).
> The C++-implementation takes approximately 25 seconds on the same
> machine using g++ (same version, same optimization parameters).
>
> Is anybody out there able to give a comment on the reasons for
> this huge loss of performance?
>
> Using a static function, and the double-arrays as parameters, i.e.
>
> template<class MatElem>
> void FortranMatrix<MatElem>::stamult(const MatElem* A,
> const MatElem* B,
> MatElem* C, int dimension) {
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {
> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;
> while ( idxA < endA ) {
> C[idxC] += A[idxA] * B[idxB];
> idxA++;
> idxB += dimension;
> };
> idxC++;
> };
> }
>
> Now, this is almost identical to the C algorithm, but the code still
> takes around 20 seconds. I was actually prepared to see certain
> performance losses when using C++ instead of C. However, I find it
> confusing that you lose a factor 2 in a code that's not even close to
> using objected oriented features that may slow the code down.
> I didn't try it without using templates, but that can't make too
> much of a difference, can it?
>
> I'd appreciate any comment on this, and I would like to thank you
> very much for taking the time to read this rather longish article
>
> Michael
>

guessingly yours,
Jason

Jason Leasure, Mar 2, 2004

Andre Kostur wrote:
> Show, don't tell. Where is the actual class declaration of FortranMatrix
> <MatElem>? What _is_ MatElem? Show it's declaration too.

No problem, I just felt that the posting was already lengthy enough
without it ...

template<class MatElem> class FortranMatrix {

public:

// Constructors
FortranMatrix(int dim);
FortranMatrix(const FortranMatrix<MatElem>& C);

// element access
// --> direct access by memory index
inline MatElem& operator[] (int index) const;
inline MatElem& operator[] (int index);

// --> access via line index and column index
inline MatElem& operator() (int line, int col) const;
inline MatElem& operator() (int line, int col);

inline void mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const;

/* ... addtionial methods skipped ... */

public:

// Members
MatElem* elem; // matrix elements
int dimension; // number of lines (also number of columns)

};

MatElem is instantiated by double.

Regards

Michael

Jason Leasure wrote:
>> int idxC = 0;
>> for(int i=0; i< dimension; i++)
>> for(int k=0; k< dimension; k++) {

>
>> int idxA = i*dimension;
>> int endA = idxA + dimension;
>> int idxB = k;

> here you declare each of these variables (and assign to them) dimension
> squared times.. I'd guess doing this amounts to moving the stack
> pointer around a lot more than you have to.

I gave that a try, but the performance gain is almost negligible.
Probably, the optimizer can figure that out on its own.

Thanks anyway

Michael

Claudio Puviani wrote:
> The most likely reason is that g++ doesn't generate intermediate code
> that's as good as gcc (in fact, g++ 3.x generates bigger and slower code
> than g++ 2.95).

Hi,

thanks for the suggestion, I'll definitely try to do the same
comparison using an older version of gcc/g++.
I also wanted to use an Intel compiler, but I need to ask
someone who owns a licence for that ...

Thanks

Michael

Gianni Mariani wrote:
> Try posting compilable code. The last example really does look like it
> should be at least as fast as the C code. Without examining the actual
> generated assembly code, it's hard to know why it's doing what it's doing.

(directory, listing should be allowed).

You need to set the flag __TEMPLATES_IN_HEADERS for compiling

If you can use a different compiler/architecture, the results could
lso be interesting.

Michael

Gianni Mariani wrote:
>
> C.elem[idxC]
>
> You can probably remove this from the loop.
>
> The compiler probably can't optimize the fetch of the "elem" array, so
> you'll need to do it yourself.

Well, basically, that was the reason why I tried the "static" version
of the algorithm (third code example in my previous posting).
There, A, B, and C are no longer objects, but plain double arrays.
but there's still a factor 2 compared to the C-programm.

Regards

Michael

Gianni Mariani wrote:
> /* ... */
> int idxC = 0;
> for(int i=0; i< dimension; i++)
> for(int k=0; k< dimension; k++) {
> int idxA = i*dimension;
> int endA = idxA + dimension;
> int idxB = k;
> MatElem v = 0;
> while ( idxA < endA ) {
> v += elem[idxA] * B.elem[idxB];
> idxA++;
> idxB += dimension;
> };
> C.elem[idxC] = v;
> idxC++;
> };
> }

Sorry, I nearly missed your suggestion in that code.
You actually got the solution here: the modification reduces
the running to approx. 10sec in C++.

So using
MatElem v = 0;
while ( idxA < endA ) {
v += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
C.elem[idxC] += v;

while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};

reduces the execution time by a factor of 2 (when using g++).
However, when using gcc (i.e. C), the same modifiction doesn't
make any difference, so it clearly seems to be an optimization
problem.