Performance issue: matrix multiplication in C and C++

M

Michael Bader

Hi,

I'm currently working on a matrix multiplication code (matrix times
matrix), and have come along some interesting/confusing results
concerning the running time of the (apparently) same algorithm,
when implemented in C or C++. I noticed that the implementation
in C is faster by a factor of 2.5 compared to a identical(?)
C++-implementation.

The basic algorithm in C is:

/* in function main() */
idxC = 0;
for(i=0; i< dimension; i++)
for(k=0; k< dimension; k++) {
idxA = i*dimension;
endA = idxA + dimension;
idxB = k;
while ( idxA < endA ) {
Celem[idxC] += Aelem[idxA] * Belem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};

where matrices A, B, and C are implemented using a 1-dimensional
array and line-by-line numbering (i.e. Fortran-style).
For example, idxA points to the current element A(i,j) of matrix A.
After idxA++, the pointer will point to element A(i,j+1).
Similar, idxA += dimension will move the pointer to element A(i+1,j).
Thus, the code computes the matrix product C=A*B, while trying to
avoid explicit computation of matrix indices.

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).

For the same size of matrices (729x729), the C implementation takes
around 10 seconds on my machine (Pentium III, gcc, version 3.2,
optimization level -O3 and -funroll-loops).
The C++-implementation takes approximately 25 seconds on the same
machine using g++ (same version, same optimization parameters).

Is anybody out there able to give a comment on the reasons for
this huge loss of performance?

Using a static function, and the double-arrays as parameters, i.e.

template<class MatElem>
void FortranMatrix<MatElem>::stamult(const MatElem* A,
const MatElem* B,
MatElem* C, int dimension) {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C[idxC] += A[idxA] * B[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Now, this is almost identical to the C algorithm, but the code still
takes around 20 seconds. I was actually prepared to see certain
performance losses when using C++ instead of C. However, I find it
confusing that you lose a factor 2 in a code that's not even close to
using objected oriented features that may slow the code down.
I didn't try it without using templates, but that can't make too
much of a difference, can it?

I'd appreciate any comment on this, and I would like to thank you
very much for taking the time to read this rather longish article :)

Michael
 
G

Gianni Mariani

Michael Bader wrote:
....
The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];

C.elem[idxC]

You can probably remove this from the loop.

The compiler probably can't optimize the fetch of the "elem" array, so
you'll need to do it yourself.

Try somthing like:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
MatElem v = 0;
while ( idxA < endA ) {
v += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
C.elem[idxC] = v;
idxC++;
};
}

-- still, you can also fetch the addresses this->elem, B.elem and C.elem
and place them in temporaries (although, the compiler can probably do it
for this->elem and B.elem since *this and B are const but not for C.).

idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).

For the same size of matrices (729x729), the C implementation takes
around 10 seconds on my machine (Pentium III, gcc, version 3.2,
optimization level -O3 and -funroll-loops).
The C++-implementation takes approximately 25 seconds on the same
machine using g++ (same version, same optimization parameters).

Is anybody out there able to give a comment on the reasons for
this huge loss of performance?

Using a static function, and the double-arrays as parameters, i.e.

template<class MatElem>
void FortranMatrix<MatElem>::stamult(const MatElem* A,
const MatElem* B,
MatElem* C, int dimension) {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C[idxC] += A[idxA] * B[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Now, this is almost identical to the C algorithm, but the code still
takes around 20 seconds. I was actually prepared to see certain
performance losses when using C++ instead of C. However, I find it
confusing that you lose a factor 2 in a code that's not even close to
using objected oriented features that may slow the code down.
I didn't try it without using templates, but that can't make too
much of a difference, can it?

Which version of gcc are you using ?
I'd appreciate any comment on this, and I would like to thank you
very much for taking the time to read this rather longish article :)

Try posting compilable code. The last example really does look like it
should be at least as fast as the C code. Without examining the actual
generated assembly code, it's hard to know why it's doing what it's doing.

G
 
A

Andre Kostur

Hi,

I'm currently working on a matrix multiplication code (matrix times
matrix), and have come along some interesting/confusing results
concerning the running time of the (apparently) same algorithm,
when implemented in C or C++. I noticed that the implementation
in C is faster by a factor of 2.5 compared to a identical(?)
C++-implementation.

The basic algorithm in C is:

/* in function main() */
idxC = 0;
for(i=0; i< dimension; i++)
for(k=0; k< dimension; k++) {
idxA = i*dimension;
endA = idxA + dimension;
idxB = k;
while ( idxA < endA ) {
Celem[idxC] += Aelem[idxA] * Belem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};

where matrices A, B, and C are implemented using a 1-dimensional
array and line-by-line numbering (i.e. Fortran-style).
For example, idxA points to the current element A(i,j) of matrix A.
After idxA++, the pointer will point to element A(i,j+1).
Similar, idxA += dimension will move the pointer to element A(i+1,j).
Thus, the code computes the matrix product C=A*B, while trying to
avoid explicit computation of matrix indices.

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C)
const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).

Show, don't tell. Where is the actual class declaration of FortranMatrix
<MatElem>? What _is_ MatElem? Show it's declaration too.
 
C

Claudio Puviani

Michael Bader said:
Hi,

I'm currently working on a matrix multiplication code (matrix times
matrix), and have come along some interesting/confusing results
concerning the running time of the (apparently) same algorithm,
when implemented in C or C++. I noticed that the implementation
in C is faster by a factor of 2.5 compared to a identical(?)
C++-implementation.

The basic algorithm in C is:

/* in function main() */
idxC = 0;
for(i=0; i< dimension; i++)
for(k=0; k< dimension; k++) {
idxA = i*dimension;
endA = idxA + dimension;
idxB = k;
while ( idxA < endA ) {
Celem[idxC] += Aelem[idxA] * Belem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};

where matrices A, B, and C are implemented using a 1-dimensional
array and line-by-line numbering (i.e. Fortran-style).
For example, idxA points to the current element A(i,j) of matrix A.
After idxA++, the pointer will point to element A(i,j+1).
Similar, idxA += dimension will move the pointer to element A(i+1,j).
Thus, the code computes the matrix product C=A*B, while trying to
avoid explicit computation of matrix indices.

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).

For the same size of matrices (729x729), the C implementation takes
around 10 seconds on my machine (Pentium III, gcc, version 3.2,
optimization level -O3 and -funroll-loops).
The C++-implementation takes approximately 25 seconds on the same
machine using g++ (same version, same optimization parameters).

Is anybody out there able to give a comment on the reasons for
this huge loss of performance?

Using a static function, and the double-arrays as parameters, i.e.

template<class MatElem>
void FortranMatrix<MatElem>::stamult(const MatElem* A,
const MatElem* B,
MatElem* C, int dimension) {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C[idxC] += A[idxA] * B[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Now, this is almost identical to the C algorithm, but the code still
takes around 20 seconds. I was actually prepared to see certain
performance losses when using C++ instead of C. However, I find it
confusing that you lose a factor 2 in a code that's not even close to
using objected oriented features that may slow the code down.
I didn't try it without using templates, but that can't make too
much of a difference, can it?

I'd appreciate any comment on this, and I would like to thank you
very much for taking the time to read this rather longish article :)

The most likely reason is that g++ doesn't generate intermediate code that's as
good as gcc (in fact, g++ 3.x generates bigger and slower code than g++ 2.95).
This is fairly common. The worst difference I've seen was with Watcom's C and
C++ compilers in the early 90's where some code ran 10 times or more slower when
compiled with their C++ compiler than with their C compiler. This isn't a
difference between the C and C++ languages, but between the two compiler
implementations. C++ compilers just don't have as long a history as C compilers,
and in the time that C++ compiler writers have had to spend adding and
supporting the myriad features that C doesn't have, the C compiler writers have
been able to invest in better code generation. C++ will eventually catch up
(today's C++ compilers compare favorably with C compilers 10 years ago). In the
meantime, you can certainly write performance-critical code in C and link it
into your C++ application.

Claudio Puviani
 
N

NKOBAYE027

Gianni Mariani said:
Michael Bader wrote:
...
The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];

C.elem[idxC]

You can probably remove this from the loop.

The compiler probably can't optimize the fetch of the "elem" array, so
you'll need to do it yourself.

Try somthing like:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
MatElem v = 0;
while ( idxA < endA ) {
v += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
C.elem[idxC] = v;
idxC++;
};
}
I would also remove
int idxA = i*dimension;
int endA = idxA + dimension;

to the outer loop...no need to redeclare and initialise them each time
through the loop over k. keep what depends upon the loop variable within the
loop and remove everything else
 
J

Jason Leasure

Michael said:
Hi,

I'm currently working on a matrix multiplication code (matrix times
matrix), and have come along some interesting/confusing results
concerning the running time of the (apparently) same algorithm,
when implemented in C or C++. I noticed that the implementation
in C is faster by a factor of 2.5 compared to a identical(?)
C++-implementation.

The basic algorithm in C is:

somewhere up here you declare idxA, endA, idxB, etc. ONCE
/* in function main() */
idxC = 0;
for(i=0; i< dimension; i++)
for(k=0; k< dimension; k++) {
idxA = i*dimension;
endA = idxA + dimension;
idxB = k;
while ( idxA < endA ) {
Celem[idxC] += Aelem[idxA] * Belem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};

where matrices A, B, and C are implemented using a 1-dimensional
array and line-by-line numbering (i.e. Fortran-style).
For example, idxA points to the current element A(i,j) of matrix A.
After idxA++, the pointer will point to element A(i,j+1).
Similar, idxA += dimension will move the pointer to element A(i+1,j).
Thus, the code computes the matrix product C=A*B, while trying to
avoid explicit computation of matrix indices.

The respective algorithm in C++ is:

template<class MatElem>
void FortranMatrix<MatElem>::mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
here you declare each of these variables (and assign to them) dimension
squared times.. I'd guess doing this amounts to moving the stack
pointer around a lot more than you have to.

move your declarations to the outside of the loop - keep the definitions
where you need them though.

while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Here, A, B, and C, are objects of some matrix class that have a
member elem (of type double*).

For the same size of matrices (729x729), the C implementation takes
around 10 seconds on my machine (Pentium III, gcc, version 3.2,
optimization level -O3 and -funroll-loops).
The C++-implementation takes approximately 25 seconds on the same
machine using g++ (same version, same optimization parameters).

Is anybody out there able to give a comment on the reasons for
this huge loss of performance?

Using a static function, and the double-arrays as parameters, i.e.

template<class MatElem>
void FortranMatrix<MatElem>::stamult(const MatElem* A,
const MatElem* B,
MatElem* C, int dimension) {
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
while ( idxA < endA ) {
C[idxC] += A[idxA] * B[idxB];
idxA++;
idxB += dimension;
};
idxC++;
};
}

Now, this is almost identical to the C algorithm, but the code still
takes around 20 seconds. I was actually prepared to see certain
performance losses when using C++ instead of C. However, I find it
confusing that you lose a factor 2 in a code that's not even close to
using objected oriented features that may slow the code down.
I didn't try it without using templates, but that can't make too
much of a difference, can it?

I'd appreciate any comment on this, and I would like to thank you
very much for taking the time to read this rather longish article :)

Michael

guessingly yours,
Jason
 
M

Michael Bader

Andre said:
Show, don't tell. Where is the actual class declaration of FortranMatrix
<MatElem>? What _is_ MatElem? Show it's declaration too.

No problem, I just felt that the posting was already lengthy enough
without it ...

template<class MatElem> class FortranMatrix {

public:

// Constructors
FortranMatrix(int dim);
FortranMatrix(const FortranMatrix<MatElem>& C);

// element access
// --> direct access by memory index
inline MatElem& operator[] (int index) const;
inline MatElem& operator[] (int index);

// --> access via line index and column index
inline MatElem& operator() (int line, int col) const;
inline MatElem& operator() (int line, int col);

inline void mult(const FortranMatrix<MatElem>& B,
FortranMatrix<MatElem>& C) const;

/* ... addtionial methods skipped ... */

public:

// Members
MatElem* elem; // matrix elements
int dimension; // number of lines (also number of columns)

};

MatElem is instantiated by double.

Regards

Michael
 
M

Michael Bader

Jason said:
here you declare each of these variables (and assign to them) dimension
squared times.. I'd guess doing this amounts to moving the stack
pointer around a lot more than you have to.

I gave that a try, but the performance gain is almost negligible.
Probably, the optimizer can figure that out on its own.

Thanks anyway

Michael
 
M

Michael Bader

Claudio said:
The most likely reason is that g++ doesn't generate intermediate code
that's as good as gcc (in fact, g++ 3.x generates bigger and slower code
than g++ 2.95).

Hi,

thanks for the suggestion, I'll definitely try to do the same
comparison using an older version of gcc/g++.
I also wanted to use an Intel compiler, but I need to ask
someone who owns a licence for that ...

Thanks

Michael
 
M

Michael Bader

Gianni said:
Try posting compilable code. The last example really does look like it
should be at least as fast as the C code. Without examining the actual
generated assembly code, it's hard to know why it's doing what it's doing.

I've put the code for download on
http://www5.in.tum.de/~bader/mmult/
(directory, listing should be allowed).

You need to set the flag __TEMPLATES_IN_HEADERS for compiling
(for g++, use option -D__TEMPLATES_IN_HEADERS).

If you can use a different compiler/architecture, the results could
lso be interesting.

Thanks in advance

Michael
 
M

Michael Bader

Gianni said:
C.elem[idxC]

You can probably remove this from the loop.

The compiler probably can't optimize the fetch of the "elem" array, so
you'll need to do it yourself.

Well, basically, that was the reason why I tried the "static" version
of the algorithm (third code example in my previous posting).
There, A, B, and C are no longer objects, but plain double arrays.
It actually leads to a performance gain (20sec instead of 25sec),
but there's still a factor 2 compared to the C-programm.

Regards

Michael
 
M

Michael Bader

Gianni said:
/* ... */
int idxC = 0;
for(int i=0; i< dimension; i++)
for(int k=0; k< dimension; k++) {
int idxA = i*dimension;
int endA = idxA + dimension;
int idxB = k;
MatElem v = 0;
while ( idxA < endA ) {
v += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
C.elem[idxC] = v;
idxC++;
};
}

Sorry, I nearly missed your suggestion in that code.
You actually got the solution here: the modification reduces
the running to approx. 10sec in C++.

So using
MatElem v = 0;
while ( idxA < endA ) {
v += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};
C.elem[idxC] += v;

instead of
while ( idxA < endA ) {
C.elem[idxC] += elem[idxA] * B.elem[idxB];
idxA++;
idxB += dimension;
};

reduces the execution time by a factor of 2 (when using g++).
However, when using gcc (i.e. C), the same modifiction doesn't
make any difference, so it clearly seems to be an optimization
problem.

Thanks for your help

Michael
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top